The International Telecommunications Union (ITU) sets a standard for mouth-to-ear latency, recommending it be under 400 milliseconds for natural conversations. This latency measures the time from when a speaker’s words are uttered until they are heard, critical for AI systems aiming to emulate human interaction. Any delay beyond one second can disrupt the conversational flow and feel unnatural. Currently, many voice AI technologies are nearing this benchmark, with advancements continually improving response times.
In specific applications, such as healthcare, latency issues can hinder effectiveness, especially when paired with inadequate language support. An Australian startup faced challenges using AI callers for elderly Cantonese-speaking patients due to high latency with U.S.-based technology and a lack of Cantonese text-to-speech (TTS). To mitigate latency, developers must focus on real-time, end-to-end processing, enabling simultaneous input and output rather than sequentially waiting for responses, optimizing the overall user experience.
Source link