Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Amazon is best known as an e-commerce giant and then somewhere perhaps slightly further down the list of notable offerings is its Alexa AI voice assistant product, which just got a big intelligence upgrade last month thanks in part to Amazon investment Anthropic.
Now Alexa will have to make space for a new Amazon voice AI sibling: today the company is introducing Amazon Nova Sonic, a new foundation model designed to allow third-party app developers to build realtime, naturalistic, conversational voice interactivity to their products using Amazon’s web platform Bedrock
It’s available now via a bi-directional streaming application programming interface (API).
Obvious use cases include customer support and service, guidance, information retrieval, and entertainment.
A unified approach
Nova Sonic addresses a key challenge in voice AI: the fragmentation of technologies.
Traditionally, building voice interfaces required combining separate models for speech recognition, language processing, and speech synthesis, according to Rohit Prasad, SVP and Head Scientist for Artificial General Intelligence (AGI) at Amazon, in a video call interview with VentureBeat yesterday using Amazon’s Chime video service.
This complexity often results in robotic, unnatural interactions and increased development overhead.
Now, Sonic seeks to improve on this state of affairs by combining all three distinct model types into one.
Prasad explained the model’s core innovation: “Nova Sonic brings together three traditionally separate models—speech-to-text, text understanding, and text-to-speech—into one unified system that can model not just the ‘what’ but also the ‘how’ of communication.”
By retaining the acoustic context—such as tone, cadence, and style—Nova Sonic helps maintain the nuances of human conversation.
Recognizing the intricacies and quirks of live, two-way audio conversations
One of Nova Sonic’s defining capabilities is its ability to handle live, two-way conversations. It recognizes when users pause, hesitate, or interrupt—common behaviors in human speech—and responds fluidly while maintaining context.
“The real breakthrough here is real-time, interactive, low-latency voice interaction, which means you can interrupt the AI mid-sentence, and it will still maintain context and respond coherently,” said Prasad. This feature is especially relevant in scenarios like customer service, where responsiveness and adaptability are critical.
Nova Sonic is also designed to integrate seamlessly with other systems. It automatically generates transcripts of spoken input, which can be used to trigger APIs or interact with proprietary tools. This allows companies to build AI agents that can perform tasks such as booking appointments, retrieving live information, or answering complex customer inquiries.
“You can use Nova Sonic through Amazon Bedrock and connect it with any tools or proprietary data sources, even visual ones, as long as they’re wrapped as callable APIs,” said Prasad. This flexibility makes the model suitable for a wide range of industries, from education and travel to enterprise operations and entertainment.
Benchmark performance and industry comparisons
Nova Sonic has been benchmarked against other real-time voice models, including OpenAI’s GPT-4o and Google’s Gemini Flash 2.0. On the Common Eval data set, it achieved a 69.7% win-rate over Gemini Flash 2.0 and a 51.0% win-rate over GPT-4o for American English single-turn conversations using a masculine voice. Similar gains were seen with feminine and British English voices.
Prasad emphasized Nova Sonic’s strong performance in its primary language markets: “Nova Sonic is currently best-in-class in U.S. and British English, outperforming even GPT-4o real-time in both conversational naturalness and accuracy.” He added, “To the best of our knowledge, only two other models—GPT-4o real-time and a variant of GPT-4o mini—come close to what Nova Sonic does in combining speech understanding and generation in real time. This space is still very early and very hard.”
Multilingual capabilities and noisy environment handling
In speech recognition, Nova Sonic also excels in multilingual and real-world conditions. It recorded a word error rate (WER) of 4.2% on the Multilingual LibriSpeech benchmark, outperforming GPT-4o Transcribe by over 36% across English, French, German, Italian, and Spanish. In noisy, multi-speaker environments (measured using the AMI benchmark), Nova Sonic showed a 46.7% improvement in WER over GPT-4o Transcribe.
Expressive voices and language expansion
Currently, the model supports multiple expressive voices, both masculine and feminine, in American and British English. Amazon noted that additional accents and languages are in development and will be released in future updates.
Low latency and enterprise-friendly cost
Speed and cost are also part of the appeal. Third-party benchmarking shows Nova Sonic delivers a customer-perceived latency of 1.09 seconds, compared to 1.18 seconds for OpenAI’s GPT-4o and 1.41 seconds for Google’s Gemini Flash 2.0.
From a pricing standpoint, Amazon positions Nova Sonic as an enterprise-ready solution. “We’re nearly 80% cheaper than GPT-4o real-time, and that superior price-performance is resonating with enterprises moving from experimentation to deployment,” said Prasad.
Early adoption across sectors
According to Amazon, companies across different sectors have already begun using or testing Nova Sonic.
ASAPP is applying the technology to optimize contact center workflows, praising its accuracy and natural dialog handling.
Education First (EF) uses the model to support language learners with real-time pronunciation feedback, especially for non-native speakers with varied accents.
Sports data provider Stats Perform is leveraging Nova Sonic’s low latency and simple setup to power rapid, data-rich interactions in its Opta AI Chat platform.
Responsible AI and safety commitment
Alongside performance and cost, Amazon is highlighting its commitment to responsible AI development. The Nova family of models includes built-in safeguards and is supported by AWS AI Service Cards that outline intended use cases, potential limitations, and ethical guidelines.
Prasad underscored Amazon’s focus on trust and safety: “Trust is paramount for us—developers can customize personality within limits, but we’ve put in strong guardrails to prevent voice cloning or unwanted mimicry.” He added, “We work extremely hard to eliminate hallucinations and voice drift. The bar we’ve set for release is high because speech generation must be trustworthy.”
Amazon Nova Sonic is now generally available through Amazon Bedrock. Developers and enterprises interested in exploring the model can get started by visiting https://aws.amazon.com/nova/.
Daily insights on business use cases with VB Daily
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.
Read our Privacy Policy
Thanks for subscribing. Check out more VB newsletters here.
An error occured.
