Announcing new products and features for Azure OpenAI Service including GPT-4o-Realtime-Preview with audio and speech capabilities

We are thrilled to announce the public preview of GPT-4o-Realtime-Preview for audio and speech, a major enhancement to Microsoft Azure OpenAI Service that adds advanced voice capabilities and expands GPT-4o’s multimodal offerings. This milestone further solidifies Azure’s leadership in AI, especially in the realm of speech technology. Azure’s legacy in this space has been long-established through its speech service, which historically integrated speech-to-text, text-to-speech, neural voices, and real-time translation across core Microsoft products like Teams, Office 365, and Edge.

Now, GPT-4o-Realtime-Preview pushes the boundaries even further by integrating language generation with seamless voice interaction, giving developers the tools they need to craft more natural and conversational AI experiences. From creating virtual assistants to powering real-time customer support, this new model opens a vast array of possibilities for voice-driven applications. The new model is also integrated with Copilot, as part of the new Copilot Voice product announced.

Building on recent Azure OpenAI announcements

This announcement continues a series of significant updates within Azure OpenAI Service, including:

O1 Series: A new lineup of models designed for advanced reasoning over complex data. We are happy to make the API available to our developers on Azure today after a two-week preview in the Azure AI Studio Playground.

Data zones: Enabling regional data residency to support customer privacy and compliance.

Trustworthy AI: New tooling, including evaluations in Azure AI Studio to support proactive risk assessments, and watermarking on images generated by DALL*E.

Cache Prompting (coming soon): Cheaper and faster inferencing through caching on GPT-4o and o1 models.

This continuous evolution demonstrates Azure’s commitment to providing the most comprehensive, secure, and versatile AI tools to customers worldwide. Bookmark our newsfeed to track all future announcements.

What’s new in GPT-4o-Realtime-Preview?

GPT-4o-Realtime API: With this release, GPT-4o evolves to support audio input and output, enabling real-time, natural voice-based interactions that go beyond traditional text-based AI conversations. This multimodal capability empowers developers to build innovative voice applications with ease.

Azure AI Studio Early Access playground: For developers eager to explore, this dedicated space allows early experimentation with GPT-4o-Realtime API for Audio capabilities. The studio provides an environment to test, fine-tune, and optimize voice interactions before launching them into production environments.

Performance that speaks for itself

Early customers using GPT-4o-Realtime API for Audio shared remarkable results, confirming its performance and impact:

Faster responses: GPT-4o-Realtime API for Audio provides voice responses significantly faster than many traditional text-to-speech engines, leading to reduced latency and smoother interactions.

Natural conversations: The model minimizes the robotic tone often associated with AI-generated speech, making conversations sound more engaging.

Multilingual support: The API supports a wide range of languages, allowing for natural, multilingual conversations that can be applied to global-facing applications.

Applications of GPT-4o-Realtime-Preview in Azure OpenAI Service

The potential of GPT-4o-Realtime-Preview spans across various industries, transforming how businesses operate and how users interact with technology:

Customer service: Voice-based chatbots and virtual assistants can now handle customer inquiries more naturally and efficiently, reducing wait times and improving overall satisfaction.

Content creation: Media producers can revolutionize their workflows by leveraging speech generation for use in video games, podcasts, and film studios.

Real-time translation: Industries such as healthcare and legal services can benefit from real-time audio translation, breaking down language barriers and fostering better communication in critical contexts.

Use cases driving innovation

The versatility of GPT-4o-Realtime-Preview is already transforming operations across a variety of sectors. Here are a few early adopters and how they’re benefiting from this technology:

Bosch (Germany): Integrating GPT-4o-Realtime API for Audio for virtual reality training in automotive settings, allowing consumers and technicians to receive voice-guided instructions.

“AOAI is an ideal interface for our HeyBosch – Virtual Sales Executive Solution as it is a conversation first solution. We can easily integrate AOAI to our existing solution – Thanks for the reference samples. The response time from the virtual agent has improved substantially as we now have a single interface coupling both (speech and LLM). This helps in keeping latency minimal.  This integration shows the art of possibility of creating compelling user experiences combining GenAI, 3D tech and real time speech processing capabilities.”—Vamsidhar Sunkari Senior Expert Bosch Global Software Technologies Pvt Ltd.

Lyrebird Health (Australia): Using GPT-4o-Realtime-Preview as a medical copilot, summarizing patient information and automating follow-up tasks in real-time.

“Lyrebird Health is excited to bring audio capabilities to the provider/patient relationship. The new GPT-4o-realtime-preview model will allow us to experiment and launch new experiences for our customers and end users. This will help us on our mission to provide the best people technology on the planet.”—Kai Van Lieshout, Co-founder and CEO of Lyrebird Health

Azure AI Search: VoiceRAG leverages Azure OpenAI’s GPT-4o real-time audio model and Azure AI Search to create an advanced voice-based generative AI application with Retrieval-Augmented Generation (RAG). The system integrates real-time audio streaming and function calling to perform knowledge base searches, ensuring responses are well-grounded without compromising latency. By securely handling model configurations and retrieval processes on the backend, VoiceRAG provides a natural, conversational interface that includes citations seamlessly displayed in the user experience. Deep dive the VoiceRAG experience in a dedicated blog on Microsoft Tech Community.

Our commitment to Trustworthy AI

Azure remains steadfast in its commitment to responsible AI, with safety and privacy as default priorities. The Realtime API utilizes multiple layers of safety measures, including automated monitoring and human review, to prevent misuse.

The Realtime API has undergone rigorous evaluations guided by our commitments to Responsible AI. Check out the 2024 Responsible AI Transparency Report.

Azure OpenAI Service provides built-in Content Safety features at no extra cost, and Azure AI Studio offers tools to assess the safety of your AI applications, ensuring a secure and responsible AI experience.

What’s next with GPT-4o-Realtime API for Audio?

As we continue to innovate and expand the capabilities of GPT-4o-Realtime API for Audio, we are excited to see how developers and businesses will leverage this cutting-edge technology to create voice-driven applications that push the boundaries of what’s possible.

Whether you’re looking to integrate voice capabilities into your customer service operations or explore the possibilities of multilingual interactions, GPT-4o-Realtime API for Audio provides the flexibility and power to transform your AI solutions. Starting today, you can explore these new capabilities in the Azure OpenAI Studio, experiment with them in the Early Access Playground, or directly integrate the realtime API in public preview into your applications.

Be sure to review our documentation for the latest updates, dive into the available use cases, and start building with GPT-4o-Realtime API for Audio to bring your business to the next level of AI innovation.

Stay tuned for upcoming customer stories, detailed use case demos, and more as we continue to roll out updates in the weeks ahead!