На 22 марта 2026 уже неточно описывать Voice AI через старую таблицу GPT-4o Realtime API / ElevenLabs Conversational AI / Sesame / Deepgram и бинарный выбор pipeline vs native speech-to-speech. Current official landscape стал шире и точнее:
gpt-4o-realtime-preview, а gpt-realtime и gpt-realtime-1.5;gpt-4o-mini-transcribe и gpt-4o-mini-tts;Conversational AI в ElevenLabs Agents и теперь подаёт продукт как multimodal agent platform;Flux plus Voice Agent API.Поэтому сегодня Voice AI полезнее понимать как рынок realtime agent stacks, а не как просто набор STT/TTS сервисов.
GPT-4o Realtime vs STT→LLM→TTS уже слишком грубая. Current landscape включает gpt-realtime, Gemini Live, ElevenLabs Agents, Deepgram Flux, dedicated transcribe/TTS models и более тонкие tradeoffs around latency, tooling, deployment and orchestration.Current Voice AI уже нельзя честно объяснять только через speech-to-text + LLM + speech synthesis.
Official sources now point to four distinct product categories:
Из-за этого practical selection в 2026 стал сложнее, но и осмысленнее:
Самый важный current update у OpenAI - official docs now put gpt-realtime in the center.
Model page пишет:
gpt-realtime is the first GA realtime model;WebRTC, WebSocket or SIP;32k context window;OpenAI also surfaces gpt-realtime-1.5 as a stronger flagship audio lane for voice agents and customer support.
Это важный shift:
gpt-4o-realtime-preview framing is no longer the best default;gpt-realtime family plus dedicated speech models.Current OpenAI docs also make the specialized speech stack clearer:
gpt-4o-mini-transcribe for speech-to-text;gpt-4o-mini-tts for text-to-speech.This matters practically because not every voice app needs a full native realtime model.
Use cases:
gpt-realtime.In 2026, the useful OpenAI framing is not "one audio model does everything", but realtime lane plus dedicated speech utilities.
Google's Live API has become a serious current reference in voice AI.
Official Live API docs and capability guide show:
The guide also notes:
This makes Gemini Live especially relevant when:
Current ElevenLabs product story changed materially.
Official sources now say:
Conversational AI has been renamed to ElevenLabs Agents;This is a major shift in how ElevenLabs should be explained.
It's no longer enough to say:
Current ElevenLabs is better described as a voice-rich agent platform, where voice quality is still a differentiator, but orchestration and deployment are now equally important.
Deepgram also moved beyond the old framing "fast STT + decent TTS".
Current docs show two important layers:
Flux, described as conversational speech recognition built for voice agents;Voice Agent API, which helps build interactive voice agents directly.Flux quickstart and related guides highlight:
This matters because Deepgram's value now sits not only in raw transcription quality, but in conversational timing and voice-agent pipeline optimization.
The old distinction still matters:
STT -> LLM -> TTS;But in 2026 this binary is not enough.
A better practical taxonomy is:
Examples:
gpt-realtime;Gemini Live native audio flows.Best when:
Examples:
ElevenLabs Agents;Deepgram Voice Agent API.Best when:
Examples:
gpt-4o-mini-transcribe;gpt-4o-mini-tts;Deepgram Nova / Aura.Best when:
In older voice-AI discussions, latency was treated as the whole story.
Current product surfaces show a more useful practical truth:
turn-taking, interruptions, pause handling and when the system decides to respond.This is why:
Current Voice AI especially fits:
The strongest use cases are ones where:
Voice AI is still usually less suitable when:
In other words, voice AI is powerful, but it still works best as an orchestrated system, not just a model with a microphone.
Самая полезная current framing такая:
gpt-realtime = OpenAI native realtime lane;Gemini Live = Google live multimodal lane;ElevenLabs Agents = voice-rich agent platform lane;Deepgram Flux / Voice Agent API = timing- and pipeline-optimized speech infrastructure lane.То есть Voice AI в 2026 - это рынок agent stacks and speech systems, а не просто список STT/TTS vendors.
1. Что сильнее всего устарело в старой подаче Voice AI?
2. Какой current официальный OpenAI realtime reference наиболее уместен для voice agents?
3. Почему Deepgram Flux важен в current voice landscape?