How OpenAI delivers low-latency voice AI at scale
- Sean-Der - 53685 sekunder sedanVery grateful that OpenAI published the article/publicized their usage of Pion[0] a library I work on. If you aren't familiar with WebRTC it's a super fun space. I work on a book WebRTC for the Curious [1] that details how it works.
- legohead - 53718 sekunder sedanThe low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.
I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.
- Lucasoato - 50428 sekunder sedanWait a minute... I’m genuinely happy that they are sharing this, but keep in mind that realtime audio model from OpenAI are still stuck with the 4o family in terms of capabilities, sadly. I still find them so useful, such a pity that there’s no real competitor in this segment, having the experience a real conversation has helped me so much in expressing ideas and concepts.
Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.
(Please Sam, if you read this, release the new realtime audio models)
- maxglute - 1032 sekunder sedan>feels natural if conversation moves at the speed of speech
As someone use to podcast at 3x speech and sapi text to speech at much higher rate, listening to AI at human speech is a chore.
- thimabi - 56633 sekunder sedan> Voice AI only feels natural if conversation moves at the speed of speech […] At OpenAI’s scale, that translates into three concrete requirements: Global reach for more than 900 million weekly active users
Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?
That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.
- Aeroi - 56333 sekunder sedanif anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat
- amirathi - 22680 sekunder sedanI find OpenAI's speech-to-text model the best of the lot. It can handle my & my 5-year old daughter's Indian accent pretty well.
I wonder if they run the STT model's output through the current model (that we're chatting with) as a final pass - since the text seem to be well aligned to the current conversation context.
For long prompts, I often speak to OAI web/app and copy-paste the text to Claude / Gemini :)
- didibus - 54031 sekunder sedanI wouldn't mind waiting longer for answers that would go through a better model with more thinking. As long as it has good support for interrupting and also it doesn't start answering as soon as I pause for 1 second and it's smart about knowing I'm done speaking.
- zerop - 6119 sekunder sedanI have used voice mode on chatgpt, Gemini, Grok as I use it while driving. Best is from openAI. Natural conversation, smarter and meaningful replies.
- tracyhenry - 21791 sekunder sedanAfter all these, I still feel their voice AI interrupts quite a lot, especially when I pause just for 0.5 sec. Interestingly, when I tell it to interrupt less, it seems to be better.
- deferredgrant - 22945 sekunder sedanPeople are very sensitive to timing in conversation. Even if the words are good, a slightly wrong pause or interruption can make the system feel much less intelligent.
- vjay15 - 16425 sekunder sedanThis is such a good write up, WebRTC is one of the coolest things ever! It's kinda genius to use the VIP approach, SFU is also pretty scalable but now they dont even have to do that
- qrush - 53286 sekunder sedanAm I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?
- logickkk1 - 51359 sekunder sedanIMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?
- shevy-java - 5867 sekunder sedanI don't like AI in general, and on youtube there are soooooo many horrible videos with voice AI. Having said that, I did notice AI has actually worked for some hobbyist-maintained games for the most part. Example: BG2EE (Baldur's Gate 2 Enhanced Edition). Yes, this is a forgotten game; and I actually have background music as audio rather than listen to the dialogue, save for testing it, but for the most part it worked here. So for poor-ressource hobbyists, AI is actually not totally useless. For youtube I find only horribly crap examples. I don't watch any AI-involved videos (if I can spot it; so much fake on youtube these days, Google does not realise how AI is killing many old users and visitors here).
- hnav - 45530 sekunder sedanRFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.
- charisma123 - 53915 sekunder sedanIf a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?
- furyofantares - 54565 sekunder sedan> Global reach for more than 900 million weekly active users
lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.
- whateveracct - 20621 sekunder sedanwhy is the "How" included here? it is often removed
- hiroakiaizawa - 21315 sekunder sedanInteresting. What are the main latency bottlenecks in practice?
- anzerarkin - 56786 sekunder sedanI hate the voice ai though, it's so much dumber
- - 53304 sekunder sedan
- CrzyLngPwd - 53471 sekunder sedanIt's bad enough having to speed-read the waffle of its written answers; even when told to be concise, the thought of having to listen to it waffle on in its smarmy, sycohpantic fashion makes me want to reach for the sick bag.
- doctorpangloss - 55468 sekunder sedanwhat i learned from making a webrtc+kubernetes game streaming product:
- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."
- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.
- libwebrtc is the only game in town.
- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions
- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)
this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.
- tom1IIIl1iIL - 50997 sekunder sedanI think it's better to join some kind of club if you want to make friends?
- AIorNot - 56816 sekunder sedanso is the answer
WebRTC + Kubernetes
- devopsengine - 35701 sekunder sedanInspired
- rvz - 53211 sekunder sedanOpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.
Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.
- Ozzie-D - 7361 sekunder sedan[flagged]
- NikolaosC - 14810 sekunder sedan[dead]
- mt_ - 40495 sekunder sedan[dead]
- DumpoLumbo - 52376 sekunder sedan[dead]
- DumpoLumbo - 52339 sekunder sedan[flagged]
- Dorrell - 55967 sekunder sedan[flagged]
- charcircuit - 38545 sekunder sedan[dead]
- testing_auth - 53495 sekunder sedan[dead]
- testing_auth - 52989 sekunder sedan[dead]
- flakiness - 54685 sekunder sedanShould I or shouldn't I be glad to see zero mention on Codex.
- jonahs197 - 52886 sekunder sedanWho cares? Their company is dying.
- cdrnsf - 56643 sekunder sedanIt's missing the part where they explain how they obtained the training data for their voice AI.
Nördnytt! 🤓