OpenAI’s WebRTC problem

Posted by atgctg 1 day ago

344 points | 86 commentspage 3

sam1r 11 hours ago|

>> ... I say hi to <strike> Scarlett Johansson <strike>

Had a nice chuckle.

molszanski 9 hours ago||

I remember using webrtc data channel for p2p video. Browser to browser UDP is neat :) fun memories. Thank you for the read

spongebobstoes 11 hours ago||

this misses a few key things but hits on many others

webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result

I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat

> and then a GPU pretends to talk to you via text-to-speech

OpenAI is speech-to-speech, there is no TTS in voice mode

> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection

signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further

ultimately though, it comes down to

> It’s not like LLMs are particularly responsive anyway

I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced

to be fair, the new models were released the day after this MoQ blog was published

Terretta 9 hours ago|

> OpenAI is speech-to-speech, there is no TTS in voice mode

Which results in the interesting situation where the transcript isn't what was said:

Q: Why do the voice transcripts sometimes not match the conversation I had?

A: Voice conversations are inherently multimodal, allowing for direct audio exchange between you and the model. As a result, when this audio is transcribed, the transcription might not always align perfectly with the original conversation.

keizo 11 hours ago||

interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.

perryizgr8 3 hours ago||

How is OpenAI Voice mode any different than a Whatsapp call? Ignoring the part that there is a GPU on the other side instead of a human. But what is the technical challenge in the voice call portion? It seems like that has been a solved problem for a long time now.

giancarlostoro 12 hours ago||

Probably because WebTransport is the lesser known alternative to WebRTC.

est 12 hours ago|

WebTransport requires some speicific server setup.

cldouflare doesn't support WebTransport well.

brcmthrowaway 10 hours ago||

This is interesting. Does niche knowledge in this area command $1mn salary?

hnav 9 hours ago||

It can, in general knowing how to shuffle packets according to RFCs is a pretty decent gig. Pretty much every hyperscaler ends up building various LBs and the learning curve is too steep to just toss randos at it unsupervised, but at the same time it's not necessarily inventing anything new most of the time.

ec109685 9 hours ago||

> “Here’s a million dollars to implement WebRTC for the fourth time”

“Hell no”

> “Umm…”

yugoslavia4ever 1 hour ago||

[dead]

Giefo6ah 12 hours ago||

Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.

spongebobstoes 11 hours ago||

IPv4 support is necessary, but IPv6 isn't

whattheheckheck 11 hours ago||

How would ipv6 handle it

tardedmeme 11 hours ago||

You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.

pocksuppet 9 hours ago||

It's not really relevant in this case since one endpoint is a massive server farm.

hnav 9 hours ago||

It is because most of their complexity is in routing packets. With IPv6 you can just have the thing handling the conversation directly addressable by the client. The last 64 bits of a v6 let you have billions of instances in a region.

coalstartprob 11 hours ago|

[dead]