Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR

Posted by code_brian 1/14/2026

Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR(www.tavus.io)

For the past year I've been working to rethink how AI manages timing in conversation at Tavus. I've spent a lot of time listening to conversations. Today we're announcing the release of Sparrow-1, the most advanced conversational flow model in the world.

Some technical details:

- Predicts conversational floor ownership, not speech endpoints

- Audio-native streaming model, no ASR dependency

- Human-timed responses without silence-based delays

- Zero interruptions at sub-100ms median latency

- In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines

I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...

123 points | 49 commentspage 2

allan_s 1/15/2026|

How does it compare with https://github.com/KoljaB/RealtimeVoiceChat , which is absent of the benchmark ?

sippeangelo 1/15/2026||

That's not a turn-taking model, it's just a silence detection Python script based on whatever text comes out of Whisper...

bpanahij 1/15/2026||

I haven’t tried that one yet, I’ll check it out.

orliesaurus 1/15/2026||

Literally no way to sign up to try. Put my email and password and it puts me into some wait list despite the video saying I could try the model today. That's what makes me mad about these kind of releases is that the marketing and the product don't talk together.

qfavret 1/15/2026|

try signing up for the API platform on the site. You can access it there

sourcetms 1/15/2026||

How do I try the demo for Sparrow-1? What is pricing like?

bpanahij 1/15/2026|

You can try Sparrow-1 with any of our PALs, or by signing up for a developer account.

ttul 1/15/2026||

I tried talking to Claude today. What a nightmare. It constantly interrupts you. I don’t mind if Claude wants to spend ten seconds thinking about its reply, but at least let ME finish my thought. Without decent turn-taking, the AI seems impolite and it’s just an icky experience. I hope tech like this gets widely distributed soon because there are so many situations in which I would love to talk with a model. If only it worked.

mavamaarten 1/15/2026||

Agreed. English is not my native language. And I do speak it well, it's just that sometimes I need a second to think mid-sentence. None of the live chat models out there handle this well. Claude just starts answering before I've even had the chance to finish a sentence.

Tostino 1/15/2026||

English is my native language, and I still have this problem all the time with voice models.

sigmoid10 1/15/2026|||

Anthropic doesn't have any realtime multimodal audio models available, they just use STT and TTS models slapped on top of Claude. So they are currently the worst provider if you actually want to use voice communication.

code_brian 1/15/2026||

It's unfortunate though, because Anthropic LLMs and ecosystem is the best IMHO. Tavus (we) and Anthropic should form a partnership.

sigmoid10 1/16/2026||

I think Anthropic currently has a slight edge for coding, but this is changing constantly with every new model. For business applications, where tool calling and multi-modality matter a lot, OpenAI is and always has been superior. Only recently Google started to put some small dents in their moat. OpenAI also has the best platform, but less because it is good and more because Google and Anthropic are truly dismal in every regard when it comes to devx. I also feel like Google has accrued an edge in hard-core science, but that is just a personal feeling and I haven't seen any hard data on this yet.

MrDunham 1/15/2026|||

I love Anthropic's models but their realtime voice is absolutely terrible. Every time I use it there is at least once that I curse at it for interrupting me.

My main use case for OpenAI/ChatGPT at this point is realtime voice chats.

OpenAI has done a pretty great job w/ realtime (their realtime API is pretty fantastic out of the box... not perfect, but pretty fantastic and dead simple setup). I can have what feels like a legitimate conversation with AI and it's downright magical feeling.

That said, the output is created by OpenAI models so it's... not my favorite.

I sometimes use ChatGPT realtime to think through/work through a problem/idea, have it create a detailed summary, then upload that summary to Claude to let 4.5 Opus rewrite/audit and come up with a better final output.

code_brian 1/15/2026||

I use Claude Code for everything, and I love Anthropic's models. I don't know why, but it wasn't until reading this that I realized: I can use Sparrow-1 with Anthropic's models within CVI. Adding this to my todo list.

Taikonerd 1/15/2026|||

Agreed. I tried using Gemini's voice interface in their app. It went like this:

===

ME: "OK, so, I have a question about the economics of medicine. Uh..." [pauses to gather thoughts to ask question]

GEMINI: "Sure! Medical economics is the field of..."

===

And it's aggravated by the fact that all the LLMs love to give you page-long responses before it's your turn to talk again!

butlike 1/15/2026||

Am I not allowed to cut you off if you're ramble-y and incoherent?

BizarroLand 1/15/2026||

Its rude if you're a human, and entirely unacceptable if you are a computer.

code_brian 1/15/2026||

The one thing that really surprised me, the thing I learned that's affected my conversational abilities the most: turn taking in conversation is a negotiation: there are no set rules. There are protocols: - bids - holds / stays - implications (semantic / prosodic)

But then the actual flow of the conversation is deeply semantic in the best conversations, and the rules are very much a "dance" or a negotiation between partners.

BizarroLand 1/15/2026||

That's an interesting way to think about it, I like that.

It also implies that being the person who has something to say but is unable to get into the conversation due to following the conversational semantics is akin to going to a dance in your nice clothes but not being able to find a dance partner.

code_brian 1/16/2026||

Yeah, I can relate to that. Maybe it's also because you are too shy to ask someone to dance. I think I learned that lesson: just ask, and be unafraid to fail. Things tend to work themselves out. Much of this is experimentation. I think our models need to be open to that: which is one cool thing about Sparrow-1: it's a meta-in-context learner. This means that when it try's and fails, or you try and fail, it learns at runtime to adapt.

mentalgear 1/15/2026||

Metric | Sparrow-1 Precision 100% Recall 100%

Common ...

bpanahij 1/15/2026||

The response timing in the chart in the blog post shows that even with perfect precision/recall Sparrow-1 also has the fastest true positive response times.

The turn taking models were evaluated in a controlled environment with no additional cascaded steps: LLM, TTS, Phx. This matters to get apples to apples comparison: without the rest of the pipeline variability influencing the measurements.

The video conversation examples are sparrow-1 within the full pipeline. These responses aren’t as fast as sparrow itself because the LLM, TTS, facial rendering, and network transport also take time. Without Sparrow-1 they would be slower. Sparrow-1 enables the responses being as fast as they are, and with a faster CVI pipeline configuration the responses can be as fast as 430ms in my testing.

reubenmorais 1/15/2026||

If you watch the demo video you can see how they would get this: the model is not aggressive enough. While it doesn't cut you off, which is nice, it also always waits an uncanny amount of time to chime in.

oersted 1/15/2026||

That should lead to a low recall: too many false negatives. I wonder how they are calculating it.

vpribish 1/15/2026|

What is "ASR" - automatic speech recognition?

code_brian 1/16/2026|

Ah good question: Yes, ASR stands for Automatic Speech Recognition.