Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR

Posted by code_brian 1/14/2026

Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR(www.tavus.io)

For the past year I've been working to rethink how AI manages timing in conversation at Tavus. I've spent a lot of time listening to conversations. Today we're announcing the release of Sparrow-1, the most advanced conversational flow model in the world.

Some technical details:

- Predicts conversational floor ownership, not speech endpoints

- Audio-native streaming model, no ASR dependency

- Human-timed responses without silence-based delays

- Zero interruptions at sub-100ms median latency

- In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines

I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...

123 points | 49 comments

cuuupid 1/15/2026|

The first time I met Tavus, their engineers (incl Brian!) were perfectly willing to sit down and build their own better Infiniband to get more juice out of H100s. There is pretty much nobody working on latency and realtime at the level they are, Sparrow-1 would be an defining achievement for most startups but will just be one of dozens for Tavus :)

lostmsu 1/15/2026|

> perfectly willing

dreaming

bpanahij 1/15/2026||

Maybe infiniband is a bit more than we can handle. That technology is incredible! You are right though, we have been willing to build things we needed that didn’t exist yet, or were not fast enough or natural enough. Sparrow-1, Raven-1, and Phoenix-4 are all examples that, and we have more on the way.

ljoshua 1/15/2026||

Hey @code_brian, would Tavus make the conversational audio model available outside of the PALs and video models? Seems like this could be a great use case for voice-only agents as well.

code_brian 1/15/2026|

You can reach out to our sales team. You can chat with our AI SDR here, and they will review it and reach out. https://www.tavus.io/demo

randyburden 1/15/2026||

Awesome. We've been using Sparrow-0 in our platform since launch, and I'm excited to move to Sparrow-1 over the next few days. Our training and interview pre-screening products rely heavily on Tavus's AI avatars, and this upgrade (based on the video in your blog post) looks like it addresses some real pain points we've run into. Really nice work.

bpanahij 1/15/2026|

That’s great! I also built Sparrow-0, and Sparrow-1 was designed to address Sparrow-0’s shortcomings. 1 is a much better model, both in terms of responsiveness and patience.

arkobel 1/19/2026||

Have you compared with Krisp-TT models? https://krisp.ai/blog/krisp-turn-taking-v2-voice-ai-viva-sdk... Krisp LLC also shares an End-of-Turn Test dataset. Did you test your model on that? https://huggingface.co/datasets/Krisp-AI/turn-taking-test-v1

And can you share some information about the model size and FLOPS?

dfajgljsldkjag 1/15/2026||

I am always skeptical of benchmarks that show perfect scores, especially when they come from the company selling the product. It feels like everyone claims to have solved conversational timing these days. I guess we will see if it is actually any good.

fudged71 1/15/2026||

Different industry, but our marketing guy once said "You know what this [perfect] metric means? We can never use it in marketing because it's not believable"

khalic 1/15/2026||

Just include some noise, it’s like the most available resource in the universe

drob518 1/15/2026||

Never thought of noise as a resource, but yea.

bpanahij 1/15/2026||

You should be skeptical, and try it out. I selected 28 long conversations for our evaluation set, all unseen audio. Every turn taking model makes tradeoffs, and I tried to make the best tradeoffs for each model by adjusting and tuning the implementations. I’m certainly not in a position as the creator of Sparrow to be totally objective. However we did use unaltered real conversational audio to evaluate. I tried to find examples that would challenge Sparrow-1 with lots of variation in speaker style across the conversations.

nextaccountic 1/15/2026||

> Non-verbal cues are invisible to text: Transcription-based models discard sighs, throat-clearing, hesitation sounds, and other non-verbal vocalizations that carry critical conversational-flow information. Sparrow-1 hears what ASR ignores.

Could Sparrow instead be used to produce high quality transcription that incorporate non-verbal cues?

Or even, use Sparrow AND another existing transcription/ASR thing to augment the transcription with non-verbal cues

bpanahij 1/15/2026|

This is a very good idea. We currently have a model in our perception system (Raven-1) that performs this partially. It uses audio to understand tone and augment the transcription we send to the conversational LLM. That seems to have an impact on the conversational style of the replicas output, in a good way. We’re still evaluating that model and will post updates when we have better insights.

nubg 1/15/2026||

Btw while I think this is cool and useful for real time voice interfaces for the general populace, I wonder if for professional users (eg a dev coding by dictating all day), a simple push to talk is not always going to be superior, because you can make long pauses while you think about something, this would creep out a human, but the AI would wait patiently for your push to talk.

bpanahij 1/15/2026|

As a dev myself, I see a couple of modes of operation: - push to talk - long form conversation - short form conversation

In both conversational approaches the AI can respond with simple acknowledgements. When prompted by the user the AI could go into longer discussions and explanations.

It might be nice for the AI to quickly confirm it hears me and for it to give me subtle queues that it’s listening: backchannels: “yeah”, and non-verbal: “mhmm”. So I can imagine having a developer assistant that feels more like working with another dev than working with a computer.

That being said, there is room for all modes, all at the same time, and at different times shifting between them. A lot of time I just don’t want to talk at all.

krautburglar 1/15/2026||

Such things were doing a good-enough job scamming the elderly as it is--even with the silence-based delays.

bpanahij 1/15/2026|

That’s unfortunate and certainly not what I spend my time dreaming about. My favorite use case for the elderly is as a sort of companion for sharing their story for future generations. One of our partners uses our technology to help elderly. But yeah, this kind of technology makes AI feel more natural, so we should be aware of that and make sure it’s used for good.

pugio 1/15/2026||

It sounds really cool, but I don't see any way of trying the model directly. I don't actually want a "Persona" or "Replica" - I just want to use the sparrow-one model. Is there any way to just make API calls to that model directly?

nubg 1/15/2026|

Any examples available? Sounds amazing.

bpanahij 1/15/2026|

Try out the PALs: they all use Sparrow-1. You can try Charlie on Tavus.io on the homepage in one of the retro retro-styled windows there.

More comments...