Posted by code_brian 1/14/2026
Some technical details:
- Predicts conversational floor ownership, not speech endpoints
- Audio-native streaming model, no ASR dependency
- Human-timed responses without silence-based delays
- Zero interruptions at sub-100ms median latency
- In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines
I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...
dreaming
And can you share some information about the model size and FLOPS?
Could Sparrow instead be used to produce high quality transcription that incorporate non-verbal cues?
Or even, use Sparrow AND another existing transcription/ASR thing to augment the transcription with non-verbal cues
In both conversational approaches the AI can respond with simple acknowledgements. When prompted by the user the AI could go into longer discussions and explanations.
It might be nice for the AI to quickly confirm it hears me and for it to give me subtle queues that it’s listening: backchannels: “yeah”, and non-verbal: “mhmm”. So I can imagine having a developer assistant that feels more like working with another dev than working with a computer.
That being said, there is room for all modes, all at the same time, and at different times shifting between them. A lot of time I just don’t want to talk at all.