Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Posted by ipotapov 15 hours ago

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift(blog.ivan.digital)

345 points | 112 commentspage 2

jwr 14 hours ago|

As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).

atiorh 3 hours ago||

MacWhisper supports 10x faster models with the same accuracy like Parakeet v2 (they were the first to do it 6-9 months ago). Have you tried those?

kavith 13 hours ago|||

Not sure if this will help but I've set up Handy [1] with Parakeet V2 for STT and gpt-oss-120b on Cerebras [2] for post-processing and I'm happy with the performance of this setup!

[1] https://handy.computer/ [2] https://www.cerebras.ai/

jiehong 13 hours ago||

parakeet v3 is also nice, and better for most languages.

vunderba 6 hours ago||

The latest build of Handy actually supports Parakeet V3 (among other models) under the covers. Agreed that it's a very solid multilingual model.

https://github.com/cjpais/Handy

regularfry 14 hours ago|||

If you haven't already, give the models that Handy supports a try. They're not Whisper-large quality, but some of them are very fast.

kermitime 12 hours ago|||

the parakeet TDT models that are coreml optimized by fluid audio are hands down the fastest local models i’ve tried— worth checking out!

(unloading to the NPU is where the edge is)

https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v2-c...

https://github.com/FluidInference/FluidAudio

The devs are responsive and active and nice on their discord too. You’ll find discussions on all the latest whizbangs with VAD, TTS, EOU etc

smcleod 11 hours ago||

Handy with parakeet v2 is excellent

sgt 14 hours ago||

My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.

spockz 14 hours ago||

Regarding the less than ideal audio conditions, there are also already models that have impressive noise cancellation. Like this https://github.com/Rikorose/DeepFilterNet one. If you put them in serial, maybe you get better results?

pain_perdu 14 hours ago||

Hi. Our model at http://www.Gradium.ai has no problem with 'code-switching' between Spanish English and we have excellent background noise suppression. Please feel free to give it a try and let me know what you think!

sgt 13 hours ago||

Looks interesting! How did you train it and how many hours of material did you use?

dubeye 12 hours ago||

It doesn't feel like speech recognition has been improving at the same rate as other generative AI. It had a big jump up to about 6% WER a year or two ago, but it seems to have plateaued. Am I just using the wrong model? Or is human level error rate, some kind of limit, which I estimate to be about 5%.

Krisso 10 hours ago||

Awesome, but given the Apple Silicon population and configuration, how does this fare on a M1 with 8GB of total ram? I'd imagine this makes running another llm for tool-calls and inference tough to impossible.

ruhith 8 hours ago||

Cool demo but without tool calling this is basically a fast parrot. The traditional pipeline is slower but at least you can plug in a real brain.

mrtesthah 6 hours ago|

voice to voice models can call tools. no need for TTS.

michelsedgh 14 hours ago||

its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot

WeaselsWin 14 hours ago||

This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?

ilaksh 11 hours ago||

There is OpenAI gpt-realtime and Gemini Flash or whatever which are great but they do not seem to be quite the same level of overlapping realistic full duplex as moshi/personaplex.

Tepix 14 hours ago||

Yes, OpenAI rolled out their advanced voice mode in September 2024. Since then it recognizes your emotions and tone of voice etc.

ricardobeat 10 hours ago||

No mention of tool use. If the model cannot emit both text and audio at the same time, to enable tools, it’s not really useful at all for voice agents.

Serenacula 14 hours ago||

This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.

sigmoid10 14 hours ago|

OpenAI has been offering this for a while now, featuring text and raw audio input+output and even function calling. Google and xAI also offer similar models by now, only Anthropic still relies on TTS/STT engine intermediates. Unfortunately the open-weight front is still lagging behind on this kind of model.

nerdsniper 13 hours ago|

Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.

ilaksh 11 hours ago|

tavus.io

nerdsniper 11 hours ago||

Hmm. Would this let me replace my own face in a live videoconferencing session? It seems like it's more of a video chatbot than a v-tuber style overlay.

ilaksh 10 hours ago||

Had no idea that was what you were asking for. Search for Zoom Face Filter or OBS Face Filter OBS deep fake live etc.

More comments...