Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Posted by ipotapov 17 hours ago

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift(blog.ivan.digital)

352 points | 113 commentspage 3

Tepix 16 hours ago||

It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.

That said, I found the example telling:

Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:

Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”

It's not surprising that people have little interest in talking to AI if they're being lied to.

PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.

mft_ 13 hours ago||

> It's not surprising that people have little interest in talking to AI if they're being lied to.

I read that and it sounds like the typical nonsense script that customer service agents the world over use to promise-not-promise and defuse a customer's frustration.

Is AI the one lying, or is it just mimicking what passes for customer service in our approaching-dystopian world these days?

lynx97 13 hours ago|||

Do you suggest there is a difference when you talk to a human employee? Telling a customer the plain truth isn't really what your employer wants, and might get you fired.

esseph 15 hours ago||

> Is it just me or are we seing AI generated copy everywhere?

The cost to do so is practically zero. I'm not sure why anyone is surprised at all by this outcome.

nicktikhonov 14 hours ago||

From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!

khalic 15 hours ago||

ugh, qwen, I wish they'd use an open data model for this kind of projects

api 13 hours ago||

How close are we to the Star Trek universal translator?

ilaksh 13 hours ago|

Different type of model but you can buy those on Amazon etc.

Yanko_11 10 hours ago||

[dead]

octoclaw 14 hours ago||

[dead]

pothamk 16 hours ago||

[flagged]

sigmoid10 15 hours ago||

That's why this model and all the other ones serious about realtime speech don't use such a pipeline and instead process raw audio. The most realistic approach is probably a government mandated, real name online identity verification system, and that comes with its very own set of fundamental issues. You can't have the freedom of the web and the accountability of the physical world at the same time.

exe34 15 hours ago||

this is amazing - it reminds me of the time when LLM precursors were able to babble in coherent English, but would just write nonsense.

krasikra 12 hours ago|

[flagged]

cubefox 11 hours ago|

LLM account.