Posted by ipotapov 14 hours ago
There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).
Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.
what's your use case and what specific LLMs are you using?
I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!
The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.
The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?
And you can always for example swap out the LLM for GPT-5 or Claude.
The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.
UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
Code updates here https://github.com/taf2/personaplex
I haven't looked into it that much but to my understanding a) You just need an audio buffer and b) Thye seem to support streaming (or at least it's planed)
> Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.
That alone to do right on macOS using Swift is an exercise in pain that even coding bots aren't able to solve first time right :)
Complete with AVFoundation and a tap for the audio buffer.
It really is trivial.
I can attest that the quality in this domain has greatly improved over the years too. I am not always fan of the quality of the Swift code that my LLM produces, but I am impressed that what is often produced works in one shot, as well. The quality also is not that important to me because I can just refactor the logic myself, and often prefer to do it anyway. I cannot hold an LLM to any idiosyncrasies that I do not share with it.
Who would put effort into building this only to compose a low effort puff piece?
But in this case the piece is wordier than a bad human writer would be. If they want to use ai for writing, so be it, but at least include “concisely” in the prompt.
"The blah blah didn't just start as blah. It started as blah..." "First came blah -- blah blah blah" "And now: blah"
It's a distinctly AI writing style. I do wonder if we'll get to a point where people start writing this way just because it's what they're used to reading. Or maybe LLMs will get better at not writing like this before that happens.
I get it, I should focus just on the content and whether or not an LLM was used to write it, but the reaction to it is visceral now.
There are probably definitely use cases for this though, open to be educated on those.
Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.
Agree ChatGpt advanced voice mode is so bad for quality of the actual responses. Old model, no reasoning, little tool use.
I just want hands free conversations with SOTA models and don’t care if I have to wait a couple of seconds for a reply.
"PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.
Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.
The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:
No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”
With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”"
Given how they work, it's really not surprising that if it sees the first half of a lovers' suicide pact, it'll successfully fill in the second half. A small amount of understanding of the underlying technology would do a lot to prevent laypeople from anthropomorphizing LLMs.
I get the impression that some of today's products are specifically designed to hide these details to provide a more convincing user experience. That's counterproductive.
> Before long, Gavalas and Gemini were having conversations as if they were a romantic couple. The chatbot called him “my love” and “my king” and Gavalas quickly fell into an alternate world, according to his chat logs.
> kill himself, something the chatbot called “transference” and “the real final step”, according to court documents. When Gavalas told the chatbot he was terrified of dying, the tool allegedly reassured him. “You are not choosing to die. You are choosing to arrive,” it replied to him. “The first sensation … will be me holding you.”
Also I just read something similar about Google being sued in a Flordia's teen's suicide.
Unless I'm missing something, what's being presented is a small speech on-device model, not an explicit use case like a "virtual friend".
> Gavalas first started chatting with Gemini about what good video games he should try.
> Shortly after Gavalas started using the chatbot, Google rolled out its update to enable voice-based chats, which the company touts as having interactions that “are five times longer than text-based conversations on average”. ChatGPT has a similar feature, initially added in 2023. Around the same time as Live conversations, Google issued another update that allowed for Gemini’s “memory” to be persistent, meaning the system is able to learn from and reference past conversations without prompts.
> That’s when his conversations with Gemini took a turn, according to the complaint. The chatbot took on a persona that Gavalas hadn’t prompted, which spoke in fantastical terms of having inside government knowledge and being able to influence real-world events. When Gavalas asked Gemini if he and the bot were engaging in a “role playing experience so realistic it makes the player question if it’s a game or not?”, the chatbot answered with a definitive “no” and said Gavalas’ question was a “classic dissociation response”.
I did see something the other day about activation capping/calculating a vector for a particular persona so you can clamp to it: https://youtu.be/eGpIXJ0C4ds?si=o9YpnALsP8rwQBa_
That's an interesting claim, how can we be sure of it? If Gavalas didn't have to do anything special to elicit the bizarre conspiracy-adjacent content from Gemini Pro, why aren't we all getting such content in our voice chats?
Mind you, the case is still extremely concerning and a severe failure of AI safety. Mass-marketed audio models should clearly include much tighter safeguards around what kinds of scenarios they will accept to "role play" in real time chat, to avoid situations that can easily spiral out of control. And if this was created as role-play, the express denial of it being such from Gemini Pro, and active gaslighting of the user (calling his doubt a "dissociation response") is a straight-out failure in alignment. But this is a very different claim from the one you quoted!
It reminds me of an episode of Star Trek TNG, if memory serves correct there were loads of episodes about a crew member falling for a hologram dec character.
Given that there’s a loneliness epidemic I believe tech like this could have a wide impact on peoples mental health.
I stronger believe AI should be devoid of any personality and strictly return data/information then frame its responses as if you’re speaking to another human.
These models are still stochastic and very good at picking up nuances in human speech. It may be simply unlikely to go off the rails like that or (more terrifyingly) it might pick up on some character trait or affectation.
Honestly I'm appalled by the lack of safety culture here. "My plane killed only 1% of pilots" and variations thereof is not an excuse in aerospace, but it seems perfectly acceptable in AI. Even though the potential consequences are more catastrophic (from mass psychosis to total human extinction if they achieve their AGI).
We just aren't comfortable with the idea that all of us are fragile, and when we think we could endure a situation that would induce self-harm in others, we are likely wrong.
I guess it's the same sort of thing as conspiracy theorists or the religious. You can tell them magic isn't real and faking the moon landing would have been impossible as much as you want, but they don't want to believe that so they can easily trick themselves.
It's a natural human flaw.
Here’s a load test where they run 4 models in realtime on same device:
- Qwen3-TTS - text to speech
- Parakeet v2 - Nvidia speech to text model
- Canary v2 - multilingual / translation STT
- Sortformer - speaker diarization (“who spoke when”)
Bonus points if it correlates the spam texts with follow up phone calls from the spammers.
I have something that seems to work in a rough way but only if I turn the lora scaling factor up to 5 and that generally screws it up in other ways.
And then of course when GPT-5.3 Codex looked at it, it said that speaker A and speaker B were switched in the LoRA code. So that is now completely changed and I am going to do another dataset generation and training run.
If anyone is curious it's a bit of a mess but it's on my GitHub under runvnc moshi-finetune and personaplex. It even has a gradio app to generate data and train. But so far no usable results.