Posted by RIshabh235 12 hours ago
Personally, I'm a bit surprised the DS chat app still doesn't offer its own text to speech and speech to text features (I know DS doesn't have any ASR model for example, but there are quite a few in the open).
As someone who would rather send a slack message to a coworker rather than actually walking over and talk to them, the idea of having to talk with my laptop is not appealing at all, haha.
One problem has been ChatGpt/Claude apps don’t really do this well. They use weak and/or non-reasoning models for voice interaction and the UX is not optimized for hands free.
I wrote an iOS chatbot app mainly for this purpose for myself and family/friends. Allows starting/sending voice prompts with the action button so I never have to look at the screen. Supports any model at any reasoning level so conversations are not dumbed down. Added a video transcription tool so any model can “read” YouTube/Tiktok videos and chat about them. Great to discuss lectures on tech topics.
It takes slightly longer to use a reasoning model for voice interaction use but I prefer the intelligence. The latency can be minimized a few ways, bidirectional streaming helps. It’s TTS agnostic, I’ve got a few selectable providers and the output can be prompt styled “use a chill tone that’s not too eager”.
For some godawful reason, Apple Maps voice directions assume that you also understand what it omits. So if it says "turn right in 500 meters" "250 meters" and then you stop at an intersection after 150 meters and it says "turn right", it expects you to understand that it doesn't mean the immediate right at the intersection, but the next one [because you still haven't driven the full 250m]. It is nuts and I have no clue how that has ever gotten past testing.
What it should do is say nothing until I have to turn, or say "turn right in 100 meters" "turn right".
They also clearly show which voices can do street names (which is hugely helpful). For some reason the Australian and British accented voices feel more polite than the Americans
For example for voice ChatGPT still uses a quantized gpt40 non-reasoning model that hallucinates pretty frequently. It also doesn’t do much automatic search for updated information and fact checking.
I usually don’t find I need high, usually DeepSeek v4 with medium reasoning is sufficient.
However if it’s important chat like brainstorming on complex topics I sometimes bump it up.
OpenAI has a new voice api that supports adjustable reasoning, but ChatGpt is not using it currently.
My current flow is: Google Eloquent to capture 127WPM (my typing is best case is 65wpm). This lets me get the thoughts out without thinking too much about structure or flow, the same way I would brain-dump type it.
Next I use AI to compress, summarize, and restructure to create a clear coherent message for my peer to read (which is way faster for them).
When communicating with AI, its the same thing, except I skip the second step since AI does a good job at understanding my ramblings.
----
It drives me crazy that some cultures only send voice messages to each other. It drives me crazy they can't be respectful of my time and use STT+AI to convert their 90 second monologue to a few written sentences.
I kinda view myself as a wheelchair user. I'm bad at walking so I use at wheelchair so I can at least have a semblance of decent communication. I don't think my ideas are not worth sharing, but I'm just bad at writing them in an engaging way.
The scarier thing for me is coding. I am good at coding. But I don't even read a single line of code any more.
But I imagine if I'd been a native speaker I wouldn't mind using AI like OC does since it's a convenience. Same way I use a calculator for two digit multiplications in real life but spent years learning to do it manually in school.
I avoid using AI as a direct translation tool, but its super useful for me to translate complex english ideas to vietnamese.
I am loving the conversation here though of how people are using speech to talk to LLMs or not though, it is something that no one talks about much
I can either edit down my rambling by hand (costs about 10-30min depending on the length) or I can ask chatGPT for assistance, where I manually edit chatgpt's edits for factual correctness and tone.
--- STT
Like, besides the lease risk, I think 30 to 50% of the business is going to end the moment the owner stops showing up and the motorbikes are gone. Either, I think it was Roger mentioned or you mentioned the Google reviews all mentioned that people go to the bar because it attracts other motorcycle people. And, you know, we don't have an existing, like, network to come in and grow this. And so we might see a 30 to 50% decline in revenue within a month or two with nothing ready to, like, backfill that with. And if our main goal is to make a private club or event space, ideally, I'd like to have some, like, something substantial to, like, commit to that spot. Like, right now, we're three fucking dudes with, you know, a little bit of a vision, but not really. And, you know, we're buying what will be a failing bar unless we, you know, figure out how to run events or use that backspace.
polish
---- gpt5.5
Polished version:
Separate from the lease risk, I think there is a real chance that 30–50% of the business disappears once the current owner stops showing up and the motorbikes are gone.
Either Roger mentioned this, or we discussed it separately, but the Google reviews seem to reinforce the same point: a meaningful part of the bar’s appeal is that it attracts motorcycle people. We do not currently have an existing network or community that can step in and replace that traffic.
That means we could see a 30–50% revenue decline within the first month or two, with no clear plan in place to backfill it. If the main goal is to turn the space into a private club or event venue, I would feel much better if we had something substantial already committed to that location.
Right now, we are three guys with a loose vision, but not much concrete traction. Without a clearer plan for events, memberships, or activating the back space, we may effectively be buying a bar that starts failing the moment the current identity and customer base disappear.
I have used Whisper to transcribe audio into text in the past. You could probably build a pipeline for that, whether running locally or in the cloud - and the run the transcription through the same summarization agent.
If you use AI to drive your communication with other humans, you suck.
Everything is almost instant, it's insanely fast, and lets me work on multiple different agents/windows at the same time fast with cmux.
I use the same thing to talk to people on Slack, iMessage, etc now when I'm working from home instead of typing.
I also can help articulate my thoughts better when I'm thinking them literally out loud instead of just sitting silent and typing them on a computer for hours.
It's just something that you need to try and get used to because I also thought it was something I wouldn't like at first.
edit: nevermind, found info on the docs about how to enable post processing. Would still be interested in your prompt though if you don't mind sharing!
This is the prompt I use (it's probably overkill and can be condensed):
It's the model doing the work inside the wrapper that an app provides.
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
It's almost instant on my new M5 Max w/ 36GB of memory, but I used both with Handy on my previous 2019 Intel Mac w/ 16GB memory and was completely surprised at just how fast it was for being on-device! Not instant, but only a couple seconds.
Transcription this good used to cost A LOT, now it rounds down to free.
Actually, my thoughts on this matter changed so much that it inspired me to get much more into voice controls because I realized how this same problem was basically why some people sucked at remote work or weren't able to properly use tools like claude code, because it was essentially the same problem but worse (typing / messaging feeling too high-friction or raising the barrier for participation). I have a way to let Claude call me now to tell me stuff when I have a bunch of instances out doing stuff and then leave to go home.
I'm trying to get that better integrated in my devloop because I think it makes managing >4 agents simultaneously much more feasible and natural for some people (I used to play Starcraft a lot so I'm used to the multitasking, but it still takes sustained willpower to be constantly "driving" or monitoring things, or to field questions), especially ones who have never served as TLs or people managers before. IMO it's a big performance roadblock for a lot of developers to be treat directing multiple agents simultaneously as some kind of high-stakes/high-cost thing. The kind of developer who would not say anything in a team meeting unless prompted or who thinks everything is stupid by default (because they are afraid of making decisions / being wrong even if only briefly) is both very common and reluctant to work this way, but also really probably needs it to be as productive as more skilled developers.
"You're right to push back" has become the gold standard phrase I'm looking for from these things to assure myself that I'm covering all the bases and understanding what it's building (not that that's enough, and not that it isn't still going to build some ungodly blob anyway).
I kinda like using voice to jot down my next questions or iterate on things, but there's a clear danger to it, which is that you may inadvertently be signing off on stuff you haven't thoroughly read. If there's one thing about LLM-written code, it's that the devil is in the details.
But I love the chatgpt voice interface e.g. on a long drive when I can use it to learn about random stuff (btw, turn advanced voice off for such usage).
Other part though is, hacker news vs regular population, majority of which would much much rather talk and listen than type and read.
Other week I fixed a a water valve. After planning the thing with ChatGTP I brought the new valve. Then I described what I was seeing as I swapped the old valve for the new one to make sure everything was right. Really cool experience!
I understand a bit Spanish but I don’t speak Spanish yet, and they don’t speak English.
I speak English to the AI and end with “translate to Spanish, translation only”, and then the AI says the thing I was saying in Spanish (not perfect but good enough, and also it has a slightly weird accent that might be it using English or English influenced text to speech even when speaking Spanish sentences?).
Pity the managers with no one left to boss around besides the machines coming for their own jobs.
I was asked just yesterday if I could wire up [redacted] so that [redacted profession] could have a realtime voice interface while in the middle of performing [redacted]. My basic answer was yes, but it would be a bit slower than you want if something is going wrong, and it would probably be unethical for a whole lot of reasons.
This trivial fact of life is observed every day by e.g.:
- students taking notes and finding it necessary to only jot down key facts so that they can keep up,
- stenographers who require special training and equipment to keep up verbatim with live speech in the courtroom,
- annoying colleagues who insist on "hopping on a quick call" or arranging big, wasteful, and disruptive meetings instead of just writing down their problem / sending a message or email,
- friends who insist on sending short voice messages in DMs instead of typing, because it's more "personal" that way (which to be fair it is, but not to the extent proclaimed).
Nobody will cry when their AI girlfriend model gets revoked. You'll always have the weights.
Presumably for the low cost of spinning up an H200 or two you can use the weights forever.
No more claiming your LLM gets nerfed. No more claiming your video model can't do Spider-Man anymore.
Is it a new silent update?
I thought "thinking" is literally the model generating additional text in a human language that shows its "thought process". It's added to the model's context, which helps it reason better because it now has this self-generated context.
The "their own language" idea seems to come from some recent science fiction where LLMs develop their alien language and take over the world by 2037 or something.
Research only showed that thinking might be disconnected from the final output but in my experience they are very strongly correlated in recent models
It is trivial to regularly spot obvious contradictions and inconsistencies if you read carefully. For example I've encountered traces that amounted to "I can deduce X, therefore Y, so that means Z" but then the model turns around and outputs "the answer is W because X". It's even been demonstrated that having the model output placeholder tokens or other gibberish instead of "thoughts" still improves performance. However the thinking traces can still be useful to the end user regardless.
I think that for DeepSeek problem (thinking and replying in Chinese) everything is kinda simpler: in their official chat, they're probably using some kind of system prompt which is (probably) written in Chinese, so that's why model may prefer Chinese in it's output.
They occasionally show snippets of CoT in papers they write, e.g. for o3/o4/GPT5 models [1] or Claude 3.5 Haiku [2].
[1]: https://openai.com/index/evaluating-chain-of-thought-monitor... [2]: https://transformer-circuits.pub/2025/attribution-graphs/bio...
Or hallucinated
I use the API however, not the chat interface.
It also happened a handful of times with Anthropic models.
I'm curious to know how they can they offer at such a cheap price. Some say it's electricity surplus in China and/or government subsidy. It'll be a very interesting read if there's an extensive study on their economics.
1.1B (cache reads) * $0.5 = ~576
39M (ache miss) * $5 = ~199
21M (output) * $25 = ~529
Opus 4.8 = 1304
1.1B (cache reads) * $0.003625 = ~4.17
39M (ache miss) * $0.435 = ~17.3
21M (output) * $0.87 = ~18.4
Deepseek V4 Pro = ~40Turns out, to use Claude Agents SDK, you need to have a vision enabled API. If Deepseek API could see, it can fully drive Claude Code and Claude Agents SDK. A project I'm working on relies on a Claude-in-CloudflareWorker setup and I've been relying on Qwen and gemini flash lite, both more expensive than Deepseek.
Can't wait to have it available on deepseek.