Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

I built a Voice agent platform my drobotics lab of my university..which is already being cloned by 330+ people within 12hrs .. I am a first year cse student and so I tried to figure out a way to actually run everything on my laptop and working on it currently to completely transform to edge ai voice assistants for the robotics and 100% private and local control of robotics related project of my lab..

The intersting features are : 1> I used json rag with real time embeddings so that for a few specs and info we don't need to set a whole pipeline..

I have already built " Hierarchical Agentic Rag with Hybrid Search ( knowledge graph + vector search) u can view that on my profile ...

I am actively trying to share as much as possible related to it but that project is actually linked with a huge set of files it's 693k points of data with pgvector+ postgress .. give a visit u will get more idea from that

2> I had tried every sort of whisper models.. faster whisper .. turbo or anything u can u think of ..even with a self c++ engine .. but that model itself was hallucintion prone architecture..

Then I moved to parakeet tdt with silero vad and not parakeet rnn for better speed and optimisations .. repo has further details ..

3> fine tuned a dataset from anthropic rlhf through space and glinner and convert that to a perfect training dataset of the Lama 3.2 3b ..

I will attach the dataset of u need or will upload that to hugging face if u want to use it for yourself..

4> attached phonetic correctors for both output from parakeet and llama for better tts working .

5> I used setfit to route the queries and confidence based semantic search for faster and accurate as much as possible

6> I am using sherpa onxx and qued the tts and stt and everything but as a experimentation I have also achieved llama generating respond and kokora processing as a batch with a full nyc working as well and everything on my laptop...

7> along with these my frontend also relies on heavy three.js and 3d view files but I had applied optimisations there which works perfectly with everything together on the laptop..

8> I also applied glued interaction to the llm model .. implemented FIFO with 5 interactions and storing them for future fine tuning and phonetic words additions.

Pls give a visit it and let me know if I should learn something new ..

One kind note : as a enthusiast spending so much energy on these things things .. I have taken help from ai for the md files and expansion or explanations in the codes for better help of every single person...