Posted by bilsbie 9 hours ago
7B on 15W could be any of the Orin (TOPS): Nano (40), NX (100), AGX (275)
Curious if you've experimented with a larger model on the Thor (2070)
Huh? Why would industrial inspection, in particular, benefit from lower latency in exchange for accuracy? Sounds a bit backwards, but maybe I'm missing something obvious.
1. Cursor used online RL to get +28% approval rate: https://cursor.com/blog/tab-rl
2. Vercel used RFT for their AutoFix model for V0: https://vercel.com/blog/v0-composite-model-family
3. Perplexity's Sonar for Deep Research Reasoning I think was a finetuned model: https://docs.perplexity.ai/docs/getting-started/overview
4. Doordash uses LoRA, QLoRA for a "Generalized Attribute Extraction model" https://careersatdoordash.com/blog/unleashing-the-power-of-l...
5. NASA flood water detection https://earthdata.nasa.gov/news/nasa-ibm- openly-release-geospatial-ai-foundation-model-nasa-earth-observation-data6
6. Online RL for robotics - imagine you teaching a robot in the future via some mini finetuning
7. OpenAI's RFT page has more: https://developers.openai.com/api/docs/guides/rft-use-cases
8. For larger models - https://www.mercor.com/blog/expert-data-drives-model-perform...
I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.
Heres raw numbers to spark discussion:
| Model | DocType% | Year% | Subject% | In $/MTok |
|---------------|----------|-------|----------|-----------|
| llama-70b -----| 83 | 98 | 96 | $0.72 |
| gpt-oss-20b --| 83 | 97 | 92 | $0.07 |
| ministral-14b -| 84 | 100 | 90 | $0.20 |
| gemma-4b ----| 75 | 93 | 91 | $0.04 |
| glm-flash-30b -| 83 | 93 | 90 | $0.07 |
| llama-1b ------| 47 | 90 | 58 | $0.10 |
percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.
in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.
Can easily generate tens of thousands of examples.
Might try myself, but always keen for an opinion.
_edit for table formatting_
Source: Consulted for a few companies to help them finetune a bunch of LLMs. Typical categorical / data extraction use cases would have ~10x fewer errors at 100x lower inference cost than using the OpenAI models at the time.
I did an experiment where I did very simple SFT on Mistral 7b and it was extremely good at converting receipt images into structured json outputs and I only used 1,000 examples. The difficulty is trying to get a diverse enough set of examples, evaling, etc.
If you have great data with simple input output pairs, you should really give it a shot.
like this | Model | DocType% | Year% | Subject% | In $/MTok |
|----------------|----|-----|----|-------|
| llama-70b -----| 83 | 98 | 96 | $0.72 |
| gpt-oss-20b ---| 83 | 97 | 92 | $0.07 |
| ministral-14b -| 84 | 100 | 90 | $0.20 |
| gemma-4b ------| 75 | 93 | 91 | $0.04 |
| glm-flash-30b -| 83 | 93 | 90 | $0.07 |
| llama-1b ------| 47 | 90 | 58 | $0.10 |I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?
Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.
Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.
While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.
As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!
Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.
I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.
There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.
I've tried too. Wasted a few days trying out even high end paid models.
You are correct if we are talking about knowledge.
However it is bad at hyper-idiosyncratic, gritty style transfer.
I first noticed the issue when asking claude code to draft email responses. The choice of register was off. ("Register in writing refers to the level of formality and tone chosen to suit a specific audience, purpose, and context.")
I decided to talk all my HN comments and rewrite them in various bad LLM prose, and see if I could use DSPy to optimize a prompt using in-context-learning (ICL, I give it 10 examples of my HN comments) and the results were abysmal. RHLF fine-tuned frontier LLMs have a deep seated aversion to the target stylistic distribution of my comments.
I tried fine-tuning qwen3, llama, and gemma models. Instruct models are already so tuned that they could not be tuned. This is using several hunded comments as gold targets and 5 different LLM degradations per gold as the input.
1. If we have robots at home, they need some sort of efficient continual learning, which could be on the go finetuning / RL via some small LoRA - this will need to do multimodal finetuning with sparse reward signals - one could also imagine all data is aggregated to one central processing center after anonymization, and training a larger model with more data + RL like that
2. Agreed images, audio, video etc is what still LoRA does well - the guide at https://unsloth.ai/docs/models/qwen3.5/fine-tune is actually a vision + text finetuning guide, so you can finetune the vision layers on your own use case
3. Model routing is going to be more the norm in the future - ie locally smallish models with LoRA for continuous finetuning can be used, but complex tasks can be offloaded to a large LLM in the cloud.
4. I also wrote about more use-cases below on the post - DoorDash, Vercel, Mercor, Stripe, NASA, Perplexity, Cursor and many others all do finetuning - for eg Cursor, Perplexity finetune large OSS LLMs themselves for their specific product lines - so there is definitely value if you have the data for it.
For example last year with Daniel/Unsloth's help we released a tiny specialized model that can get equivalent to Gemini level purpose specifically for FC. For folks that need efficient limited purpose models small models like this can fit a specific need.
https://blog.google/innovation-and-ai/technology/developers-...
Especially on device. https://developers.googleblog.com/on-device-function-calling...
It's the same with chips, we have general purpose CPUs but we still have specialized silicon for tasks that are smaller, more power efficient, cheaper, and because they're single purpose it simplifies and derisks certain designs.
And I have to add, if you want to learn about finetuning models efficiently the Unsloth guides are at the top of my list. They're practical, have all the technical details, and most importantly Daniel and the others are working around the clock to keep it up to date in what is an incredibly fast moving space of models and hardware. I am continually astounded by their work.
Nice work with Gemma and Gemini as usual! :) Excited for more cool models this year!
I make it sound like a rare perfect storm needs to exist to justify fine tuning, but these circumstances are not uncommon - to an extent (a), (c) and (d) were already prerequisites for deploying traditional ML systems.
Using the large model to generate synthetic data offline with the techniques you mentioned, then fine-tuning the small model on it, is an underrated technique.
a) qwen3-coder
b) qwen3.5 (general)
Because these models are good in general but their Latvian output is half-drivel, like the roots of the words are usually the right ones, but not the rest.
That, and EuroLLM is really slow to release new models that would be similarly good off the shelf.