Top
Best
New

Posted by EvanZhouDev 22 hours ago

MAI-Code-1-Flash(microsoft.ai)
https://microsoft.ai/models/mai-code-1-flash/

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...

517 points | 243 comments
camelmel 21 hours ago|
Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

davecitron 11 hours ago||
Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).

On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.

easygenes 9 hours ago|||
Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/

kosolam 6 hours ago|||
Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.
sfifs 7 hours ago|||
Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.

[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...

giancarlostoro 21 hours ago|||
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
mdasen 20 hours ago|||
Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

IanCal 9 hours ago|||
> 98% smaller in terms of active parameters (since it's a mixture of experts model).

I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.

stingraycharles 20 hours ago||||
I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.
yorwba 8 hours ago||
The technical report https://microsoft.ai/wp-content/uploads/2026/06/main_2026060... has a lot of detail about decontaminating their training data and developing new in-house benchmarks to ensure reliable evaluation. If other models were just overfit to public benchmarks while Microsoft produced something that generalizes better to unseen data, they could've used those in-house benchmarks to argue that point.

Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.

Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.

Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.

fmajid 3 hours ago||
If their model was trained purely on properly licensed data, the reduced legal liability could be a selling point
davecitron 10 hours ago|||
[dead]
minraws 21 hours ago|||
They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.
ignoramous 6 hours ago|||
Can't yet use MAI-Thinking-1? [0] And no indication of it being made available in GitHub Copilot, either.

[0] Not even here: https://playground.microsoft.ai/

giancarlostoro 21 hours ago||||
Good question, and I missed that entirely!
lostmsu 17 hours ago|||
Compete? It is behind Kimi K2.6, which is in turn away behind Sonnet.
kristjansson 21 hours ago|||
> 137B-A5B

Yeah, not a 5B param model as the earlier title implied!

epolanski 10 hours ago|||
So what other models use less than half of Haiku's tokens while providing higher success rate?
akie 10 hours ago||
Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?
NitpickLawyer 9 hours ago|||
You would be surprised how much code haiku writes behind the scenes. With the whole 'plan w/ opus, spawn subagents w/ haiku' that cc does. And you'd be surprised how useful the small models can be under some guidance / hand holding. You can daily-drive gpt5-mini and still find it useful. They're not as good as the big ones, obviously, and can't handle a project start-to-finish on their own, but given a well-scoped task, they'll do it just fine.
epolanski 10 hours ago|||
I'm not sure I follow, but I'll give you a very fresh example.

I was implementing a re-print functionality in my warehouse management system.

It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.

After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.

Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.

Haiku simply added another field on the terminal state that carried the already generated urls.

I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.

Imho we should experiment rather than assume that what the rest of the community does to be the best practice.

vinzenzu 9 hours ago||
Totally agree. I've been using cheap Chinese open-source models via OpenCode Go, and they are faster, cheaper and in my experience arrive at the solution quicker because they are more pragmatic.

Yesterday Codex was making a big issue out of a new module that was upgraded in our cluster and because of which the same SSH key would be "regenerated" by Terraform. No big deal, it just truncates a newline at the end of the SSH key and it works all the same. But not being aware that this, as an example, is unimportant can cost a lot more time than using the big models saves.

easygenes 13 hours ago|||
While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:

  Qwen3.6-35B-A3B   vs   Claude Haiku 4.5
    reasoning mode · AA Intelligence Index v4.0
  
  46.0 ┤   ↖ better — cheaper · smarter · faster
       │
       │
  44.0 ┤     ╭─────╮
       │     │  ●  │ Qwen3.6-35B-A3B
       │     ╰─────╯
  42.0 ┤
       │
       │
  40.0 ┤
       │
       │
  38.0 ┤                                       ╭───╮
       │                      Claude Haiku 4.5 │ ○ │
       │                                       ╰───╯
  36.0 ┤
       └┬─────────┬─────────┬─────────┬─────────┬────────┬
        $200    $300      $400      $500      $600    $700
  
    x → cost to run the index (USD)        lower is better
    y → AA intelligence index              higher is better
  
    bubble area = output speed (tokens / sec)
          ╭─────╮                  ╭───╮
          │  ●  │ Qwen ~196 t/s    │ ○ │ Haiku ~93 t/s
          ╰─────╯                  ╰───╯
  
    ┌─────────────────────┬──────────┬──────────┬───────────┐
    │ model               │ AA index │ run cost │ out speed │
    ├─────────────────────┼──────────┼──────────┼───────────┤
    │ Qwen3.6-35B-A3B    ●│   43.5   │   $280   │  196 t/s  │
    │ Claude Haiku 4.5   ○│   37.1   │   $620   │   93 t/s  │
    └─────────────────────┴──────────┴──────────┴───────────┘


    COST PER TOKEN   ≠   COST PER TASK  
    output tokens per index run:
       Haiku 4.5    87.3M   (79.3M reasoning + 8.0M answer)
       Qwen3.6     143.2M   (131.7M reasoning + 11.5M answer)
       → Qwen emits 1.64× more output
  
    ── output speed (tokens / sec) ──────────  raw rate · higher = faster
       Qwen3.6     100%   ~196 t/s
       Haiku 4.5   ~47%   ~93 t/s
                                                  → Qwen ~2.1× faster per token
  
          ╎   1.64× more tokens  <  2.1× faster rate
          ▼
  
    ── solution speed (per finished answer) ──  higher = faster
       Qwen3.6     100%
       Haiku 4.5   ~78%
                                                  → Qwen ~1.3× FASTER to a solution
  
    SCORECARD
                            intelligence    cost / task     speed to solution
     Qwen3.6-35B-A3B        43.5            $280            ~1.3× faster 
     Claude Haiku 4.5       37.1            $620            (slower)
  
     → Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
       the raw-speed edge (2.1×), so Qwen stays ahead per task.
HarHarVeryFunny 4 hours ago||
How did you get that nicely formatted graph and table in your post ?!
Krysoph 4 hours ago||
> Text after a blank line that is indented by two or more spaces is formatted as code.

https://news.ycombinator.com/formatdoc

  crimes ↑
         │
   10.0  ┤                                           ● Airport burger
         │                                      ╭──────────────╮
    8.0  ┤                                      │  theft arc   │
         │                                      ╰──────────────╯
    6.0  ┤                         ● Five Guys
         │
    4.0  ┤              ● Food truck burger
         │
    2.0  ┤      ● McBurger
         │
    0.0  ┤ ● Homemade burger
         │
         └───────┬─────────┬─────────┬─────────┬─────────→ price
                $2        $8        $14       $22       $38

  ┌────────────────────┬────────┬──────────────┬────────────────────┐
  │ burger             │ price  │ crime index  │ expected behavior  │
  ├────────────────────┼────────┼──────────────┼────────────────────┤
  │ Homemade burger    │   $2   │          0.0 │ law-abiding citizen│
  │ McBurger           │   $6   │          1.4 │ steals extra napkin│
  │ Food truck burger  │  $11   │          3.1 │ lies about hunger  │
  │ Five Guys          │  $18   │          6.2 │ financial crime    │
  │ Airport burger     │  $34   │          9.7 │ enters villain arc │
  └────────────────────┴────────┴──────────────┴────────────────────┘

  conclusion: burger inflation is a gateway condiment
HarHarVeryFunny 3 hours ago||
Thanks, so in this case the value of "code fomatting" is using a fixed-width font ?

The next question is where did the "ASCII-art" graph and table come from? Are there sites to generate these?

Krysoph 1 hour ago|||
The code formatting puts the content into a <pre> which preserves spaces, indentation and line breaks.

Just built a tool for that: https://krysoph.github.io/UnicodeData/

It is a single html file with no dependencies, it takes json data and turns into unicode charts.

Source: https://github.com/Krysoph/UnicodeData

wetpaws 21 hours ago||
[dead]
bel8 21 hours ago||
It's a start and I welcome competition but I don't think I ever used small cloud models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.

And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.

GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot

I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.

If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.

fnordpiglet 19 hours ago||
I use larger models to organize work into a topologically sorted task graph and pin smaller models to the tasks depending on the complexity with a larger model evaluating the work and patching where necessary. This uses haiku quite often for routine work. I’m able to do multi hour highly complex work with superior results and a much lower bill as a result by doing this, with a parent orchestrator able to do a massive labor within a single context window by effectively organizing work and reviewing quality and integrating where needed. I don’t use haiku directly, but it’s often 30-40% of any major efforts token use. This further improves time to completion as well as cost - but I find haiku is better at following literal instructions and plans without “second guessing,” while opus class models second guess in their thinking constantly.

As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.

pshirshov 18 hours ago|||
Everyone does that. But I don't find Haiku useful for actual coding tasks. Good to, ehm, generate commit messages and summaries.

In my tests, openweight Qwens and GLM are way better than it.

Phoenixhq 5 hours ago||||
Topologically sorted task graph is exactly right — the orchestrator/worker split maps cleanly to senior engineer delegating to juniors, where cheap models handle the leaf nodes fine.
lukevp 18 hours ago|||
Got anything from your orchestrator you could share that’s usable by others? Sounds like how I’d like to work but is difficult to get going from scratch
pshirshov 18 hours ago||
https://github.com/7mind/baboon - all the backends apart from C# and Scala ones were created automatically, same for LSP server, same for playground.
SwellJoe 18 hours ago|||
I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.

And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.

There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).

SyneRyder 2 hours ago|||
I don't suppose you've had a chance to benchmark MiniMax V3 yet? I've only just started testing other models after being an Anthropic fan. I haven't put MiniMax V3 to coding tasks yet, but something about my early simple tests has impressed me. The MiniMax API pricing is about 7% of Anthropic API prices (about matching Anthropic's subscription pricing).
gwerbin 18 hours ago||||
I don't think that's what these small models are for. They are for things like text summarization and generating a title for your AI session. Maybe Haiku occupies a weird zone where it's overpowered for those tasks but underpowered for anything more sophisticated. But for example I used it on an agentic reasoning task recently (reading a chunk of information and drawing a written conclusion, not writing code) and it did just fine. More powerful model would have been a waste of money.
SwellJoe 18 hours ago|||
Sure, but it's priced higher than many better models. I'm not saying use the biggest models for everything. I'm saying Haiku is not a great deal as small models go. You can even self-host a model that is competitive if you've got a pretty beefy machine.

Haiku costs $1/$5. DeepSeek V4 Flash, a stronger model, is only $0.0028/$0.14/$0.28. That first number is the cached input, and DeepSeek caching is crazy efficient. So, using DeepSeek V4 Flash costs about an order of magnitude less than Haiku and performs better.

I have a Claude subscription because I'm willing to pay a premium for the best model for coding, one that doesn't waste as much of my time doing dumb stuff. But, if I need something other than Claude Code, I'm using something other than Claude models. Why burn money for no benefit?

Oh, also, Haiku chews tokens like crazy. In my benchmarks it used three times more tokens than the next highest model. Of course, security bug hunting is not in its wheelhouse, so it's not fair to judge it based on that one thing, but if it's more expensive per token and burns a lot more tokens, it ends up being a lot more expensive.

hadlock 17 hours ago||
I suspect the outrageous pricing of haiku/sonnet is offsetting the cost of opus. The value proposition a year ago was they were cheaper than opus, not that they're a fantastic value (which they're not)
not_kurt_godel 18 hours ago|||
Haiku/Flash/small models are underpowered for literally anything where being non-false-positively correct on details matters at least like 25%. (That's not to say they are only correct 25% of the time, it's definitely more than that, but they're blatantly confidently wrong often enough that the wasted time is a significant net negative for me, even on relatively trivial tasks.)
canpan 11 hours ago||||
Same opinion. Opus is best for coding, but Qwen 3.6 27b Q8 is next, before Sonnet.

Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.

But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.

egeozcan 12 hours ago|||
DeepSeek competes with Sonnet, not significantly worse or better. It tends to do weird things in codebases on the bigger side.
SwellJoe 12 hours ago||
At $3/$15, Sonnet is more than an order of magnitude more expensive than DeepSeek at $0.435/$0.87 (with cached input pricing of $0.003625, DeepSeek is very good at caching, so it's very cheap to use). So, if they're equal in performance, DeepSeek is ten times better value.

But, from what I can tell DeepSeek is better than Sonnet, though I agree it is not at the level of current Opus or GPT 5.5 (but I think it probably beats Gemini Pro 3.1). I use the best model I can for code, because the cost of weaker performance is more than the $100/month I pay for Claude Opus, but it's worth knowing there are very cheap, very good, models for stuff I want to do that isn't Claude Code.

egeozcan 11 hours ago||
I think there are so many variables from harnesses to tasks, making it very hard to put the models to a pecking order unless one beats another in virtually every task (like in Opus vs DeepSeek).

But all in all, I don't think we disagree.

GaryBluto 20 hours ago|||
Almost exactly the same story here. I've also had little to no refusals from DeepSeek, with it's Chinese values meaning substantially less friction when it comes to things like reverse engineering, finding copyrighted files, working with dubiously-sourced source code, et cetera. I don't think I'd go back to Copilot even if they dropped prices by 90%.
papascrubs 18 hours ago||
Are you purchasing directly from DeepSeek? Any concerns as far as privacy or data protection?
GaryBluto 17 hours ago|||
Using OpenRouter, going to migrate to DeepSeek's official API soon. I'm not using it for anything commercial or for private data so I have no privacy qualms.
papascrubs 17 hours ago||
Makes sense. Privacy is my only real hang up with DeepSeek. Both of the big SOTA providers have become extremely filtered. Things that I could do one version ago are now getting refusals. Anthropic is almost unusable. ChatGPT is slightly better. Even with a "cyber exception" in place and a vetted account. They are going to force me to take my business elsewhere.
treesknees 15 hours ago|||
GitHub Copilot refuses to do any security testing or proof-of-concepts for exploits. While I understand why, we pay for Enterprise and I’m working on our proprietary code base. It’s incredibly annoying.

I’ve actually had luck taking the analysis from GHCP and pasting it into our M365 Copilot and getting a useful poc to stick into my bug reports.

yowlingcat 3 hours ago|||
You can always run deepseek yourself, v4-pro and flash are open weights. It's a little tricky to get the hang of self deploying open weight models but you do fully own your deployment substrate and privacy narrative at that point.
ignoramous 6 hours ago|||
> Any concerns as far as privacy or data protection?

We moved to OpenCode Go ($10/mo), so we could switch between DeepSeek v4, GLM 5.1, and Qwen 3.7 models run by providers in EU, US, & Singapore that OpenCode FAQ claims don't use retained data for training.

  What about data and privacy?

  The [OpenCode Go] plan is designed primarily for international users, with models hosted in the US, EU, and Singapore for stable global access. Our providers follow a zero-retention policy and do not use your data for model training.
I find their rather verbose privacy policy is not making far-reaching guarantees about any of this though: https://opencode.ai/legal/privacy-policy
lambda 19 hours ago|||
Yeah, seems like this is in the range of Qwen 3.6, Gemma 4, Nemotron 3 Super, and the like. There are lot of models, including much smaller cheaper ones (like Qwen 3.6 35B-A3B), that are similarly competitive with Haiku. I can run these on my laptop, I don't need to rent them from Microsoft.

I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.

Hfuffzehn 7 hours ago|||
Agreed. Seems like this could have been a nice model if we would still be in the old GitHub Copilot free request/ premium multiplier mode. It could have been a good compromise to somehow reign in the costs for Microsoft.

But with Copilot now just being paying per-token prices I don't see how this is competitive with Chinese models.

It is probably telling you can't find the costs in the announcement. Because Input $0.75 Cached input $0.075 Output $4.50 might be competitive with Haiku, but nobody in their right mind uses Haiku and Anthropic has abandoned it chasing the tokenmaxers who aren't thinking about budgets.

So I guess they are aiming for corporate customers that are bound to Microsoft through compliance approval that will soon start seeing their budgets explode that have to find some corporate compromise.

hparadiz 20 hours ago|||
The $20/month ChatGPT plan that comes with codex is good value. Even just have premium ChatGPT is nice. I get rate limited regularly but it still lets me do most things.
tedggh 20 hours ago||
The $100/month is excellent value. I don’t understand how’s that not the default option for all professional developers. Unless people don’t produce any value writing code, like playing around and experimenting with vibe coding, I understand. But if software development is your actual income, and assuming you live in a wealthy country, $100/month is nothing for a tool like Codex.
perching_aix 18 hours ago|||
Picked up the most recent SO developer survey that features relevant info, the 2024 release: https://survey.stackoverflow.co/2024/work#coding-outside-of-...

The supermajority of respondents did report that they do engage in some coding outside of working hours, for one reason or another. I'm impressed; I'm basically a zombie after hours, rarely in any shape to touch anything technical. Good for them.

But then only 19.3% of respondents ticked that they code for freelancing reasons, and only 15% said they're doing it in an attempt to bootstrap a business. These groups were the only types that suggested revenue generating after-hours activity, and they even overlap to a non-obvious-to-me extent. But even if we pretended they didn't, that adds up to like a third at best.

So when you say:

> I don’t understand how’s that not the default option for all professional developers.

that's in contradiction with this data (and imo common sense), which suggests that the supermajority of professional developers simply do not perform revenue generating software development activity outside of work hours, period. Therefore, for them, the ROI on any potential AI subscription is a flat and constant zero.

Unless you envision people working at "bring your own license" type shops, I don't know how this is supposed to make sense. These are work tools, corporate should be providing them already. But then I'm clearly not from a "wealthy" country either, so YMMV.

hparadiz 20 hours ago||||
Work pays for my work stuff and I have both claude and codex there. On the personal side I sometimes go days without using it. It's more like my assistant to do annoying terminal shit on my home computer and like personal projects I guess. It's plenty for that.
sebra 9 hours ago||||
It's because that price point is for individuals not for companies. So my company can't pay for the $100 plan unlike with Claude. Only pay-as-you go pricing is available for companies beyond the $29 plan which runs out for me in 2/5 hours. And pay-as-you-go is insanely expensive.
59nadir 8 hours ago||||
I don't use LLMs for code generation except for very simple, small things because they suck at it and I wouldn't want to ship what they write.

Since I use LLMs basically only for analysis and as a signal in bug discovery, debugging, research and general search, I don't need a very powerful model and I don't need high token counts. A $100 subscription would be entirely way too much for useful usage for me, and would border on just using tokens for the sake of using them.

veber-alex 16 hours ago||||
Every developer who writes code for a living should get an AI subscription from work and not have to pay for it himself.
KronisLV 4 hours ago|||
I don’t live in a wealthy country and my salary isn’t that great, but Anthropic’s 100 USD tier is still worth it for me. I’d probably go with a 50 USD tier if they had one but oh well. I’m also looking at DeepSeek since they permanently lowered their prices and feel like I could probably add the cheaper Codex tier to the list (you really feel the limits with the cheaper Anthropic one though).
Aperocky 3 hours ago|||
If you use claude-code Haiku is used under the hood for certain task. I'm not sure what it is, but there's some kind of routing that goes to Haiku automatically.
nate 20 hours ago|||
The small stuff has their place. I have this safari extension and needed a way to quickly title people's chat histories. Haiku is the fast cheap thing to come up with decent titles of blocks of text. I feel like there's a bunch of those little things lying around you need a model for. I'm even finding Apple's Foundation Model is super useful for stuff like that. Even summarizing an article. It's like equally awful at doing it, but gets enough done to still be useful as a way to be like "oh yeah, this article is actually worth reading"
seanlinehan 20 hours ago||
Small models are super useful. But I'm skeptical of their use for coding in particular, which is what this model is advertised for.
alkonaut 20 hours ago|||
Won’t (presumably) all the market actors converge on similar pricing? If OpenAI stopped operating on subsidies and charge the true costs and their most token hungry customers are the ones that switch to Anthropic and others, then their pricing model switch will also be around the corner.

Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?

stefan_ 20 hours ago||
Anthropic & co charge API users much more, not least to demolish the middlemen low-effort plays like Cursor and Copilot. To not own the model is not viable in 2026.
swores 19 hours ago|||
Sorry, what do you mean by "To not own the model is not viable in 2026."

I assume I'm misunderstanding you (likely my fault), because the way I read that is that you're saying nobody should currently be using models owned & hosted by companies like OpenAI and Antheopic, while clearly a huge number of people are using those in 2026 despite not owning them.

roywiggins 13 hours ago||
It's that companies like copilot/cursor are in real trouble if they are in the business of reselling expensive Anthropic tokens
HarHarVeryFunny 4 hours ago||
But isn't the current understanding that harness is equally important as model once you get above a certain threshold, so there seems to be room to add value there.

Cursor is potentially about to be acquired by X.ai (i.e. SpaceX), unless this is just some IPO game being played by Musk. They are certainly not just a token reseller since they have their own models in addition to their own vector database approach for code matching.

eli 17 hours ago|||
I think it’s more correct to say they charge subscription users much much less. I assume less even than the cost of providing the inference, if you actually are using it.
vidarh 19 hours ago|||
Haiku does quite well if given a detailed plan. That means much more detail than you otherwise would, but you can still save over e.g. having Opus or Sonnet do everything by having them expand their initial plans into more specific levels of detail and feed it to Haiku (or similar level models).

I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.

eli 17 hours ago|||
Makes sense as part of a larger coding workflow, especially if it’s fast. Using a trillion parameter model to figure out how to call a targeted edit tool or generate a commit message is a waste. Also narrow tasks like “make the background darker” or “rename this function and update callers”
roywiggins 13 hours ago||
> “rename this function and update callers”

I'm old enough to remember when IDEs could do this without needing a couple gigabytes of matrices to do it

(LLMs are great for anything even slightly more complicated ofc)

Hfuffzehn 6 hours ago||
The first time I was impressed by AI coding was when I pointed it at some switch case monster code and told it to replace it with a strategy pattern.

And it did just fine.

So no matter what you think about vibe coding, using AI for these slightly more complicated use cases is genuinely useful.

verdverm 20 hours ago|||
I've been having really good results with DeepSeek-v4-flash, qwen-3.6-moe, and the older gimini-3-flash-preview. (recent geminis suck hard)

Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.

OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go

chillfox 17 hours ago||
You don’t have to limit yourself to the tiny models with the OpenCode Go plan, you can get a lot of usage from the bigger models if you keep the cache hot.

I am about 85% through my quota with 9 days left before refresh and have just used over 1B tokens, mostly DeepSeek V4 Pro, but also a little mimo 2.5 pro and kimi k2.6

verdverm 16 hours ago||
For sure, I've been flipping between flash/pro (or the equivalent for other families), been trying to stick to one family per project as a way to test them out independently over longer periods and more realistic/diverse tasks. I've definitely spent more quota on pro and pushed more tokens through flash.
bbstats 17 hours ago|||
What application/UI are you using deep seek flash high on? Still copilot or something else
partiallypro 20 hours ago|||
> "GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs"

AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.

epolanski 10 hours ago|||
> They are cute but for serious coding they tend to waste your expensive time.

90% of corporate job tasks are trivial enough that Haiku can handle them.

Just this morning I have been implementing a reprint functionality in our warehouse management system, which needed to print again carrier labels and delivery notes for a specific order.

It essentially had to do the same workflow of print, but instead of generating and uploading the pdfs, it only had to fetch and print them.

Took Opus 4.8 high 24m1 seconds and 87k tokens. Took Haiku 6m30 seconds and half the tokens.

So not really sure what do you mean by "wasting your expensive time" here. I think you really don't experiment with these tools and assume higher effort, bigger model => time saved, but that's true only when tasks are much bigger and complex enough that a smaller/less precise model would fail or land work of much lower quality.

bel8 9 hours ago||
Unfortunately there's no defending Haiku 4.5 at this point when cheaper and better options are available.

TLDR:

https://artificialanalysis.ai/models?models=gemini-3-5-flash...

and: https://i.imgur.com/nTu3VCZ.png

For starters I did experiment a heck lot with models since Github Copilot gave me access to OpenAI, Gemini and Anthropic models. So I probably experimented more than the average LLMer. When GitHub Copilot had a generous quota I ran the same tasks with many models to compare them (and pursue best solution among them) quite often.

Now about my experience with Haiku, I think it was free for some time in GitHub Copilot, then it was 0.33x quota usage (when Sonnet was 1x and Opus was 3x, good times). I tried to use it for light coding for about a week.

In my tests I concluded that there was zero reason to use 0.33x priced Haiku in my coding workload because it constantly generated subpar solutions. Even when they worked, Sonnet at 1x and Opus at 3x quota usage had a lot less tech debt on average and my plan permitted continuous Sonnet/Opus usage for my workload, otherwise I would use Gemini Flash (the old one, not this 3.5 one) which was better than Haiku by a mile.

Then GPT 5.4 came at 1x quota usage and it was competitive with Opus at 3x quota usage. So I stopped using Opus in favor of GPT and by this time there was even less reason to use Haiku on my $39/mo GitHub Copilot plan.

And now we have DeepSeek v4 which is Sonnet+ levels in my tests because it has an actual 1 million token context window and their crazy alien caching tech (https://huggingface.co/blog/deepseekv4).

I urge you to throw $5 at OpenCode Go plan for 30 days and toy around with DeepSeek Flash on high setting (not max).

Or MiMo 2.5 Pro on the same OpenCode Go plan. 2 amazing models.

ignoramous 6 hours ago||
> DeepSeek Flash on high setting

In your experience, is max worse or you suggest it for less token use?

> MiMo 2.5 Pro on the same OpenCode Go

Xiaomi dropped dropped MiMo 2.5 rates by 70%+ [0] & now it is cost competitive with DeepSeek v4 Pro. I haven't used MiMo, but since you have, do you find it to be better than DeepSeek v4? If so, for what tasks? How do you decide when to use which, if you have an intuition for it? Thanks.

[0] https://news.ycombinator.com/item?id=48282814

bel8 3 hours ago||
> In your experience, is max worse or you suggest it for less token use?

Yes. DS4 Flash max is incredibly chatty for minimal gain over DS4 high.

I asked the same question a month ago: https://news.ycombinator.com/item?id=47978820 and confirmed in my tests.

> ...MiMo, but since you have, do you find it to be better than DeepSeek v4?

I didn't test MiMo 2.5 enough to form a veridict but from initial tests it is equivalent to DS4. But MiMo 2.5 (non Pro) has the advantage of having vision capability and MiMo is priced equaly as DeepSeek v4 in the $10/mo OpenCode Go now, after the discount you mentioned, see the yellow bars at https://opencode.ai/go

I'll start testing MiMo seriously next week.

LoganDark 17 hours ago|||
I really hope one day there is something like Opus 4.8 but with Cerebras' speed -- they reach over 1,000t/s on gpt-oss-120b but that model is seemingly not even properly trained for tool calling. But watching it slam out several entire screens of thinking/reasoning per second is amazing. I'd love that with Opus quality.
sheeshkebab 16 hours ago||
I like gpt oss - great model even if not too smart.. runs on my laptop at over 100ts has a certain tone that I like over all these qwens stuck up their asses.
emsign 20 hours ago||
I wonder when THEY make it illegal to vote with your wallet.
hmokiguess 21 hours ago||
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
linuxhansl 21 hours ago||
I am using Opus 4.x at work, and these "smaller" (20-80bn, 3-4bn active) models at home. Unfortunately there is no comparison, yet (IMHO anyway).

With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.

The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.

I wish it were different, and maybe in a year or two it will be.

therealdrag0 17 hours ago||
Ya when it’s your own token budget on the line the smaller/cheaper models are more attractive.

I’ve used GPT mini quite a bit and it’s decent.

motoboi 1 hour ago|||
Unless you are token rich, you'll have to find a way pretty soon.

For tasks (like kubernetes, linux, reports, database exploration and such) I use GLM5.1. Faster is actually smarter in those cases. And much cheaper too.

Opus 4.8 is for the unknown. Things I don't know how to do myself.

0123456789ABCDE 21 hours ago|||
>Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones?

always has been

claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.

https://code.claude.com/docs/en/model-config#opusplan-model-...

edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.

https://code.claude.com/docs/en/model-config#control-the-mod...

hedgehog 20 hours ago|||
Yes. Divide execution of a change into separate responsibilities. Designate the main chat as the "orchestrator", Opus. You designate a goal, then tell it to grind until it gets there using the following sub-agents in sequence:

1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator

2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).

3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.

Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).

The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.

Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.

pkaye 18 hours ago|||
Because the Haiku model is quite cheap but doesn't screw up too often I used it for interactive coding for my existing projects on the older copilot plans.

For simple features I don't have a full plan worked out. I write a bit of code then tell the model in a short line prompt what it should do. Sometimes I put temporary comments in the code to give it guidance. Generally if the code change is within a file or package, Haiku is good enough follow what you ask and not mess up too much. I also have skills created over time to give it guidance. There were some months when I used GitHub copilot where I had excess credits available at the end of the month I frantically try to use up.

Even the AI code completions can be pretty good on their own. Sometimes I write some temporary comments describing what the code should do and just press Tab-Tab-Tab and the entire function is done.

I think there is a tendency for people to go for the advanced models thinking they we screw up less but if you really understand the code its easier to interactively do it with a lesser model.

ojr 21 hours ago|||
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
hmokiguess 21 hours ago|||
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
ojr 20 hours ago|||
I spend around $20 a month through API fees using my own harness, https://slidebits.com/isogen. Nothing too special, I prompt it produces file changes using grep and vector search and I can individually accept which files.

I also work on a consumer AI application https://apps.apple.com/us/app/slidebits-studio/id1138731130

For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.

dist-epoch 21 hours ago|||
If you don't hit a limit running Opus, it means you are very much in the loop.

For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.

hmokiguess 21 hours ago||
What’s your prompt for this, the way you described it made it seem like there’s a generalizable way I can go about this. I just rely on a testing pipeline instead so can’t think of why I would need to proactively find holes where tests haven’t already done that for me.
Marha01 21 hours ago|||
I use similar workflow. Here is my refactoring and code quality prompt that I regularly run:

    Perform a thorough analysis of the <project_name> project (the code and the documentation).
    - Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
    - Look for refactoring opportunities and ways to improve code quality and organization.
    - Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
    - Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
      - Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
    - Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
    - Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
    - Brainstorm ideas for improvements of the code and docs.
    
    After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.
hmokiguess 19 hours ago||
Thank you!
dist-epoch 21 hours ago|||
tests will not find inconsistent naming, duplicate functions, scenarios you have not thought about testing

I use quite plain prompts, nothing fancy:

> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.

> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.

> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.

hmokiguess 19 hours ago||
Noted, thank you! Appreciate it
bitexploder 10 hours ago|||
3 Flash is likely rather underrated here. It continues to impress me on few-shot tasks.
hocuspocus 4 hours ago||
GPT-5.4 mini seems noticeably better to me, token cost between Gemini 3 and 3.5 Flash.
veselin 21 hours ago|||
Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.
killermouse0 21 hours ago|||
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
yaodub 21 hours ago||
[dead]
XCSme 18 hours ago|||
Not sure if considered it's considered small in any way, but DeepSeek V4 Flash is really decent.
axi0m 20 hours ago|||
From my experience, smaller models like Haïku 4.5 have indeed shown very convincing results on specific, scoped tasks (themselves generated by a more capable model such as Opus 4.6). We use this kind of workflows in production to optimize speed, efficiency, and costs.
lanthissa 21 hours ago|||
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.

As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.

Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.

I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.

cush 20 hours ago|||
Implicitly, yes. A lot of harnesses will invoke small models to do small changes, saving time and tokens.
newusertoday 21 hours ago|||
plan using opus execute using local
glaslong 21 hours ago|||
I keep trying to, because I really want to make qwen 3.6 35b work for end implementation of a fleshed out spec (mostly for local data privacy reasons).

...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.

ebbi 21 hours ago|||
I use it for smaller changes that I need to make, mainly on UI fixes or some easy logic fixes.
scotty79 21 hours ago|||
In DeepSWE anything from Antropic is a whole class lower than what's achievable with gpt-5.5

So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.

altmanaltman 21 hours ago|||
I actually find planning/design easier with a smaller model and implementation with a larger one. I'm mostly manually working with the model on planning and design and decisions are mine and smaller models are faster. And when there's a clear design/wayforward, the bigger models are usually better at understanding the overall context and applying the specific patch they were assigned to. I call it the 1-2 punch system where you do the first light punch then the harder punch when its actually important to hit properly. I know it goes against the standard of throwing the biggest model at design but I personally experience the bigger models try to do TOO MUCH and take a lot of time which is something that's not good in the design/arch/boilterplate phase.
claud_ia 7 hours ago|||
[flagged]
wd021 3 hours ago||
[dead]
motoboi 1 hour ago||
To understand microsoft IA problems right now, observe that NONE of the models announced are available for use even in the microsoft foundry, which is the place were you add models to your account.

I understand github copilot rollout takes time, but why can't we consume the models via microsoft own api after launching?

Anthropic models are available at foundry the same moment they are launched, but not Microsoft's own models.

chokolad 53 minutes ago|
To understand microsoft IA problems right now observer the parent comment. It is literally false [1] but somehow creates a whole story of Microsoft inaptitude.

[1] https://github.blog/changelog/2026-06-02-mai-code-1-flash-is...

cwillu 19 hours ago||
What is with people reimplementing window scrolling badly?
illusive4080 17 hours ago||
Probably vibe coded. I use StopTheMadness to prevent it.
hankbond 17 hours ago||
Immediately noticed that and then closed out.
capten 22 hours ago||
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.

Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?

npn 20 hours ago||
from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.
redrove 21 hours ago||
It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.

It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.

Yet another reason the current buildout will feel like the railroads.

necubi 21 hours ago|||
It's 5B active params in MoE, not 5B total params (total is 137B).
bgirard 20 hours ago||||
> It’s about bang for buck.

Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.

Hfuffzehn 6 hours ago||
https://docs.github.com/en/copilot/reference/copilot-billing...

Model Input Cached input Output MAI-Code-1-Flash $0.75 $0.075 $4.50

Flere-Imsaho 21 hours ago||||
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.

That's what I'm betting on anyway.

thewebguyd 21 hours ago|||
That seems to be what Microsoft is betting on also based on what was shown at the BUILD keynote today + that new surface ultra and the surface mini PC with the new Nvidia chip. Nadella really played up local AI as the main use case they have in mind.
search_facility 21 hours ago||||
MOE basically work that way already, QWEN/etc with low active params (A-number in name) allows to inference big models locally (only active params have to fit into memory)
girvo 19 hours ago|||
Step 3.7 Flash on my Asus GB10 based mini pc is incredibly close to that today. I’m very impressed, and that’s without MTP to boost performance
dist-epoch 21 hours ago|||
The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".
HarHarVeryFunny 4 hours ago||
There will always be tasks that are withing reach of whatever the SOTA models are, but not of the cheaper, perhaps locally runnable ones. It seems that already people are finding Qwen 3.6 27B sufficient for many coding tasks (the llama.cpp author is now using it exclusively).

As models get better and smaller, I expect that we will rapidly (within a year?) get to the point where SOTA models are not needed for the vast majority of coding tasks, and even today it seems many people are just using them for the planning phase.

How many people drive Ferraris vs Fords? How many people driving a Ford would, on a utilitarian basis, be any better off driving a Ferrari?

So far there seems to be mainly two high volume use cases that have been found for LLMs - coding and business flow automation, and it seems neither of these need SOTA models. I wonder if there will continue to be enough market demand for massive expensive SOTA models to make them worthwhile developing?

AntiRush 22 hours ago||
The introductory blog post has a lot more information

https://microsoft.ai/news/introducingmai-code-1-flash/

and the model card

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

dang 21 hours ago|
Thanks! I've changed the top link to the blog post and put the other links in the toptext.
mekpro 2 hours ago||
The technical report is very detailed and would 'reinforcement learning' of future researchers, Thanks Microsoft!
eterevsky 17 hours ago||
They are comparing it to Haiku 4.5. Not Opus, not Sonnet, but Haiku, the smallest Anthropic model, 3 versions old.
lemonish97 16 hours ago|
4.5 is still the latest Haiku model
Hfuffzehn 4 hours ago|
So I guess the important link the marketing department forgot is this one: https://docs.github.com/en/copilot/reference/copilot-billing...

Model Input Cached input Output

MAI-Code-1-Flash $0.75 $0.075 $4.50

Comparing to

Claude Haiku 4.5 $1.00 $0.10 $5.00

looks fine.

But they also forgot to include the benchmarks comparing to

GPT-5.4 mini $0.75 $0.075 $4.50

Those would have been helpful.

Hfuffzehn 4 hours ago|
And as I am on holiday today I will try to help them out:

                   GPT-5.4 mini Haiku 4.5 MAI-Code
SWE-Bench Pro 54.4 % 35.2% 51.2%

Terminal-Bench 2.0 60.0 % 41.6% 54.8%

Source: https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

More comments...