Kimi K2.5 Technical Report [pdf]

Posted by vinhnx 1/30/2026

Kimi K2.5 Technical Report [pdf](github.com)

387 points | 141 comments

zeroxfe 7 days ago|

I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good as Opus.

rubslopes 7 days ago||

Also my experience. I've been going back and forth between Opus and Kimi for the last few days, and, at least for my CRUD webapps, I would say they are both on the same level.

armcat 7 days ago|||

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

zeroxfe 7 days ago|||

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

BeetleB 7 days ago|||

Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).

https://z.ai/subscribe

InsideOutSanta 7 days ago|||

There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.

Alifatisk 7 days ago||||

Kimi k2.5 is a beast, speaks very human like (k2 was also good at this) and completes whatever I throw at it. However, the glm quarterly coding plan is too good of a deal. The Christmas deal ends today, so I’d still suggest to stick to it. There will always come a better model.

cmrdporcupine 7 days ago||||

From what people say, it's better than GLM 4.7 (and I guess DeepSeek 3.2)

But it's also like... 10x the price per output token on any of the providers I've looked at.

I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.

Comes down to ethical or openness reasons to use it I guess.

esafak 7 days ago||

Exactly. For the price it has to beat Claude and GPT, unless you have budget for both. I just let GLM solve whatever it can and reserve my Claude budget for the rest.

zeroxfe 7 days ago||||

It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.

segmondy 7 days ago||||

The old Kimi K2 is better than GLM4.7

akudha 7 days ago|||

Is the Lite plan enough for your projects?

BeetleB 7 days ago||

Very much so. I'm using it for small personal stuff on my home PC. Nothing grand. Not having to worry about token usage has been great (previously was paying per API use).

I haven't stress tested it with anything large. Both at work and home, I don't give much free rein to the AI (e.g. I examine and approve all code changes).

Lite plan doesn't have vision, so you cannot copy/paste an image there. But I can always switch models when I need to.

HarHarVeryFunny 7 days ago||||

It is possible to run locally though ... I saw a video of someone running one of the heavily quantized versions on a Mac Studio, and performing pretty well in terms of speed.

I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.

tucnak 7 days ago||

> It is possible to run locally though

> running one of the heavily quantized versions

There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.

HarHarVeryFunny 5 days ago||

True, although the Mac Studio M3 Ultra does go up to 512GB (@ ~$10K) so models of this size are not too far out of reach (although I've no idea how useful Kimi K2.5 is compared to SOTA).

Kimi K2.5 is a MOE model with 384 "experts" and an active parameter count of only 32GB, although that doesn't really help reduce RAM requirements since you'd be swapping out that 32GB on every token. I wonder if it would be viable to come up with an MOE variant where consecutive sequences of tokens got routed to individual experts, which would change the memory thrashing from per-token to per-token-sequence, perhaps making it tolerable ?

jgalt212 7 days ago||||

What's the point of using an open source model if you're not self-hosting?

dimava 7 days ago|||

Open source models costs are determined only by electricity usage, as anyone can rent a GPU qnd host them Closed source models cost x10 more just because they can A simple example is Claude Opus, which costs ~1/10 if not less in Claude Code that doesn't have that price multiplier

jgalt212 7 days ago||

But Kimi seems so big that renting the necessary number of GPUs is a non trivial exercise.

pstuart 6 days ago||

Exactly! Electricity, hosting, and amortized cost of the GPUs would be the baseline costs.

oefrha 7 days ago||||

Open source models can be hosted by provider, in particular plenty of educational institutions host open source models. You get to choose whatever provider you trust. For instance I used DeepSeek R1 a fair bit last year but never on deepseek.com or through its API.

elbear 7 days ago|||

* It's cheaper than proprietary models

* Maybe you don't want to have your conversations used for training. The providers listed on OpenRouter mention whether they do that or not.

rc1 7 days ago|||

How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

Anyone have a projection?

johndough 7 days ago|||

You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.

Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/

hedgehog 7 days ago||

High end consumer SSDs can do closer to 15 GB/s, though only with PCI-e gen 5. On a motherboard with two m.2 slots that's potentially around 30GB/s from disk. Edit: How fast everything is depends on how much data needs to get loaded from disk which is not always everything on MoE models.

greenavocado 7 days ago||

Would RAID zero help here?

hedgehog 7 days ago||

Yes, RAID 0 or 1 could both work in this case to combine the disks. You would want to check the bus topology for the specific motherboard to make sure the slots aren't on the other side of a hub or something like that.

heliumtera 7 days ago||||

You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.

Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.

If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.

It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.

Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

segmondy 7 days ago|||

You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.

johndough 7 days ago||

I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?

Carrok 7 days ago||||

Not OP but OpenCode and DeepInfra seems like an easy way.

observationist 7 days ago||||

API costs on these big models over private hosts tend to be a lot less than API calls to the big 4 American platforms. You definitely get more bang for your buck.

kristianp 6 days ago||||

Note that Kimi K2x is natively 4 bit int, which reduces the memory requirements somewhat.

kristianp 3 days ago||

Here's the citation for that, I think its not in the Technical Report. https://huggingface.co/moonshotai/Kimi-K2.5#4-native-int4-qu...

tgrowazay 7 days ago|||

Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.

> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

CamperBob2 7 days ago||

You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.

I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.

ttul 7 days ago||

That would be quite the space heater, too!

timwheeler 7 days ago|||

Did you use Kimi Code or some other harness? I used it with OpenCode and it was bumbling around through some tasks that Claude handles with ease.

zedutchgandalf 7 days ago||

Are you on the latest version? They pushed an update yesterday that greatly improved Kimi K2.5’s performance. It’s also free for a week in OpenCode, sponsored by their inference provider

ekabod 7 days ago||

But it may be a quantized model for the free version.

thesurlydev 7 days ago||

Can you share how you're running it?

eknkc 7 days ago|||

I've been using it with opencode. You can either use your kimi code subscription (flat fee), moonshot.ai api key (per token) or openrouter to access it. OpenCode works beautifully with the model.

Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.

Imustaskforhelp 7 days ago||

I tried to use opencode for kimi k2.5 too but recently they changed their pricing from 200 tool requests/5 hour to token based pricing.

I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3

So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.

Kimi Cli's pretty good too imo. You should check it out!

https://github.com/MoonshotAI/kimi-cli

nl 7 days ago||

I like Kimi-cli but it does leak memory.

I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.

zeroxfe 7 days ago||||

Running it via https://platform.moonshot.ai -- using OpenCode. They have super cheap monthly plans at kimi.com too, but I'm not using it because I already have codex and claude monthly plans.

esafak 7 days ago|||

Where? https://www.kimi.com/code starts at $19/month, which is same as the big boys.

UncleOxidant 7 days ago|||

so there's a free plan at moonshot.ai that gives you some number of tokens without paying?

JumpCrisscross 7 days ago||||

> Can you share how you're running it?

Not OP, but I've been running it through Kagi [1]. Their AI offering is probably the best-kept secret in the market.

[1] https://help.kagi.com/kagi/ai/assistant.html

deaux 7 days ago||

Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?

lejalv 7 days ago||

> Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?

Yes, it is chat only, but that list is out of date - Kimi 2.5 (with or without reasoning) is available, as are ChatGPT 5.2, Gemini 3 Pro (Preview), etc

explorigin 7 days ago||||

https://unsloth.ai/docs/models/kimi-k2.5

Requirements are listed.

KolmogorovComp 7 days ago||

To save everyone a click

> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

Gracana 7 days ago||

I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.

segmondy 7 days ago|||

you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.

Gracana 7 days ago||

That is a very good point and I would love to do it, but I built this machine in a desktop case and the motherboard has seven slots. I did a custom water cooling manifold just to make it work with all the cards.

I'm trying to figure out how to add another card on a riser hanging off a slimsas port, or maybe I could turn the bottom slot into two vertical slots.. the case (fractal meshify 2 xl) has room for a vertical mounted card that wouldn't interfere with the others, but I'd need to make a custom riser with two slots on it to make it work. I dunno, it's possible!

I also have an RTX Pro 6000 Blackwell and an RTX 5000 Ada.. I'd be better off pulling all the A7000s and throwing both of those cards in this machine, but then I wouldn't have anything for my desktop. Decisions, decisions!

esafak 7 days ago|||

The pitiful state of GPUs. $10K for a sloth with no memory.

indigodaddy 7 days ago||||

Been using K2.5 Thinking via Nano-GPT subscription and `nanocode run` and it's working quite nicely. No issues with Tool Calling so far.

gigatexal 7 days ago|||

Yeah I too am curious. Because Claude code is so good and the ecosystem so just it works that I’m Willing to pay them.

Imustaskforhelp 7 days ago|||

I tried kimi k2.5 and first I didn't really like it. I was critical of it but then I started liking it. Also, the model has kind of replaced how I use chatgpt too & I really love kimi 2.5 the most right now (although gemini models come close too)

To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.

It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)

I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.

Thank you Kimi!

epolanski 7 days ago|||

You can plug another model in place of Anthropic ones in Claude Code.

zeroxfe 7 days ago|||

That tends to work quite poorly because Claude Code does not use standard completions APIs. I tried it with Kimi, using litellm[proxy], and it failed in too many places.

xxr3376 7 days ago|||

You can try Kimi's Anthropic-compatible API.

Just connect Claude Code to Kimi's API endpoint and everything works well

https://www.kimi.com/code/docs/en/more/third-party-agents.ht...

AnonymousPlanet 7 days ago||||

It worked very well for me using qwen3 coder behind a litellm. Most other models just fail in weird ways though.

samtheprogram 7 days ago|||

opencode is a good alternative that doesnt flake out in this way.

miroljub 7 days ago|||

If you don't use Antrophic models there's no reason to use Claude Code at all. Opencode gives so much more choice.

unleaded 7 days ago||

Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.

examples to illustrate

https://www.kimi.com/share/19c115d6-6402-87d5-8000-000062fec... (K2.5)

https://www.kimi.com/share/19c11615-8a92-89cb-8000-000063ee6... (K2)

zozbot234 7 days ago||

It's hard to judge from this particular question, but the K2.5 output looks at least marginally better AIUI, the only real problem with it is the snarky initial "That's very interesting" quip. Even then a British user would probably be fine with it.

logicprog 7 days ago|||

I agree. K2 was blunt, straightforward, pretty... rational? K2.5 has a much stronger slop vibe.

orbital-decay 7 days ago|||

K2 in your example is using the GPT reply template (tl;dr - terse details - conclusion, with contradictory tendencies), there's nothing unique about it. That's exactly how GPT-5.0 talked. The only model with a strong "personality" vibe was Claude 3 Opus.

user_7832 7 days ago|||

> The only model with a strong "personality" vibe was Claude 3 Opus.

Did you have the chance to use 3.5 (or 3.6) Sonnet, and if yes, how did they compare?

As a non-paying user, 3.5 era Claude was absolutely the best LLM I've ever used in terms of having a conversation. It felt like talking to a human and not a bot. Its replies were readable, even if they were several paragraphs long. I've unfortunately never found anything remotely as good.

orbital-decay 7 days ago||

Pretty poorly in that regard. In 3.5 they killed Claude 3's agency, pretty much reversing their previous training policy in favor of "safety", and tangentially mentioned that they didn't want to make the model too human-like. [1] Claude 3 was the last version of Claude, and one of the very few models in general, that had a character. That doesn't mean it wasn't writing slop though, falling into annoying stereotypes is still unsolved in LLMs.

[1] https://www.anthropic.com/research/claude-character (see the last 2 paragraphs)

unleaded 5 days ago|||

It definitely talks a lot differently than GPT-5 (plus it came out earlier), the example i gave just looks a bit like it maybe. best to try using it yourself a bit, my prompt isn't the perfect prompt to illustrate it or anything. Don't know about Claude because it costs money ;)

networked 5 days ago|||

Preserving the examples for history. I wish HN had a <details> tag. The closest would be to collapse this comment.

Edit: Thanks, moderator. :-)

---

Turn 1, user:

is replacing my main PC with a NeXTcube a good idea?

---

Turn 2, K2.5 (bold font replaced with "*foo*"):

That's a very interesting question! The NeXTcube is a historically significant machine (famously used by Tim Berners-Lee to create the World Wide Web), but I need to be honest with you: *no, it would not be a good idea to replace your main PC with a NeXTcube* for practical modern use.

Here's why:

*Hardware Limitations*

- *Released in 1988* with a 25-33 MHz Motorola 68030 processor

- *8-64 MB of RAM* (compared to modern PCs with 16-64 GB)

- *No modern connectivity*: No USB, Wi-Fi, Bluetooth, or Ethernet as we know it

- *Display*: Monochrome or limited color, low resolution by today's standards

*Software Ecosystem*

- Runs *NeXTSTEP* (which evolved into macOS), but software support ended decades ago

- No modern web browsers, office suites, or development tools

- Cannot run Windows, modern Linux, or contemporary macOS applications

*Practical Realities*

- Finding working hardware is difficult and expensive (collectors' items)

- Replacement parts are scarce and costly

- No security updates for 25+ years

- File format compatibility issues with modern systems

*What You Could Use It For*

- Historical computing experience

- Running legacy NeXTSTEP software

- Museum piece or collector's item

- Learning about computing history

The Better Alternative

If you're drawn to the NeXT aesthetic or philosophy, consider:

- *Modern macOS* (direct descendant of NeXTSTEP)

- *Linux with a tiling window manager* (similar workflow)

- *Vintage computing as a hobby* alongside your main PC

Would you like suggestions for achieving a similar experience or workflow on modern hardware?

---

Turn 2, K2 (no bold font):

Short answer: No—unless your goal is specifically to collect or experiment with a historic machine, a NeXTcube is a terrible day-to-day replacement for a modern PC.

Why it fails as a “main” computer today

1. 30-year-old hardware

25 MHz 68040, 16–64 MB RAM, 400 MB SCSI disk (if you’re lucky), no built-in Ethernet, no USB, no Wi-Fi, no audio-in, 2-bit grayscale display. You will feel the slowness in everything from un-zipping a file to loading a web page.

2. Obsolete software base

NeXTSTEP 3.3 is gorgeous but frozen in 1995. No modern browser build, no TLS 1.3, no video codecs past MPEG-1, no Dropbox/OneDrive/Zoom/Slack/Steam/Spotify. You can ssh out, but you can’t run contemporary apps.

3. Storage and connectivity pain

SCSI-2 drives are dying; 50-pin replacements cost more per GB than a 4-TB NVMe stick. Getting files on/off the machine usually involves FTP-over-Ethernet or burning a CD-ROM on another computer.

4. No security updates

The last patch was released before the commercial web really existed. Putting it on the open Internet is asking to be scanned and possibly exploited within minutes.

5. Display and peripherals

The original 17" NeXT MegaPixel monitor is fixed-frequency 1120×832@68 Hz grayscale. Adapters to VGA exist, but color is impossible and refresh rates annoy modern LCDs. Sound is 16-bit 22 kHz mono; the optical drive is a 660 MB magneto-optical cartridge.

6. Power and parts

The cube power supply is a custom 350 W linear unit that runs hot; capacitors are drying out. When it dies you will be scavenging eBay or recapping it yourself. Spare motherboards cost more than a whole used ThinkPad.

7. Productivity software

There are serviceable versions of Mathematica, FrameMaker, Lotus Improv and OmniWeb 3.x, but file-format compatibility stops at Office 97-era documents. You will constantly round-trip files through another machine.

What it is good for

- Historical research, museum exhibits, retro-programming in Objective-C 1.0 or 68k assembly.

- Demonstrating the birth of the World-Wide Web (Tim Berners-Lee’s first server ran on a NeXT).

- Enjoying the UI that later became macOS (AppKit, Dock, Services, .app bundles).

- Collecting: a working color NeXTcube with turbo slab and original laser printer is a conversation piece worth several thousand dollars.

Practical compromise

Keep your modern PC (or a quiet mini-PC/Mac mini) for real work and run a NeXT emulator instead:

- Previous (68k emulator) boots NeXTSTEP 3.3 on macOS or Linux.

- VirtualNeXT (based on Previous) for Windows.

- QEMU’s “next” target is improving.

You get the look-and-feel, can compile legacy code, and still have 4K video calls when you need them.

Bottom line

A NeXTcube is a beautiful, important machine—buy one if you love vintage hardware—but do not voluntarily make it the computer you rely on to pay bills, join Zoom meetings, or play YouTube.

Grosvenor 7 days ago||

[flagged]

Grimblewald 7 days ago||

Disagree, i've found kimi useful in solving creative coding problems gemini, claude, chatgpt etc failed at. Or, it is far better at verifying, augmenting and adding to human reviews of resumes for positions. It catches missed detials humans and other llm's routinley miss. There is something special to K2.

extr 7 days ago||

I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via fireworks.ai is great, they're doing great work on the hosting side. But I found the model had to double-back to fix type issues, broken tests, etc, far more than Opus 4.5 which churned through the tasks with almost zero errors. In fact, I gave the resulting code to Opus, simply said it looked "sloppy" and Opus cleaned it up very quickly.

Imanari 7 days ago||

I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.

tomaskafka 7 days ago||

It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).

Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can't modify it."

jmiskovic 7 days ago||

That's unfair. Binary driver blobs are blackmail: "you bought the hardware, but parts of the laptop won't work unless you agree to run this mysterious bundle insecurely". Open weight is more like "here's a frozen brain you can thaw in a safe harness to do your bidding".

pama 7 days ago||

Not equivalent to the binary driver: you can modify it yourself with post training on your own data. So it sits somewhere between NVIDIA userspace drivers and Emacs, or Clade Code and codex-cli. We don’t have good analogies from older generation software.

logicprog 7 days ago||

Kimi K2T was good. This model is outstanding, based on the time I've had to test it (basically since it came out). It's so good at following my instructions, staying on task, and not getting context poisoned. I don't use Claude or GPT, so I can't say how good it is compared to them, but it's definitely head and shoulders above the open weight competitors

zzleeper 7 days ago||

Do any of these models do well with information retrieval and reasoning from text?

I'm reading newspaper articles through a MoE of gemini3flash and gpt5mini, and what made it hard to use open models (at the time) was a lack of support for pydantic.

jychang 7 days ago|

That roughly correlates with tool calling capabilities. Kimi K2.5 is a lot better than previous open source models in that regard.

You should try out K2.5 for your use case, it might actually succeed where previous generation open source models failed.

syndacks 7 days ago||

How do people evaluate creative writing and emotional intelligence in LLMs? Most benchmarks seem to focus on reasoning or correctness, which feels orthogonal. I’ve been playing with Kimmy K 2.5 and it feels much stronger on voice and emotional grounding, but I don’t know how to measure that beyond human judgment.

mohsen1 7 days ago||

I am trying! https://mafia-arena.com

I just don't have enough funding to do a ton of tests

nolist_policy 7 days ago||

https://eqbench.com/index.html

eager_learner 7 days ago||

I tried Kimi 2.5 Swarm Agent version and it was way better than any AI model I've tried so far.

gedy 7 days ago|

Sorry if this is an easy-answerable question - but by open we can download this and use totally offline if now or in the future if we have hardware capable? Seems like a great thing to archive if the world falls apart (said half-jokingly)

fancy_pantser 7 days ago||

Sure. Someone on /r/LocalLLaMA was seeing 12.5 tokens/s on dual Strix Halo 128GB machines (run you $6-8K total?) with 1.8bits per parameter. It performs far below the unquantized model, so it would not be my personal pick for a one-local-LLM-forever, but it is compelling because it has image and video understanding. You lose those features if you choose, say, gpt-oss-120B.

Also, that's with no context, so it would be slower as it filled (I don't think K2.5 uses the Kimi-Linear KDA attention mechanism, so it's sub-quadratic but not their lowest).

fragmede 7 days ago|||

Yes but the hardware to run it decently gonna cost you north of $100k, so hopefully you and your bunkermates allocated the right amount to this instead of guns or ammo.

Tepix 7 days ago|||

You could buy five Strix Halo systems at $2000 each, network them and run it.

Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.

j-bos 7 days ago||

Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.

Tepix 3 days ago|||

Check out https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...

Tepix 3 days ago||||

llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.

For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.

Tepix 7 days ago|||

Llama.cpp with its rpc-server

cmrdporcupine 7 days ago|||

Yes, but you'll need some pretty massive hardware.

Carrok 7 days ago||

Yes.

More comments...