Can I run AI locally?

Posted by ricardbejarano 11 hours ago

781 points | 216 commentspage 4

comrade1234 3 hours ago|

I can't tell at a glance what this page is showing, but I am curious about the licenses on the various models that let me run it locally and make money off it. Awhile ago only deepseek let you do that - not sure now.

mind_heist 3 hours ago|

nice, this is an interesting idea. Can you elaborate on the licensing issue ? how do you get blocked for using the models commercially ?

comrade1234 3 hours ago||

Just read the license agreement. Last time I looked into this the only model I could run locally and do what I want was deepseek. I think it was the MIT license. The others had various restrictions that just didn't make it worth it.

I stopped researching this because buying the hardware to run deepseek full model just isn't practical right now. Our customers will have to be happy with us sending data to OpenAI/deepseek/etc if they want to use those features.

singpolyma3 2 hours ago||

qwen3.5 is just apache

rcarmo 6 hours ago||

This is kind of bogus since some of the S and A tier models are pretty useless for reasoning or tool calls and can’t run with any sizable system prompt… it seems to be solely based on tokens per second?

orthoxerox 7 hours ago||

For some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.

bityard 6 hours ago|

Yeah, this site is iffy at best. I didn't even see Strix Halo on the list, but I selected 128GB and bumped up the memory bandwidth. It says gpt-oss-120b "barely runs" at ~2 t/s.

In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.

urba_ 3 hours ago||

Man, I wonder when there will be AI server farms made from iCloud locked jailbroken iPhone 16s with backported MacOS

am17an 6 hours ago||

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090

John23832 9 hours ago||

RTX Pro 6000 is a glaring omission.

embedding-shape 8 hours ago||

Yeah, that's weird, seems it has later models, and earlier, but specifically not Pro 6000? Also, based on my experience, the given numbers seems to be at least one magnitude off, which seems like a lot, when I use the approx values for a Pro 6000 (96GB VRAM + 1792 GB/s)

schaefer 8 hours ago||

No Nvidia Spark workstation is another omission.

azmenak 5 hours ago||

From my personal testing, running various agentic tasks with a bunch of tool calls on an M4 Max 128GB, I've found that running quantized versions of larger models to produce the best results which this site completely ignores.

Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)

AstroBen 7 hours ago||

This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-

mongrelion 3 hours ago||

Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!

misnome 7 hours ago||

It seems to be missing a whole load of the quantized Qwen models, Qwen3.5:122b works fine in the 96GB GH200 (a machine that is also missing here....)

amelius 6 hours ago||

It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.

rootusrootus 6 hours ago|

Someone linked to llmfit. That would be a great tool to integrate with ollama. Just highlight the one you want and tell it to install.

Quick, someone go vibe code that.

dugidugout 4 hours ago||

The latest level of abstraction! You just release your ideas half baked in some internet connected box and wake up with products! Yahoo! Onwards into the Gestell!

SXX 5 hours ago|

Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

teaearlgraycold 5 hours ago|

Wait for the M5 Ultra. It will get the 4x prompt processing speeds from the rest of the M5 product line. I hear rumors it will be released this year.

More comments...