Posted by jetter 8 hours ago
Gave it a short prompt and it gave me an openscad model with everything parametrized. I printed with no changes in tpu and it was nearly perfect on the first try. Claude put in a 0.3mm subtraction in the x/y dimensions and I lowered it to 0.1 and it's perfect.
Much easier shape than ancient Roman architecture but still very cool how easy it was.
I've had similar experiences with making simple functional parts off a 3d printer with OpenSCAD + LLMs. I'm very aware that the models are worse at it than say, generating react code, and I'm also the antithesis of a skilled pilot. It's still cool and has resulted in me starting to learn a new skill at a hobby level.
“Reproducible build” already usually implies bit-by-bit reproducibility.
cause youd start with the flat shape, the set some contraints that certain edges are colinear
That is seriously really impressive. I looked at the 3D model and didn't even thing to LOOK INSIDE the building before reading this.
Here's [1] the 3D model with `show_cutaway` enabled.
[1] https://modelrift.com/models/pantheon-benchmark-antigravity-...
My Antigravity (forced) replacement for Gemini CLI requires me to log on via browser every time I use it, and my Antigravity IDE won't update at all, so:
If it's ok I'd prefer they just work on reaching a baseline acceptable rollout before worrying about being Top in anything.
Ps actual title:
OpenSCAD LLM Benchmark: Building the Pantheon
I was actually hoping for "Opus level intelligence at Haiku costs" model or "Sonnet level performance in Gemini 3.0 pricing", either of these would have been a workhorse, plus a competitor to Claude/Codex (1 app to do things). I got neither.
This seems very similar to mobile data limits (remember those years?), where there wasn't enough tower bandwidth to serve everyone unlimited data, so telecos were in constant tension between data caps and bandwidth throttling.
It wasn't until 5G came along with 100x network capacity that they could finally give everyone "unlimited" data.
I get you have to change limits, but reducing limits in a way which both applies retroactively and has a really long reset period is just infuriating. If they'd applied the new limits more gently or at the next billing period I'd probably have continued paying.
I don't mind paying a fair price for a service that provides value, but I really hate having a service I think I'm paying for rug-pulled with no clear justification.
So far I like it much more than Gemini CLI (my previous daily driver for personal projects). Seems more mature and "feels more intelligent" (very subjective ofc)
If you're on WSL, getting dbus to work is a PITA. There may be other OS-level issues that folks are running into.
I'm guessing that most harnesses/tools will resize an image before processing and in doing so will loose enough detail to make it much harder to reason about - especially wireframe images.
I'm sure I'm holding it wrong, but this test didn't really test this. It was just a one off. That breaks down pretty quickly and especially if you don't have reference pictures of what you are trying to create.
- Models are very jagged (might excel in one type of 3d model, but not another)
- Gemini models are the least jagged in my experience and have the best image understanding
- Gemini models are also the most creative (which may be undesirable if you want precise CAD part)
- Overall this benchmark doesn't prove much because one 3d model (and one attempt) is just not enough. I am usually testing on at least a dozen models each generated 3 times, but should really do much more, but it's too pricey for a solo dev.
Still, thanks for publishing this. Will be definitely run flash 3.5 soon to see how it performs.
Just totally subjective grading criteria of a single poorly defined example with no end use case in mind to guide how to even do evaluation.
As a side note Autodesk released an agentic assistant back in December for Fusion. Six months later it is still quite bad.
At this point I'm not even sure if it can properly create a simple primitive solid.
Scad needs unit tests. It would be powerful to asset that a profile doesn't have slope greater than 45°, that intersection of two objects is null, or specific volume.
It also needs cut away views. I got okay results using boxes to remove everything except a sliver, to view a slice and internal details. But without hash marks, texture, or outlines it can be hard to tell the forms.
I would be more interested in benchmarking the modeling of an anonymous structure based on provided references alone. It kind of feels like the shallow magic of watching an LLM one-shot a to-do app..