Posted by meetpateltech 14 hours ago
[1] The photo of the outfit: https://share.google/mHJbchlsTNJ771yBa
EDIT: After reading the prompt translation, this was more just like a “year of the horse is going to nail white engineers in glorious rendered detail” sort of prompt. I don’t know how SD1.5 would have rendered it, and I think I’ll skip finding out
From the article it seems the name is 马启仁, not 马骑人 so the guy's name sounds the same as 'horse riding man', but that's not a literal translation of his name.
He also claimed that LLMs were a failure because of prompts that GPT 3.5 couldn't parse, after the launch of GPT-4,which handled them with aplomb.
For example I think there would be a lot of businesses in the US that would be too afraid of backlash to use AI generated imagery for an itinerary like the one at https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...
Ha! An American would have no such qualms.
this is sending me, I don't know what's funnier, this translation being accurate or inaccurate
This problem is infamous because it persisted (unlike other early problems, like creating the wrong number of fingers) for much more capable models, and the Qwen Image people are certainly very aware of this difficult test. Even Imagen 4 Ultra, which might be the most advanced pure diffusion model without editing loop, fails at it.
And obviously an astronaut is similar to a man, which connects this benchmark to the Chinese meme.
But on the one picture that honestly looks like a man getting ass-raped by a horse, it's a white man.
I mean even in the west where you can hardly see an ad with a white couple anymore, they don't go that far (at least not yet).
White people are a minority on earth and anti-white racism sure seems to be alive and well (btw my family is of all the colors and we speak three languages at home, so don't even try me).
You act as though they first decided to make an image representing Westerners and then chose that particular scene as an intentional insult, but you need to consider that they likely made thousands of test images, most of which were just playing around with the model's capabilities and not specifically crafted for the announcement post.
So why did this one get picked? I think it boils down to the visual gag being funny and the movie-like quality.
Which is really apt because in Serbian "konj", or horse, is a colloquial word for moron. So, horses riding people is a perfect representation of the reality of the Serbian government.
Another fun fact, the parliament building in HL2's City 17 was modelled from that building.
1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.
2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can run on much more modest GPUs. For reference, the original Qwen-Image is a 20B-parameter model.
3. This is a unified model (both image generation and editing), so there’s no need to keep separate Qwen-Image and Qwen-Edit models around.
4. The original Qwen-Image scored the highest among local models for image editing in my GenAI Showdown (6 out of 12 points), and it also ranked very highly for image generation (4 out of 12 points).
Generative Comparisons of Local Models:
https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt
Editing Comparison of Local Models:
https://genai-showdown.specr.net/image-editing?models=kxd,og...
I'll probably be waiting until the local version drops before adding Qwen-Image-2 to the site.
Qwen 2512 (December edition of Qwen Image)
* 19B parameters, which was a 40GB file at FP16 and fit on a 3090 at FP8. Anything less than that and you were in GGUF format at Q6 to Q4 quantizations… which were slow, but still good quality.
* used Qwen 2.5 VL. So a large model and a very good vision model.
* And iirc, their own VAE. Which had known and obvious issues of high frequency artifacts. Some people would take the image and pass it through another VAE like WAN Video model’s or upscale-downscale to remove these
Qwen 2 now is
* a 7B param model. Right between Klein 9B (non-commercial) this (license unknown), Z-Image 7B (Apache), and Klein 4B (Apache). Direct competition, will fit on many more GPUs even at FP16.
* upgrades to Qwen 3 VL, I assume this is better than the already great 2.5 VL.
* Unknown on the new VAE. Flux2’s new 128 channel VAE is excellent, but it hasn’t been out long enough for even a frontier Chinese model to pick up.
Overall, you’re right this is on the trend to bring models on to lower end hardware.
Qwen was already excellent and now they rolled Image and Edit together for an “Omni” model.
Z-Image was the model to beat a couple weeks ago… and now it looks like both Klein and Qwen will! Z-Image has been disappointing to see how it just refuses to adhere to multiple new training concepts. Maybe they tried to pack it too tightly.
Open weights for this will be amazing. THREE direct competitors all vying to be “SDXL2” at the same time.
The Qwen convention was confusing! You had Image, 2509, Edit, 2511 (Edit), 2512 (Image) and then the Lora compatibility was unspecified. It’s smart to just 2.0 this mess.
I'm really looking forward to running the unified model through its paces.
If I were to guess, I would say that Z-Image’s life is shorter than it initially appeared. Even as a refiner which are just workarounds for model issues.
What's interesting is that the bottleneck is no longer the model — it's the person directing it. Knowing what to ask for and recognizing when the output is good enough matters more than which model you use. Same pattern we're seeing in code generation.
The fight right now outside of API SOTA is who will replace SDXL to be the “community preference”
It’s now a three way between Flux2 Klein, Z-Image, and now Qwen2.
I want the ability to lean into any image and tweak it like clay.
I've been building open source software to orchestrate the frontier editing models (skip to halfway down), but it would be nice if the models were built around the software manipulation workflows:
In my (very personal) opinion, they're part of a very small group of organizations that sell inference under a sane and successful business model.
I was a mod on MJ for its first few years and got to know MJ's founder through discussions there. He already had "enough" money for himself from his prior sale of Leap Motion to do whatever he wanted. And, he decided what he wanted was to do cool research with fun people. So, he started MJ. Now he has far more money than before and what he wants to do with it is to have more fun doing more cool research.
1. real time world models for the "holodeck". It has to be fast, high quality, and inexpensive for lots of users. They started on this two years ago before "world model" hype was even a thing.
2. some kind of hardware to support this.
David Holz talks about this on Twitter occasionally.
Midjourney still has incredible revenue. It's still the best looking image model, even if it's hard to prompt, can't edit, and has artifacting. Every generation looks like it came out of a magazine, which is something the other leading commercial models lack.
Even something like Flux.1 Dev which can be run entirely locally and was released back in August of 2024 has significantly better prompt understanding.
How. By magic? You fell for 'Deepseek V3 is as good as SOTA'?
What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
Sad state of affairs and seems they're enshittifying quicker than expected, but was always a question of when, not if.
Other people gave you the right answer, ComfyUI. I’ll give you the more important why and how…
There is a huge effort of people to do everything but Comfy because of its intimidating barrier. It’s not that bad. Learn it once and be done. You won’t have to keep learning UI of the week endlessly.
The how, go to civitai. Find an image you like, drag and drop it into comfy. If it has a workflow attached, it will show you. Install any missing nodes they used. Click the loaders to point to your models instead of their models. Hit run and get the same or a similar image. You don’t need to know what any of the things do yet.
If for some reason that just does not work for you… Swarm UI, is a front end too comfy. You can change things and it will show you on the comfy side what they’re doing. It’s a gateway drug to learning comfy.
EDIT: most important thing no one will tell you out right… DO NOT FOR ANY REASON try and skip the VENV or miniconda virtual environment when using comfy! You must make a new and clean setup. You will never get the right python, torch, diffusers, driver, on your system install.
Engine:
* https://github.com/LostRuins/koboldcpp/releases/latest/
Kcppt files:
LinkedIn is filled with them now.
Yes, cringeworthy but at least not addictive! Its like facebook all those years ago, i can IM friends from highschool without having to pay any attention to the feed.
Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.
See Gas Town for non-Qwen examples of how bad it can get:
https://news.ycombinator.com/item?id=46746045
(Not commenting on the other results of this model outside of diagramming.)
Thank you for this phrase. I don't think that bad diagrams are limited to the AI in any way and this perfectly describes all "this didn't make things any clearer" cases.
When I used the exact prompt the post - the chat works. It gives me the exact output from the blog post.
Then I used Google Translate to understand the prompt format. The prompt is: A 4x6 panel comic, four lines, six panels per line. Each panel is separated by a white dividing line.
The first row, from left to right: Panel 1: Panel 2: .....
and when I try to change the inputs the comic example fails miserably. It keeps creating random grids - sometimes 4x5 other times 4x6 but then by third row the model will get confused and the output has only 3 panels. Other times English dialogue is replaced with Chinese dialogue. so, not very reliable in my books.
"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight. The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground. The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds. The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces."""
(I don’t even know if I’m being sarcastic)