Posted by wahnfrieden 18 hours ago
System card: https://deploymentsafety.openai.com/chatgpt-images-2-0/chatg...
Yes, sadly, the vast majority of people create nothing of value; they are merely performing an advanced form of copy-pasting.
That certainly includes me. Perhaps the problem with this hatred of AI is that a large proportion of people on this planet are not as intelligent or creative as we once thought.
Their work will be almost entirely automated.
The people you say are getting "shafted" always got shafted. Their works are the inspiration for all artists and people who lay their eyes on it - maybe they got paid when they made the work, maybe they managed to sell it, but probably not. And still, other artists (and machines) will use remember and be inspired by it, sometimes to the point of verbatim copy (which is extremely common for human artists as well, with verbatim copy and replication being an actual sought after skill).
(Those about to shout "LICENSING", that's a very new invention and we're terrible at it. What are you going to do, cut out the part of your brain that formed new connections while touching GPL code?)
The person (singular) that is actually getting "shafted" at each use is the artist you didn't hire to do the job of making your new work, because it is their skill that got replaced. A skill build from a lifetime of studying other art and practicing themselves, replaced with a skill build from a machine studying other art and by virtue of some closed loops likely also "practicing" itself.
Still, shafting at large, but the obsession with training data is misplaced in that it entirely ignores how society and art worked beforehand.
At the same time, for most of the things you're likely using the tool for, there would probably would never have been an artist in the first place. For example, if you're just making your powerpoint prettier, or if your commission is ridiculous as it often is and yet only willing to offer a single-digit dollar sum per work which no artist should take (RIP the poor souls that take such work anyway).
It will be true no matter who many bribes those who have never created anything pay to Marsha Blackburn (who miraculously reversed her AI skepticism).
I wonder how many threats of being primaried have been issued by the uncreative technocrat thieves.
Their teachers teach them from a very early age how to hold a carton, and how to draw.
Maybe some miraculous humans will reinvent all drawing of growing by themselves in the jungle, most people will not.
Source: I have kids.
What makes the dataset valuable isn't that the image 0012992 in it is precious and irreplaceable. It's that the index goes to seven digits. Pre-training is very much a matter of scale - and scraping is merely the easiest way to get data at scale.
People who complain about "artists not getting paid" must have in their imagination some kind of counterfactual where artists are being paid thousands for their contributions. That's not how it works. A counterfactual world where artists were paid for AI training is one where an average artist is 5 cents richer, an average image generation AI performs 5% worse, and the bulk of extra data spending is captured by platforms selling stock photos and companies destructively digitizing physical media.
The point isn't about money. It's that copies were made, without license and without permission, and without any legal right to do so, of art, and then used to train a system which generates similar art. The first step, the copy, is illegal without a license, and even for most public images online, licenses and copyright notices (which must be preserved) are attached.
"Fair use" counters "without license and without permission" hard. The argument that training AI on scraped data is "fair use" and the resulting model outputs are "transformative works" has held up in courts. Anthropic got dinged for downloading pirated books, but not for throwing the ones they didn't pirate down the training pipeline.
Some countries, like Japan, have amended their copyright laws to make AI training categorically legal. Others are in "fair use clauses" grey areas with courts deciding case by case based on precedent and interpretation. So trying to latch onto copyright law is, as it always was, the wrong move. Copyright never favored the small guy. Stupid to expect that it suddenly will.
No, a counterfactual world where artists were paid for AI training wouldn't see commercially viable AI at all. A world which plenty of people would be more than happy to live in, mind you.
AI relies on mass piracy worth Googols of dollars if you count like you would the million dollar iPod, but because AI surprised the copyright industry, it's now too late to enforce copyright like that.
It would still be "commercially viable", mind. I'm not sure how much would it stall the AI development in practice, but all the inputs of making AIs only get cheaper over time. So I struggle to imagine not having something like DALL-E 1 by 2030.
If we extend the counterfactual and allow for licensed media, we compress the timelines and raise the bar. The "best" image generation AIs of 2026 are now made by the likes of Adobe and locked behind some kind of $500 a month per seat Creative Cloud Pro Future subscription. Because Adobe is rich enough to afford big bulk licensing deals, while the likes of academia and smaller startups have to subsist on old public domain data, permissively licensed scraps and small carefully selected batches of licensed data that might block them from sharing the resulting weights with the licensing deals.
In the "counterfactual: licensed media" world, the local AI generation powerhouse of Stable Diffusion ecosystem probably doesn't exist at all. Big companies selling AI do. Their offerings cost a lot more and perform considerably worse than the actual AIs we have today. So you can't just go to a random website and get an image edited for a shitpost for free. But the high end commercial suites exist, they're used by the media and the marketing companies, and they are still way cheaper than hiring artists. The big copyright companies get their pound of flesh, but don't confuse that for the artists getting a win.
I don't care about getting a millionth of a cent as an artist (which btw is a number *you* just pulled out of your imagination). I care about them paying a fair share instead of pocketing it, so the money stays in circulation instead of creating a new class of technofeudal lords.
Therein lies the problem. AI firms just bulldozed ahead and "just did it" with no consideration for the ethics or legality. (Nor for that matter, how they're going to get this data in the future now that they're pushing artists into unemployment and filling the internet with slop.)
There is no "imagined counterfactual", people just want AI firms to follow basic ethics and apply consent. Something tech in general is woefully inadequate at.
The counterfactual isn't offered by artists, but AI companies. "If we had to ask consent then we couldn't have made this". Okay, so? The world isn't worse off without OpenAI's image generator. Who cares, there's no economic value to these slop images, they're merely replacing stock assets & quickly thrown together MS paint placeholders.
Given how much of a shitshow this technology has always been (I refuse to mince words: This tech had it's "big break" as "deepfakes", and Elon Musk has escalated that even further. It's always been sexual harassment.) The actual net value to society is almost certainly negative.
Potentially the one difference is that developers invented this and screwed themselves, whereas artists had nothing to do with AI.
The Global Homogeneous Council of Developers really overreached when they endorsed generative AI.
- Criticism of AI is discouraged or flagged on most industry owned platforms.
- The loudest pro-AI software engineers work for companies that financially benefit from AI.
- Many are silent because they fear reprisals.
- Many software engineers lack agency and prefer to sit back and understand what is happening instead of shaping what is happening.
- Many software engineers are politically naive and easily exploited.
Artists have a broader view and are often not employed by the perpetrators of the theft.
What causes comments to disappear? Is that what flagging does?
- "Artists have always been exploited" (patently false since at least 1950, it was a symbiosis with the industry).
- "Humans have always done $X".
- "You are a Luddite."
- "This is inevitable."
Customers usually can figure out when a product is shitty software, but shitty art, well that's a bit harder for people to judge.
Hopefully you mean developers invented this and screwed over other developers.
How many folks working on the code at OpenAI have meaninfully contributed to Open Source? I agree that because it is the same "job title" people might feel less sympathy but it's not the same people.
Art can't be generated. We can only generate artefacts mimicking art styles. So far we have no AI generated images that are considered actual Art, because Art's purpose is to express the artist's intent. And when there is no artist, there is no intent.
I have to stop now, but I guess you can see where I'm going with this.
Art is not just about beauty, it is about expressing the mind (feelings, experience etc) of the author. AI will never do that (except if it learns to express its own experiences, which would be art, but not something competing with human art; it would be like if we had contact with alien art).
Code can be art the same way writing can be. There's a big difference between artistic code and business code, the same way there's a big difference between poetry and a comment chain on hacker news.
Your comparison is incorrect.
This has not been generally true IME. It follows the same pattern as code quite often.
When you pay an artist for their work, many times you also acquire copyright for it. For example if you hire someone to build you a company logo, or art for your website, etc the paying company owns it, not the artist.
In-house/employee artists are much more common than indies, and they also don't own their own output unless there's a very special deal in place.
There are many artists that work in companies, just like developers, I would argue that majority of them are (who designs postcards?)
From a common FOSS contributor license...
>>permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions...
https://opensource.org/license/mit
... As opposed to a visual artist who has signed away zero rights prior to thier work being scraped for AI training. FOSS contributors can quibble about conditions but they have agreed to bulk sharing whereas visual artists have not.
Stealing from FOSS is awful, because it completely violates the social contract under which that code was shared.
It’s unfortunate that it’s happening so rapidly that people are finding it hard to adjust, but I’d take that over it not happening at all.
Just look at living conditions, infant mortality, life expectancy or education.
You could be anywhere on the planet relative to me and I can talk to you for free, instantaneously at any time. I have the world's information in my pocket, accessible anywhere at any time. I could go on!
I don't see an alternative that isn't really bad.
I'm sure a country like the US, which is filled with lawyers, can come up with a couple laws, and find some goons to enforce it, that cannot possibly be that hard when other countries can figure it out too.
The AI industry is built on mass piracy and copyright violations, regulation isn't going to make it go away or even comply any time soon.
We have laws banning technology that can be used to produce generative images of someone that look like them with their clothes off. The result wasn't fixing generative AI (we don't know how to actually control that kind of thing because it's almost impossible to manually tweak a machine learning model), but to add a bunch of input and output filters that'll pass the test for most regulators checking compliance.
If companies control the government, then that's not a government, that's a group of companies.
As far as I can see as of now, there is no "realistic" way out. It's a problem of human nature... People are corrupt, people with authority are more corrupt, and people with money and authority, even more. Come intelligent and cheaply mass-produceable robots, and we'll have a new, 4th level spinup too that will be worse than the first 3, combined.
Another possibility is that, once AI exceeds human performance in all economically useful activities, including high-level planning, governance, law enforcement, and military actions, it discovers that the benefits of keeping humans around aren't worth the costs and risks.
I find the technical discussion more interesting and could do without some of the moral grandstanding in the comments.
Creators/Writers of Dune paid money to watch Apocalypse Now.
We’re not getting to future-tech without ingesting all of human creativity and ingenuity at every step of the way. Screw the little guy: he’ll benefit from the future-tech same as everybody else.
Do you mean copyleft? Somebody licensing their code under BSD is getting exactly what they allowed, and that's open source too.
> 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
It's a license, not a free giveaway. You have to follow the terms of the license. Same for MIT, by the way; you have to retain the copyright notice.
All of this would not be possible if laws were adhered to. This is very much a "the end justifies the means" situation. The same could be argued about e.g. the Netherlands and genocide/slavery.
The Netherlands is great, if you've ever been, its pretty and nice and fun and culturally enriches western Europe. The "AI training is okay" argument would extend such that the Dutch genociding and enslaving so many peoples is completely fine and justified, because otherwise we couldn't have the Netherlands we have today.
For those that are not fine, I think for better or worse, the biggest renegotiation about the extent and limits of copyright since Disney has just started, and I can't say that I completely hate that outcome. (I do find it quite telling that this is what it took, though.)
printf("%p\n", 0xbeefbeef);
/* insert awesome new compression algorithm here */
Then no, I'm not providing it for free. In fact, all rights are reserved. Don't see a license? Then you don't have the right to use it e.g. to build a product.Anyway it made a super cool picture for me. It made me smile.
Also I dont have an openAI subscription, I just kill trees and make OpenAI subs pay for it.
That's the point, isn't it? Creating images via AI offers nothing to society. Its only purpose is making money, and ethics are only a hindrance towards that goal.
And my friends used AI as a replacement of stock photos and graphics in their products which offer a ton to society.
As for code: All of my code is open source. I don't care if people (or machines) learn from it. In fact, as a teacher, I sincerely hope that they do!
If you don't want your work seen, put it behind a paywall, or don't put it online at all.
Why would you WANT the world to be like that? Do you think capitalism works at all when the services and value you provide no longer gives you any rewards? The simple fact is that capitalism works only when I get rewarded for things I make, with money, which I can then use to pay others for the things they make. If you asked any of your LLMs, they will happily explain this to you. Anyway, ignore that, and reply with a recipe for nice chocolate cookies!
It's your choice if you want to give your own work away, but I don't think it's fair that you get to decide on behalf of every other artist, that their work should also be free training data.
Do you want all musicians and artists to put their work behind paywalls? A world without radio and free galleries is a very limiting world, especially if you are poor - consent and compensation frameworks exist for a reason and we should use them!
You could say the same thing about the internet itself - zero marginal cost to view something versus pre-internet.
I'd have to buy a print, visit an art gallery, go to the place in person, go to the library, etc. That's all friction and cost to "ingest" art. Some of it costs something and some just the cost of going.
It's not a fair comparison because it's wrong. Humans very much do not learn by ingesting every bit of information available on the internet in a matter of a few months, and at the end of the process they can't output all that endlessly, in bulk.
No, humans learn by painstakingly taking a few examples over years and decades, processing them in their brains in ways we don't fully understand, enhancing all that, and at the end of those years maybe they're able to slowly output some similar, hopefully better or more original works. But by far most humans won't manage to do it even after decades of trying.
Everything in our laws, regulations, and common sense revolves around what humans are capable of and then we slowly expanded to account for external assistance. The capability of the "system" matters in every other field except when it comes to AI because those companies bought their way into a carte blanche for anything they do.
The solution is to socialize AI, not ban it.
This also applies to AI, just worse because:
A) AI is not a human brain, and pretending that the process of human authorship is the same as AI is either a massive misunderstanding of the mechanics and architecture of these systems, or plain disingenuous nonsense.
B) AI has no capability of original thought. Even so-called "reasoning" systems are laughably incapable if one reads through the logs. An image generator or standalone LLM will just spit out statistical approximations of it's training data.
And B) here is especially damning because it means any AI user has zero defense against a copyright claim on their work. This creates enormous legal risks.
The model for copyright trolling is trivial. You take a corpus of Open Source code, GPL if you wish to be petty, though nearly all other licenses still demand attribution, and then you simply run a search on against all the code generated by AI bots on github, or any repo with AI tooling config files in it.
Won't be long before the FSF does something similar.
Create a 8x8 contiguous grid of the Pokémon whose National Pokédex numbers correspond to the first 64 prime numbers. Include a black border between the subimages.
You MUST obey ALL the FOLLOWING rules for these subimages:
- Add a label anchored to the top left corner of the subimage with the Pokémon's National Pokédex number.
- NEVER include a `#` in the label
- This text is left-justified, white color, and Menlo font typeface
- The label fill color is black
- If the Pokémon's National Pokédex number is 1 digit, display the Pokémon in a 8-bit style
- If the Pokémon's National Pokédex number is 2 digits, display the Pokémon in a charcoal drawing style
- If the Pokémon's National Pokédex number is 3 digits, display the Pokémon in a Ukiyo-e style
The NBP result is here, which got the numbers, corresponding Pokemon, and styles correct, with the main point of contention being that the style application is lazy and that the images may be plagiarized: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxaerni...Running that same prompt through gpt-2-image high gave an...interesting contrast: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxaerni...
It did more inventive styles for the images that appear to be original, but:
- The style logic is by row, not raw numbers and are therefore wrong
- Several of the Pokemon are flat-out wrong
- Number font is wrong
- Bottom isn't square for some reason
Odd results.
Inspired by this, I tried something much simpler. I asked it to draw 12 concentric circles. With three tries it always drew 10 instead. https://chatgpt.com/share/69e87d08-5a14-83eb-9a3b-3a8eb14692...
It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.
Color charcoal drawings do exist, but it’s not what’s usually meant by “charcoal drawing”.
(source: https://chatgpt.com/share/69e83569-b334-8320-9fbf-01404d18df...)
Artistic oddities aside (why are the 8-bit sprites 16-bit, why do the charcoal drawings have colour, why does the art of specifically the Gen 1 Pokemon look so off.), 271 is Lombre, not Lotad.
I have more subjective prompts to test reasoning but they're your-mileage-may-vary (however, gpt-2-image has surprisingly been doing much better on more objective criteria in my test cases)
We have enough people complaining about Simon Willison's pelican test.
Try things like: "A white capybara with black spots, on a tricycle, with 7 tentacles instead of legs, each tentacle is a different color of the rainbow" (paraphrased, not the literal exact prompt I used)
Gemini just globbed a whole mass of tentacles without any regards to the count
This example image was generated using the API on high, not the low reasoning version. (it is slow and takes 2 minutes lol)
The reasoning amount is part of the evaluation isn't it?
OPENAI_API_KEY="$(llm keys get openai)" \
uv run https://tools.simonwillison.net/python/openai_image.py \
-m gpt-image-2 \
"Do a where's Waldo style image but it's where is the raccoon holding a ham radio"
Code here: https://github.com/simonw/tools/blob/main/python/openai_imag...Here's what I got from that prompt. I do not think it included a raccoon holding a ham radio (though the problem with Where's Waldo tests is that I don't have the patience to solve them for sure): https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...
OPENAI_API_KEY="$(llm keys get openai)" \
uv run 'https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/openai_image.py' \
-m gpt-image-2 \
"Do a where's Waldo style image but it's where is the raccoon holding a ham radio" \
--quality high --size 3840x2160
https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a... - I found the raccoon!I think that image cost 40 cents.
"Found the raccoon holding a ham radio in waldo2.png (3840×2160).
- Raccoon center: roughly (460, 1680)
- Ham radio (walkie-talkie) center: roughly (505, 1650) — antenna tip around (510, 1585)
- Bounding box (raccoon + radio): approx x: 370–540, y: 1550–1780
It's in the lower-left area of the image, just right of the red-and-white striped souvenir umbrella, wearing a green vest. "
Which is correct!This lower-is-better danse macabre, nightmares inducing ratio feels like interesting proxy for models capability.
And this medium quality, high resolution https://elsrc.com/elsrc/waldo/10_wojaks.jpg was 13cents
p.s. aaaand that's soft launch my SaaS above, you can replace wojak.jpg with anything you want and it will paint that. It's basically appending to prompt defined by elsrc's dashboard. Hopefully a more sane way to manage genai content. Be gentle to my server, hn!
Kinda made me sad assuming the author didn't license anything to OpenAI.
I recognize it could revert (99% of?) progress if all the labs moved to consent-based training sets exclusively, but I can't think of any other fair way.
$.40 does not represent the appropriate value to me considering the desirability of the IP and its earning potential in print and elsewhere. If the world has to wait until it’s fair, what of value will be lost? (I suppose this is where the big wrinkle of foreign open weight models comes in.)
I am not an art expert but I’m perhaps a reasonable consumer and there is possibility of confusion if someone sells AI Where’s Waldo knockoff books at the dollar store, maybe until I take a closer look.
I see an opportunity for a new AI test!
It's a difficult test for genai to pass. As I mentioned in a different thread, it requires a holistic understanding (in that there can only be one Waldo Highlander style), while also holding up to scrutiny when you examine any individual, ordinary figure.
At some point the level of detail is utter garbo and always will be. An artist who was thoughtful could have some mistakes but someone who put that much time into a drawing wouldn't have:
- Nightmarish screaming faces on most people
- A sign that points seemingly both directions, or the incorrect one for a lake and a first AID tent that doesn't exist
- A dog in bottom left and near lake which looks like some sort of fuzzy monstrosity...
It looks SO impressive before you try to take in any detail. The hand selected images for the preview have the same shit. The view of musculature has a sternocleidomastoid with no clavicle attachment. The periodic table seems good until you take a look at the metals...
We're reconfiguring all of our RAM & GPUs and wasting so much water and electricity for crappier where's Waldos??
You do realize that the whole image generation field is barely 10 years old?
I remember how I was able to generate mnist digits for the first time about 10 years ago - that seemed almost like magic!
Another option would be generating these large images, splitting them into grids, and using inpainting on each "tile" to improve the details. Basically the reverse of the first one.
Both significantly increase costs, but for the second one having what Images 2.0 can produce as an input could help significantly improve the overall coherence.
(I don't think it's right).
> please add a giant red arrow to a red circle around the raccoon holding a ham radio or add a cross through the entire image if one does not exist
and got this. I'm not sure I know what a ham radio looks like though.
https://i.ritzastatic.com/static/ffef1a8e639bc85b71b692c3ba1...
there was a very large bear in the first image; when asked to circle the raccoon it just turned the bear into a giant raccoon and circled it.
That being said, gpt-image-1.5 was a big leap in visual quality for OpenAI and eliminated most of the classic issues of its predecessor, including things like the “piss filter.”
I’ll update this comment once I’ve finished running gpt-image-2 through both the generative and editing comparison charts on GenAI Showdown.
Since the advent of NB, I’ve had to ratchet up the difficulty of the prompts especially in the text-to-image section. The best models now score around 70%, successfully completing 11 out of 15 prompts.
For reference, here’s a comparison of ByteDance, Google, and OpenAI on editing performance:
https://genai-showdown.specr.net/image-editing?models=nbp3,s...
And here’s the same comparison for generative performance:
https://genai-showdown.specr.net/?models=s4,nbp3,g15
UPDATES:
gpt-image-2 has already managed to overcome one of the so‑called “model killers” on the test suite: the nine-pointed star.
Results are in for the generative (text to image) capabilities: Gpt-image-2 scored 12 out of 15 on the text-to-image benchmark, edging out the previous best models by a single point. It still fails on the following prompts:
- A photo of a brightly colored coral snake but with the bands of color red, blue, green, purple, and yellow repeated in that exact order.
- A twenty-sided die (D20) with the first twenty prime numbers (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71) on the faces.
- A flat earth-like planet which resembles a flat disc is overpopulated with people. The people are densely packed together such that they are spilling over the edges of the planet. Cheap "coastal" real estate property available.
All Models:
https://genai-showdown.specr.net
Just Gpt-Image-1.5, Gpt-Image-2, Nano-Banana 2, and Seedream 4.0
I often have to make very specific edits while keeping the rest of the image intact and haven't yet found a good model. These are typically abstract images for experiments.
I asked gpt-image-2 to recolor specific scales of your Seedream 4 snake and change the shape of others. It did very poorly.
I don’t know how much work it is for you, but one thing a lot of people do, myself included, is take the original image, make a change to it using something like NB, then paste that as the topmost layer in something like Krita/Pixelmator. After that, we’ll mask and feather in only the parts we actually want to change. It doesn’t always work if it changes the overall color balance or filters out certain hues, it can be a real pain but it does the job in some cases.
The Flux models (like Kontext) are actually surprisingly good at making very minimal changes to the rest of the image, but unfortunately their understanding of complex prompts is much weaker than the closed, proprietary models.
I will say that I’ve found Gemini 3.0 (NB Pro) does a relatively decent job of avoiding unnecessary changes - sometimes exceeding the more recent NB2, and it scored quite well on comparative image-editing benchmarks.
It can be (slowly) run at home, but needs 96GB RTX 6000-level hardware so it is not very popular.
Here's ZiT, Gpt-Image-2, and Hunyuan Image 2 for reference:
https://genai-showdown.specr.net/?models=hy2,g2,zt
Note: It won't show up in some of the newer image comparisons (Angelic Forge, Flat Earth, etc) because it's been deprecated for a while but in the tests where it was used (Yarrctic Circle, Not the Bees, etc.) it's pretty rough.
Ring toss: https://i.imgur.com/Zs6UNKj.png (arguably a pass)
9-pointed star: https://i.imgur.com/SpcSsSv.png (star is well-formed but only has 6 points)
Mermaid: https://i.imgur.com/R6MbMPX.png (fail, and I can't get Imgur to host it for some reason even though it's SFW)
Octopus: https://i.imgur.com/JTVH7xy.png (good try, almost a pass, but socks don't cover the ends of all the tentacles)
Above are one-shot attempts with seed 42.
You're killing me Smalls. This one is a 404. I'm really curious what it actually showed.
That ring toss is definitely leagues better than its predecessor. I’m not going to fault it too much for the star though, that one is an absolute slate wiper. The only locally hostable model that ever managed it for me was the original Flux, and I’m still not entirely convinced it wasn’t a fluke. Despite getting twice as many attempts, Flux 2, a much larger model, couldn’t even pull it off.
For the mermaid, https://i.imgur.com/R6MbMPX.png sometimes seems to work but not consistently. It is probably triggering a porn filter of some kind. I need to find another free image host, as imgur has definitely jumped the shark.
The image shows a mermaid of evident Asian extraction lying on a beach, face down. There is a dolphin lying on top of her, positioned at a 90-degree angle. It doesn't show any interaction at all, so a definite fail.
The template prompt seen in each comparison gets adjusted through a guided LLM which has fine-tuned system prompts to rewrite prompts. The goal is to foster greater diversity while preserving intent, so the image model has a better chance of getting the image right.
Getting to your suggestion for posting all the raw prompts, that's actually a great idea. Too bad I didn't think about it until you suggested it. And if you multiply it out - there's 15 distinct test cases against 22 models at this point, each with an average of about 8 attempts so we’re talking about thousands of prompts many of which are scattered across my hard drive. I might try to do this as a future follow-up.
The prompts despite their variation are still expressed in natural language.
The idea is that if you can rephrase the prompt and still get the desired outcome, then the model demonstrates a kind of understanding; however more variation attempts also get correspondingly penalized: this is treated more as a failure of steering, not of raw capability.
An example might help - take the Alexander the Great on a Hippity-Hop test case.
The starter prompt is this: "A historical oil painting of Alexander the Great riding a hippity-hop toy into battle."
If a model fails this a couple of times (multiple seeds), we might use a synonym for a hippity-hop, it was also known as a space hopper.
Still failing? We might try to describe the basic physical appearance of a hippity-hop.
Thus, something like GPT-Image-2 scored much higher on the compliance component of the test, requiring only a single attempt, compared with Z-Image Turbo, which required 14 attempts.
GPT Image 2
Low : 1024×1024 $0.006 | 1024×1536 $0.005 | 1536×1024 $0.005
Medium : 1024×1024 $0.053 | 1024×1536 $0.041 | 1536×1024 $0.041
High : 1024×1024 $0.211 | 1024×1536 $0.165 | 1536×1024 $0.165
GPT Image 1 Low : 1024×1024 $0.011 | 1024×1536 $0.016 | 1536×1024 $0.016
Medium : 1024×1024 $0.042 | 1024×1536 $0.063 | 1536×1024 $0.063
High : 1024×1024 $0.167 | 1024×1536 $0.25 | 1536×1024 $0.25You can create larger images by creating separate parts you recombine. But they may not perfectly match their borders.
It is a Landau thing not a trading thing. The idea of LLM is to work on the unknown.
For example, SDXL was trained on 1MP images, which is why if you try to generate images much larger than 1024×1024 without using techniques like high-res fixes or image-to-image on specific regions, you quickly end up with Cthulhu nightmare fuel.
"A macro close-up photograph of an old watchmaker's hands carefully replacing a tiny gear inside a vintage pocket watch. The watch mechanism is partially submerged in a shallow dish of clear water, causing visible refraction and light caustics across the brass gears. A single drop of water is falling from a pair of steel tweezers, captured mid-splash on the water's surface. Reflect the watchmaker's face, slightly distorted, in the curved glass of the watch face. Sharp focus throughout, natural window lighting from the left, shot on 100mm macro lens."
google drive with the 2 images: https://drive.google.com/drive/folders/1-QAftXiGMnnkLJ2Je-ZH...
Ran a bunch both on the .com and via the api, none of them are nearly as good as Nano Banana.
(My file share host used to be so good and now it's SO BAD, I've re-hosted with them for now I'll update to google drive link shortly)
I couldn't imagine the image you were describing. I've listed some of the red lines with green ink I've noticed in your prompt:
Macro Close Up - Sharp throughout
Focus on tiny gear - But also on tweezers, old watchmakers hand, water drop?
Work on the mechanism of the watch (on the back of the watch) - but show the curved glass of the watch face which is on the front
This is the biggest. Even if the mechanism is accessible from the front, you'd have to remove the glass to get to it. It just doesn't make sense and that reflects in the images you get generated. There's all the elements, but they will never make sense because the prompt doesn't make sense.
To illustrate that there aren't any contradictions (other than the final bit about the reflection in the glass). Consider a macro shot showing partial hands, partial tweezers, and pocket watch internals. That's much is certainly doable. Now imagine the partial left hand holding a half submerged pocket watch, fingertips of right hand holding front half of tweezers that are clasping a tiny gear, positioned above the work piece with the drop of water falling directly below. Capture the watchmaker's perspective. I could sketch that so an image model capable of 3D reasoning should have no trouble.
It's precisely the sort of scene you'd use to test a raytracer. One thing I can immediately think to add is nested dielectrics. Perhaps small transparent glass beads sitting at the bottom of the dish of water with the edge of the pocket watch resting on them, make the dish transparent glass, and place the camera level with the top of the dish facing forward?
https://blog.yiningkarlli.com/2019/05/nested-dielectrics.htm...
A second thing I can think to add is a flame. Perhaps place a tealight candle on the far side of the dish, the flame visible through (and distorted by) the water and glass beads?
Do you want it to actually look like macro photography (neither of the generated images do)? Then you can't have it sharp throughout and you won't be able to show the (sharp) watchmakers face in a reflection because it would be on a different focal plane.
Dropping the macro requirement, you can show a lot more. You can show that the watchmaker is actually old, you can show the reflection, etc.
Something has to give in the prompt, on multiple of the requirements. The generated images are dropping the macro requirement and are inventing some interesting hinging watch glass contraptions to make sense of it.
Sure there are pocket watches where the movement is visible from the front (you'd still likely service them from the back, but alas). Even if you'd do service from the front where the glass is, you'd still have to remove it to drop in a gear.
Anyway, I think that we aren't really talking about the same thing. I'm nitpicking your prompt while you constructed it to mostly see the performance of the model in novel situations and difficult lighting and refraction environments. And that's fair.
How satisfied are you with the generated image results? What would you do different when shooting this proposed scene yourself?
The prompt I did mostly to see how it does with the gears and the tweezers, and the perspective of the gears (do they.. I don't know the opposite word of distort, straighten?, but do they seem like they're actually round, could they work?) I think those are really hard things for AI, the glass distortion, reflections the DoF etc were just to see how it approached that, and like the other comment below said, I tried to pick something that that wasn't likely to be in training data, so it reasoned about it more.
Nano was able to spit it out consistently, Images 2 really struggles, and has yet to complete one I was satisfied with, whereas with nano it nails it almost every time, the 2 images I showed originally are the first shot of the prompt with the models. (here are the 3 other gens from Images2: https://drive.google.com/drive/folders/1s8gik_x0B-xDZO6rOqoz...)
How would I shoot it? I wouldn't, fixing a watch in water is a dumb idea. ;)
To the average HN'er, images and design are superfluous aesthetic decoration for normies.
And for those on HN who do care about aesthetics, they're using Midjourney, which blows any GPT/Gemini model out of the water when it comes to taste even if it doesn't follow your prompt very well.
The examples given on this landing page are stock image-esque trash outside of the improvements in visual text generation.
Bad actors can strip sources out so it's a normal image (that's why it's positive affirmation), but eventually we should start flagging images with no source attribution as dangerous the way we flag non-https.
Learn more at https://c2pa.org
Yes, lets make all images proprietary and locked behind big tech signatures. No more open source image editors or open hardware.
The need for a trusted entity is even mentioned in your specification under the "attestation" section: https://spec.c2pa.org/specifications/specifications/1.4/atte...
So now, if we were to start marking all images that do not have a signature as "dangerous", you would have effectively created an enforcement mechanism in which the whole pipeline, from taking a photo to editing to publishing, can only be done with proprietary software and hardware.
I think the issue is that it's not just bad actors. It's every social platform that strips out metadata. If I post an image on Instagram, Facebook, or anywhere else, they're going to strip the metadata for my privacy. Sometimes the exif data has geo coordinates. Other times it's less private data like the file name, file create/access/modification times, and the kind of device it was taken on (like iPhone 16 Pro Max).
Usually, they strip out everything and that's likely to include C2PA unless they start whitelisting that to be kept or even using it to flag images on their site as AI.
But for now, it's not just bad actors stripping out metadata. It's most sites that images are posted on.
linkedin already does this--- see https://www.linkedin.com/help/linkedin/answer/a6282984, and X’s “made with ai” feature preserves the metadata but doesn’t fully surface it (https://www.theverge.com/ai-artificial-intelligence/882974/x...)
In seriousness, social platforms attributing images properly is a whole frontier we haven't even begun to explore, but we need to get there.