Posted by trq_ 3 days ago
The historical progression from text to still images to audio to moving images will hold true for AI as well.
Just look at OpenAI's progression as well from LLM to multi-modal to the realtime API.
A co-worker almost 20 years ago said something interesting to me as we were discussing Al Gore's CurrentTV project: the history of information is constrained by "bandwidth". He mentioned how broadcast television went from 72 hours of "bandwidth" (3 channels x 24h) per day to now having so much bandwidth that we could have a channel with citizen journalists. Of course, this was also the same time that YouTube was taking off.
The pattern holds true for AI.
AI is going to create "infinite bandwidth".
You'll have to explain what you mean by this. Direct speech, text, illustrations, photos, abstract sounds, music, recordings, videos, circuits, programs, cells... these are all just different mediums with different characteristics. There is no "progression" apparent among them. Why should there be? They each fulfill different ends and have different occasions for which they best suit.
We seem to have discovered a new family of tools that help lossilly transform content or intent from one of these mediums to some others, which is sure to be useful in its own ways. But it's not a medium like the above in the first place, and with none of them representing a progression, it certainly doesn't either.
> You'll have to explain what you mean by this
The progression of distribution. Printing press, photos, radio, movies, television. The early web was text, then came images, then audio (Napster age), and then video (remember that Netflix used to ship DVDs?).The flip side of that is production and the ratio of producers to consumers. As the bandwidth for distribution increases, there is a decrease in the cost and complexity for producers and naturally, we see the same progression with producers on each new platform and distribution technology: text, still images, audio, moving images.
And as a sibling commenter noted, the individual history of these media are disjoint and not really in the sequence you suggest.
Regardless, generative AI isn't a media like any of these. It's a means to transform media from one type to another, at some expense and with the introduction of loss/noise. There's something revolutionary about how easy it makes it to perform those transitions, and how generally it can perform them, but it's fundamentally more like a screwdriver than a video.
Pretty much all of them have followed the same pattern:
Text, images, audio, video, and maybe hologram as the end goal.
We are getting the same with AI today.
Your history is incorrect, though. Still images predate text, by a lot.
Cave paintings came before writing. Woodcuts came before the printing press.
It is entirely logical to say that AI development will follow the same progression as the early internet or as broadcast since they all fall under the same data constraints.
Ever since the release of Whisper and others, text-to-speech and speech-to-text have been more or less solved, while image generation seems to still sometimes have trouble. Earlier this week was a thread about how no image model could draw a crocodile without a tail.
Meanwhile, the first photographs predate the first sound recordings. And moving images without sound, of course, predate moving images with sound.
The original poster was trying to sound profound as though there was some set sequence of things that always happens through human development. But the reality is a much more mundane "less complex things tend to be easier than more complex things".
> The original poster was trying to sound profound
I'm just here trying to justify why NVDA is still a growth stock; we're nowhere near peak gen AI.Toddlers also learn to recognize drawing and to be able to do simple drawing themselves way before they learn to read or to write.
Image definitely predates text.
There is overlap between text to image and text to video -- image would help video animating interesting or complex prompts; video would help image learn how to differentiate features as there are additional clues in terms of how the image changes and remains the same.
There's overlap with audio, text transcripts, and video around learning to animate speech e.g. by leaning how faces move with the corresponding audio/text.
There's overlap with sound and video -- e.g. being able to associate sounds like dog barking without direct labelling of either.
That's not what LLMs do. More like AI art.
Vision and audio plays a nice role, but that's because of humans and reality. real world <-> vision|audio <-> processing pipeline makes sense. But processing <-> data <-> vision|audio <-> data <-> processing cycle is just non sense and a waste of resources.
There's been a lot of attempts over the years and varying degrees of accuracy, but I don't know if you can go as far as to "formalize" it. Beyond the syntax, (tokenising, syntactic chunking and ...beyond) there is the intent, and that is super hard to measure. And possibly, the problem with these prompts is they get things right a lot of the time but wrong say 5% of the time. Purely because they couldn't formalize it. My web hosting has 99.99% uptime which is a bit more reassuring than 95%.
I think of how the USA had cable TV and hundreds of channels projecting all kinds of whatever in the 80s while here in the UK we were limited to our finite channels. To be fair those finite channels gave people something to talk about the next day, because millions of people saw the same thing. Surely a lot of what mankind has done is to tame entropy, like steam engines etc.
With AI and everyone having a prompt, it's surely a game changer. How it works out, we'll see.
Was listening to a Seth Godin interview where he pointed out that there was a time when you had to purchase a slice of spectrum to share your voice on the radio. Nowadays you can put your thoughts on a platform, but that platform is owned by corporations who can and will put their thumb on thoughtcrime or challenges.
I really do love your comment. Cheers.
There's a related concept as well which is that as "bandwidth" increases, the ratio of producers to consumers pushes upwards towards 1. My take is that generative AI will accelerate this
I write a bit more in depth about it here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...
Three channels of television over 8 hours was already more than anyone had time to take in.
AI might be able to create a summarizing layers and relays that help manage that.
AI isn't going to create infinite bandwidth. It's as likely to increase entropy and introduce noise.
RF: text (telegram), audio, still images (fax), moving images
Web had the same progression: text, still images (inverted here), audio age (MP3s, Napster), video (Netflix, YouTube)
AI: text, images, audio (realtime API), ...?
Vision is the obvious next medium.
Of course, I can use AI tools to get approximations of such things, and it'll probably get better, which means we will now be using this increased bandwidth or progression to produce more video to be pushed out through the pipes and distilled by an AI tool into shorter video or something like hypertext.
Progress!
Also: thanks for tuning in, raid shadow legends, many people ask, but how... Anyway, you need these two lines of text (20 minutes YouTube videos could have been a half page of text)
Finally: Huge output of bad quality and very, very limited input capacity. So "infinite bandwidth in" and then horrible traffic jam out.
If you and I prompt OpenAI to generate an image of a woman holding a candle, we'll get two totally novel instances.
For whom?
If you mean infinite outpouring, then yes, but it will drown us in a sea of noise. We've constructed a Chinese Room for the mind. The computer was a bicycle, but this is something different.
Bandwidth is carrying ability, and the current incarnation of "AI" does not increase signal. It takes vastly more resources to produce something close enough, but not quite... it.
That visual interface will watch as you prep your pancakes and give you tips, suggest a substitute if you are missing an ingredient. Your experience with that recipe will be one of "infinitely" many.
https://prirai.github.io/books/unix-koans/#master-foo-discou...
Imagine OpenAI can not only read the inflection in your voice, but also nuances in your facial expressions and how you're using your hands to understand your state of mind.
And instead of merely responding as an audio stream, a real-time avatar.
or maybe my own HAL9000..
A little bit ambivalent on this haha, looking forward to seeing what comes of it either way though :)
I’ve worked in AI for more than 30 years and I have no idea what you mean by this. Can you explain?
There's two ways to think about bandwidth. One is the physical capacity. The other is the content that can be produced and distributed.
We once had 3 channels equating to a maximum of 72h of content in a 24h period. Now we have YouTube which is orders of magnitude more content and bandwidth. The constraint now is the ratio of producers to consumers. Some creator had to create the exact content that you want.
What if gen AI can create the exact content and media experience that you want? Effectively pushing the ratio of producers and consumers towards 1 so that every experience is unique? It is effectively as if there was infinite bandwidth to create and distribute content. You are no longer constrained by physical bandwidth and no longer constrained by production bandwidth (actual creators making the content).
You want your AI generated news reel delivered by Walter Cronkite. I want mine delivered by Barbara Walters wearing a fake mustache while standing on one hand on the moon. It is as if there are infinite producers.
I write a bit more on this topic here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...
This is a great place for people to start caring about accessibility annotations. All serious UI toolkits allow you to tell the computer what's on the screen. This allows things like Windows Automation https://learn.microsoft.com/en-us/windows/win32/winauto/entr... to see a tree of controls with labels and descriptions without any vision/OCR. It can be inspected by apps like FlauiInspect https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main... But see how the example shows a statusbar with (Text "UIA3" "")? It could've been (Text "UIA3" "Current automation interface") instead for both a good tooltip and an accessibility label.
Now we can kill two birds with one stone - actually improve the accessibility of everything and make sure custom controls adhere to the framework as well, and provide the same data to the coming automation agents. The text description will be much cheaper than a screenshot to process. Also it will help my work with manually coded app automation, so that's a win-win-win.
As a side effect, it would also solve issues with UI weirdness. Have you ever had windows open something on a screen which is not connected anymore? Or under another window? Or minimised? Screenshots won't give enough information here to progress.
The ultimate API is "all the raw data you can acquire from your environment".
So you still rely on developers to make reasonable GUIs
- A list of applications that are open - Which application has active focus - What is focused inside the application - Function calls to specifically navigate those applications, as many as possible
We’ve found the same thing while building the client for testdriver.ai. This info is in every request.
It’s actually a super cool development and I’m very exiting already to let my computer use any software like a pro infront of me. Paint me canvas of a savanna sunset with animals silhouette, produce me a track of uk garage house, etc. everything with all the layers and elements in the software not just an finished output.
I don’t understand why people make a point about energy consumption as it would be something bad.
"the Earth has only one mechanism for releasing heat to space, and that’s via (infrared) radiation. We understand the phenomenon perfectly well, and can predict the surface temperature of the planet as a function of how much energy the human race produces. The upshot is that at a 2.3% growth rate, we would reach boiling temperature in about 400 years. And this statement is independent of technology. Even if we don’t have a name for the energy source yet, as long as it obeys thermodynamics, we cook ourselves with perpetual energy increase."
Obviously we need to use a way to not harm and destroy our environment more but we are on a good way on that. But technically we need much much more energy.
We do things fast and expensive that could be done slow but cheap.
The problem is we are running out of time.
If you want more energy you first build clean energy sources then you can pump up consumption not the other way around.
It’s usually the other way around. If we would do things only if the resources are there in the first place we wouldn’t the progress we have.
We built computers to be used by humans and humans overwhelmingly operate computers with GUIs. So if you want a machine that can potentially operate computers as well as humans then you're going to have to stick to GUIs.
It's the same reason we're trying to build general purpose robots in a human form factor.
The fact that a car is about as wide as a two horse drawn carriage is also no coincidence. You can't ignore existing infrastructure.
(1) Automated testing of apps that use traditional UIs, and
(2) Automating legacy apps that it is not practical or cost effective to update.
You don't and that's fine but certainly many people are interested in such a thing.
>maybe I’m missing the point of this but I just can’t imagine a usecase where this is a good solution.
If it could operate computers robustly and reliabily then why wouldn't you ? Not everything someone does on a computer is a task they wouldn't like to automate away but can't with current technology.
>For everything browser based, the burden of making an API is probably relatively small
It's definitely not less effort than stricking to a GUI
>and if the page is simple enough, you could maybe even get away with training the AI on the page HTML and generating a response to send.
Sure in special circumstances, it may be a good idea to use something else.
>And for everything that’s not browser-based, I would either want the AI embedded in the software (image editors, IDEs…) or not there are all.
AI embedded in software and AI operating the computer itself are entirely different things. The former is not necessarily a substitute for the latter.
Having access to SORA is not at all the same thing as AI that can expertly operate Blender. And right now at least, studios would actually much prefer the latter.
Even if they were equivalent (they're not), then you wouldn't be able to operate most applications without developers explicitly supporting and maintaining it first. That's infeasible.
But there's tons of applications that are locked behind a website or deckstop GUI with no API that are accessible via vision.