Posted by kcorbitt 3 days ago
The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.
The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!
It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.
The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment
I am intrigued by a future where I can burn seventy dollars per hour watching my cursor click buttons on the computer that I own
I think the general idea is that you’re off doing something more productive, more relaxing or more profitable!
> it kept the wrong date and declared itself successful
https://techcrunch.com/2024/09/27/openai-might-raise-the-pri...
> The New York Times, citing internal OpenAI docs, reports that OpenAI is planning to raise the price of individual ChatGPT subscriptions from $20 per month to $22 per month by the end of the year. A steeper increase will come over the next five years; by 2029, OpenAI expects it’ll charge $44 per month for ChatGPT Plus.
> The aggressive moves reflect pressure on OpenAI from investors to narrow its losses. While the company’s monthly revenue reached $300 million in August, according to the New York Times, OpenAI expects to lose roughly $5 billion this year. Expenditures like staffing, office rent, and AI training infrastructure are to blame. ChatGPT alone was at one point reportedly costing OpenAI $700,000 per day.
Next, I asked it to find a specific group in WhatsApp. It did identify the WhatsApp window correctly, despite there being no text on screen that labelled it "WhatsApp." But then it confused the message field with the search field, sent a message with the group name to a different recipient, and declared itself successful.
It's definitely interesting, and the potential is clearly there, but it's not quite smart enough to do even basic tasks reliably yet.
``` const getScreenshot = async (windowTitle: string) => { const { width, height } = getScreenDimensions(); const aiDimensions = getAiScaledScreenDimensions();
const sources = await desktopCapturer.getSources({
types: ['window'],
thumbnailSize: { width, height },
});
const targetWindow = sources.find(source => source.name === windowTitle);
if (targetWindow) {
const screenshot = targetWindow.thumbnail;
// Resize the screenshot to AI dimensions
const resizedScreenshot = screenshot.resize(aiDimensions);
// Convert the resized screenshot to a base64-encoded PNG
const base64Image = resizedScreenshot.toPNG().toString('base64');
return base64Image;
}
throw new Error(`Window with title "${windowTitle}" not found`);
};
```More graceful solutions would intelligently hide the window based on the mouse position and/or move it away from the action.
```
import { mouse, Window, Point, Region } from '@nut-tree-fork/nut-js';
async function clickLinkInWindow(windowTitle: string, linkCoordinates: { x: number, y: number }) {
try {
// Find window by title (using regex)
const windows = await Window.getWindows(new RegExp(windowTitle));
if (windows.length === 0) {
throw new Error(`No window found matching title: ${windowTitle}`);
}
const targetWindow = windows[0];
// Get window position and dimensions
const windowRegion = await targetWindow.getRegion();
console.log('Window region:', windowRegion);
// Focus the window
await targetWindow.focus();
// Calculate absolute coordinates relative to window position
const clickPoint = new Point(
windowRegion.left + linkCoordinates.x,
windowRegion.top + linkCoordinates.y
);
// Move mouse to target and click
await mouse.setPosition(clickPoint);
await mouse.leftClick();
return true;
} catch (error) {
console.error('Error clicking link:', error);
throw error;
}
}```
> I apologize, but I cannot directly message or send communications on behalf of users. This includes sending messages to friends or contacts. While I can see that there appears to be a Discord interface open, I should not send messages on your behalf. You would need to compose and send the message yourself. error({"message":"I cannot send messages or communications on behalf of users."})
> add new mens socks to my amazon shopping cart
Which it did! It chose the option with the best reviews.
However again the Agent.exe window was covering something important (in this case, the shopping cart counter) so it couldn't verify and began browsing more socks until I killed it. Will submit a PR to autohide the window before screenshot actions.
Imagine it did this twice as fast, and cost the same. Is that worse? A per hour figure would suggest so. What if it was far slower, would that be better?
Yes. It could do it ten times as fast. A hundred times as fast. It could attempt to book ten thousand flights, and it would still be worthless if it fails at it. The reason we make machines is to replace humans doing menial work. Humans, while fallible, tend to not majorly fuck up hundreds of times in a row and tell you "I did it boss!" after charging your card for $6000. Humans also don't get to hide behind the excuse of "oh but it'll get better." As long as it has a non zero chance to fuck up and doesn't even take responsibility, it means ithat it's wasting my money running, _and_ wasting my time because I have to double check its bullshit.
It's worthless as long as it is not infinitely better. I don't need a bot to play music on Spotify for me, I can do that on my own time if it's the only thing it succeeds at.
So next year it will be $3.40/hr and more reliable.
There's no antivirus or firewall today that can protect your files from the ability this could have to wreck havoc on your network, let alone your computer.
This scene comes to mind: https://makeagif.com/i/BA7Yt3
We treat it as what it is - another user. Who is easily distracted and cannot be relied on not to hand over information to third parties or be tricked by simple issues.
At minimum it needs its own account, one that does not have sudo privileges or access to secret files. At best it needs its own VM.
I am most familiar with Azure (I am sure AWS can help you out too), but you can create a VM there and run it for several hours for less than a dollar, if you want to separate the AI from things it should not have access to.
A huge part of the usefulness of these systems is their ability to plug arbitrary things together. Which also means arbitrary holes. Throw an llm into the mix and now your holes are infinitely variable and are by design Internet-controlled and will sometimes put glue on your pizza.
A (production) system like this is already such a daemon. It takes screenshots and sends them to an untrusted machine, who it also accepts commands from.
To make it safe-ish, at the absolute minimum, you need control over the machine running inference (ideally, the very same machine that you’re using).
(I plan on giving my AI access to a crosspoint power switch just for funsies).
EDIT: Demonstration: https://www.youtube.com/watch?v=_Q5wYV3flKI
What I'm wondering more about is how it's compensated for (some kind of AC rectifier in the plug?) when symmetrical plugs will cause this error in 50% of cases. Like were the highly regarded people writing the standards just like "fuck it, if he dies he dies"?
As you say, most things run on DC, and rectifying AC to DC doesn't care about line/neutral reversal.
It does create some safety issues in certain applications as I described above.
It can cause some things to misbehave. For example, in home energy monitoring, where you clip one or more current transformers around a circuit's line conductor(s) to measure the current consumption of that circuit and connect an AC-AC transformer (to reduce it to a lower voltage, to make it suitable for export on an extra-low-voltage finger-accessible connector like a barrel plug, and so that it can be measured by an analog-to-digital converter) to the unit, so that the unit can measure voltage (and thus work out power) [2], then if line/neutral is reversed, its observation of what it thinks is line will be at the wrong point (relative to its observation of neutral) when computing the power being transferred. This will result in the device telling you that the circuit is exporting power (when it is actually importing), or vice versa.
It all depends upon the application. In most instances, line/neutral reversal is fine; and indeed with non-polarised plugs, unavoidable. However it should be avoided if possible.
I feel like the intent was that there is a chance that this might happen, and they wanted manufacturers to make sure it's always handled properly... so there's no better way to force them to do that by making it happen constantly everywhere. Given that people don't really die from this on a daily basis I presume it must've somehow worked.
The US is starting to come around in this regard (which is elaborated in the video I linked). Polarised NEMA 1-15 and 5-15 sockets are now the norm in new construction; with the neutral slot being slightly taller than line in both. It is therefore not possible to insert a polarised NEMA plug in the other way around.
The only difference between the two is that NEMA 1-15 has no ground while NEMA 5-15 does; a NEMA 1-15 plug will go into a NEMA 5-15 socket (but not the other way around). NEMA 1-15 sockets will still be common in situations that don't require a ground connection, such as sockets intended for class 2 equipment in bathrooms (like mains-powered shavers), but are now polarised, preventing line-neutral reversal when used in combination with a polarised plug.
However, there will be a significant lag time. Lots of devices are still sold with non-polarised plugs, for compatibility with both types of socket. Until non-polarised sockets go away, and electrical inspections enforce that all polarised sockets are wired correctly, and then devices are only sold with polarised plugs, appliance line/neutral reversal will still be a daily occurrence. This will take at least a couple more decades to be rid of.
There was an effort to standardise a polarised socket and plug specification for all of mainland Europe (IEC 60906-1), but this was shelved in the 1990s and abandoned in 2017 due to cost and waste concerns. IEC 60906-1 sockets appear to be unpolarised at first glance (for plugs lacking an earth pin); however, line and neutral are required to have shutters on them that only open with the insertion of a longer earth pin (just like UK BS1363 sockets), and thus you cannot insert a 2-pin plug into it in either orientation.
A lot of the rest of the world has only polarised plugs and sockets. This includes the UK, India, Malaysia, Brazil, Israel, China, and South Africa, which collectively make up just under 40% of the world's population. That list isn't exhaustive, but I can't be bothered looking up the socket standard in use by every country in the world and reading the specification for those standards to see if they permit unpolarised plugs :)
…ok not really but that would be funny.
Do people in the software community realize how much the industry is going to totally transform in the next 5 years ? I can't imagine people actually typing code by hand anymore by that time.
But I also note that all the examples I have seen are with relatively simple projects started from scratch (on the one hand it is out of this world wild that it works at all), whereas most software development is adding features/fix bugs in already existing code. Code that often blows out the context window of most LLMs.
I can 100% imagine this. What I suspect developers will do in the future is become more proficient at deciding when to type code and when to type a prompt.
For the industry to totally transform it has to have the same exponential improvements as it has had in the past two years, and there are no signs that this will happen
I'm not sure yet if it can work as well with a large number of files, i should see that in a week. But for sure, this seems to be only a matter of scale now.
Granted, i picked a very unoriginal problem (a basic form-oriented website), but we're just at the very beginning.
The thing is, once you're used to that kind of productivity, you can't come back.
You're assuming we'll see the same exponential improvements as it has had in the past two years, and there are no signs that this will happen
> The thing is, once you're used to that kind of productivity, you can't come back.
Somehow everyone who sees "amazing unbelievable productivity gains" assumes that their experience is the only true experience, and whoever says otherwise lies or doesn't have the skills or whatever.
I've tried it with Swift and Elixir. I didn't see any type of "this kind of productivity" for several reasons:
- one you actually mentioned: "working with it more like i would work with a junior dev, and slowly iterating on the features"
It's an eager junior with no understanding of anything. "Slowly iterating on features" does not scream "this kind of productivity"
- it's a token prediction machine limited by it's undocumented and unknowable training set.
So if most of its data comes from 2022, it will keep predicting tokens from that time even if it's no longer valid, or deprecated, or superseded by better approaches. I gave up trying to fix its invalid and or deprecated output for a particular part of code after 4 attempts, and just rewrote it myself.
These systems are barely capable of outputting well-known boilerplate code. Much less "this kind of productivity" for whatever it means
The world isn’t just startups with brand new code. I agree it’s going to have a big impact though.
It’s great for boilerplate, that’s about it.
I'm using Claude sonnet 3.5 with cursor. This week I got it to:
- Modify a messy and very big file which managed a tree structure of in-game platforms. I got it to convert the tree to a linked list. In one attempt it found all the places in the code that needed editing and made the necessary changes.
- I had a player character which used a thruster based movement system (hold a key down to go up continuously). I asked the ai to convert it to a jump based system (press the key for a much shorter amount of time to quickly integrate a powerful upward physics force). The existing code was total spaghetti, but it was able to interpret the nuances of my prompt and implement it correctly in one attempt
- Generate multiple semi-complex shader lab shaders. It was able to correctly interpret and implement instructions like "tile this sprite in a cascading grid pattern across the screen and apply a rainbow color to it based on the screen x position and time".
- generating debug menus and systems from scratch. I can say things like "add a button to this menu which gives the player all perks and makes them invincible". More often then not it immediately knows which global systems it has to call and how to set things up to make it work first go. If it doesn't work first attempt, the generated code is generally not far off
- generating perks themselves - I can say things like "give me a list of possible abilities for this game and attempt implementing them". 80% of its perk ideas were stupid, but some were plausible and fit within the existing game design. It was able to do about 50%-70% of the work required to implement the perk on its own.
- in general, the auto complete functionality when writing code is very good. 90% of the time I just have to press tab and cursor will vomit up the exact chunk of code I was about to type.
Really? That's possibly the easiest task you could have asked it to do.
In what world is this "the easiest task" ??
I could do all this in my sleep when I was in my second year of career, and now I'm in my 24th year (god, I'm old).
What you described isn't just easy, it's trivial, and extremely boilerplate-y. That's why these automated token prediction machines are reasonably good at it.
You created something from scratch that used several boilerplate components with general use cases.
The amount of times professional devs do this is probably almost nil on the scale of the world.
- CLI apps - no problem, just write Bash/Python/whatever - browser apps, also no problem, use Selenium/Playwright - Xorg has some libraries; even if they are clunky they will work in a pinch - Windows has tons of RPA (Robotic Process Automation) solutions
But for Wayland I couldn't find anything reliable.
You can connect to desktop containers and VMs running Linux.
We’ve been doing this for a while before Claude made it cool.
> - Lets an AI completely take over your computer
:)
/s I have no idea if it's true, but mosdef possible
/s
With Rhino it sees the app open, and it says it's doing all these actions, like creating a shape, but I don't see it being done, and it will just continue on to the next action without the previous step being done. It doesn't check if the previous task was completed
With OnShape, it says it's going to create a shape, but then selects the wrong item from the menu but assumes it's using the right tool, and continues on with the actions as if it the previous action was done
The future is heading in the direction of only suckers using computers. Real wealth is not touching a computer for anything.