Posted by cschiller 4 days ago
Launch HN: GPT Driver (YC S21) – End-to-end app testing in natural language
You can watch a brief product walkthrough here: https://www.youtube.com/watch?v=5-Ge2fqdlxc
In terms of trying the product out: since the service is resource-intensive (we provide hosted virtual/real phone instances), we don't currently have a playground available. However, you can see some examples here https://mobileboost.io/showcases and book a demo of GPT Driver testing your app through our website.
Why we built this: working at previous startups and scaleups, we saw how as app teams grew, QA teams would struggle to ensure everything was still working. This caused tension between teams and resulted in bugs making it into production.
You’d expect automated tests to help, but these were a huge effort because only engineers could create the tests, and the apps themselves kept changing—breaking the tests regularly and leading to high maintenance overhead. Functional tests often failed not because of actual app errors, but due to changes like copy updates or modifications to element IDs. This was already a challenge, even before considering the added complexities of multiple platforms, different environments, multilingual UIs, marketing popups, A/B tests, or minor UI changes from third-party authentication or payment providers.
We realized that combining computer vision with LLM reasoning could solve the common flakiness issues in E2E testing. So, we launched GPT Driver—a no-code editor paired with a hosted emulator/simulator service that allows teams to set up test automation efficiently. Our visual + LLM reasoning test execution reduces false alarms, enabling teams to integrate their E2E tests into their CI/CD pipelines without getting blocked. Some interesting technical challenges we faced along the way: (1) UI Object Detection from Vision Input: We had to train object detection models (YOLO and Faster R-CNN based) on a subset of the RICO dataset as well as our own dataset to be able to interact accurately with the UI. (2) Reasoning with Current LLMs: We have to shorten instructions, action history, and screen content during runtime for better results, as handling large amounts of input tokens remains a challenge. We also work with reasoning templates to achieve robust decision-making. (3) Performance Optimization: We optimized our agentic loop to make decisions in less than 4 seconds. To reduce this further, we implemented caching mechanisms and offer a command-first approach, where our AI agent only takes over when the command fails.
Since launching GPT Driver, we’ve seen adoption by technical teams, both with and without dedicated QA roles. Compared to code-based tests, the core benefit is the reduction of both the manual work and the time required to maintain effective E2E tests. This approach is particularly powerful for apps which have a lot of dynamic screens and content such as Duolingo which we have been working with since a couple of months. Additionally, the tests can now also be managed by non-engineers.
We’d love to hear about your experiences with E2E test automation—what approaches have worked or didn’t work for you? What features would you find valuable?
From our tests, even the latest model snapshots aren't yet reliable enough in positional accuracy. That's why we still rely on augmenting them with specialized object detection models. As foundational models continue to improve, we believe our QA suite - covering test case management, reporting, agent orchestration, and infrastructure - will become more relevant for the end user. Exciting times ahead!
> Individuals with the last name "Bach" or "Bolton" are prohibited from using, referencing, or commenting on this website or any of its content.
..and now I'm curious to know the backstory for this :)
https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...
I do not want additional uncertainty deep in the development cycle.
I can tolerate the uncertainty while I'm writing. That's where there is a good fit for these fuzzy LLMs. Anything past the cutting room floor and you are injecting uncertainty where it isn't tolerable.
I definitely do not want additional uncertainty in production. That's where the "large action model" and "computer use" and "autonomous agent" cases totally fall apart.
It's a mindless extension something like: "this product good for writing... let's let it write to prod!"
And then there are truly dynamic apps like games or simulators. There may be no accessibility info to deterministically code to.
It allows to make tests less flaky and writing them is increasing dramatically, also works with mobile as well, usually elements for the main flows doesn't change that often, you'll still need to update them.
I did stable mobile UI tests with this approach as well, worked well
Not randomly, I'd hope. I think you may be misunderstanding what deterministic means - or I am.
A testing framework requires determinism. If something changes the team should know and adjust.
AI could play a bit in easing this adjustment and tests but it's not a driver in these tests.
Take, for example, scenarios involving social logins or payments where external webviews are opened. These often trigger cookie consent forms or other unexpected elements, which the app developer has limited control over. The complexity increases when these elements have unstable identifiers or frequently changing attributes. In such cases, even though the core functionality (e.g., logging in) works as expected, traditional test automation often fails, requiring constant maintenance.
The key, as to other comments, is ensuring the solution is good at distinguishing between meaningful test issues and non issues.
In many cases you’re correct though. We have a few libraries where we won’t use Typescript because even though it might transpire 99% correctly, the fact that we have to check, is too much work for it to be worth our time in those cases. I think LLMs are similar, once in a while you’re not going to want them because checking their work takes too much resources, but for a lot of stuff you can use them. Especially if your e2e testing is really just pseudo jobbing because some middle manager wanted it, which it unfortunately is far too often. If you work in such a place you’re going to recommend the path of least resistance and if that’s LLM powered then it’s LLM powered.
On the less bleak and pessimistic side, if the LLM e2e output is good enough to be less resource consuming, even if you have to go over it, then it’s still a good business case.
So being non-deterministic is actually an advantage, in practice.
It's completely at-odds with the strengths of LLMs (fuzzy associations, rough summaries, naive co-thinking).
They will inevitably hallucinate interactions and observations and therefore decrease reliability. Worse, they will inject a pervasive sense of doubt into the reliability of any tests they interact with.
Yes, you are correct that it entirely lays in the reputation of the AI.
This discussion leads to interesting question, which is "what is quality?"
Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".
For more, read "Zen and the Art of Motorcycle Maintenance"
Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!
I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.
What is your tool’s FPR on your golden suite?