Top
Best
New

Posted by cschiller 10/23/2024

Launch HN: GPT Driver (YC S21) – End-to-end app testing in natural language

Hey HN, we are Chris and Chris from MobileBoost (https://mobileboost.io/). We’re building GPT Driver, an AI-native approach to create and execute end-to-end (E2E) tests on mobile applications. Our solution allows teams to define tests in natural language and prevents test flakiness by taking a visual approach paired with LLM (Large Language Model) reasoning. This helps achieve E2E test coverage with a fraction of the usual effort.

You can watch a brief product walkthrough here: https://www.youtube.com/watch?v=5-Ge2fqdlxc

In terms of trying the product out: since the service is resource-intensive (we provide hosted virtual/real phone instances), we don't currently have a playground available. However, you can see some examples here https://mobileboost.io/showcases and book a demo of GPT Driver testing your app through our website.

Why we built this: working at previous startups and scaleups, we saw how as app teams grew, QA teams would struggle to ensure everything was still working. This caused tension between teams and resulted in bugs making it into production.

You’d expect automated tests to help, but these were a huge effort because only engineers could create the tests, and the apps themselves kept changing—breaking the tests regularly and leading to high maintenance overhead. Functional tests often failed not because of actual app errors, but due to changes like copy updates or modifications to element IDs. This was already a challenge, even before considering the added complexities of multiple platforms, different environments, multilingual UIs, marketing popups, A/B tests, or minor UI changes from third-party authentication or payment providers.

We realized that combining computer vision with LLM reasoning could solve the common flakiness issues in E2E testing. So, we launched GPT Driver—a no-code editor paired with a hosted emulator/simulator service that allows teams to set up test automation efficiently. Our visual + LLM reasoning test execution reduces false alarms, enabling teams to integrate their E2E tests into their CI/CD pipelines without getting blocked. Some interesting technical challenges we faced along the way: (1) UI Object Detection from Vision Input: We had to train object detection models (YOLO and Faster R-CNN based) on a subset of the RICO dataset as well as our own dataset to be able to interact accurately with the UI. (2) Reasoning with Current LLMs: We have to shorten instructions, action history, and screen content during runtime for better results, as handling large amounts of input tokens remains a challenge. We also work with reasoning templates to achieve robust decision-making. (3) Performance Optimization: We optimized our agentic loop to make decisions in less than 4 seconds. To reduce this further, we implemented caching mechanisms and offer a command-first approach, where our AI agent only takes over when the command fails.

Since launching GPT Driver, we’ve seen adoption by technical teams, both with and without dedicated QA roles. Compared to code-based tests, the core benefit is the reduction of both the manual work and the time required to maintain effective E2E tests. This approach is particularly powerful for apps which have a lot of dynamic screens and content such as Duolingo which we have been working with since a couple of months. Additionally, the tests can now also be managed by non-engineers.

We’d love to hear about your experiences with E2E test automation—what approaches have worked or didn’t work for you? What features would you find valuable?

129 points | 82 commentspage 2
tomatohs 10/23/2024|
Curious what happened to the other YC Mobile AI E2E company, CamelQA (YC W24). They pivoted to AI assistants. Could be good lessons there if you're not already in touch with them.
bluelightning2k 10/23/2024||
Genuinely curious, is the timing on this immediately after Claude computer use a coincidence? Or was that like the last missing piece, or a kind of threat which expedited things
cschiller 10/23/2024|
Good call! The timing was actually a coincidence, but not unexpected. OpenAI had already announced their plans to work on a desktop agent, so it was only a matter of time.

From our tests, even the latest model snapshots aren't yet reliable enough in positional accuracy. That's why we still rely on augmenting them with specialized object detection models. As foundational models continue to improve, we believe our QA suite - covering test case management, reporting, agent orchestration, and infrastructure - will become more relevant for the end user. Exciting times ahead!

doublerebel 10/23/2024||
How does this compare with Test.ai (now aka Testers.ai) who have offered basically this same service for the last 5 years?
tauntz 10/23/2024|
Totally offtopic but I looked at testers.ai and noticed the following from the terms of service:

> Individuals with the last name "Bach" or "Bolton" are prohibited from using, referencing, or commenting on this website or any of its content.

..and now I'm curious to know the backstory for this :)

LeFever 10/24/2024||
John Bolton and James Bach are the founders of RST [1] and generally big names in the “formal” software testing space. Presumably the testers.ai folks aren’t fans. :p

[1] https://rapid-software-testing.com/authors/

archerx 10/23/2024||
Curious question, what ever happened with the OpenAI drama with trademarking “GPT”. I’m guessing they were not successful?
chrtng 10/23/2024|
From what we understand the term GPT was deemed too general for OpenAI to claim as its own.

https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...

archerx 10/25/2024||
Thank you.
alexwordxxx 10/23/2024||
Hey https://google.com
101008 10/23/2024||
Still interesting how a lot of companies offer a LLM (non-deterministic) solution for deterministic problems.
chairhairair 10/23/2024||
This fundamental issue seems to be totally lost on the LLM-heads.

I do not want additional uncertainty deep in the development cycle.

I can tolerate the uncertainty while I'm writing. That's where there is a good fit for these fuzzy LLMs. Anything past the cutting room floor and you are injecting uncertainty where it isn't tolerable.

I definitely do not want additional uncertainty in production. That's where the "large action model" and "computer use" and "autonomous agent" cases totally fall apart.

It's a mindless extension something like: "this product good for writing... let's let it write to prod!"

usernameis42 10/23/2024||
Same goes with the real people, we all can do mistakes and AI Agents would get better over time, and will be ahead of many specialist pretty soon, but probably not perfect before AGI, just as we are.
layer8 10/23/2024|||
One of the advantages of automation has traditionally been that it cuts out the indeterminacy and variability inherent in real people.
conorjh 10/23/2024|||
your software has real people in it?
SkyBelow 10/23/2024||
Ideally it does. Users, super users, admins, etc. Though one might point out exactly how much effort we put into locking down what they can do. I think one might be able to expand this to build up a persona for how LLMs should interface with software in production, but too many applications give them about the same level of access as a developer coding straight into production. Then again, how many company leaders would approve of that as well if they thought it would get things done faster and at lower cost?
aksophist 10/23/2024|||
It’s only deterministic for each version of the app. Versions change: UI elements move, change their title slightly. Irrelevant promo popups appear, etc. For a deterministic solution, someone has to go and update the tests to handle all of that. Good ‘accessibility hygiene’ can help, but many apps lack that.

And then there are truly dynamic apps like games or simulators. There may be no accessibility info to deterministically code to.

usernameis42 10/23/2024|||
There is great approach based on test-id strategy, basically it's a requirement for the frontend teams to cover all interactive elements with test-id's.

It allows to make tests less flaky and writing them is increasing dramatically, also works with mobile as well, usually elements for the main flows doesn't change that often, you'll still need to update them.

I did stable mobile UI tests with this approach as well, worked well

digging 10/23/2024|||
> Versions change: UI elements move, change their title slightly

Not randomly, I'd hope. I think you may be misunderstanding what deterministic means - or I am.

MattDaEskimo 10/23/2024|||
It's crazy to have people so out of their league try to argue against well established meanings.

A testing framework requires determinism. If something changes the team should know and adjust.

AI could play a bit in easing this adjustment and tests but it's not a driver in these tests.

minhaz23 10/23/2024|||
Ever worked with extjs? :/
cschiller 10/23/2024|||
I agree that it can seem counterintuitive at first to apply LLM solutions to testing. However, in end-to-end testing, we’ve found that introducing a level of flexibility can actually be beneficial.

Take, for example, scenarios involving social logins or payments where external webviews are opened. These often trigger cookie consent forms or other unexpected elements, which the app developer has limited control over. The complexity increases when these elements have unstable identifiers or frequently changing attributes. In such cases, even though the core functionality (e.g., logging in) works as expected, traditional test automation often fails, requiring constant maintenance.

The key, as to other comments, is ensuring the solution is good at distinguishing between meaningful test issues and non issues.

worldsayshi 10/23/2024|||
I would assume that the test runner translates the natural language instruction into a deterministic selector and only re-does that translation when the selector fails. At least that's how I would try to implement it..
tomatohs 10/23/2024||
This is the right idea and how we do it at TestDriver.ai. The deterministic selector still has about 20% fuzz matching rate, and if it fails it trys to recover.
devjab 10/23/2024|||
I think it’s less of an issue for e2e testing because e2e testing sucks. If teams did it well in general you would be completely correct, but in many places a LLM will be better even if it hallucinates. As such I think there will be a decent market for products like this, even if they aren’t may not even really be testing what you think they are testing. Simply because that may well be way better than the e2e testing many places already do.

In many cases you’re correct though. We have a few libraries where we won’t use Typescript because even though it might transpire 99% correctly, the fact that we have to check, is too much work for it to be worth our time in those cases. I think LLMs are similar, once in a while you’re not going to want them because checking their work takes too much resources, but for a lot of stuff you can use them. Especially if your e2e testing is really just pseudo jobbing because some middle manager wanted it, which it unfortunately is far too often. If you work in such a place you’re going to recommend the path of least resistance and if that’s LLM powered then it’s LLM powered.

On the less bleak and pessimistic side, if the LLM e2e output is good enough to be less resource consuming, even if you have to go over it, then it’s still a good business case.

batikha 10/23/2024|||
I work in the field and built a tool that has way less flakiness than deterministic solutions. The issue is testing environments are always imperfect because (a) they are stateful and (b) there's always some randomness in actual production software. Some teams have very clean testing environment but most don't.

So being non-deterministic is actually an advantage, in practice.

joshuanapoli 10/23/2024|||
I think that the hope/dream here is to make end-to-end tests less flakey. It would be great to have navigation and assertions commands that are robust against simple changes in the app that aren't relevant to the test case.
chairhairair 10/23/2024||
It's just a dream then.

It's completely at-odds with the strengths of LLMs (fuzzy associations, rough summaries, naive co-thinking).

yorwba 10/23/2024||
Fuzzy associations seem relevant? Interact with the UI based on what it looks like, not the specific implementation details.
chairhairair 10/23/2024||
No. Both of the requirements "to interact" and "based on what it looks like" require unshakable foundations in reality - which current models clearly do not have.

They will inevitably hallucinate interactions and observations and therefore decrease reliability. Worse, they will inject a pervasive sense of doubt into the reliability of any tests they interact with.

tomatohs 10/23/2024||
> unshakable foundations in reality

Yes, you are correct that it entirely lays in the reputation of the AI.

This discussion leads to interesting question, which is "what is quality?"

Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".

For more, read "Zen and the Art of Motorcycle Maintenance"

dartos 10/23/2024|||
Tbf, users are also non-deterministic, so if LLM testing like this does catch on, it’ll be in the same realm as chaos testing.
aksophist 10/23/2024||
how do you evaluate your tool, and have you published your evaluation along with the metrics?
chrtng 10/23/2024|
Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations.

Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!

aksophist 10/24/2024||
What is a false positive rate? Is it when the agent falsely passes or falsely “finds a bug”? And regardless of which: why don’t you include the other as a key metric?

I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.

What is your tool’s FPR on your golden suite?

iknownthing 10/23/2024||
no logo?
lihua919 10/23/2024|
interesting