Posted by Klaster_1 4 days ago
Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?
What about rerunning things?
Is there shrinking?
edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.
Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.
Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.
Thanks for the questions and feedback!
Should be pretty easy to make it deterministic if you follow that precondition.
(How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)
Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)
Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.
If you're curious about it, here's a very detailed spec for TodoMVC in Bombadil: https://github.com/owickstrom/bombadil-playground/blob/maste... It's still work-in-progress but pretty close to the original Quickstrom-flavored spec.
Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to enable executing the code also symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.
[1] https://www.microsoft.com/en-us/research/publication/pex-whi...
Ui tests like:
* if there is one or more items on the page one has focus
* if there is more than one then hitting tab changes focus
* if there is at least one, focusing on element x, hitting tab n times and then shift tab n times puts me back on the original element
* if there are n elements, n>0, hitting tab n times visits n unique elements
Are pretty clear and yet cover a remarkable range of issues. I had these for a ui library, which came with the start of “given a ui build with arbitrary calls to the api, those things remain true”
Now it’s rare it’d catch very specific edge cases, but it was hard to write something wrong accidentally and still pass the tests. They actually found a bug in the specification which was inconsistent.
I think they often can be easier to write than specific tests and clearer to read because they say what you actually are testing (a generic property, but you had to write a few explicit examples).
What you could add though is code coverage. If you don’t go through your extremely specific branch that’s a sign there may be a bug hiding there.
I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.
https://github.com/papers-we-love/san-francisco/blob/master/...
Recently evaluated other testing tools/frameworks and if you're not already running the npm-dependencyhell-shitshow for your projects, most tools will pull in at least 100 dependencies.
I might be old fashioned but that's just too much for my taste. I love single-use tools with limited scope like e.g. esbuild or now this.
Will give this a try, soon.
Sometimes, only thing you can do is let the plague spread, and hope that the people who survive start showering and washing their hands.
[0]: I once interviewed at a company that sold a kind of on-prem VM hosting and storage product. They were shipping a physical machine with Linux and a custom filesystem (so not ZFS), and they bragged about how their filesystem was very fast, much faster than ZFS or Btrfs on SSDs. I asked them, if they were allowed to tell me how they achieved such consistent numbers. They listed a few things, one of which was: "we disabled block-level check-summing". I asked: "how do you prevent corruption?". They said: "we only enable check-summing during the nightly tests". So, a little unsettled, I asked: "you do not do _any_ check-summing at any point in production"? They replied: "Exactly. It's not necessary". So, throwing caution to the wind (at this point I did not care to get the job), I asked: "And you've never had data corruption in production"? They said: "Never. None". To which I replied: "But how do you _know_"? My no-longer-future-coworker thought for a few seconds, and realization flashed across his face. This was a company that had actual customers on 2 continents, and was pulling in at least millions per year. They were probably silently corrupting customer data, while promising that they were the solution -- a hospital selling snake-oil, while thinking it really is medicine.
You should report this to the SQLite developers - they are very smart and very interested in fixing SQLite correctness bugs!
Nice name, now who is he?
Is there a video showing someone spinning this up and finding a bug in a simple app?
A broken counter app maybe?
It's helpful to know what the tool maintainers see as upcoming or incomplete work. It also saves a consultant like me a lot of time to evaluate new tools for clients if I also know the limitations before diving in. Maybe a section in the manual for "What Bombadil can't do".
Great work!
Jokes aside, great project and documentation (manual)! Getting started was really simple, and I quickly understood what it can and cannot do.