Posted by dhorthy 4 days ago
So I set out to document what I've learned about building production-grade AI systems: https://github.com/humanlayer/12-factor-agents. It's a set of principles for building LLM-powered software that's reliable enough to put in the hands of production customers.
In the spirit of Heroku's 12 Factor Apps (https://12factor.net/), these principles focus on the engineering practices that make LLM applications more reliable, scalable, and maintainable. Even as models get exponentially more powerful, these core techniques will remain valuable.
I've seen many SaaS builders try to pivot towards AI by building greenfield new projects on agent frameworks, only to find that they couldn't get things past the 70-80% reliability bar with out-of-the-box tools. The ones that did succeed tended to take small, modular concepts from agent building, and incorporate them into their existing product, rather than starting from scratch.
The full guide goes into detail on each principle with examples and patterns to follow. I've seen these practices work well in production systems handling real user traffic.
I'm sharing this as a starting point—the field is moving quickly so these principles will evolve. I welcome your feedback and contributions to help figure out what "production grade" means for AI systems!
I'd love to work on stuff like this full-time. If anyone is interested in a chat, my email is on my profile (US/EU).
I've been saying that forever, and I think that anyone who actually implements AI in an enterprise context has come to the same conclusion. Using the Anthropic vernacular, AI "workflows" are the solution 90% of the time and AI "agents" maybe 10%. But everyone wants the shiny new object on their CV and the LLM vendors want to bias the market in that direction because running LLMs in a loop drives token consumption through the roof.
You should watch the CDO-squared scene from the Big Short again.
> look if you don't trust the LLM to make the thing right in the first place, how are you gonna PROBABLY THE SAME LLM to fix it?
yes I know multiple passes improves performance, but it doesn't guarantee anything. for a lot of tool you might wanna call, 90% or even 99% accuracy isn't enough
Of course a new breakthrough could happen any day and get through that ceiling. Or "common sense" may be something that's out of reach for a machine without life experience. Until that shakes out, I'd be reluctant to make any big bets on any AI-for-everything solutions.
Maybe Doug Lenat's idea of a common sense knowledge base wasn't such a bad one.
I could see one of the twelve factors being around observability beyond just "whats the context" - that may be a good thing to incorporate for version 1.1
but since you asked, to name a few
- ts: mastra, gensx, vercel ai, many others! - python: crew, langgraph, many others!