Posted by dhorthy 4 days ago
So I set out to document what I've learned about building production-grade AI systems: https://github.com/humanlayer/12-factor-agents. It's a set of principles for building LLM-powered software that's reliable enough to put in the hands of production customers.
In the spirit of Heroku's 12 Factor Apps (https://12factor.net/), these principles focus on the engineering practices that make LLM applications more reliable, scalable, and maintainable. Even as models get exponentially more powerful, these core techniques will remain valuable.
I've seen many SaaS builders try to pivot towards AI by building greenfield new projects on agent frameworks, only to find that they couldn't get things past the 70-80% reliability bar with out-of-the-box tools. The ones that did succeed tended to take small, modular concepts from agent building, and incorporate them into their existing product, rather than starting from scratch.
The full guide goes into detail on each principle with examples and patterns to follow. I've seen these practices work well in production systems handling real user traffic.
I'm sharing this as a starting point—the field is moving quickly so these principles will evolve. I welcome your feedback and contributions to help figure out what "production grade" means for AI systems!
5: Unify execution state and business state
8. Own your control flow
That is exactly what SecAI does, as it's a graph control flow library at it's core (multigraph instead of DAG) and LLM calls are embedded into graph's nodes. The flow is reinforced with negotiation, cancellation and stateful relations, which make it more "organic". Another thing often missed by other frameworks are dedicated devtools (dbg, repl, svg) - programming for failure, inspecting every step in detail, automatic data exporters (metrics, traces, logs, sql), and dead-simple integrations (bash). I've released the first tech demo [1] which showcases all the devtools using a reference implementation of deepresearch (ported from AtomicAgents). You may especially like the Send/Stop button, which is nothings else then "Factor 6. Launch/Pause/Resume with simple APIs". Oh and it's network transparent, so it can scale.Feel free to reach out.
From my experience, PydanticAI really nailed it with Logfire—debugging[0] agents was significantly easier and more effective compared to the other frameworks and libraries I tested.
The approach is to shape behavior from chaos by exclusion, instead of defining all possible transitions. With LLMs, this process could be automated and effectively an agent would be dynamically creating itself using a DSL (state schema and predefined states). The great thing about LLMs is being charged by tokens instead of a number of requests. We can just interrogate them about every detail separately and build a flow graph with transparent (and debuggable) reasoning. I also have API sketches for proactive scenarios (originally made for an ML prototype) [0].
[0] https://github.com/pancsta/secai/blob/474433796c5ffbc7ec5744...
Like you, biggest one I didn't include but would now is to own the lowest level planning loop. It's fine to have some dynamic planning, but you should own an OODA loop (observe, orient, decide, act) and have heuristics for determining if you're converging on a solution (e.g. scoring), or else breaking out (e.g. max loops).
I would also potentially bake in a workflow engine. Then, have your model build a workflow specification that runs on that engine (where workflow steps may call back to the model) instead of trying to keep an implicit workflow valid/progressing through multiple turns in the model.
As I was reading, I saw mention of BAML > (the above example uses BAML to generate the prompt ...
Personally, in my experience hand-writing prompts for extracting structured information from unstructured data has never been easy. With DSPY, my experience has been quite good so far.
As you have used raw prompt from BAML, what do you think of using the raw prompts from DSPY [2]?
[0] https://dspy.ai/
[1] https://github.com/humanlayer/12-factor-agents/blob/main/con...
[2] https://dspy.ai/tutorials/observability/#using-inspect_histo...
I don't agree fully with this article https://www.chrismdp.com/beyond-prompting/ but the comparison of punchards -> assembly -> c -> higher langs is quite useful here
I just don't know when we'll get the right abstraction - i don't think langchain or dspy are the "C programming language" of AI yet (they could get there!).
For now I'll stick to my "close to the metal" workbench where I can inspect tokens, reorder special tokens like system/user/JSON, and dynamically keep up with the idiosyncrasies of new models without being locked up waiting for library support.
However, I think the vast majority of use cases will not require this level of control, and we will abandon prompts once the tools improve.
Langchain and DSPY are also not there for me either - I think the whole idea of prompting + evals needs a rethink.
(full disclaimer: I'm working on such a tool right now!)
here's a take, I adapted this from someone on the notebookLM team on swyx's podcast
> the only way to build really impressive experiences in AI, is to find something right at the edge of the model's capability, and to get it right consistently.
So in order to build something very good / better than the rest, you will always benefit from being able to bring in every optimization you can.
That's certainly what I found in games. The games which felt magic to play were never the ones with the best hand rolled engine.
The tools aren't there yet to ignore prompts, and you'll always need to drop down to raw prompting sometimes. I'm looking forward to a future where wrangling prompts is only needed for 1% of my system.
“… you can find frameworks not just in software, but also in ordinary life. If you buy package holidays, you're buying a framework - they transport you to some place, put you in a hotel, feed you and your activities have to fit into the shape provided by the framework (say, go into the pool and swim there). If you travel independently, you are composing libraries. You have to book your flights, find your accommodation and arrange your program (all using different libraries). It is more work, but you are in control - and you can arrange things exactly the way you need.”
Hmm, that hit a bit of a nerve. My experience with switch blocks is it can be a gateway drug for teams A, B, C to add their special-case code to team D's repo within a `switch(calling_service)` block. My read of the presentation is more, factor your stuff so that any "switch" is a higher level concern that consumers can do in their own services. Then if you start to see all your consumers write very similar consumption logic, then start thinking about how to pull that down into the library/service itself.
But beyond that trigger nerve, agreed.
agree big switch statements can be an anti-pattern, e.g. when an interface is clearly better suited
And I don't mean to imply that frameworks are always bad. Things like security best practices out of the box can be worth it. But especially in AI right now, nobody knows what those best practices are going to be. So it's best to spend this time learning how to do things at a low level rather than attaching to some framework that may be obsolete in a year.
If we had the right interface, we would set up the black box, and then put holes/knobs on the box to allow anyone to change the things they should actually need to change.
if we have the wrong interface, then the knobs aren't interesting, and instead we keep end up opening the box, or reaching into the holes at weird angles to do things that nobody knew we'd want to, but that are obviously the right things to do to maximize performance
someday we'll have the right interface, but for now, better to skip the box and do the extra cycles. You're an engineer, you can write a for loop and a switch statement. don't outsource your prompts and give up control flow to save a few hundred lines that will eventually become pretty customized anyway
These things aren't cheap at scale, so whenever something might be handled by a deterministic component, try that first. Not only save on hallucinations and latency, but could make a huge difference in your bottom line.
Definitely wanna evolve this in the open with the community
I am inspired by the simplicity of these 12 factors and definitely want to learn more with an example that embraces these factors.
Personally I've had success with LangGraph + pydantic schemas. Curious to know what others have found useful.
> I have learned 80% the hard way
because the other working title for this was "Agents the Hard Way" (in the spirit of https://github.com/kelseyhightower/kubernetes-the-hard-way)
I've been tinkering with an idea for an audiovisual sandbox[1] (like vvvv[2] but much simpler of course, barebones).
Idea is to have a way to insert LM (or some simple locally run neural net) "nodes" which are given specific tasks and whose output is expected to be very constrained. Hence your example:
"question -> answer: float"
Is very attractive here. Of course, some questions in my case would be quite abstract, but anyway. Also, multistage pipelines are also very interesting.[1]: loose set of bulletpoints brainstorming the idea if curious, not organised: https://kfs.mkj.lt/#audiovisllm (click to expand description)
[2]: https://vvvv.org/