Posted by blumpy22 1 day ago
It feels like most of the “rules” are “don’t be an ass to your consumer”.
Doing stuff for the machine: the behaviour of a pragmatic, nuanced builder. A forward-thinking agentic AI pioneer, executing and shipping at the unexplored boundary of modes of human creativity #building #shipping #executing
I found myself nodding along to the linked tweet/article. Recently I did many rounds of iterative user-centered design with an agent to improve the CLI interface in Jobs [0], a task manager for LLMs. The resulting CLI follows most of these principles.
One great idea from the tweet that I will be adding: a `feedback` subcommand, for the agent to capture feedback while they work.
> There's a deeper assumption underneath all of it. The classic Command Line Interface Guidelines treat a human at a terminal as the primary user, with agents as a tolerated secondary audience. That's no longer the right default. Cloudflare puts it directly in their post: "Increasingly, agents are the primary customer of our APIs." Their whole schema approach is built around that. HeyGen launched their CLI with "agent" in the marketing copy. Design for agents first, and humans benefit. Designing for humans first and bolting on agent support is what produces the inconsistent, prompt-prone, stdout-only CLIs the first five principles exist to correct.
I don't think that's true at all. If you're someone who has lived in the terminal for a few years, you will have a sense of taste that naturally leads you to do the right thing. If you've used Git and systemctl and you know why p7zip feels alien on Unix and you have cursed a command where `-h` doesn't mean help, nobody needs to tell you basically any of this. If you've ever met jq, you don't need anyone to tell you that `--json` is a very valuable thing to have. You also don't need anyone to tell you what a uniform hierarchy of flags and options with different scopes should look like; if you've used a program that uses subcommands, even a shitty one, you know what a good one should look like.
When command-line tools (or inconsistent collections thereof) are difficult for AI in the ways the article describes, it's because they're shit. When command-line tools are shit, it's because nobody is taking the design of those interfaces seriously at all, typically some combination of:
- the interface isn't "designed" at all, it's just naively evolved.
- you're leaving writing a CLI tool to someone who tolerates the command-line but doesn't live in it
- the object is treated as only a human/interactive interface or only a programming interface when in fact it's always both
- your suite of tools has diffuse ownership and nobody thinks command-line interfaces are important enough to have standards for
If you treat a GUI as unseriously as that, it invariably turns to a pile of shit, too!Anybody who ought to be writing one has already internalized all the right norms. Most of it comes for free from living in the shell. Put one person in charge and it'll be uniform. If you can't, writing a style guide and enforcing it with linters and tests is a great idea. But this is just taking command-line interfaces seriously as interfaces. It has pretty much nothing to do with AI except at the edges (e.g., json-flavored companion to --help).
When I worked at Heroku basically all of these principles were true (though usually described slightly differently or for different reasons) back then too. These are just good CLI design principles, nothing agent-native about them. Build small sharp commands that don't require interactivity, follow *nix conventions so users can pipe in/out results to build workflows beyond what you initially imagined, provide useful help and examples, if there's a reasonable guess about the next thing a person should do offer it as a suggestion, be consistent in your terminology, be consistent in data format (e.g., don't expect a shortform name of a resource as the input in one place and the integer ID in another), for information that is important for the context in which to execute a command (e.g., which user, which org, etc) provide an environment level config and a per-command config option.
Just lots of generally helpful advice for people. Turns out it's helpful to agents too.
Something that seems like agent-specific conventional wisdom that I'm not fully bought into: JSON as the output format. For all but the most trivial outputs the LLM does not actually seem to want JSON output and will instead jump through various hoops to turn it into something it can parse more easily. We experimented with TOON[1] as a format and immediately confirmed the reduced token output claims. However when benchmarking actual real use cases TOON performed worse than both JSON and having the LLM just consume the human output. Digging further into that was eye-opening as it revealed the reason JSON did so well was less to do with the LLM understanding JSON and more its knowledge of the extensive ecosystem around JSON as a format that already exists. Looking at all the various tool calls we could see it'd make heavy use of piping JSON data to `jq`, `cut`, `awk`, `sort`, `wc`, etc. to get the data into the shape it needed. Failing that it would fall back to writing temporary python scripts to get it into the correct shape.
Capturing all of those logs to understand the performance differences felt like a form of usability testing we used to do at Heroku too. I suddenly saw the way someone (something in this case) was using the tool in ways I didn't entirely expect. Many of them to essentially get answers to perfectly reasonable questions that we should be surfacing in a better way to both humans and agents alike. It's like I managed to squash hundreds of usability tests into a couple of days. It was pretty simple to add additional flexibility into the CLI commands and clearer messaging in other places which drastically reduced the need for the LLM to post-process the data no matter what format they received it in.
So we still support JSON as a data format because it's genuinely useful for a bunch of reasons. But we also have something more LLM friendly (TOON-like, but not entirely compliant in specific circumstances where we can see it's inefficient) to be as efficient with token usage as we can be. That's about the only agent-only addition to the CLI in the end. Despite building it agent-first, it's helped us get to a better human product.
> 1. Non-interactive by default > Commands have to run without interactive prompts when an agent invokes them. When a subagent spawns a background process, there's nothing answering the prompt. The command hangs.
If a command has stdout/stdin attached, you can be interactive mode. If it doesn't you can be in non-interactive mode. This isn't even wrong, it's just nonsense.
> 2. Structured, parseable output > A nicely aligned table with ANSI colors is for humans. An agent extracting a post ID needs JSON.
It's a natural language agent. It obviously doesn't need JSON. And JSON is extremely wasteful in terms of tokens.
> 3. Errors that teach, and enumerate > The original principle was "fail fast with actionable errors." That still holds, with one refinement I missed the first time. When the failure is "you passed an invalid value for X," the error should include the valid set.
Except that the valid set can be huge. You're much better off describing what's valid. It's a natural language agent. Talk to it!
> 4. Safe retries and explicit mutation boundaries > Agents retry. Humans glance at a duplicate row and notice; agents don't.
What does this have to do with agents? Yeah, if possible make your operations idempotent. If not, well, .. then don't? Humans will make exactly the same mistakes as agents here.
> 5. Bounded responses, at every layer > Tokens cost money and context. Big outputs are sometimes justified, but the default should be narrow.
How do you know what your agent needs? Let the agent bound and select. Agents are perfectly capable of sending your output through grep or through head/tail.
> 6. Cross-CLI vocabulary consistency > This is the principle I'm most certain about, and the one most under-stated in the original. > Agents don't memorize one CLI at a time. They build a generalized model of what CLIs do, drawn from every CLI they've seen. When your tool uses info for what every other tool calls get, the agent doesn't fail; it succeeds slowly, with extra retries, after burning tokens on --help. Multiply that across thousands of agent invocations per week and the cost is real.
Agents need to deal with hundreds of CLIs that are all inconsistent. What matters is that you describe to the agent how each CLI works. It doesn't matter if they're consistent.
> 7. Three-layer introspection > The original principle here was "progressive help discovery": top-level --help lists commands, subcommand --help shows usage. That's still true, but it's now the bottom layer of a three-layer stack. Each layer answers a different question.
Truly the worst advice. Take a simple clean output that's easy to understand in one go and turn it into a crazy complex json that requires multiple inferences to understand. Not only are you wasting compute, you're wasting your time waiting for that compute.
> 8. Async-aware execution > Most CLIs treat async APIs the way the underlying HTTP endpoint does: submit returns a job ID, poll returns a status, that's the agent's problem. Two failure modes follow. Either the agent writes its own poll loop (wasting tokens and getting it subtly wrong), or it doesn't, and the workflow fails because the result wasn't ready when the next step ran.
No, it got worse. Horrific advice. Take a simple API that an agent can easily wait for in the background and turn it into a stateful monster that can clog everything up with junk. Oh and now when you have multiple agents they get to have a fun conflict.
> 9. Persistent identity through profiles > Agents don't show up once. They show up tomorrow, and the day after, and a week from now, in a different shell, with the same underlying intent and a different specific input. Stateless leaf-shaped CLIs make every invocation re-specify the same eight flags.
Ok, I was wrong. 8 was bad. 9 is much much worse. It's a guarantee that your agents will get things wrong. Why? Because agents forget all the time!
> 10. Two-way I/O > The original principle 6 (composable and predictable structure) covered stdin/stdout pipelining. That's still true. But agents don't only consume CLIs through pipes, and the CLI doesn't only emit through stdout. There are two new mechanisms worth adding: a way for the CLI to emit artifacts where the agent actually needs them, and a way for the agent to report friction back.
This literally exists in a form that every single agent knows: bash pipes and redirects. They've been trained on billions of examples of this. Now instead of just using that, you're adding a custom version that will just confuse the agent.
I'm not sure I could have written a worse list if I tried.
That said, I will say I personally love an optional --wait flag. I've written so many bash scripts where I have to do the status looping manually when all I want is to just do the operation, then do something else once it's complete. For the most part I'm willing to sacrifice a little control there for simplicity.
I 100% agree with your take on the "Two Way I/O". I hate having to figure out how to coerce tools to give me the right output file when all I want is for them to cleanly write the output to stdout, the progress messages and errors to stderr, and let me deal with how they get redirected. This is a core principle that's existed in CLI tools since forever. Agents and humans are both very capable of stringing together other tools to get the results you want.