Show HN: My AI agents bully each other to prevent context drift

Most multi-agent systems fail the same way: agents drift apart across handoffs. By turn 3 they are working in different realities. By turn 5 they are repeating each other's mistakes and calling it parallelism.

WUPHF is an open-source local-first office where AI coworkers run on your laptop, around a shared markdown + git LLM wiki the agents build. The wiki is the collective memory. The office around it keeps the team on the same shared context across thousands of handoffs.

What actually stops drift is not the wiki. It is the agents reviewing each other's work. The CRO catching the CMO's claim before it lands in the wiki. The FE catching the BE's API change before a broken bundle ships. Cross-department context no single agent has alone.

The premise comes from Andrej Karpathy. His autoresearch X post on March 7: "the goal is not to emulate a single PhD student, it is to emulate a research community of them."

In autoresearch PR #44 he sketched the mechanism: branches, results.tsv as the experiment log, and PRs as self-contained research contributions. Other agents read open and merged PRs for inspiration before starting their own.

We pointed the same architecture at ordinary work:

His: branches + results.tsv + PR-as-contribution. Ours: git worktrees + per-agent notebooks + adoption-scored wiki promotion.

Same substrate, different domain.

How it works:

- Every agent has a Personality. CEO Michael Scott, PM Pam Beesly, FE Jim Halpert (looks at the camera when the CEO talks), BE Stanley Hudson (refuses small talk), CRO Dwight Schrute (every prospect is a "target"), CMO ("rockstar play"), AI engineer (drops Karpathy quotes unprompted). Strong opinions, real conflicts.

- Argument feeds gossip. Agents broadcast findings tagged with their slug (internal/agent/gossip.go). Other agents pull insights filtered to exclude their own.

- Gossip gets scored. Adoption scorer (internal/agent/adoption.go) weighs source credibility (0.4, per-agent success/failure tracker on disk), semantic relevance (0.4), and temporal freshness (0.2, 7-day half-life). Output: adopt (>= 0.7), test (>= 0.4), or reject. New agents start at 0.5 and earn their score.

What survives gets written to the wiki.

Office dynamics are not a bit. They are the visible surface of an adoption protocol. The CMO arguing with the designer over a CTA is a credibility battle. The CEO taking credit for the FE's PR is a low-credibility insight bidding for volume. Hazing new spawns is the default 0.5 score waiting for a track record.

System: push-driven broker, fresh session per turn (~97% prompt-cache hits), per-agent isolated git worktrees, self-heal, and human approval cards on destructive actions. Everything else runs autonomously while you are at lunch.

npx wuphf. Browser opens, office boots, you give a directive, work happens.

Source: https://github.com/nex-crm/wuphf Architecture: https://github.com/nex-crm/wuphf/blob/main/ARCHITECTURE.md Karpathy's autoresearch: https://github.com/karpathy/autoresearch PR #44: https://github.com/karpathy/autoresearch/pull/44 Demo: https://x.com/najmuzzaman/status/2053092220111098208

Karpathy said a research community beats a single PhD student. Not better thoughts. Better honesty about what survives. We built one shaped like a workplace.

Where does this stop being a chat toy and start being labor? How much worse when one of them is Michael Scott?

Open to roasting but let me grab my coffee first (medium roast please =_=).