At human.tech we’ve tried a number of experiments with AI and have 30xed productivity for some jobs. At first we used AI like everyone did: developers used Claude Code extensively, while everyone else asked for assistance on miscellaneous tasks from ChatGPT, Gemini, or Claude. AI was mostly for engineers: nearly all of their output was from AI, but all non-engineers were perhaps 30% more effective at most. But then, we took some major risks to experiment with a new AI-native company architecture. And the results were surprising.
The motivating problem for this experiment was: “what makes Claude Code so effective for programming but not other tasks?” We always assumed Anthropic would just release a “Claude PM,” “Claude CMO,” and “Claude Ops” equivalent for 10x efficiency boosts in departments besides engineering…but neither Anthropic nor anybody else made any good tools that could 10x anybody’s output. I kept wondering: what was special about engineering?
The first hypothesis was that it was due to Claude Code’s three-step harness:
- Plan
- Execute
- Verify
These steps are far easier to give an agent for engineering tasks than for non-engineering ones. Planning can utilize thousands of lines of organized code. Execution is just modifying files. And verifying is running tests. For non-engineering tasks, an agent can’t easily get this harness.
Planning and verification are the hard parts for non-engineers. The performance of AI tools on a codebase varies tremendously with how organized it is — which suggests context management, not raw volume, is what matters: the win is letting agents easily find comprehensive material about exactly what they need without polluting their context window.
If this is the case, this then becomes a solvable problem for non-engineering departments: organize all the information about your organization, products, people, etc. — anything you would ever need an agent to know to do comprehensive work for you — in a way that agents can quickly grab any information they ever need but not pollute their context with irrelevant information. I.e.: scrap Notion and remake all internal documentation, project planning, etc. in markdown. We spent days planning the architecture of this information. It involved extensive indexing, verification a.k.a “self-healing” to avoid stale facts and broken file indexes or references, and structured YAML frontmatter to provide semi-structured ways for agents to perform this verification. We even put roadmap, OKRs, and everyone’s tasks in internal docs so that agents could help strategize and distribute tasks.
Post-internal-docs results
This was planned to be step one of three in an attempt to build claude-code-like harnesses for non-engineers. After over a week of work on the docs I thought “that was fun but we need to do the other (far more difficult) steps to actually do anything useful.” The same day, Daniel and I caught up. Daniel runs our developer relations, go-to-market, and business development. Before I could share any next step plans for non-engineer harness experiments, he told me: “I’ve been using the internal docs the past three days and I just did three months of work.” This was surprising to me because it felt like only step one. Turns out if your tasks involve creating developer relations materials, decks, GTM strategies, and other generative work you often don’t need the other steps of the harness.
We’ve since put our CRM in internal-docs too, so agents can pull customer and pipeline context just like everything else. They can also generate branded materials like invoices, sales enablement decks, and marketing graphics directly from the brand assets and instructions stored there.
Agents used to hallucinate multiple facts about our company nearly every time prompted. But now mistakes are exceedingly uncommon; the only mistake we tend to find is feature liveness because for some strange reason AI is not good at telling when planned features are actually in production. I’m sure this could be fixed with a simple skill, but is rare enough that it has not justified the effort. With so few mistakes, the verification step of the harness is less important.
Next
This was just step one. It 30xed Daniel’s productivity, and we believe every AI experiment has potential to rapidly pay off. Today, we focus most of our effort on step two of the three-step harness: execution. Content, decks, and strategies are great, but how about doing something that’s not purely generative but also requires action, e.g. updating production deployments or posting on Twitter? Well, it’s hard for everyone who’s not technical to set up and maintain 10 MCP servers or skills, and as a security-conscious company that plans to launch a token: there’s no way we’re giving everyone and their agent the ability to post on X or discord or change our DNS, cloud infrastructure, or production deployments. Even if it would be the most convenient thing in the world to have an agent hooked up to every tool we might ever use.
We have been experimenting with a new way to give everyone and their agent access to “dangerous” credentials — AWS, cloudflare, twitter, email, databases, etc. We have it hooked up to 25 credentials. Yesterday, we used it to make sensitive DNS changes across all our websites on multiple cloud providers, which would have taken lots of time manually and likely caused downtime. We’ve used it for cloud infrastructure cost audits and fixes, production database migration, marketing, etc. And it’s completely changed how we’ve worked.
I find it especially useful as a founder to have 15 different credentials hooked up to one place so I can check all my important emails, messages, tweets, tasks, analytics, draft responses to people across all platforms all in one text chat. So I no longer need to visit a dozen web pages and apps to do all my daily tasks. Giving agents access to tons of credentials that would be too scary otherwise unlocks a ton. Due to demand, we will launch the internal tool for this, and you can email me at nanak@holonym.id if you’d like to try it out.
The bigger picture
In addition to their efficiency gains (and fun), these tools also opened up plenty of potential next paths. E.g. we have experimented with AI-driven companies, putting all OKRs, roadmap items, and task boards in the internal documentation so that agents can effectively strategize and delegate tasks (with human approval). We now have all tasks in GitHub as issues so agents can help decide what to prioritize and even in some cases begin tasks themselves, only asking for human approval.
Now that agents have access to plan on behalf of our company and perform action on our behalf, even the most sensitive ones (with human approval :)), the question is how to actually keep them in check. Human approval is now the bottleneck. AI rarely hallucinates now that the internal docs exist, but its performance is not always stellar for certain kinds of tasks. I believe the majority of tasks where AI still is ineffective are just a question of verification. We have experimented with a few new techniques for verification to complete the harness loop where it can assess its performance and iterate until it does better.
Happy Hacking
I hope you found this useful in thinking about AI-native organizations. Not only has efficiency been amazing, but this has been extremely fun. And overall the mindset we have adopted on AI tooling is “Find your biggest blockers that AI cannot automate. Then figure out how to make AI automate it.” It’s a challenge, and it typically works. People find a way.
Leave a comment