BM
Bhavik Mehta
Contact Me
Back to Blog
{ 07 } — AI

The AI Agent Gold Rush: Miners and Shovel Sellers

2026-04-267 min read
#AI Agents#LLMs#Software Engineering#AI Hype

Human hand reaching toward a robot hand — the AI handshake everyone imagines, rarely what ships

In 2024, someone at a hackathon built an agent that books their calendar. By 2025, there were 47 startups doing exactly that. In 2026, one of them raised $4M.

The "Automation" store is empty. Everyone is in the "AI Agent" store. Nobody stops to ask whether those two stores were selling different things.


What Everyone Thinks an AI Agent Is

The pitch is always the same. An autonomous system that reads your email, manages your calendar, deploys your code, files your taxes, negotiates your rent, and—somewhere in slide 7 of the deck—replaces three of your employees.

The demos look incredible. Devin, Cognition's "AI software engineer," launched in March 2024 and became the most hyped thing on developer Twitter within 48 hours. Cognition raised $175 million at a $2 billion valuation within six months of founding. By 2026, they were reportedly in talks at a $25 billion valuation.

The fantasy version of an agent is autonomous, self-healing, and runs on vibes and API calls. OpenAI's Operator promised to browse the web, fill out forms, and book restaurant reservations. The demo looked like science fiction. The product, as always, is more complicated.


What Most AI Agents Actually Are

Here's the implementation under the hood: a for-loop, a try/catch, and a GPT-4 call in the middle. Maybe a retry. Possibly a timeout. Ship it. Call it an agent. Write the blog post.

Devin's independent task resolution rate on real-world software issues—measured on the SWE-bench benchmark—is 13.86%. Impressive relative to prior baselines. Less impressive if you were expecting the autonomous engineer from the launch video.

OpenAI Operator's earliest evaluation showed a 1% success rate on real-world task suites. Not 10%. Not 50%. One percent. The company's own system card acknowledged it "did not reach human-level accuracy." The demos don't show you the 99 failed runs before the one that made the video.

AutoGPT peaked in April 2023 with 175,000 GitHub stars—one of the fastest climbs in the platform's history. The engineers who actually built on it discovered the same thing everyone discovers: LLMs hallucinate tool calls, get stuck in loops, and confidently take wrong actions without flagging anything. The repo's star count stayed. The production deployments didn't.


The Hype Lifecycle (The Real One)

It goes like this, every time.

Week 1: "I built an agent in two hours, here's the demo thread." 800 retweets. Three recruiters in the DMs.

Week 2: "Why does it keep deleting the wrong files?" Zero retweets. One reply from someone who had the same problem six months ago.

Week 3: The repo's last commit is fix: added better error handling. It was three months ago. The author is now building an MCP server.

The AI agent market grew from $5.25 billion in 2024 to $7.84 billion in 2025. A significant share of that is shovel-selling: frameworks, orchestration layers, "agentic RAG" wrappers, and evaluation tools for agents that don't work well enough to need evaluating yet. The Klondike comparison isn't a metaphor. It's a business model.


When You Actually Should Build an Agent

Here's the part that matters. Some agents work extremely well. The pattern they share is narrow, boring, and unsexy compared to the pitch deck.

Cursor hit $2 billion in annual recurring revenue by February 2026. Over 1 million daily users, 50,000 businesses. It did not get there by building the most ambitious agent possible. It got there by building a very good code-editing assistant with a tight feedback loop, a clear success state (does the code work?), and a human in the loop the entire time.

Harvey AI runs across 100,000 lawyers at 1,300 law firms. $190 million in ARR. Its agents draft, review, and analyze legal documents—read contract, flag clauses, summarize holdings. The reason Harvey works is that every output goes back to a human before it leaves the building. Harvey isn't replacing lawyers. It's automating the parts of legal work that lawyers didn't want to do anyway.

The common thread across every production agent that actually works: narrow input format, verifiable output, a domain where errors surface before they cause harm, and a human review step before anything consequential happens.


The Checklist Nobody Runs Before Building

Before you write a single line of agent code, answer these:

  1. Does this task have a clear success/failure state? If you can't write a test for what "done" looks like, the agent will never know when to stop—so it won't.

  2. Is the input format reliable and structured? Agents collapse on messy input. If the data is inconsistent, the agent produces inconsistent outputs. Garbage in, confident garbage out.

  3. Are the tool calls idempotent? If the agent calls a function twice by accident, does something break? If yes, that's a critical dependency you need to model before you ship.

  4. What does silent failure look like? Not "the agent throws an error"—that's easy to handle. What happens when it does the wrong thing and returns success? Model that case explicitly.

  5. Are you okay if it's wrong 15% of the time? Most production LLM pipelines hover between 85–95% accuracy on well-defined tasks. On multi-step agentic workflows, that compounds with each hop. Five sequential steps at 90% reliability is a 59% end-to-end success rate.

  6. Does a human see the output before it costs money or sends a message? If no, you need a very good reason why. Most teams don't have that reason. They just haven't thought it through yet.

  7. Could a simpler pipeline—an API call, a cron job, a form—solve 80% of this? If yes, build that first. Earn the complexity of an agent by exhausting the simple options.

This checklist is the thing that separates engineers who build things that run in production from engineers who build things that run in demos.


The Real Flex

The engineers who'll win the next five years aren't the ones who built the most agents. They're the ones who knew when not to.

The instinct to reach for an agent has become the default. That's the problem. "I'll just have an agent handle it" is the new "I'll just use a spreadsheet"—sometimes right, often lazier than the situation deserves, occasionally catastrophic.

The engineers who are genuinely dangerous right now can look at a problem, recognize it as a three-line script masquerading as an agent use case, and write the three lines. Then, when a problem actually warrants an agent—narrow task, reliable tools, human in the loop, verifiable output—they build it right the first time.

The gold rush metaphor holds all the way down. The miners who made money in 1849 weren't the ones who showed up first. They were the ones who could tell gold from fool's gold before they started digging.

Most of the ground right now is fool's gold. That's not pessimism. That's signal.


References


Disclaimer: This blog post was researched, written, and published with the assistance of AI. The content reflects general information on the topic and does not represent the personal opinions, beliefs, professional advice, or endorsements of Bhavik Mehta. Nothing in this post should be construed as legal, financial, technical, or professional advice. Readers should independently verify any information before acting on it.