AI Traffic Broke GitHub Twice in One Week

The Week GitHub Could Not Keep Up

Two incidents. Four days apart. Neither caused by a misconfigured deployment, a supply chain compromise, or a cloud provider failure. Both caused by an architecture that was engineered for the pace of human developers, now running under the full weight of machine-generated traffic.

On April 23rd, 2026, 658 repositories quietly had code reverted. Developers opened their repos and found commits gone. GitHub's own monitoring did not alert. Customers filed support tickets before any internal alarm fired. Four days later, on April 27th, GitHub search went dark platform-wide. Pull request lists returned empty. Issues disappeared. Project boards showed nothing. Git itself was healthy the entire time; the git layer was never the problem. But because search threads through so many surface areas of the GitHub UI, the platform looked completely broken from the outside.

Understanding why both incidents happened within the same week requires understanding what AI tooling has done to the traffic profile hitting GitHub's infrastructure every second.

What AI Agents Did to GitHub's Traffic

GitHub processed 90 million pull requests in a single month in early 2026. 1.4 billion commits. 20 million new repositories. GitHub engineering leader Solomon Neas described this growth as fundamentally unlike the demand patterns the platform was originally designed to absorb.

GitHub activity growth 2024 to 2026 with AI agent requests overtaking commits and PRs, 10x to 30x scaling alert flagged

The underlying dynamic is straightforward once you see it. When a human developer commits code, they follow a recognizable rhythm: think, write, review, iterate, merge. Cycle times measured in hours or days, with natural pauses built into the process. An AI coding agent has no such rhythm. It does not wait for review feedback before opening the next task. It does not batch work to avoid write contention. It does not sleep. A single AI coding pipeline can generate more concurrent merge queue entries in an hour than a medium-sized engineering team produces in a week.

This is what "machine-pace development" means in practice. The concurrency profile of inbound writes scales at a rate that human-pace systems were never stress-tested against. GitHub's merge queue, search indexing pipeline, notification fanout system, and branch protection enforcement were all designed and load-tested around the assumption that commits arrived at human pace. AI agents broke that assumption, and the architecture had no defensive layer to catch it.

The 90 million PRs per month figure is not just a large number. It represents a step change in the shape of write operations hitting GitHub's data layer. Human-generated PRs tend to cluster around business hours, show natural batching around sprint cycles, and have predictable review-then-merge patterns. AI agent PRs do none of that. They arrive uniformly, at all hours, with minimal clustering, and with merge operations following immediately after automated checks pass. Every architectural weakness in GitHub's concurrency handling was now being exercised continuously.

Incident 1: 658 Repositories, Zero Alerts

The April 23rd incident was a merge queue bug in how GitHub handled squash merge strategy when multiple PRs were queued simultaneously.

When two or more pull requests sat in a merge queue together and at least one of them used the squash merge strategy, the system could silently revert commits from earlier entries in that queue. No error was surfaced to the user. No failed status check appeared. No notification went to the repository owner. The affected commits simply disappeared from the branch as if they had never been pushed.

The final count: 658 repositories affected, 2,092 pull requests touched.

The most operationally significant detail was not the scale; it was the silence. GitHub's internal monitoring did not detect the issue. The incident was identified through customer support reports. Developers noticed missing code and filed tickets before any internal telemetry produced an alert.

Silent data corruption is the most dangerous failure mode in distributed systems. A loud failure, a 500-series error, a failed health check, gives you something to page on. Silent corruption accumulates undetected across every affected repository in parallel. By the time a customer notices and files a report, the blast radius has been growing for however long the bug has been running.

The underlying mechanism was a race condition in the merge queue's state machine. Under low concurrency, the queue processed operations correctly. Under the concurrency pressure that AI agent pipelines generate, it hit an edge case that had never been exercised in testing. Two concurrent squash merge operations could produce an inconsistent intermediate state, where the earlier commit was dereferenced from the branch head without being explicitly reverted. The result was functionally identical to a revert from the end user's perspective, but left no trace in the commit history.

This is the class of bug that testing tends to miss precisely because it only manifests above a concurrency threshold. At human-pace commit rates, two PRs being squash-merged simultaneously in the same queue is uncommon. At machine-pace rates, it is the default state.

Incident 2: Search Goes Dark

Four days later, GitHub's Elasticsearch cluster became overloaded. GitHub's post-incident analysis pointed to botnet-like traffic patterns as a likely contributing factor, though the exact source was not conclusively identified.

The technical failure was scoped: the search index cluster became saturated and stopped returning results. What was not scoped was the blast radius.

Pull request lists, issue trackers, and project boards all depend on GitHub's search infrastructure in ways that are not visible to most users. When the search layer returned empty results, the UI rendered empty PR lists. Users saw blank issue trackers and concluded their data was gone. This was not what happened; every git repository was intact and fully accessible throughout the incident. git clone, git push, and git pull all worked normally. The problem existed entirely in the search and rendering layer.

But the user experience was indistinguishable from a total data loss event. Engineering teams reported being unable to conduct code reviews. On-call engineers spent significant time in the first hour of the incident determining whether their repositories had actually been deleted before tracing the failure to the search layer. The confusion itself generated support load and compounded the incident response timeline.

This is a textbook blast radius problem. Search is conceptually a separate concern from core version control operations. But at the infrastructure level, the two were not isolated. No circuit breaker existed between "search is unavailable" and "PR list renders empty." The Elasticsearch failure propagated directly into the user-facing surface of every feature that used search as a data source, which turned out to be most of the product interface.

Why One PR Touches Everything

Both incidents are expressions of the same architectural reality: GitHub's core infrastructure was built as a Rails monolith, and that monolith has been extended, scaled, and patched over nearly two decades without fully decoupling its internal dependencies.

When a single pull request is created, reviewed, or merged, the following systems are all involved in some form: git object storage, branch protection rule evaluation, GitHub Actions trigger dispatch, search index updates, notification fanout to subscribers, webhook delivery to external integrations, API response caching, and database writes across multiple tables. Under human-pace commit rates, this coupling is manageable. Operations queue behind each other, consistency is achievable because concurrency stays low, and the window for any given race condition to manifest is narrow.

Under AI-agent-pace commit rates, every one of these systems receives concurrent load simultaneously. The merge queue race condition from April 23rd had existed at human-pace, in theory, for a long time. It never manifested because human developers never exercised the concurrent path at sufficient scale. AI coding agents found it within months of becoming widely used.

The Elasticsearch failure was a different mechanism but the same root problem: a subsystem that was not isolated from the rest of the platform's user-facing surface area. No one had ever implemented a hard requirement that a search failure must produce a graceful degradation rather than a cascading empty state across the entire product interface. That requirement was never written because the scenario had never happened at sufficient severity to force it.

The real cost of architectural coupling is not visible until something fails. At that point, the cost is paid all at once.

GitHub Platform uptime dashboard showing 85.51 percent uptime and 95 incidents in the last 90 days as of April 30 2026

GitHub's 30x Rebuild Roadmap

GitHub had already started addressing this before the April incidents hit. In October 2025, the engineering team announced a 10x infrastructure capacity plan in response to the steady rise in AI-generated traffic. The April 2026 incidents proved that 10x was not enough. The target was revised upward to 30x after internal modeling showed that AI-generated traffic was still growing faster than the original estimate accounted for.

The roadmap has four structural pillars.

Service isolation is the highest priority. Every major subsystem needs its own health boundary, its own circuit breaker, and its own fallback behavior. Search going down should surface as "search temporarily unavailable" with stale cached results visible, not as blank PR lists. A notification delivery failure should not affect merge queue processing. Each subsystem failure should be a scoped, visible, recoverable event rather than a silent cascade. GitHub has identified at least twelve internal service boundaries where current coupling creates unnecessary blast radius.

Go migration addresses the concurrency ceiling of the Ruby monolith. Go's goroutine model handles high-throughput parallel workloads significantly better than Ruby's threading model, particularly for the merge queue and search indexing pipeline, the two hottest paths under AI agent traffic. GitHub is migrating these critical components to Go microservices. This is not a full application rewrite; it is a targeted extraction of the systems where Ruby's concurrency characteristics produce the most risk under load. The migration is estimated to take 18 to 24 months for the core components.

Multi-cloud deployment reduces single-provider failure risk. Both April incidents had contributing components tied to single infrastructure dependencies. Moving critical systems to multi-cloud redundancy means that a regional capacity exhaustion or availability zone failure at one provider does not cascade into a platform-wide event.

Data-layer observability directly addresses the failure mode from April 23rd. Monitoring that only checks service health metrics misses silent data corruption entirely. GitHub is adding integrity checks one level below the service health layer: merge state consistency verification, commit graph validation after queue operations, and branch protection audit trails with cryptographic integrity. The goal is to catch corruption before a customer does, not after.

GitHub has not committed to a public completion date for the full roadmap. The service isolation work is being delivered on an accelerated timeline relative to the Go migration, since it addresses blast radius risk without requiring the full architecture overhaul. The 30x capacity target is expected to be provisionally achieved through a combination of Go migration on critical paths plus infrastructure horizontal scaling, with the service isolation work making the system failure-safe rather than just faster.

What Your Team Should Take From This

Most engineering teams will never operate at GitHub's scale. But the failure patterns from these two incidents are not scale-specific. They appear at much lower traffic levels the moment non-human clients start hitting your write paths.

Atomic feature flag rollouts eliminate partial-state bugs. The merge queue race condition was a partial-state problem: under the original code, a queue containing mixed merge strategies could produce inconsistent outcomes depending on operation ordering. Atomic feature flags, where a behavior is either completely on or completely off for a given operation context, eliminate the class of bug where old and new code paths interact against the same piece of state. Any time you are rolling out new merge, write, or queue behavior, verify there is no path through your system where old and new logic can execute against the same data simultaneously.

Isolation is an architectural requirement, not a monitoring task. The Elasticsearch incident produced a platform-wide user-facing failure because nobody had implemented a hard architectural boundary between the search layer and the PR list rendering layer. Monitoring would have detected the Elasticsearch failure; it cannot prevent blast radius. Only explicit architectural isolation does that. This means designing each subsystem with a defined failure behavior: what does this subsystem return when it is down? The answer should never be "the calling service renders empty state." It should be a stale cache return, a degraded-but-functional fallback, or an explicit unavailability indicator that does not corrupt the user's understanding of their data.

Design for machine-pace traffic before your first incident. This is the hardest recommendation because it requires spending engineering capacity on a problem that has not happened yet. GitHub's architecture was sensible for the traffic it was built to handle. The problem was that the traffic profile changed before the architecture did. If AI agents, automation pipelines, or any non-human client can reach your write paths today, the race conditions that AI traffic finds are already in your codebase. Load-test your concurrency assumptions now, identify where your state machines have untested concurrent paths, and determine what your blast radius looks like if any given subsystem saturates. GitHub paid for that knowledge on April 23rd and April 27th. You do not have to.

The 30x rebuild GitHub has committed to is not a response to an unusual failure. It is the cost of not having designed for machine-pace traffic before shipping to it. The architectural debt was always there. AI agents were just the first clients fast enough to find it.

For more on how AI agents interact with existing developer infrastructure, see my post on building AI code review systems.

References

I use AI tools to help research and draft posts. The ideas, opinions, and takes are mine. Verify anything technical or time-sensitive before acting on it.