Building an AI Code Review Tool with GPT-4

Last quarter I built an automated code review pipeline that runs GPT-4 against every pull request before a human ever looks at it. This post is the honest write-up: the architecture that worked, the prompts that didn't, and the three problems that ate 80% of the build time — none of which were the AI part.

The Problem with Manual Code Review

Every developer eventually looks at their pull request queue and cringes. Reviewing thousands of lines of code is exhausting, error-prone, and slow — and the failure mode is predictable. Humans are good at catching design problems and terrible at staying vigilant through the 40th file of a refactor. By the time a reviewer reaches the bottom of a large diff, they are pattern-matching on indentation, not reading logic.

Linters and static analyzers cover the mechanical layer, but there is a gap between "the linter passed" and "a senior engineer read this carefully." That gap — logical flaws, missed edge cases, security-sensitive changes, code that contradicts the module's existing conventions — is exactly the territory where a language model earns its keep.

Why GPT-4?

I chose GPT-4 because it understands context better than any traditional tool in this space. It doesn't just look for syntax errors; it reads the diff the way a reviewer does — inferring intent, noticing when a function's new behavior contradicts its name, and flagging error paths that were quietly dropped. In early testing it caught an unawaited promise inside a transaction block and a SQL query that interpolated a user-controlled value — both of which had sailed past the linter.

The Architecture

The pipeline is deliberately boring. Every exotic design I sketched lost to this one:

Developer opens a PR. Nothing special — the trigger is the standard pull_request event.
GitHub Action fires. A small script grabs the git diff against the merge base, plus a limited amount of surrounding file context.
Chunking. Large PRs are split logically by file, never mid-function. Each chunk carries the PR title and description so the model knows the intent of the change.
Inference. GPT-4 reviews chunks in parallel with a structured output contract.
Filtering and posting. Findings below a severity threshold are dropped, duplicates are merged, and what survives gets posted as inline PR comments.

export async function analyzeChunk(chunk, prContext) {
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
      { role: "system", content: REVIEWER_PROMPT },
      {
        role: "user",
        content: JSON.stringify({
          intent: prContext.title,
          description: prContext.body,
          file: chunk.path,
          diff: chunk.patch,
        }),
      },
    ],
  });
  return JSON.parse(response.choices[0].message.content).findings;
}

Two details in that snippet carry most of the weight. temperature: 0 because review is a precision task — creativity here just manufactures false positives. And a JSON response contract, because free-text reviews are unusable in CI: you can't threshold them, dedupe them, or attach them to a line number.

Prompt Engineering Is the Hard Part

Note: If you give GPT-4 generic instructions, you get generic, useless code reviews. "You are a senior engineer, review this code" produces the review equivalent of a horoscope — plausible-sounding observations that apply to any diff ever written.

The prompt went through roughly a dozen revisions. The changes that actually moved quality:

Give it a taxonomy, not a vibe. The final prompt enumerates exactly what to look for — correctness bugs, security issues, dropped error handling, concurrency hazards, contract violations — and explicitly forbids style commentary. The moment I banned opinions on naming and formatting, signal jumped.

Force a severity and a failure scenario. Every finding must include a concrete "this breaks when…" description. If the model can't articulate the failing input, the finding is almost always noise, and the schema requirement makes it self-filter.

Show it two examples. One real bug with the exact output format, one benign diff with an empty findings array. The second example matters more — it teaches the model that "no findings" is an acceptable answer, which is the single best defense against invented problems.

The Three Problems Nobody Warns You About

False positive fatigue. The first version commented on everything, and within a week developers were scrolling past the bot the way you scroll past a flaky test. The fix was brutal thresholding: only high-severity findings post by default. A review bot gets one chance at a first impression, and trust, once spent, doesn't come back with a config change.

Chunking large PRs. Naive token-based splitting cuts functions in half and destroys the model's ability to reason. Splitting by file, keeping whole hunks intact, and skipping generated files (lockfiles, snapshots, migrations) fixed most of it. PRs that still exceed the budget get a summary review with an explicit "this PR is too large for line-level review" note — which, quietly, is also social pressure toward smaller PRs.

Cost control. Reviewing every push to every PR adds up fast. Three changes cut spend by roughly 70%: only re-reviewing changed files on new commits, caching identical chunk results by content hash, and skipping draft PRs entirely.

What It Catches — and What It Doesn't

After a few months of real usage, a clear pattern: the bot is excellent at local reasoning — null paths, missing awaits, swallowed exceptions, off-by-one boundaries, injection-shaped string building. It is mediocre at global reasoning — architectural drift, duplicated concepts across services, "this whole approach is wrong." Human reviewers remain load-bearing for the second category, and the honest framing for the team was never "AI review," but "the human reviewers start where the bot stops."

The pipeline pattern here — structured outputs, parallel inference, aggressive filtering — is the same one that shows up in building autonomous agents with LangChain; code review just happens to be the version with the clearest ROI.

Where I'd Start If Doing It Again

Start with the filtering layer, not the prompt. The prompt gets you from nothing to a working demo in an afternoon; the trust machinery — severity gates, dedupe, knowing when to say nothing — is what gets a bot from demo to something engineers actually read. Build that first, and the rest is plumbing.

I use AI tools to help research and draft posts. The ideas, opinions, and takes are mine. Verify anything technical or time-sensitive before acting on it.