The PR review skill that fights agent assumptions

Two stacks of papers side by side

Every time I asked Claude to self-review a PR, it told me everything looked good. Solid code, tests covered, no concerns.

GitHub Copilot found issues. Human reviewers found issues. Claude, the tool that had written the code in the first place, found nothing.

That's not a coincidence. It's a structural problem.

When an agent writes code, reviews it, and has full context of what the code was supposed to do, it isn't reviewing — it's confirming. It knows the intent. It checks that the implementation matches the intent it already holds, finds that it does, and calls it clean. The gap between "what I meant to write" and "what I actually wrote" is exactly where bugs live, and it's exactly what this kind of review can't see.

A pull request has three distinct information sources: the linked ticket, the PR description, and the diff. If an agent reads any of those before reading the code — which is the natural order — it already knows what the code is supposed to do before it reads what the code actually does. From that point it's not analyzing. It's pattern-matching against a story it's already bought into.

This is not a model quality problem. A better model confirms its own assumptions more confidently.

The Anchoring Problem

I kept asking Claude why it was missing such obvious mistakes. And it would admit it. It had hallucinated a fix, or assumed something based on the PR description, or filled in a gap with what it thought I intended. Honest after the fact. Just not catching things before.

So I pushed further. I asked Claude to analyze its own reasoning, to trace back where the wrong assumptions were coming from. The answer was consistent: it was reading the PR description early, building a model of what the code was supposed to do, and then evaluating the diff against that model instead of reading the code on its own terms.

This is anchoring. Once you hold a prior belief, every ambiguous piece of evidence gets interpreted in its favor. The PR description is written by the person who wrote the code. It describes intent, not outcome. An agent that reads it first doesn't review the diff so much as confirm it, resolving every ambiguous line charitably against what the description implies was intended.

The same principle extends to existing review comments. If another reviewer left a note before you run your analysis, that signal bleeds in. The agent shapes its findings around what's already been said rather than forming an independent read.

The instinct is to fix this with prompting. "Ignore the PR description while you analyze the code." It doesn't work. The description is already in the context window. The agent has already read it.

The fix is sequencing. We worked through the ideal processing order together, asking Claude to analyze where its own reasoning was breaking down. Treat the code changes as the only source of truth first. Read the diff, infer what actually changed and what functionality was added or modified, evaluate those changes on their own merits against best practices and established patterns in the codebase. Only then compare that independent read against the PR description, and then against the ticket.

When all three agree, great. When they disagree, you've found something worth flagging. Read them together upfront and the disagreements disappear into interpretation.

flowchart LR subgraph A["❌ Natural order"] n1[Ticket] --> n2[PR desc] --> n3[Diff] --> n4[Analysis] end subgraph B["✅ Skill order"] s1[Diff] --> s2[Derive from code] --> s3[Evaluate] --> s4[Compare] end

The Six Groups (and Why You Can't Skip Them)

My solution to the ordering problem is a rigid structure. Six analysis groups, run in sequence, where each group depends on the conclusions of the one before it. You cannot skip a group. You cannot run them in a different order.

Group 1 is entirely about understanding the change from the code itself. Before looking at anything else, the agent reads the diff and derives behavior. What do these changed functions actually do? What paths through the code are new or different? The PR description plays no part in this step.

Group 2 uses the findings from Group 1 to evaluate test coverage. Not test coverage in general — coverage of the specific paths that changed. This ordering matters: without Group 1, test feedback tends to float free of the actual change, flagging gaps that aren't relevant to what the PR touched.

Groups 3 and 4 move into code quality and system-level analysis. Logic correctness, pattern consistency, failure modes, N+1 risks. Still no PR description.

Group 5 is platform-specific checks. For a Next.js and Supabase project, that means things like server vs. client component boundaries, RLS implications, and cache invalidation. For a Rails project it shifts to things like eager loading, callback chains, and strong parameter coverage. Different stack, different file.

Group 6 is where existing review comments finally come in. By this point the agent has a complete independent read of the code. It can look at what others said and take a real position: agree, disagree, flag something they missed, or note overlap with its own findings. It is not shaping its analysis around what someone else already wrote. It's evaluating it.

sql

Group 1: Understand the Change
Group 2: Test Coverage
Group 3: Code Quality
Group 4: System-Level Analysis
Group 5: Platform-Specific
Group 6: Reviewer Feedback

The Path-Spec Map

Before any analysis starts, the skill builds a path-spec map. It reads the diff, lists every changed production file and function, and then maps each one to the test files that exercise it.

This sounds like a small bookkeeping step. It isn't.

Before this was in place, the review would regularly flag tests as missing when they weren't. They existed, they just weren't part of the change set, and the agent hadn't done the work to find them. The result was a list of test gaps that were actually gaps in the agent's knowledge, not in the codebase.

The path-spec map forces that work upfront. By the time test coverage analysis runs in Group 2, the agent has already located the relevant tests. Every gap it flags from that point is a real one: something this PR introduced or left uncovered in code it actually touched.

bash

Path-spec map

app/api/posts/[id]/route.ts — PATCH handler
  app/api/posts/__tests__/route.test.ts
    ✓ "PATCH returns 200 and updates post"
    ✓ "PATCH returns 401 if unauthenticated"
    ✗ Missing: no test for PATCH with unknown id (404 path)

components/admin/PostEditor.tsx — handleSave()
  components/admin/__tests__/PostEditor.test.tsx
    ✓ "calls PATCH with updated body on save"
    ✓ "disables save button while request is in flight"

The constraint is strict on purpose. If a behavior isn't in the map, test feedback for it doesn't belong in the report. That cuts a lot of noise. And in a code review, noise is what hides the things that matter.

The Report as a Triage Interface

The output isn't a wall of observations. It's a structured report designed for a specific job: getting a reviewer up to speed quickly and making it obvious what needs action.

Every finding gets one tag. [required] means it needs to be addressed before or right after merge. [suggestion] is substantive but not blocking. [follow-up] is worth doing in a later PR. [nit] is minor polish. The tags are ordered in the report, so the things that matter most are always at the top.

Every item includes a file and line number, plus a short snippet of the exact code being flagged. That detail matters more than it seems. Line numbers shift as you iterate on a PR. A snippet lets you find the issue even after you've pushed three more commits.

A finding looks like this:

css

[suggestion] app/api/posts/[id]/route.ts:47

  const post = await supabaseAdmin
    .from('posts')
    .select()
    .eq('id', id)
    .single()

Query fetches all columns but only title and body are used downstream.
Add .select('id, title, body, slug') to avoid overfetching.

The report also actively looks for things done well. Not a generic sign-off, but specific praise: an elegant solution to a tricky problem, a test added for an edge case nobody asked for, a safe navigation operator quietly preventing a nil error three steps downstream. If the author did something smart, the review says so and names it.

After the initial report, the process is iterative. You can ask for comment text for any specific item and get something ready to post. When you're done, you ask for the final verdict message. That message is scoped to exactly what you acted on. If you commented on two items, it says so. It doesn't imply you addressed four.

That last rule is about honesty in the review process as much as anything else. A reviewer who posts an approval saying "see my comments" when they only left one nit is creating a false impression. The skill doesn't let you do that by accident.

You Still Have to Read the Code

None of this replaces a human review. That needs to be said plainly.

The skill structures the process, surfaces things you might miss, and does the work of tracking what was said and by whom. What it doesn't do is understand what you're looking at. It doesn't know the history of a function, why a particular pattern exists in this codebase, or whether a change that looks correct is actually solving the right problem.

You still read the diff. You still form your own understanding of what changed and why. The skill makes that faster and harder to shortcut — the structured output gives you a starting point, and the path-spec map means you're not hunting for context from scratch. But skipping the read and trusting the report is exactly the mistake the skill was designed to prevent in the first place.

There's also a feedback loop worth using. When the skill misses something a human catches, you can add it as a learned check. It gets filed under the relevant analysis group and applied on every future review. The skill gets more useful the more real PRs run through it.

That's the version of AI-assisted review worth building toward. One where the human reviewer's judgment compounds over time rather than atrophies.