Graph-Aware Code Review: What Review Tools Still Miss

What Review Tools Still Miss

Most AI code review tools work the same way. They read the diff, feed chunks of it to an LLM, and comment on the lines they see.

That setup is good enough for style issues, missing checks, and the occasional obvious bug. It is not good enough for the question that actually matters in review: what does this change break downstream?

That is why so many review bots feel noisy even when the models are strong. They can talk about the text in front of them. They cannot see the dependency structure around it.

Without that structure, they miss the thing developers actually care about. Rename a function in one file, and the tool may have no idea that three other files import it. Or worse, it may flag a break that was already fixed elsewhere in the same pull request.

That is not mainly a model problem. It is an information problem.

Why Lines Aren't Enough

Once you frame review that way, the right abstraction is not lines. It is a graph.

Treat the codebase as a directed graph $G = (V, E)$ where each file is a vertex and each dependency is an edge. A change $C$ touches a subset of vertices $V_c \subseteq V$ . For a changed file $v$ , its impact set is the set of files that depend on it:

$impact(v) = \{ u \in V : (u, v) \in E \}$

From there the logic gets simple.

A change is breaking if some downstream consumer sits in $impact(v)$ but not in $V_c$ . A change is safe if all of its downstream consumers are co-changed in the same diff.

That gives you a much better workflow:

Find reverse dependencies for each changed file.
Split them into co-changed and untouched consumers.
Check whether the co-changed files actually accommodate the break.
Flag only the untouched consumers that still look exposed.

The important part is not just the graph itself. It is the division of labor. The graph tells you what matters. The LLM reasons about a much smaller and better-defined question.

How Clewso Sees the Codebase

That is the architecture behind Clewso.

The indexing pipeline parses source files with tree-sitter, extracts imports, calls, and definitions, and stores the dependency graph in Neo4j with embeddings in Qdrant. The review step then queries Neo4j directly for reverse dependencies.

There are three main query strategies.

Module-stem matching finds files importing modules that begin with the changed module stem:

MATCH (f:File {repo_id: $repo_id})-[:IMPORTS]->(m:Module)
WHERE m.name STARTS WITH $stem AND f.path <> $path
RETURN DISTINCT f.path

Function callers finds files calling functions defined in the changed file:

MATCH (:File {repo_id: $repo_id, path: $path})-[:DEFINES]->(cb:CodeBlock)
WITH cb.name AS def_name
MATCH (caller:File {repo_id: $repo_id})-[:CALLS]->(fn:Function {name: def_name})
WHERE caller.path <> $path
RETURN DISTINCT caller.path

Symbol imports handles patterns like Rust's use crate::module::Symbol, matching imported symbols back to definitions in the changed file.

Each downstream consumer is then tagged with its diff status:

co-changed: modified in the same diff
co-deleted: removed in the same diff
unaddressed: not touched in the diff

That context goes to the LLM with explicit reasoning rules. If all consumers are co-changed and the diffs accommodate the break, the change is safe. If all consumers are co-deleted, the teardown is coherent. If a removed public symbol has no remaining references, that is safe too. What gets flagged are the cases with specific, evidenced downstream risk.

Where the Signal Came From

We tested the system across five iterations on a 30-file refactoring PR in a 10-crate Rust workspace called monastic.

The refactor deleted a content module, changed a constructor from static to dynamic initialization, updated function signatures across crate boundaries, and moved embedded data into CSV files.

Iteration	Approach	True Positives	False Positives	False Negatives
1	Vector search for node lookup	0	0	5
2	Direct Neo4j graph queries	5	4	0
3	+ Same-diff co-change detection	5	2	0
4	+ Deletion coherence rules	5	1	0
5	+ Workspace detection + symbol grep	5	0	0

That progression tells the story.

The first version failed because embedding similarity is not good enough for reliable graph node lookup. The second found the right dependencies, but lacked same-diff awareness, so it still over-warned. Each later version tightened the context and cut false positives. By iteration five, the system found all real issues and stopped inventing new ones.

The last false positive was a pub mod removal in a crate lib.rs. The model hedged about possible external consumers. Adding two deterministic signals, workspace membership detection and codebase-wide symbol grep, resolved the ambiguity. The model did better once it had facts instead of a hole to guess into.

Why Same-Diff Awareness Matters

The biggest jump came from one idea: same-diff awareness.

A lot of breaking changes are fixed in the same pull request that introduces them. Developers rename a function and update the callers. They delete a module and remove its consumers. They change a constructor signature and patch the call sites in one sweep.

Line-level review tools cannot take advantage of that because they look at files in isolation. A graph-aware tool can. It knows the co-change set. It can tell the difference between a break that was handled and a break that was left behind.

That changes the job you give the model. Instead of asking, "Is this change breaking?" you ask, "Did this specific co-change handle the break?" That is a narrower question, and a much more reliable one.

Why It Stays Quiet More Often

On that same 30-file PR, a typical line-level review tool would have produced dozens of comments: style notes, defensive suggestions, generic warnings.

The graph-aware system produced zero flags in the final iteration, because every potentially breaking change had already been addressed in the same diff.

That is the real standard for review tooling. Not whether it can always say something, but whether it can keep quiet when the change is already coherent.

And when it does speak, the signal is sharper. The tool can point to the exact unaddressed consumer, the relationship type, and whether that consumer was modified, deleted, or untouched. That is not vague review noise. That is an architectural claim you can inspect.

Where It Still Falls Short

The approach still has limits.

Semantic dependencies. Tree-sitter captures syntax, not full type-aware behavior. Trait-mediated or interface-mediated dependencies can remain invisible.

Cross-repository edges. A service calling another service's API does not automatically become a graph edge. That remains a major gap for larger systems.

LLM reliability. The final judgment about whether a co-change actually handles a break still relies on a probabilistic model. The good news is that structured context makes this much more reliable. The bad news is that it does not make it deterministic.

Runtime dependencies. Data files like CSV, TOML, or JSON often matter deeply at runtime but do not appear in the static import graph unless you layer in framework-specific heuristics.

None of that breaks the value of the approach. It just marks the boundary of what static graph awareness can do on its own.

The Bet

The real change here is not that graph-aware review is a little better than line-level review. It is that it asks a better question.

Line-level tools ask whether the changed code looks right in isolation. Graph-aware review asks whether the change breaks anything downstream, and if it does, whether the break was handled.

That is a much better fit for how real code review works.

Clewso is open source under AGPL-3.0 at github.com/clewnet/clewso. The graph is built automatically from tree-sitter parsing. The review runs locally. And the argument is simple: if your review tool cannot see the dependency graph, it is blind to the most important part of the job.

That is the bet here too. Not that review can be fully automated. Just that the tooling should at least be looking at the thing that makes review hard.