Deterministic repository graph intelligence research for Code Mesh.
repo-graph-rag is the public research artifact behind the Code Mesh line of
work. It is not a productized service and it is not presented as a complete
cross-language graph engine. The supported public path is a Python-first,
deterministic snapshot pipeline that turns repository source into reproducible
graph outputs with explicit provenance.
This repository is built around one narrow belief:
repository understanding becomes much more reliable when context is deterministic, traversable, and evidence-backed instead of purely probabilistic.
That means:
- graph state should be inspectable and reproducible
- node and edge identity should be stable
- relations should explain how they were extracted
- public claims should stay narrower than the implementation can actually prove
The repo makes more sense as part of a sequence:
llama-githubexplored GitHub-native retrieval as a substrate.LlamaPReviewvalidated the practical value of high-quality code context.llamapreview-context-researchformalized the failure mode: context instability.repo-graph-ragpushes the idea toward deterministic graph construction and traversal-first repository intelligence.
From the repository root:
python3.11 -m venv .venv
.venv/bin/pip install -r repo_kg_maintainer/requirements.txt
PYTHONPATH=repo_kg_maintainer .venv/bin/python repo_kg_maintainer/main_v2.py \
--tenant tenant-demo \
--repo examples/python-demo \
--commit demo-commit \
--source examples/python_demo_repo \
--output /tmp/python_demo_snapshot_v2.jsonExpected stable result for the committed demo:
{
"graph_version": "2.0",
"nodes": 14,
"edges": 11,
"schema_hash": "a4bc762e8e4e2d91c3c52f3dda836ef818c438c78c184a0d948249328a6a47a9",
"snapshot_hash": "1c6493238faab5970ec76770a1ddafed05099c21a8d4b411776aa6111aecea1e"
}The committed reference artifact lives at
examples/python_demo_snapshot_v2.json.
Comparison instructions live in docs/validation.md.
The tiny demo repository is intentionally small but non-trivial. It exercises:
- file and symbol extraction
- local import resolution
- class instantiation
- method and function calls
- provenance-bearing edges
This is the public proof surface for the supported Python mainline.
From examples/python_demo_repo/service.py:
from helpers import finalize
from workers import Worker
class TaskService:
def execute(self, raw_value: str) -> str:
worker = Worker()
result = worker.work(raw_value)
return finalize(result)The resulting graph includes:
service.py::_file_ --IMPORTS--> Workerservice.py::_file_ --IMPORTS--> finalizeTaskService.execute --INSTANTIATES--> WorkerTaskService.execute --CALLS--> Worker.workTaskService.execute --CALLS--> finalize
Those edges also carry provenance such as:
imports.module.symbolinstantiates.class.callcalls.function.dispatch
Example edge excerpt:
{
"relation_type": "CALLS",
"source_id": "tenant-demo|examples/python-demo|demo-commit|Method|service.py::TaskService.execute",
"target_id": "tenant-demo|examples/python-demo|demo-commit|Method|workers.py::Worker.work",
"provenance": {
"extractor_pass": "relation_extraction",
"rule_id": "calls.function.dispatch",
"source_span": [8, 18],
"confidence": 0.9
}
}This is the central idea of the repo: not just extracting relations, but making their origin visible and reproducible.
The public support boundary is intentionally narrow:
- Supported: Python
v2deterministic snapshot generation and its in-memory query / MCP parity foundations - Legacy: Python + Arango full-build path kept for historical compatibility
- Experimental: Go analyzer subtree
- Archived: broken or environment-coupled research modules removed from the public runtime surface
The Python v2 path is the supported public contract.
Public interfaces:
| Interface | Purpose |
|---|---|
repo_kg_maintainer/main_v2.py |
Local CLI for deterministic snapshot generation |
repo_kg_maintainer/v2/analyzer/pipeline.py |
Pass-based Python graph extraction |
repo_kg_maintainer/v2/api/service.py |
Service contract for indexing and querying |
repo_kg_maintainer/v2/graph/store.py |
In-memory and Arango-backed snapshot stores |
repo_kg_maintainer/v2/mcp/toolset.py |
MCP-friendly deterministic graph queries |
repo_kg_maintainer/v2/serializer.py |
Canonical serialization and snapshot hashing |
This repo does not ask readers to trust a vague story. The supported path has:
- executable tests
- a committed demo repo
- a committed expected snapshot
- a deterministic snapshot hash for comparison
flowchart LR
A[Repository Source] --> B[Parse / Normalize]
B --> C[Symbol Table]
C --> D[Import Resolution]
D --> E[Type Inference]
E --> F[Relation Extraction]
F --> G[Relation Validation]
G --> H[Deterministic Graph Snapshot v2]
H --> I[GraphServiceV2]
H --> J[GraphMCPToolsetV2]
The mainline pipeline is deliberately simple:
- collect Python source files from a local repository root
- parse and normalize files deterministically
- extract file and symbol entities
- resolve imports and infer relation targets
- emit nodes and provenance-bearing edges
- canonicalize the snapshot before saving or serving it
| Area | Status | Notes |
|---|---|---|
Python v2 snapshot pipeline |
Supported | Primary public surface |
GraphServiceV2 / GraphMCPToolsetV2 |
Supported | Deterministic query layer over snapshots |
| Python + Arango legacy path | Legacy | Full-build only; keeps llama-github retrieval/filtering in the loop |
| Go analyzer subtree | Experimental | Kept for research value, not default adoption |
| Document graph enrichment path | Archived | Removed from runtime support surface |
Start here if you want the supported path:
Boundary documents:
Full docs index:
- Python is the only supported public extraction path.
- Java / JS / TS extraction exists, but relation extraction is not positioned as complete or public-mainline ready.
- The legacy Arango path still depends on
llama-githubplus GitHub/Arango credentials, but the current OSS cut validates it on publishedllama-github==0.4.0and its modern LangChain provider stack. - Incremental updates on the legacy path were an unfinished experiment and are intentionally not exposed as a public capability.
- The Go subtree is experimental and outside the default CI and support contract.
- Large generated graph artifacts and SVG outputs were removed from
HEAD. - Public docs now center the deterministic Python mainline and the committed demo proof surface.
- The legacy dependency story now tracks published
llama-github==0.4.0directly instead of duplicating LangChain-family pins in this repository. - Historical modules removed from the runtime tip remain discoverable in git history, but they are not part of the public support boundary.
Apache 2.0. See LICENSE.
See CONTRIBUTING.md.
See SECURITY.md.