Every LLM tool right now (Cursor, Claude Code, Copilot, all of them) decides what to put in the context window using heuristics: grep for some symbols, embed and cosine-similarity search, or just cram as many files as will fit. Nobody has an actual algorithm for this.
This project is an attempt to build one.
Runs entirely local with no LLM calls, API keys, or cloud dependencies, and supports Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, and C++.
Think of it this way:
| Classic OS | LLM Equivalent | Current State |
|---|---|---|
| RAM | Context window | Manually managed |
| Virtual memory / page swaps | Context eviction + retrieval | Crude summarization |
| Process scheduler | Agent orchestration | Hand-coded loops |
| File system cache | Knowledge retrieval | Cosine similarity |
| Memory allocator | Token budget allocation | Nobody does this |
Early computers had programmers manually managing memory addresses. Then virtual memory was invented, a single algorithm, and it unlocked everything we know as modern computing.
The LLM ecosystem is at the "manual memory management" stage right now. Context is the single most important resource where reasoning happens, yet every tool manages it with heuristics instead of optimization.
Cognitive-cache is building the "virtual memory" for LLM reasoning.
Given a task (like a GitHub issue) and a codebase, cognitive-cache picks which files to include in the LLM's context window across nine languages. It combines multiple signals (symbol matching, dependency graph distance, git recency, semantic similarity, redundancy penalties, file role awareness) into a scoring function and runs greedy submodular optimization to select the highest-value set of files that fits within a token budget.
The key insight is treating context selection as a constrained optimization problem rather than a retrieval problem. RAG systems ask "what's most similar to the query?" but what you actually want is "what maximizes the chance the model gets this right?" Those are different questions.
Benchmarked on 23 real bug-fix PRs across 8 open-source repositories. For each issue, we run every strategy with a 12K token budget and measure file recall: did the algorithm select the files that were actually modified in the fix?
| Strategy | Avg Recall | Median | Head-to-head vs cognitive-cache |
|---|---|---|---|
| cognitive-cache | 34.7% | 40% | |
| llm-triage | 25.7% | 20% | 8W / 11T / 4L |
| embedding (RAG) | 25.9% | 0% | 9W / 9T / 5L |
| grep | 22.8% | 0% | 7W / 15T / 1L |
| random | 4.7% | 0% | 12W / 11T / 0L |
| full-stuff | 2.2% | 0% | 13W / 10T / 0L |
cognitive-cache never loses to random or full-stuff and has a winning record against every baseline, including llm-triage (asking the LLM to pick its own files), which is the hardest one to beat.
| Repository | cognitive-cache | llm-triage | embedding | grep | Issues |
|---|---|---|---|---|---|
| Textualize/rich | 100% | 0% | 50% | 100% | 1 |
| pallets/werkzeug | 62% | 38% | 44% | 44% | 4 |
| psf/requests | 50% | 25% | 25% | 25% | 2 |
| pallets/flask | 38% | 36% | 23% | 22% | 3 |
| pallets/click | 33% | 33% | 0% | 33% | 1 |
| pallets/jinja | 20% | 40% | 20% | 20% | 5 |
| fastify/fastify | 17% | 0% | 25% | 0% | 6 |
| tiangolo/fastapi | 0% | 50% | 0% | 0% | 1 |
Every issue, every strategy, full transparency:
| Issue | CC | LLM | Emb | Grep | Rand | Full |
|---|---|---|---|---|---|---|
| Textualize/rich#4006 | 100% | 0% | 50% | 100% | 0% | 0% |
| fastify/fastify#6013 | 0% | 0% | 0% | 0% | 0% | 0% |
| fastify/fastify#6021 | 50% | 0% | 0% | 0% | 0% | 0% |
| fastify/fastify#6026 | 0% | 0% | 0% | 0% | 0% | 0% |
| fastify/fastify#6030 | 0% | 0% | 50% | 0% | 0% | 0% |
| fastify/fastify#6064 | 0% | 0% | 0% | 0% | 0% | 0% |
| fastify/fastify#6613 | 50% | 0% | 100% | 0% | 0% | 0% |
| pallets/click#3225 | 33% | 33% | 0% | 33% | 33% | 0% |
| pallets/flask#5899 | 50% | 50% | 50% | 0% | 0% | 0% |
| pallets/flask#5917 | 40% | 20% | 20% | 40% | 0% | 0% |
| pallets/flask#5928 | 25% | 38% | 0% | 25% | 0% | 0% |
| pallets/jinja#1663 | 0% | 100% | 50% | 50% | 0% | 0% |
| pallets/jinja#1665 | 0% | 0% | 0% | 0% | 0% | 0% |
| pallets/jinja#1706 | 0% | 0% | 50% | 0% | 0% | 0% |
| pallets/jinja#1852 | 0% | 50% | 0% | 0% | 0% | 0% |
| pallets/jinja#2061 | 100% | 50% | 0% | 50% | 0% | 0% |
| pallets/werkzeug#3038 | 50% | 50% | 75% | 25% | 25% | 0% |
| pallets/werkzeug#3078 | 50% | 50% | 50% | 50% | 0% | 0% |
| pallets/werkzeug#3080 | 50% | 0% | 0% | 0% | 0% | 0% |
| pallets/werkzeug#3129 | 100% | 50% | 50% | 100% | 0% | 0% |
| psf/requests#7308 | 50% | 50% | 0% | 0% | 0% | 50% |
| psf/requests#7309 | 50% | 0% | 50% | 50% | 50% | 0% |
| fastapi/fastapi#15139 | 0% | 50% | 0% | 0% | 0% | 0% |
Bold = best recall for that issue. All runs use Qwen 3.5 9B (Q4_K_M) via llama.cpp at zero API cost.
- random: picks files at random until the token budget is full
- full-stuff: crams files alphabetically until the budget is full, which is roughly what most tools approximate
- embedding (RAG): TF-IDF cosine similarity (scikit-learn, up to 5K features) between the issue text and file contents. This uses TF-IDF rather than neural embeddings, so it's a simpler version of what most RAG tools do; real-world RAG with a proper embedding model would likely score somewhat higher
- grep: searches for symbols and identifiers mentioned in the issue
- llm-triage: gives the LLM (same Qwen 3.5 9B) the issue plus a full file listing and asks it to pick the most relevant files. This simulates what tools like Cursor and Claude Code do when they decide what to read, and is the hardest baseline to beat
Six signals score each file, with configurable weights:
| Signal | Weight | What it does |
|---|---|---|
| symbol overlap | 0.45 | Does this file define or mention identifiers from the task? Dominant signal when it fires |
| graph distance | 0.20 | How many imports away from task-mentioned files? Built with networkx from the dependency graph (resolves imports for all nine languages, including TS path aliases from tsconfig.json) |
| change recency | 0.03 | Recently changed in git? Only fires when the file also has structural relevance, to prevent recently-touched-but-unrelated files from flooding results |
| redundancy penalty | 0.10 | Already selected a file with similar symbols? This one is worth less, preventing budget waste on duplicate context |
| embedding similarity | 0.15 | TF-IDF cosine similarity between task and file content |
| file role prior | 0.07 | Source files score higher than test files by default. Test files get boosted only when the task mentions testing |
These get combined into a weighted score, and a two-phase greedy selector picks files: first by absolute score (core files), then by value-per-token (supporting context). Redundancy is re-evaluated after each pick. If a file is too large to fit (like a 13K token app.py when your budget is 12K), it gets chunked to extract just the relevant functions.
Test files are automatically excluded unless the task mentions testing keywords (test, spec, coverage, fixture, mock, stub), though you can override this with include_tests=True/False.
pip install cognitive-cacheOr with uv:
uv add cognitive-cachefrom cognitive_cache import select_context_from_repo
result = select_context_from_repo(".", "fix the login redirect bug")
for item in result.selected:
print(f"{item.source.path} (score: {item.score:.3f})")For repeated queries against the same repo, build the index once and reuse it:
from cognitive_cache import RepoIndex, select_context
index = RepoIndex.build(".")
r1 = select_context(index, "fix the login bug")
r2 = select_context(index, "add rate limiting to the API")All the scoring parameters are exposed if you need control over them:
from cognitive_cache import RepoIndex, select_context
from cognitive_cache.core.value_function import WeightConfig
index = RepoIndex.build(".")
result = select_context(
index,
"add test coverage for the auth module",
budget=20000,
include_tests=True,
max_files=10,
min_score=0.15,
weights=WeightConfig(symbol_overlap=0.50, graph_distance=0.25),
)cognitive-cache select --repo . --task "fix the login redirect bug"
cognitive-cache select --repo . --task "fix login" --json # machine-readable
cognitive-cache select --repo . --task "fix login" --output ctx.txt # dump context to file
cognitive-cache select --repo . --task "fix login" --include-tests no # exclude test files
cognitive-cache select --repo . --task "fix login" --max-files 5 --min-score 0.2
Claude Code — registers the server at user scope (available in all projects):
claude mcp add --scope user cognitive-cache -- uvx --from "cognitive-cache[mcp]" cognitive-cache-mcpCursor / Windsurf / other editors — add to your MCP config file:
{
"mcpServers": {
"cognitive-cache": {
"command": "uvx",
"args": ["--from", "cognitive-cache[mcp]", "cognitive-cache-mcp"]
}
}
}The tool is most useful when the relevant files aren't obvious: cross-cutting bugs, unfamiliar parts of a large codebase, or tasks that span multiple layers. For small repos or tasks where you already know which files to touch, it doesn't add much over grep.
Add this to your CLAUDE.md (or equivalent system prompt / rules file):
## context selection
When a task spans multiple files or you're not sure where to look, call `select_context_tool`
from the `cognitive-cache` MCP server before reading files manually:
- `repo_path`: absolute path to the repo root
- `task`: specific description of the task (the more precise, the better the results)
- `budget`: token budget for returned context (default 12000; raise for complex tasks)
- `include_tests`: true to always include test files, false to exclude, null for auto-detection
- `max_files`: cap on number of files returned (default 15)
- `min_score`: minimum relevance score threshold (default 0.0)
The tool returns file contents directly, so use them instead of separate file reads.
Call it at the start of investigation; the index is cached and follow-up calls are fast.
Skip it when you already know which files to read.A precise task description outperforms a vague one because symbol overlap and embedding similarity both depend on the exact words used. "users get 401 after OIDC callback when session token is present" will score better than "fix auth bug".
The included workflow (.github/workflows/context-suggest.yml) automatically comments on new issues with the most relevant files. It runs entirely in CI with no API keys required.
uv sync --dev
uv run pytest tests/ # 137 tests
To run the benchmark with a local llama.cpp server:
LLAMACPP_BASE_URL=http://localhost:8080 PYTHONPATH=. uv run python benchmark/run_local.py # full 23-issue benchmark
PYTHONPATH=. uv run python benchmark/run_test.py # quick 3-issue test
Configure the llama.cpp connection with environment variables:
| Variable | Default | Description |
|---|---|---|
LLAMACPP_BASE_URL |
http://localhost:8080 |
llama.cpp server URL |
LLAMACPP_MODEL |
Qwen3.5-9B-Q4_K_M |
Model name to request |
To expand the benchmark dataset (needs a GitHub token for API access):
GITHUB_TOKEN=ghp_xxx uv run python benchmark/curate_dataset.py
src/cognitive_cache/
api.py # public API: RepoIndex, select_context
cli.py # CLI entry point
mcp_server.py # MCP server for Claude Code / Cursor
models.py # core types (Source, Task, ScoredSource, SelectionResult)
indexer/ # turns a repo directory into a list of Source objects
signals/ # the six scoring signals
core/ # value function, greedy selector, file chunker
baselines/ # the five baseline strategies we benchmark against
llm/ # adapters for calling LLMs (claude, openai, llama.cpp)
benchmark/
dataset/ # curated github issues with known fixes
runner.py # orchestrates benchmark runs
evaluator.py # computes recall and efficiency metrics
- Adaptive replanning that re-optimizes context mid-conversation after the model calls a tool or asks a followup
- Task-aware compression that goes beyond chunking to actually compress file content while preserving what's relevant
- Expanding the benchmark dataset to include Go, Rust, and Java repositories now that multi-language indexing is in place
Context windows are getting bigger (1M tokens, soon more) but bigger doesn't mean the problem goes away. More capacity means more choices, and just stuffing everything in wastes compute while actively degrading output quality because the model's attention gets diluted. The optimization becomes more valuable as windows grow, not less.
If there's going to be one algorithm at the center of how LLM tools work, it should probably be this one. Or something like it.
Apache 2.0. See LICENSE.
