Add MCP and skill reliability report by ozymandiashh · Pull Request #357 · getagentseal/codeburn

ozymandiashh · 2026-05-18T23:17:39Z

Summary

Adds an MCP/skill reliability report to codeburn optimize.
Flags MCP servers and skills whose edit turns are disproportionately retry-heavy.
Uses turn-level capability evidence and shared-turn token accounting so one retry-heavy turn is not counted twice when both an MCP server and a skill appear together.

Why

Existing optimize findings can show broad MCP waste, context bloat, and low-worth sessions, but they do not answer a capability-level reliability question:

When a specific MCP server or skill is used during edit work, does that work repeatedly need retries?

This is useful for MCP and skill tuning because the right action is often not removal. A retry-heavy skill may need clearer instructions. A retry-heavy MCP server may need narrower project scope, a smaller tool set, or better usage guidance. The finding is intentionally framed as correlation, not causation, so users inspect the sessions before changing config.

For example:

Skill reviewer appears in 5 edit turns, and 3 of those edit turns need retries.
mcp__ci__run maps to MCP server ci, and the same 3/5 edit-turn retry pattern appears there.
If both the skill and MCP server appear in the same retry-heavy turns, the savings estimate counts those turns once instead of reporting two independent piles of waste.

What changed

Adds detectCapabilityReliability() to src/optimize.ts.
Aggregates per-capability edit reliability from existing parsed turn data:
- MCP server evidence comes from call.mcpTools, normalized from names like mcp__ci__run to server ci
- skill evidence comes from call.skills
- turn.subCategory is deliberately not treated as skill evidence, avoiding legacy or classifier-derived false labels
Emits a finding only when a capability has enough evidence:
- at least 5 edit turns
- at least 3 retry-heavy edit turns
- at least 50% of edit turns are retry-heavy
Ignores read-only turns so the report stays focused on edit reliability.
Estimates recoverable tokens from retry-heavy edit turns with a 50% ceiling.
De-duplicates shared retry-heavy turns across all flagged capabilities before computing tokensSaved.
Adds focused regression coverage in tests/optimize.test.ts.

Validation

I validated the behavior by running the real detectCapabilityReliability() export against controlled ProjectSummary fixtures via npx tsx --eval. This exercises the actual detector path and prints the expected edge cases.

{
  "skill_retry_report": {
    "title": "1 skill correlates with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesSkill": true,
    "proof": "5 edit turns using Skill reviewer; 3 retry-heavy turns at 1,000 effective tokens each; shared recovery ceiling is 50%, so 3*1000*0.5 = 1,500 tokens"
  },
  "mcp_retry_report": {
    "title": "1 MCP server correlates with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesMcpServer": true,
    "proof": "same retry pattern attributed from mcp__ci__run to MCP server ci"
  },
  "shared_mcp_skill_turn_cap": {
    "title": "2 MCP/skill capabilities correlate with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesBoth": true,
    "proof": "MCP server ci and Skill reviewer share the same 3 retry-heavy turns; tokensSaved stays 1,500 instead of doubling to 3,000"
  },
  "healthy_guard": {
    "finding": null,
    "proof": "1/5 retry-heavy edit turns is below the 50% retry-rate threshold"
  },
  "subcategory_guard": {
    "finding": null,
    "proof": "turn.subCategory without actual call.skills metadata does not create a skill reliability finding"
  },
  "readonly_guard": {
    "finding": null,
    "proof": "read-only retry turns are ignored because the detector is scoped to edit reliability"
  }
}

What this proves:

A retry-heavy skill emits a skill reliability finding with the expected 1,500 token estimate.
A retry-heavy MCP tool invocation is attributed to its MCP server namespace.
A shared MCP+skill retry pattern reports both capabilities but keeps tokensSaved at 1,500, not the inflated 3,000.
Healthy capabilities below the retry-rate threshold emit no finding.
turn.subCategory alone does not create a skill finding.
Read-only retry turns are ignored.

Supporting checks:

./node_modules/.bin/tsc --noEmit --pretty false
npx vitest run tests/optimize.test.ts — 82 tests passed
npm run build
npm test -- --run — 62 files / 877 tests passed
git diff --check
Claude Opus 4.7, effort max review returned PASS
Gemini 3.1 Pro Preview review returned PASS

Notes

The finding text explicitly says this is correlation, not proof of causation.
The fix is a prompt to audit the retry-heavy capability, not an automatic MCP or skill config edit.
This PR is independent of Add MCP and skill ROI optimize insights #354 and Add MCP project profile advisor #356; it is based on current main and uses the existing parsed turn metadata.

Add MCP skill reliability optimizer

8945f5c

ozymandiashh marked this pull request as ready for review May 18, 2026 23:21

ozymandiashh mentioned this pull request May 18, 2026

Add MCP project profile advisor #356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MCP and skill reliability report#357

Add MCP and skill reliability report#357
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/mcp-skill-reliability

ozymandiashh commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ozymandiashh commented May 18, 2026

Summary

Why

What changed

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant