Skip to content

Add MCP and skill reliability report#357

Open
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/mcp-skill-reliability
Open

Add MCP and skill reliability report#357
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/mcp-skill-reliability

Conversation

@ozymandiashh
Copy link
Copy Markdown
Contributor

Summary

  • Adds an MCP/skill reliability report to codeburn optimize.
  • Flags MCP servers and skills whose edit turns are disproportionately retry-heavy.
  • Uses turn-level capability evidence and shared-turn token accounting so one retry-heavy turn is not counted twice when both an MCP server and a skill appear together.

Why

Existing optimize findings can show broad MCP waste, context bloat, and low-worth sessions, but they do not answer a capability-level reliability question:

When a specific MCP server or skill is used during edit work, does that work repeatedly need retries?

This is useful for MCP and skill tuning because the right action is often not removal. A retry-heavy skill may need clearer instructions. A retry-heavy MCP server may need narrower project scope, a smaller tool set, or better usage guidance. The finding is intentionally framed as correlation, not causation, so users inspect the sessions before changing config.

For example:

  • Skill reviewer appears in 5 edit turns, and 3 of those edit turns need retries.
  • mcp__ci__run maps to MCP server ci, and the same 3/5 edit-turn retry pattern appears there.
  • If both the skill and MCP server appear in the same retry-heavy turns, the savings estimate counts those turns once instead of reporting two independent piles of waste.

What changed

  • Adds detectCapabilityReliability() to src/optimize.ts.
  • Aggregates per-capability edit reliability from existing parsed turn data:
    • MCP server evidence comes from call.mcpTools, normalized from names like mcp__ci__run to server ci
    • skill evidence comes from call.skills
    • turn.subCategory is deliberately not treated as skill evidence, avoiding legacy or classifier-derived false labels
  • Emits a finding only when a capability has enough evidence:
    • at least 5 edit turns
    • at least 3 retry-heavy edit turns
    • at least 50% of edit turns are retry-heavy
  • Ignores read-only turns so the report stays focused on edit reliability.
  • Estimates recoverable tokens from retry-heavy edit turns with a 50% ceiling.
  • De-duplicates shared retry-heavy turns across all flagged capabilities before computing tokensSaved.
  • Adds focused regression coverage in tests/optimize.test.ts.

Validation

I validated the behavior by running the real detectCapabilityReliability() export against controlled ProjectSummary fixtures via npx tsx --eval. This exercises the actual detector path and prints the expected edge cases.

{
  "skill_retry_report": {
    "title": "1 skill correlates with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesSkill": true,
    "proof": "5 edit turns using Skill reviewer; 3 retry-heavy turns at 1,000 effective tokens each; shared recovery ceiling is 50%, so 3*1000*0.5 = 1,500 tokens"
  },
  "mcp_retry_report": {
    "title": "1 MCP server correlates with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesMcpServer": true,
    "proof": "same retry pattern attributed from mcp__ci__run to MCP server ci"
  },
  "shared_mcp_skill_turn_cap": {
    "title": "2 MCP/skill capabilities correlate with retry-heavy edits",
    "tokensSaved": 1500,
    "expectedTokensSaved": 1500,
    "includesBoth": true,
    "proof": "MCP server ci and Skill reviewer share the same 3 retry-heavy turns; tokensSaved stays 1,500 instead of doubling to 3,000"
  },
  "healthy_guard": {
    "finding": null,
    "proof": "1/5 retry-heavy edit turns is below the 50% retry-rate threshold"
  },
  "subcategory_guard": {
    "finding": null,
    "proof": "turn.subCategory without actual call.skills metadata does not create a skill reliability finding"
  },
  "readonly_guard": {
    "finding": null,
    "proof": "read-only retry turns are ignored because the detector is scoped to edit reliability"
  }
}

What this proves:

  • A retry-heavy skill emits a skill reliability finding with the expected 1,500 token estimate.
  • A retry-heavy MCP tool invocation is attributed to its MCP server namespace.
  • A shared MCP+skill retry pattern reports both capabilities but keeps tokensSaved at 1,500, not the inflated 3,000.
  • Healthy capabilities below the retry-rate threshold emit no finding.
  • turn.subCategory alone does not create a skill finding.
  • Read-only retry turns are ignored.

Supporting checks:

  • ./node_modules/.bin/tsc --noEmit --pretty false
  • npx vitest run tests/optimize.test.ts — 82 tests passed
  • npm run build
  • npm test -- --run — 62 files / 877 tests passed
  • git diff --check
  • Claude Opus 4.7, effort max review returned PASS
  • Gemini 3.1 Pro Preview review returned PASS

Notes

@ozymandiashh ozymandiashh marked this pull request as ready for review May 18, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant