Skip to content

Add MCP and skill ROI optimize insights#354

Open
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/mcp-skill-roi
Open

Add MCP and skill ROI optimize insights#354
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/mcp-skill-roi

Conversation

@ozymandiashh
Copy link
Copy Markdown
Contributor

@ozymandiashh ozymandiashh commented May 18, 2026

Summary

  • Adds two new optimize findings for capability-level MCP/skill analysis:
    • low edit ROI for invoked MCP servers and skills in implementation-like turns
    • retry impact for capabilities whose edit turns retry materially more than same-category baseline turns
  • Keeps the findings diagnostic: CodeBurn asks users to inspect sessions and narrow/remove capabilities only after confirming the sessions justify it.
  • Caps shared-turn token/cost/retry impact once when multiple candidate capabilities appear in the same turn, so co-occurring MCP + skill usage cannot inflate ranking.

Why

optimize already reports unused MCP inventory and ghost skills, but it could not answer two higher-level questions:

  1. "This MCP server/skill is being used, but is it helping implementation work?"
  2. "When this capability is involved, does the agent need more retry loops than comparable work?"

Those are different from ghost/unused checks: a capability can be invoked frequently and still be low-signal for edit-producing work, or correlate with repeated fix/test loops. The new findings surface that as a review signal without pretending correlation is causation.

What changed

  • Adds turn-keyed capability aggregation over parsed session summaries:
    • MCP capabilities are grouped by server from canonical mcp__<server>__<tool> names.
    • Skill capabilities come from parsed call.skills only; generic turn labels such as subCategory are not treated as skills.
    • Capabilities are counted once per turn even if the same MCP server/tool appears multiple times.
  • Adds a low-edit-ROI detector for implementation-like categories (coding, debugging, feature, refactoring, testing).
    • Example output shape:
      • MCP docs: 1/4 implementation turns produced edits (25% edit rate), $2.00 touched
      • skill api-review: 0/3 implementation turns produced edits (0% edit rate), $1.50 touched
  • Adds a retry-impact detector using same-category edit baselines.
    • Example output shape:
      • skill planner: 2.0 retries/edit turn vs 0.0 baseline in the same task categories (3 edit turns, baseline 3)
  • Reuses a single capability aggregation pass inside scanAndDetect, alongside existing MCP coverage aggregation.
  • Adds regression tests for:
    • ROI sample thresholds
    • MCP/skill findings together
    • per-turn MCP deduplication
    • shared-turn savings caps for co-occurring MCP + skill candidates
    • no subCategory false skill labels
    • same-category retry baselines
    • retry sample thresholds

Validation

I validated the behavior by running the real src/optimize.ts detector exports against controlled ProjectSummary fixtures via npx tsx --eval. This is not just green tests: the harness constructs the exact edge cases this PR is meant to handle and prints the detector output.

{
  "roi_shared_turn_cap": {
    "title": "2 MCP/skill capabilities with low edit ROI",
    "tokensSaved": 1650,
    "expectedTokensSaved": 1650,
    "containsMcpCombo": true,
    "containsSkillCombo": true,
    "proof": "3 shared non-edit turns * 2,200 effective tokens * 25%; not doubled to 3,300 when MCP and skill co-occur"
  },
  "retry_same_category_baseline_and_shared_turn_cap": {
    "title": "2 MCP/skill capabilities correlated with high retries",
    "tokensSaved": 3300,
    "expectedTokensSaved": 3300,
    "containsSameCategoryBaseline": true,
    "ignoresHighRetryDebuggingBaseline": true,
    "proof": "3 shared coding edit turns * 2,200 effective tokens * 50%; debugging retry turns do not contaminate coding baseline"
  },
  "subCategory_false_skill_guard": {
    "finding": null,
    "proof": "turn.subCategory=frontend with no call.skills and no MCP tools emits no capability finding"
  }
}

What this proves:

  • ROI emits both the MCP and skill candidates, but shared non-edit turns are capped once: 1,650 tokens instead of the buggy 3,300 double count.
  • Retry impact emits both the MCP and skill candidates, but shared edit turns are capped once: 3,300 tokens instead of the buggy 6,600 double count.
  • Retry baseline is same-category: high-retry debugging edit turns are present in the fixture, but they do not contaminate the coding baseline.
  • subCategory alone does not create a fake skill frontend finding.

Supporting checks:

  • ./node_modules/.bin/tsc --noEmit --pretty false
  • npx vitest run tests/optimize.test.ts — 87 tests passed
  • npm run build
  • npm test -- --run — 62 files / 882 tests passed
  • git diff --check origin/main...HEAD
  • GitHub checks on this PR: assess, check, semgrep all passed
  • Claude Opus 4.7, effort max review:
    • first pass found shared-turn double-counting and subCategory skill-label risks
    • after fixing both, final review returned PASS
  • Gemini 3.1 Pro Preview review returned PASS on the amended diff

Notes

  • This intentionally does not auto-edit MCP or skill config. The findings are audit prompts because high retries or low edit rate can also reflect hard tasks, read-only investigation, or intentionally exploratory work.
  • Low MCP tool coverage still owns the "large server inventory is mostly unused" case; the ROI detector suppresses MCP servers already flagged there to avoid duplicate findings.

@ozymandiashh ozymandiashh marked this pull request as ready for review May 18, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant