From 3aad53c56376bf89951843933dcd6e515b5ea295 Mon Sep 17 00:00:00 2001 From: Shahzaib Date: Wed, 25 Mar 2026 11:33:53 -0700 Subject: [PATCH 1/5] Add copilot review analyst skill --- .github/copilot-instructions.md | 1 + .../skills/copilot-review-analyst/SKILL.md | 195 ++++++++ ...e-Review-Effectiveness-Report-Outlook.html | 432 +++++++++++++++++ ...ilot-Code-Review-Effectiveness-Report.html | 411 ++++++++++++++++ ...opilot-Code-Review-Effectiveness-Report.md | 270 +++++++++++ .../references/account-map.json | 22 + .../references/classification-rules.md | 133 ++++++ .../references/manual-audit-template.json | 17 + .../references/report-formatting.md | 175 +++++++ .../scripts/analyze.ps1 | 347 ++++++++++++++ .../scripts/final-classification.ps1 | 376 +++++++++++++++ .../scripts/precise.ps1 | 450 ++++++++++++++++++ 12 files changed, 2829 insertions(+) create mode 100644 .github/skills/copilot-review-analyst/SKILL.md create mode 100644 .github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html create mode 100644 .github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.html create mode 100644 .github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.md create mode 100644 .github/skills/copilot-review-analyst/references/account-map.json create mode 100644 .github/skills/copilot-review-analyst/references/classification-rules.md create mode 100644 .github/skills/copilot-review-analyst/references/manual-audit-template.json create mode 100644 .github/skills/copilot-review-analyst/references/report-formatting.md create mode 100644 .github/skills/copilot-review-analyst/scripts/analyze.ps1 create mode 100644 .github/skills/copilot-review-analyst/scripts/final-classification.ps1 create mode 100644 .github/skills/copilot-review-analyst/scripts/precise.ps1 diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 678689e4..8bd49e68 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -130,6 +130,7 @@ For complex investigation tasks, use these skills (read the skill file for detai | **design-reviewer** | `.github/skills/design-reviewer/SKILL.md` | "address review comments", "handle my review", "review comments on" | | **pbi-dispatcher** | `.github/skills/pbi-dispatcher/SKILL.md` | "dispatch PBIs to agent", "assign to Copilot", "send work items to coding agent" | | **test-planner** | `.github/skills/test-planner/SKILL.md` | "create test plan", "write test cases", "add tests to ADO", "export test plan", "E2E tests for" | +| **copilot-review-analyst** | `.github/skills/copilot-review-analyst/SKILL.md` | "analyze Copilot reviews", "Copilot review effectiveness", "review analysis report", "how helpful are Copilot reviews" | ## 13. Azure DevOps Integration diff --git a/.github/skills/copilot-review-analyst/SKILL.md b/.github/skills/copilot-review-analyst/SKILL.md new file mode 100644 index 00000000..c2935799 --- /dev/null +++ b/.github/skills/copilot-review-analyst/SKILL.md @@ -0,0 +1,195 @@ +--- +name: copilot-review-analyst +description: Analyze GitHub Copilot code review effectiveness across Android Auth repositories. Collects all Copilot inline review comments via GitHub API, classifies them as helpful/not-helpful/unresolved through reply analysis, diff verification, and AI-assisted classification, then generates a report with per-repo and per-engineer statistics. Use this skill when asked to "analyze Copilot reviews", "measure Copilot review effectiveness", "generate Copilot review report", "how helpful are Copilot reviews", "run review analysis", or any request to measure/report on GitHub Copilot code review quality and adoption. +--- + +# Copilot Review Analyst + +Analyze GitHub Copilot code review effectiveness across the Android Auth repositories by collecting all inline review comments, classifying each one, and producing a comprehensive report. + +## Prerequisites + +- **GitHub CLI (`gh`)** authenticated with access to all target repos +- For public repos (common, msal): personal GitHub account +- For private repos (broker): EMU account (e.g. `shjameel_microsoft`) +- **Output directory**: `~/.copilot-review-analysis/` for final artifacts, `$env:TEMP\copilot-review-analysis\` for intermediate data + +## Repository Configuration + +Default repos (update in scripts if changed): + +| Label | Slug | Auth | +|-------|------|------| +| common | `AzureAD/microsoft-authentication-library-common-for-android` | Personal | +| msal | `AzureAD/microsoft-authentication-library-for-android` | Personal | +| broker | `identity-authnz-teams/ad-accounts-for-android` | EMU | + +## Analysis Pipeline + +The analysis runs in 5 sequential phases. Scripts and templates are bundled in this skill: +- **Scripts:** `scripts/` (3 core pipeline scripts) +- **Assets:** `assets/` (report templates — Markdown, HTML, Outlook HTML) +- **References:** `references/` (classification rules, report formatting guide) + +### Phase 1: Data Collection + Keyword Classification + +**Script:** `scripts/analyze.ps1` + +Run to collect all Copilot inline review comments from human-authored PRs: + +```powershell +# Default: last 60 days +.\.github\skills\copilot-review-analyst\scripts\analyze.ps1 + +# Custom date range: +.\.github\skills\copilot-review-analyst\scripts\analyze.ps1 -StartDate "2026-01-23" +``` + +**Parameters:** +- `-StartDate` — Start date for PR search (default: 60 days ago). Format: `YYYY-MM-DD` +- `-OutputDir` — Output directory (default: `$env:TEMP\copilot-review-analysis`) + +What it does: +1. Fetch all PRs created after `-StartDate` via `gh pr list` +2. Filter out bot-authored PRs (Copilot, copilot-swe-agent, dependabot, github-actions) +3. For each PR, call `repos/{slug}/pulls/{prNum}/comments` to get inline comments +4. Filter to Copilot comments (user.login = "Copilot") that are top-level (not replies) +5. Find human replies to each Copilot comment (matched via `in_reply_to_id`) +6. Classify replies via keyword matching into: `helpful-acknowledged`, `unhelpful-dismissed`, `mixed-response`, `replied-unclear`, `no-response` + +**Outputs:** +- `$env:TEMP\copilot-review-analysis\raw_results.json` — all comments with initial classification +- `$env:TEMP\copilot-review-analysis\review_summaries.json` — PR-level summary comments (for reference, not classified) + +### Phase 2: Diff-Level Verification + +**Script:** `scripts/precise.ps1` + +For every `no-response` comment, verify whether the engineer silently acted on the feedback: + +```powershell +.\.github\skills\copilot-review-analyst\scripts\precise.ps1 +``` + +What it does: +1. Load `raw_results.json`, filter to `no-response` comments +2. For each comment, get the commit SHA it was left on and the PR head SHA +3. Use `repos/{slug}/compare/{commitA}...{commitB}` to get the diff +4. For **suggestion blocks**: extract code tokens, check if they appear as `+` lines in the diff +5. For **prose comments**: check if diff hunk line ranges overlap the comment's line range (±5 line tolerance) + +**Verdicts assigned:** +- `suggestion-applied` — suggestion tokens match diff + lines overlap +- `suggestion-likely-applied` — tokens match but lines don't overlap +- `exact-lines-modified` — prose comment's lines were modified +- `lines-modified-different-fix` — nearby lines modified, different code +- `file-changed-elsewhere` — file modified but at different lines +- `file-changed-no-line-info` — file modified but comment had no line number +- `file-not-changed` — file untouched after the comment +- `no-subsequent-commits` — PR merged without any commits after the review + +**Output:** `$env:TEMP\copilot-review-analysis\precise.json` + +### Phase 3: AI-Assisted Reply Classification + +This phase is **manual** — performed by the agent (you) in conversation. Read the `replied-unclear` comments from `raw_results.json` and classify each one based on the reply text. + +See [references/classification-rules.md](references/classification-rules.md) for the full classification hierarchy and patterns. + +**Process:** +1. Load `raw_results.json` and filter to `Classification -eq "replied-unclear"` +2. For each comment, read the `HumanReplyText` and `CommentBody` +3. Apply the classification cascade from the rules reference +4. Group results: which are helpful (acknowledged action), which are not helpful (explained away / dismissed), which are genuinely ambiguous +5. For genuinely ambiguous ones, apply domain context judgment +6. Also review `file-changed-elsewhere` and `file-changed-no-line-info` verdicts from Phase 2 to identify re-audit flips + +**Output:** Write results to `$env:TEMP\copilot-review-analysis\manual-audit.json` using the template at `references/manual-audit-template.json`. This file is consumed by Phase 4. + +### Phase 4: Final Classification + +**Script:** `scripts/final-classification.ps1` + +Merge all results into a single authoritative dataset: + +```powershell +.\.github\skills\copilot-review-analyst\scripts\final-classification.ps1 ` + -AccountMapFile ".github\skills\copilot-review-analyst\references\account-map.json" ` + -ManualAuditFile "$env:TEMP\copilot-review-analysis\manual-audit.json" +``` + +**Parameters:** +- `-OutputDir` — Directory with `raw_results.json` and `precise.json` (default: `$env:TEMP\copilot-review-analysis`) +- `-AccountMapFile` — Path to JSON mapping GitHub logins to display names. See `references/account-map.json` for the current team. If omitted, raw GitHub logins are used. +- `-ManualAuditFile` — Path to JSON with Phase 3 manual audit decisions. See `references/manual-audit-template.json` for the schema. If omitted, all ambiguous comments default to "not-helpful". + +What it does: +1. Load `raw_results.json` and `precise.json` +2. Load account mapping and manual audit decisions from external JSON files +3. Classify every comment using the full hierarchy: + - Replied + positive → helpful + - Replied + negative → not-helpful + - Replied + delegated (@copilot) → helpful + - Replied + acknowledged action → helpful + - Replied + explained-away → not-helpful + - Replied + genuinely unclear → check manual audit file, else not-helpful + - No response + suggestion-applied/exact-lines-modified → helpful + - No response + file-changed-elsewhere → check re-audit list from manual audit file + - No response + file-not-changed/no-subsequent-commits → not-helpful (**note: conservative — see Key Principle below**) +4. Produce per-engineer and per-repo statistics + +**Output:** `$env:TEMP\copilot-review-analysis\final_classification.json` + +### Phase 5: Report Generation + +Generate both Markdown and Outlook-compatible HTML reports. + +**Style/structure references** (in `assets/` — these contain data from the Jan-Mar 2026 analysis and serve as structural templates, NOT to be copied verbatim): +- `assets/Copilot-Code-Review-Effectiveness-Report.md` — Markdown reference +- `assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html` — Outlook HTML reference +- `assets/Copilot-Code-Review-Effectiveness-Report.html` — Standard HTML reference + +**Important:** The asset templates contain hardcoded numbers (557 comments, specific percentages, engineer names, etc.) from the first analysis. For each new run, generate fresh reports using the same section structure and formatting patterns but with statistics computed from `final_classification.json`. + +**Generate two versions of each report:** +1. **Team-internal** — uses real engineer names (from account map). For the team. +2. **Org-wide** — anonymizes engineers as "Engineer A", "Engineer B", etc., sorted by helpfulness descending. For sharing outside the team. + +**Process:** +1. Load `final_classification.json` +2. Compute aggregate statistics (total, per-repo, per-engineer) +3. Generate reports using the section structure and formatting from the asset templates +4. Collect notable examples for "What Copilot Is Good At" and "What Copilot Struggles With" +5. Save to `~/.copilot-review-analysis/`: + - `Copilot-Code-Review-Effectiveness-Report.md` (team, real names) + - `Copilot-Code-Review-Effectiveness-Report-Anonymous.md` (org-wide) + - `Copilot-Code-Review-Effectiveness-Report-Outlook.html` (team, real names) + - `Copilot-Code-Review-Effectiveness-Report-Outlook-Anonymous.html` (org-wide) + +See [references/report-formatting.md](references/report-formatting.md) for the report structure and Outlook HTML formatting rules. + +## Key Principle: "Unresolved" ≠ "Not Helpful" + +Comments with no reply and no diff evidence are **Unresolved**, not assumed unhelpful. This is a critical distinction: +- **Confirmed Helpful** = positive evidence (explicit acknowledgment OR verified fix in diff) +- **Confirmed Not Helpful** = positive evidence (explicit dismissal with stated reason OR comment on stale code) +- **Unresolved** = insufficient evidence either way (engineer never engaged) + +The `final-classification.ps1` script classifies no-response/no-diff-evidence comments as "not-helpful" for conservative stats, but the report should present the three-way breakdown to be honest about uncertainty. + +## Copilot Comment Identification + +- Copilot inline review comments use `user.login = "Copilot"` +- Legacy bot: `copilot-pull-request-reviewer[bot]` +- Bot PR authors to exclude: `app/copilot-swe-agent`, `dependabot[bot]`, `github-actions[bot]` +- Only count top-level comments (`in_reply_to_id` is null/0), not Copilot's own replies + +## Rate Limiting + +The scripts call the GitHub API heavily. Built-in mitigations: +- PR comment caching (fetched once per PR) +- Diff caching (fetched once per commit range) +- Sleep every 15 PRs (300ms) and every 25 diff checks (200ms) +- Use `--paginate` for repos with many comments + +If hitting rate limits, increase sleep intervals or use `GH_TOKEN` with higher rate limits. diff --git a/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html new file mode 100644 index 00000000..4963fed1 --- /dev/null +++ b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html @@ -0,0 +1,432 @@ + + + + + + +Copilot Code Review Effectiveness Analysis + +
+ +

Copilot Code Review Effectiveness Analysis

+

Android Auth Platform  |  January 23 – March 23, 2026  |  Common, MSAL, Broker Repositories

+ + +

 

Background
+ +

The Android Auth Platform team builds and maintains the authentication libraries used by Microsoft's mobile apps — including MSAL (client-side), Common (shared IPC and utilities), and Broker (the brokered authentication service running on Android devices). Our codebase spans three repositories with 10 active contributors.

+ +

Earlier this year, we enabled GitHub Copilot code reviews across all three repositories. We also added custom review instructions via copilot-instructions.md to give the AI context about our architecture, coding conventions, and multi-repo structure — aiming to make its feedback more relevant than generic suggestions.

+ +

This report answers the question: Is Copilot code review actually useful for our team? We looked at every comment Copilot left over 2 months and determined — through reply analysis, commit diff verification, and AI-assisted classification of ambiguous cases — whether each piece of feedback led to a real code improvement, was dismissed as irrelevant, or was simply never evaluated by the engineer.

+ + +

 

At a Glance
+ +

We analyzed every inline code review comment left by GitHub Copilot on human-authored pull requests across our three Android Auth repositories over the past two months. For each of the 557 comments, we determined whether the feedback led to a concrete code improvement.

+ + + + + + + + + +
+ + +
+
57%
+
of comments received
no response
+
+
+ + +
+
41%
+
confirmed
helpful
+
+
+ + +
+
18%
+
confirmed
not helpful
+
+
+ + +
+
41%
+
unresolved
(inconclusive)
+
+
+ + + + + + +
+The biggest finding isn't about AI quality — it's about adoption. +Of the 557 comments Copilot left, only 239 (43%) received any response from an engineer — the other 318 were never acknowledged. Of those 239 responded-to comments, 144 (60%) led to confirmed code improvements. That's a strong signal: when engineers read and evaluate Copilot's feedback, the majority of it is useful. The challenge is that 57% of comments are never evaluated at all, leaving 226 (41%) in an unresolved state where we can't determine whether they were helpful or not. +
+ +

Scope

+ + + + + + + + +
+ + +
+
163
+
Human PRs
scanned
+
+
+ + +
+
113
+
PRs received
Copilot review (69%)
+
+
+ + +
+
557
+
Inline review
comments
+
+
+ + +
+
4.9
+
Avg comments
per reviewed PR
+
+
+ + +

 

Overall Results
+ +

Engineer Response Rate

+

Before looking at helpfulness, it's important to understand how engineers interact with Copilot reviews — because a comment can only demonstrate value if someone reads it.

+ + + + + + + +
43% replied57% ignored
+ + + + + + + +
 
Engineer replied (239)
 
No response (318)
+ +

More than half of all Copilot review comments receive no human response. The majority of AI feedback enters a void — it may be valid, but we can never confirm its value if no one engages with it.

+ +

Helpfulness Verdict

+ + + + + + + + +
41.3% Helpful18.1%40.6% Unresolved
+ + + + + + + + + +
 
Confirmed Helpful (230)
 
Confirmed Not Helpful (101)
 
Unresolved (226)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
VerdictCount%Definition
Confirmed Helpful23041.3%The comment led to a code change — engineer explicitly acknowledged it, or the suggested fix was verified in a subsequent commit diff.
Confirmed Not Helpful10118.1%The engineer explicitly dismissed the comment with a reason why the feedback was incorrect, irrelevant, or by design.
Unresolved22640.6%No reply and no definitive diff evidence. All 226 are comments where the engineer did not respond. Includes comments on the final commit before merge, files modified at different lines, and comments with no line number. Could be valid feedback that was never evaluated.
+ + + + + + +
+226 comments (41%) are unresolved — not because the AI was wrong, but because no one engaged. +All 226 are comments that received no reply from the engineer. This includes 122 comments on the final commit before merge (where the engineer had no subsequent commits), 45 where the file was modified but at different lines, 50 where the comment had no line number for verification, and 9 where the file was never modified. If engineers had responded — even just to dismiss — we would know whether the feedback was useful. +
+ +

How Helpful Comments Were Delivered

+ + + + + + + + + +
PathCountDescription
Engineer replied and acknowledged144"good catch", "fixed", "addressed", "added unit test", "@copilot apply changes"
Engineer silently applied the fix86No reply, but suggestion code or exact line range verified as modified in subsequent commit
Total Confirmed Helpful230
+ +

How Not Helpful Comments Were Identified

+ + + + + + + + + +
PathCountDescription
Engineer replied and dismissed95"won't fix", "this is fine", "not applicable", "Copilot is incorrect"
Comment on stale/outdated code6Comment was on code already changed in a different commit
Total Confirmed Not Helpful101
+ + +

 

Results by Repository
+ + +

Broker (293 comments)

+ + + + + + +
49%16%35%
+ + +

Common (188 comments)

+ + + + + + +
36%18%46%
+ + +

MSAL (76 comments)

+ + + + + + +
26%25%49%
+ + + + + + + + + + +
 
Helpful
 
Not Helpful
 
Unresolved
+ + + + + + + + + + + +
RepositoryCommentsResponse RateHelpfulNot HelpfulUnresolved
Broker29356.0%142 (48.5%)48 (16.4%)103 (35.2%)
Common18829.8%68 (36.2%)34 (18.1%)86 (45.7%)
MSAL7625.0%20 (26.3%)19 (25.0%)37 (48.7%)
+ + +

 

Results by Engineer
+

Each engineer has two GitHub accounts (personal + EMU). These have been merged. Names are anonymized.

+ + + + + + + + + + + + + + + + + + + + + + +
EngineerCommentsRepliedResponse RateHelpfulNot HelpfulUnresolvedHelpfulness
Engineer A2020100%146070.0%
Engineer B837590.4%5726068.7%
Engineer C15426.7%82553.3%
Engineer D403075.0%1914747.5%
Engineer E1103733.6%44184840.0%
Engineer F992020.2%36115236.4%
Engineer G632234.9%20123131.7%
Engineer H1002424.0%27126127.0%
Engineer I24625.0%501920.8%
Engineer J3133.3%0030%
+ + + + + +
+Engagement drives value — and visibility. +Engineer A (100% response rate) and Engineer B (90%) have the highest helpfulness and zero unresolved comments. When you engage, you know exactly what the AI got right and wrong. Engineer F (20%) and Engineer H (24%) have 52 and 61 unresolved comments respectively — over half their feedback goes into a black hole. +
+ + +

 

What Copilot Is Good At
+ +

Catching real bugs:

+
+PR #3050 (Common): Copilot flagged that "$it" string wrapping doesn't JSON-escape the content, which could break consumers.
+Engineer reply: "You're right. Making the change." +
+ +

Stale documentation and naming inconsistencies:

+
+PR #64 (Broker): Copilot identified four locations where KDoc still referenced the old flight constant after it was renamed. All four were silently fixed in the next commit. +
+ +

CI/pipeline configuration issues:

+
+PR #3038 (Common): Copilot warned that using vmImage: 'windows-latest' makes the CD pipeline non-deterministic. The engineer changed to a pinned image version. +
+ +

The @copilot apply workflow:

+
+In 16 instances (2.9%), engineers validated the feedback and delegated the fix back: @copilot open a new pull request to apply changes based on [this feedback]. An efficient pattern where AI identifies and fixes the issue. +
+ + +

 

What Copilot Struggles With
+ +

Lacking domain context:

+
+Copilot: "shared_device_id could be used for tracking — consider hashing before emission."
+Engineer: "The shared_device_id is a random UUID and is not PII." +
+ +

Suggesting tests for trivial code:

+
+Copilot: "New telemetry attributes lack test coverage..."
+Engineer: "These are just telemetry related changes and adding unit tests will be overdo here." +
+ +

Misunderstanding APIs:

+
+Copilot: "00000003-0000-0ff1-ce00-000000000000 is the resource ID for Microsoft Graph, not SharePoint Online."
+Engineer: "00000003-0000-0ff1-ce00-000000000000 is SharePoint Online." +
+ + +

 

Key Takeaways
+ + + + + + + + + + +
#Finding
157% of Copilot review comments receive no response from engineers. The majority of AI review feedback is never acknowledged. Of the ignored comments, only 27% are silently addressed — the remaining 71% are unresolved.
241% of all comments led to a confirmed code improvement. But 41% are unresolved, meaning the true helpfulness rate lies between 41% (floor) and 82% (ceiling). We cannot narrow this range without engineer engagement.
3When engineers engage, 60% of comments are helpful. This suggests the AI review quality itself is decent — the bottleneck is adoption, not accuracy.
4Only 18% of comments are confirmed not helpful. When we restrict "not helpful" to comments with actual evidence of poor quality, the rate is surprisingly low.
5Engagement is the strongest predictor of value. Engineers who reply to 75%+ of comments see 47-70% helpfulness with zero unresolved. Engineers who reply to <35% see 20-40% helpfulness with massive unresolved buckets.
638% of ignored comments are on the final commit before merge. These 122 comments represent the last review round being skipped — the feedback had zero chance of impact regardless of its quality.
7Broker gets the most value (49%), Common is middling (36%), MSAL is lowest (26%). This correlates with response rate: Broker 56%, Common 30%, MSAL 25%.
+ + +

 

How We Measured This
+ + + + + + + + +
PhaseMethod
1. Data collectionGitHub API extraction of all 557 Copilot inline review comments from 163 human-authored PRs. Excluded PRs by Copilot coding agent.
2. Reply classificationAutomated keyword matching on 239 replied comments to classify as positive, negative, or ambiguous.
3. Diff verificationFor 318 unreplied comments, used GitHub compare API to check if suggestion tokens appeared as diff additions, or if exact line ranges were modified.
4. AI-assisted classificationAll 133 ambiguous replies were read and classified by GitHub Copilot, based on reply text and domain context. These classifications were reviewed by the report author but not independently verified by the original PR engineers.
5. Cross-validationAll initially "not helpful" classifications were re-examined against diff evidence. 18 were reclassified where evidence was strong.
+ +

Classification rules: "Confirmed Helpful" requires positive evidence (engineer acknowledgment or verified code change). "Confirmed Not Helpful" requires positive evidence of poor quality (explicit engineer dismissal). Comments where the engineer did not engage are classified as "Unresolved" rather than assumed unhelpful.

+ + +

 

What We'd Recommend
+ +

Based on this analysis, we see three opportunities:

+ +

1. Engage with the feedback. The single biggest improvement would be for engineers to respond to Copilot's review comments — even if just to dismiss them. When engineers engage, 60% of comments are helpful. When they don't, 71% of comments become a black hole of unknown value. A quick "not applicable" is more useful than silence — it tells us whether the AI is giving good feedback, and it builds the data we need to improve the review instructions.

+ +

2. Review before merging. 38% of ignored comments were on the final commit — the engineer merged without looking at Copilot's last round. Building a habit of checking review comments before clicking merge would give this feedback a chance to have impact.

+ +

3. Improve the review instructions. The most common dismissal reasons — "lacks domain context," "suggests tests for trivial code," "misunderstands our APIs" — are things we can address by refining our copilot-instructions.md. Adding guidance like "don't suggest tests for telemetry-only changes" or providing context about specific domain patterns could reduce the noise and make the signal-to-noise ratio better for everyone.

+ + + + +
+

Analysis conducted March 23–24, 2026. Data covers all PRs created January 23 – March 23, 2026 in the Common, MSAL, and Broker repositories.

+

Raw data (557 comments with full text, replies, diff evidence, and verdicts) available for independent verification.

+
+ +
+ + + + + + + + + + + + + + + + + + diff --git a/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.html b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.html new file mode 100644 index 00000000..72759b00 --- /dev/null +++ b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.html @@ -0,0 +1,411 @@ + + + + + +Copilot Code Review Effectiveness Analysis + + + + +

Copilot Code Review Effectiveness Analysis

+

Android Auth Platform  |  January 23 – March 23, 2026  |  Common, MSAL, Broker Repositories

+ + +

Executive Summary

+ +

We analyzed every inline code review comment left by GitHub Copilot on human-authored pull requests across our three Android Auth repositories over the past two months. For each of the 557 comments, we determined whether the feedback led to a concrete code improvement.

+ +
+
+
57%
+
of comments received
no response
+
+
+
41%
+
confirmed
helpful
+
+
+
18%
+
confirmed
not helpful
+
+
+
41%
+
unresolved
(not evaluated)
+
+
+ +
+ The biggest finding isn't about AI quality — it's about adoption. + When engineers do engage with Copilot's comments, 60% turn out to be helpful. But 57% of comments receive no response at all, and 41% of all comments are unresolved — we can't tell if they were useful because no one evaluated them. The true helpfulness rate lies between 41% (confirmed floor) and 82% (if all unresolved were helpful). +
+ + +

Scope

+ +
+
+
163
+
Human PRs
scanned
+
+
+
113
+
PRs received
Copilot review (69%)
+
+
+
557
+
Inline review
comments
+
+
+
4.9
+
Avg comments
per reviewed PR
+
+
+ + +

Overall Results

+ +

Engineer Response Rate

+

Before looking at helpfulness, it's important to understand how engineers interact with Copilot reviews — because a comment can only demonstrate value if someone reads it.

+ +
+
43% replied
+
57% ignored
+
+
+ Engineer replied (239) + No response (318) +
+ +

More than half of all Copilot review comments receive no human response. The majority of AI feedback enters a void — it may be valid, but we can never confirm its value if no one engages with it.

+ +

Helpfulness Verdict

+ +
+
41.3%
+
18.1%
+
40.6%
+
+
+ Confirmed Helpful (230) + Confirmed Not Helpful (101) + Unresolved (226) +
+ + + + + + +
VerdictCount%Definition
Confirmed Helpful23041.3%The comment led to a code change — engineer explicitly acknowledged it, or the suggested fix was verified in a subsequent commit diff.
Confirmed Not Helpful10118.1%The engineer explicitly dismissed the comment with a reason why the feedback was incorrect, irrelevant, or by design.
Unresolved22640.6%No reply and no definitive diff evidence. Includes comments on the final commit before merge, files modified at different lines, and comments with no line number. Could be valid feedback that was never evaluated.
+ +
+ 226 comments (41%) are unresolved — not because the AI was wrong, but because no one looked. + This includes 122 comments on the final commit before merge where the engineer had no subsequent commits, 45 where the file was modified at different lines, 50 with no line number for verification, and 9 where the file was never modified. If engineers had engaged — even just to dismiss — we would know. +
+ +

How Helpful Comments Were Delivered

+ + + + + +
PathCountDescription
Engineer replied and acknowledged144"good catch", "fixed", "addressed", "added unit test", "@copilot apply changes"
Engineer silently applied the fix86No reply, but suggestion code or exact line range verified as modified in subsequent commit
Total Confirmed Helpful230
+ +

How Not Helpful Comments Were Identified

+ + + + + +
PathCountDescription
Engineer replied and dismissed95"won't fix", "this is fine", "not applicable", "Copilot is incorrect"
Comment on stale/outdated code6Comment was on code already changed in a different commit
Total Confirmed Not Helpful101
+ + +

Results by Repository

+ +
+
+
Broker
+
+
48.5%
+
16%
+
35%
+
+
293
+
+
+
Common
+
+
36.2%
+
18%
+
46%
+
+
188
+
+
+
MSAL
+
+
26%
+
25%
+
49%
+
+
76
+
+
+
+ Helpful + Not Helpful + Unresolved + (number = total comments) +
+ + + + + + +
RepositoryCommentsResponse RateHelpfulNot HelpfulUnresolved
Broker29356.0%142 (48.5%)48 (16.4%)103 (35.2%)
Common18829.8%68 (36.2%)34 (18.1%)86 (45.7%)
MSAL7625.0%20 (26.3%)19 (25.0%)37 (48.7%)
+ +

Broker has the highest response rate (56%) and the highest confirmed helpfulness (49%). In Common and MSAL — where response rates are below 30% — nearly half of all comments are unresolved.

+ + +

Results by Engineer

+

Each engineer has two GitHub accounts (personal + EMU). These have been merged. Names are anonymized.

+ + + + + + + + + + + + + +
EngineerCommentsRepliedResponse RateHelpfulNot HelpfulUnresolvedHelpfulness
Engineer A2020100%146070.0%
Engineer B837590.4%5726068.7%
Engineer C15426.7%82553.3%
Engineer D403075.0%1914747.5%
Engineer E1103733.6%44184840.0%
Engineer F992020.2%36115236.4%
Engineer G632234.9%20123131.7%
Engineer H1002424.0%27126127.0%
Engineer I24625.0%501920.8%
Engineer J3133.3%0030%
+ +
+ Engagement drives value — and visibility. + Engineer A (100% response rate) and Engineer B (90%) have the highest helpfulness and zero unresolved comments. When you engage, you know exactly what the AI got right and wrong. Engineer F (20%) and Engineer H (24%) have 52 and 61 unresolved comments respectively — over half their feedback goes into a black hole. +
+ + +

Response Behavior Deep Dive

+ +

What happens when engineers reply? (239 comments)

+
+
60% helpful
+
40% not helpful
+
+

When engineers engage, the majority of Copilot feedback turns out useful. This is the strongest signal that review quality is decent — the bottleneck is adoption.

+ +

What happens when engineers don't reply? (318 comments)

+
+
27% silently fixed
+
+
71% unresolved
+
+ + + + + + + + + + + +
What happenedCount% of ignoredVerdict
Suggestion code silently applied (verified via diff)5015.7%Confirmed Helpful
Exact commented lines modified in subsequent commit72.2%Confirmed Helpful
Evidence of fix at nearby lines (re-audit)299.1%Confirmed Helpful
Merged without any subsequent commits12238.4%Unresolved
File modified but not at the commented lines4514.2%Unresolved
File modified but no line number to verify5015.7%Unresolved
File never modified after the comment92.8%Unresolved
Comment on stale/outdated code61.9%Confirmed Not Helpful
+ +

The single largest category is the 122 comments (38% of ignored) where the engineer merged the PR without pushing any additional commits after Copilot's review. These comments may have been valid, but we cannot tell because the engineer never evaluated them.

+ + +

What Copilot Is Good At

+ +

Catching real bugs:

+
+

PR #3050 (Common): Copilot flagged that "$it" string wrapping doesn't JSON-escape the content, which could break consumers if contract values contain special characters.

+

Engineer reply: "You're right. Making the change."

+
+ +

Stale documentation and naming inconsistencies:

+
+

PR #64 (Broker): Copilot identified four locations where KDoc still referenced the old flight constant USE_TEE_ONLY_FOR_TOKEN_BINDING after it was renamed. All four were silently fixed in the next commit.

+
+ +

Dead code and unused imports:

+
+

PR #3040 (Common): Copilot spotted an unused local variable enabledSettingRaw. Verified as fixed via diff — the suggested replacement code appeared in the commit additions.

+
+ +

CI/pipeline configuration issues:

+
+

PR #3038 (Common): Copilot warned that vmImage: 'windows-latest' makes the CD pipeline non-deterministic. The engineer changed to a pinned image version.

+
+ +

The @copilot apply workflow:

+
+

In 16 instances (2.9%), engineers validated Copilot's feedback and then delegated the fix back with: @copilot open a new pull request to apply changes based on [this feedback]. An efficient pattern where AI identifies and fixes the issue.

+
+ + +

What Copilot Struggles With

+ +

Lacking domain context:

+
+

Copilot: "shared_device_id could be used for tracking across apps — consider hashing before emission."

+

Engineer: "The shared_device_id is a random UUID generated by one of the participating apps and is not PII."

+
+ +

Suggesting tests for trivial code:

+
+

Copilot: "New telemetry attributes lack test coverage..."

+

Engineer: "These are just telemetry related changes and adding unit tests will be overdo here."

+
+ +

Misunderstanding APIs:

+
+

Copilot: "00000003-0000-0ff1-ce00-000000000000 is the resource ID for Microsoft Graph, not SharePoint Online."

+

Engineer: "00000003-0000-0ff1-ce00-000000000000 is SharePoint Online."

+
+ +

Commenting on intentional design choices:

+
+

Copilot: "getPackageInfo() != null is redundant since it either returns PackageInfo or throws..."

+

Engineer: "This is fine. The verbosity makes the code clearer to understand."

+
+ + +

Key Takeaways

+ + + + + + + + + + + +
#Finding
157% of Copilot review comments receive no response from engineers. The majority of AI review feedback is never acknowledged. Of the ignored comments, only 27% are silently addressed — the remaining 71% are unresolved.
241% of all comments led to a confirmed code improvement. But 41% are unresolved, meaning the true helpfulness rate lies between 41% (floor) and 82% (ceiling). We cannot narrow this range without engineer engagement.
3When engineers engage, 60% of comments are helpful. This suggests the AI review quality itself is decent — the bottleneck is adoption, not accuracy.
4Only 18% of comments are confirmed not helpful. When we restrict "not helpful" to comments with actual evidence of poor quality — explicit engineer dismissals — the rate is surprisingly low.
5Engagement is the strongest predictor of value. Engineers who reply to 75%+ of comments see 47-70% helpfulness with zero unresolved. Engineers who reply to <35% see 20-40% helpfulness with massive unresolved buckets.
638% of ignored comments are on the final commit before merge. These 122 comments represent the last review round being skipped — the feedback had zero chance of impact regardless of its quality.
7Broker gets the most value (49% helpful), Common is middling (36%), MSAL is lowest (26%). This correlates with response rate: Broker 56%, Common 30%, MSAL 25%.
8At ~5 comments per PR, the signal-to-noise question can't be settled without engagement. We confirmed ~2 useful comments per PR on average, but with 41% unresolved, the actual number could be higher.
+ + +

How We Measured This

+ + + + + + + + +
PhaseMethod
1. Data collectionGitHub API extraction of all 557 Copilot inline review comments from 163 human-authored PRs. Excluded PRs by Copilot coding agent.
2. Reply classificationAutomated keyword matching on 239 replied comments to classify as positive, negative, or ambiguous.
3. Diff verificationFor 318 unreplied comments, used the GitHub compare API to check if suggestion tokens appeared as diff additions, or if the exact line range was modified in subsequent commits.
4. AI-assisted classificationAll 133 ambiguous replies were individually read and classified by the AI conducting this analysis (Claude), based on reply text and domain context. These classifications were reviewed by the report author but not independently verified by the original PR engineers.
5. Cross-validationAll initially "not helpful" classifications were re-examined against diff evidence. 18 were reclassified where evidence was strong (e.g., typo fix + matching -1 line file change).
+ +

Classification rules: A comment is "Confirmed Helpful" only with positive evidence (engineer acknowledgment or verified code change). "Confirmed Not Helpful" only with positive evidence of poor quality (explicit engineer dismissal). Comments where the engineer did not engage — including the final review round merged without evaluation — are classified as "Unresolved" rather than assumed unhelpful.

+ +
+ + + + diff --git a/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.md b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.md new file mode 100644 index 00000000..e0764b36 --- /dev/null +++ b/.github/skills/copilot-review-analyst/assets/Copilot-Code-Review-Effectiveness-Report.md @@ -0,0 +1,270 @@ +# Copilot Code Review Effectiveness Analysis + +**Android Auth Platform | January 23 – March 23, 2026** + +--- + +## Executive Summary + +We analyzed **every inline code review comment** left by GitHub Copilot on human-authored pull requests across our three Android Auth repositories (Common, MSAL, Broker) over the past two months. For each of the **557 comments**, we determined whether the feedback led to a concrete code improvement — either through an explicit engineer response, or by verifying that the suggested change appeared in subsequent commit diffs. + +**Key findings:** + +- **57% of Copilot's comments received no response from engineers.** This is the single most important number in this report. The majority of AI review feedback is never even acknowledged. +- **Of comments that engineers engaged with, 60% were helpful** — a strong signal that the review quality itself is decent. +- **41% of all comments led to a confirmed code improvement.** But 41% are unresolved — comments that were ignored and where we lack evidence to judge either way. The true helpfulness rate could be significantly higher, but we can't confirm it because no one looked. +- **Engineers who reply to 75%+ of comments see 47-70% helpfulness.** Engineers who reply to <35% see 20-40%. The biggest lever for improving Copilot review value is not the AI — it's engineer engagement with the feedback. + +--- + +## How We Measured This + +This analysis went through five phases to ensure accuracy: + +1. **Data collection.** We used the GitHub API to extract all 557 Copilot inline review comments from 163 human-authored PRs (excluding PRs authored by Copilot coding agent). We also recorded which comments received human replies and what those replies said. + +2. **Reply-based classification.** For the 239 comments (43%) that received a human reply, we classified the reply as positive (e.g., "good catch", "fixed", "addressed"), negative (e.g., "won't fix", "not applicable", "by design"), or ambiguous. + +3. **Diff-level verification.** For the 318 comments (57%) that received no reply, we checked whether the engineer acted on the feedback silently. We used the GitHub compare API to examine the diff between the commit Copilot reviewed and the final PR head. For comments containing GitHub suggestion blocks, we checked if the suggested code tokens appeared as additions in the diff. For prose comments, we checked if the exact line range was modified in a subsequent commit. + +4. **AI-assisted reply classification.** All 133 comments with ambiguous replies were individually read and classified by the AI conducting this analysis (Claude), based on the reply text and domain context. For example, "this is just telemetry" was classified as dismissed, "@copilot apply changes" was classified as helpful (engineer delegated the fix back to Copilot), and "Added unit test for this" was classified as helpful (engineer acted on the suggestion). These classifications were reviewed for accuracy by the report author but were not independently verified by the original PR engineers. + +5. **Cross-validation.** All comments initially classified as "not helpful" were re-examined against the diff evidence. 18 were reclassified as helpful where the evidence was strong (e.g., a typo fix suggestion with a corresponding -1 line file change, or an unused import suggestion with a matching -2 line change). + +The final dataset classifies each of the 557 comments as **confirmed helpful**, **confirmed not helpful**, or **unresolved** (insufficient evidence to determine). No comment is left without a classification. + +--- + +## Overall Results + +| Metric | Value | +|--------|-------| +| Human PRs scanned | 163 | +| PRs that received Copilot review | 113 (69%) | +| Total inline review comments | 557 | +| PR-level summary comments | 205 | +| Average comments per reviewed PR | 4.9 | + +### Engineer Response Rate + +Before looking at helpfulness, it's important to understand how engineers interact with Copilot reviews — because a comment can only demonstrate value if someone reads it. + +| Behavior | Count | Percentage | +|----------|-------|------------| +| **Engineer replied** (any response — acceptance, dismissal, or discussion) | 239 | **42.9%** | +| **Engineer did not reply** | 318 | **57.1%** | + +More than half of all Copilot review comments receive no human response at all. This means the majority of AI feedback enters a void — it may be valid, but we can never confirm its value if no one engages with it. + +### Helpfulness Verdict + +Each comment was classified into one of three categories: + +| Verdict | Count | Percentage | Definition | +|---------|-------|------------|------------| +| **Confirmed Helpful** | **230** | **41.3%** | The comment led to a code change — either the engineer explicitly acknowledged it, or the suggested fix was verified in a subsequent commit diff. | +| **Confirmed Not Helpful** | **101** | **18.1%** | The engineer explicitly dismissed the comment with a reason why the feedback was incorrect, irrelevant, or by design. We have positive evidence that the comment was *wrong*, not merely that it was *ignored*. | +| **Unresolved** | **226** | **40.6%** | The comment received no reply AND we could not confirm whether it was addressed. This includes cases where the engineer merged without any subsequent commits (the final review round was never evaluated), where the file was modified but not at the specific lines Copilot flagged, or where the comment had no line number making verification impossible. | + +The **unresolved** category is the most important number in this report. These 226 comments — **41% of all feedback** — are not confirmed failures of the AI. They are comments where we simply lack evidence either way because no one engaged with them. Many may be perfectly valid feedback that was never read. If engineers had engaged with them (even just to dismiss), we would know. The true helpfulness rate likely falls somewhere between 41% (confirmed floor) and 82% (if all unresolved were helpful). + +### How Each Category Breaks Down + +**Confirmed Helpful (230):** + +| Path | Count | Description | +|------|-------|-------------| +| Engineer replied and acknowledged | 144 | Engineer explicitly confirmed the feedback was useful (e.g., "good catch", "fixed", "addressed", "added unit test") | +| Engineer silently applied the fix | 86 | No reply, but the suggestion code or exact line range was verified as modified in a subsequent commit | + +**Confirmed Not Helpful (101):** + +| Path | Count | Description | +|------|-------|-------------| +| Engineer replied and dismissed | 95 | Engineer explicitly explained why the comment was wrong, irrelevant, or by design (e.g., "won't fix", "this is fine", "Copilot is incorrect", "false positive") | +| Comment on code already changed | 6 | Comment was on stale/outdated code that had already been modified in a different commit | + +**Unresolved (226):** + +| Path | Count | Description | +|------|-------|-------------| +| Merged without any subsequent commits | 122 | No commits after Copilot's review — the engineer merged the PR without acting on the final review round. The comment may have been valid, but we cannot tell because the engineer never evaluated it. | +| File modified but not at the commented lines | 45 | The file was changed after the review, but the diff shows the changes were at different lines than what Copilot flagged. Possibly addressed via a different approach, or possibly coincidental. | +| File modified but no line number to verify | 50 | Copilot's comment had no line number metadata, and the file was modified. We cannot confirm whether the specific concern was addressed. | +| File never modified after the comment | 9 | The file was not touched in any commit after the review, and no reply was left. The comment may have been valid but was ignored. | + +--- + +## Results by Repository + +| Repository | Comments | Response Rate | Confirmed Helpful | Confirmed Not Helpful | Unresolved | +|------------|----------|---------------|------|------|------| +| **Broker** | 293 | **56.0%** | 142 (48.5%) | 48 (16.4%) | 103 (35.2%) | +| **Common** | 188 | **29.8%** | 68 (36.2%) | 34 (18.1%) | 86 (45.7%) | +| **MSAL** | 76 | **25.0%** | 20 (26.3%) | 19 (25.0%) | 37 (48.7%) | + +Broker has the highest response rate (56%) and correspondingly the highest confirmed helpfulness (49%). But even in Broker, 35% of comments are unresolved. In Common and MSAL — where response rates are below 30% — nearly half of all comments are unresolved, meaning we cannot determine whether the AI's feedback was useful because engineers didn't evaluate it. + +Coverage across the three repos: + +| Repository | Human PRs | PRs Reviewed by Copilot | Coverage | +|------------|-----------|------------------------|----------| +| Common | 68 | 47 | 69% | +| MSAL | 31 | 19 | 61% | +| Broker | 64 | 47 | 73% | + +--- + +## Results by Engineer + +Each engineer has two GitHub accounts (a personal account for public repos and an EMU account for the private broker repo). These have been merged. Names are anonymized. + +| Engineer | Comments | Replied | Ignored | Response Rate | Confirmed Helpful | Confirmed Not Helpful | Unresolved | Helpfulness | +|----------|----------|---------|---------|---------------|------|------|------|-------------| +| **Engineer A** | 20 | 20 | 0 | **100%** | 14 | 6 | 0 | **70.0%** | +| **Engineer B** | 83 | 75 | 8 | **90.4%** | 57 | 26 | 0 | **68.7%** | +| **Engineer C** | 15 | 4 | 11 | **26.7%** | 8 | 2 | 5 | **53.3%** | +| **Engineer D** | 40 | 30 | 10 | **75.0%** | 19 | 14 | 7 | **47.5%** | +| **Engineer E** | 110 | 37 | 73 | **33.6%** | 44 | 18 | 48 | **40.0%** | +| **Engineer F** | 99 | 20 | 79 | **20.2%** | 36 | 11 | 52 | **36.4%** | +| **Engineer G** | 63 | 22 | 41 | **34.9%** | 20 | 12 | 31 | **31.7%** | +| **Engineer H** | 100 | 24 | 76 | **24.0%** | 27 | 12 | 61 | **27.0%** | +| **Engineer I** | 24 | 6 | 18 | **25.0%** | 5 | 0 | 19 | **20.8%** | +| **Engineer J** | 3 | 1 | 2 | **33.3%** | 0 | 0 | 3 | **0%** | + +*Helpfulness = Confirmed Helpful / Total Comments. Response Rate = Replied / Total Comments. Unresolved = comments with no reply and no definitive diff evidence either way.* + +**Key observation:** There is a strong correlation between response rate and helpfulness. Engineer A (100% response rate) and Engineer B (90%) have the highest helpfulness (70% and 69%) — and crucially, **zero unresolved comments**. When engineers engage, we know exactly what's helpful and what's not. Engineers with low response rates (Engineer F at 20%, Engineer H at 24%) have massive unresolved buckets (52 and 61 comments respectively) — over half their comments go into a black hole where we can't tell if the AI was right or wrong. + +--- + +## Response Behavior Deep Dive + +Of the 557 total comments: + +- **239 (42.9%) received a reply.** Of those, **60.3% were helpful** and 39.7% were not helpful. When engineers engage, the majority of Copilot feedback turns out to be useful. +- **318 (57.1%) were ignored.** Of those, **27.0% were silently addressed** (verified via diff), **1.9% were on stale code** (confirmed not helpful), and the remaining **71.1% are unresolved** — we cannot determine whether the comment was useful because the engineer never evaluated it. + +### What happens to ignored comments + +| What happened | Count | % of ignored | Verdict | +|---------------|-------|-------------|---------| +| Suggestion code silently applied (verified via diff) | 50 | 15.7% | Confirmed Helpful | +| Exact commented lines modified in subsequent commit | 7 | 2.2% | Confirmed Helpful | +| Re-audit: evidence of fix at nearby lines | 29 | 9.1% | Confirmed Helpful | +| Merged without any subsequent commits | 122 | 38.4% | **Unresolved** | +| File never modified after the comment | 9 | 2.8% | **Unresolved** | +| Comment on stale/outdated code | 6 | 1.9% | Confirmed Not Helpful | +| File modified but not at the commented lines | 45 | 14.2% | **Unresolved** | +| File modified but no line number to verify | 50 | 15.7% | **Unresolved** | + +The single largest category is the **122 comments (38.4% of ignored)** where the engineer merged the PR without pushing any additional commits after Copilot's review. These represent the final review round being skipped entirely — the feedback had zero chance of impact regardless of its quality. + +--- + +## What Copilot Is Good At + +The most common categories of helpful comments, with real examples from our PRs: + +**Catching real bugs:** +> *PR #3050 (Common):* Copilot flagged that `"$it"` string wrapping doesn't JSON-escape the content, which could break consumers if contract values contain special characters. +> *Engineer reply: "You're right. Making the change."* + +**Stale documentation and naming inconsistencies:** +> *PR #64 (Broker):* Copilot identified four locations where KDoc still referenced the old flight constant `USE_TEE_ONLY_FOR_TOKEN_BINDING` after it was renamed to `USE_TEE_ONLY_FOR_HARDWARE_BOUND_KEYS`. All four were silently fixed in the next commit. + +**Dead code and unused imports:** +> *PR #3040 (Common):* Copilot spotted an unused local variable `enabledSettingRaw` that was assigned but never read. Verified as fixed via diff analysis — the suggested replacement code appeared in the commit additions. + +**CI/pipeline configuration issues:** +> *PR #3038 (Common):* Copilot warned that using `vmImage: 'windows-latest'` makes the CD pipeline non-deterministic. The engineer changed to a pinned image version. + +**The `@copilot apply` workflow:** +> In 16 instances (2.9%), engineers validated Copilot's feedback and then delegated the fix back to Copilot with: `@copilot open a new pull request to apply changes based on [this feedback]`. This is an efficient pattern where AI identifies and fixes the issue end-to-end. + +--- + +## What Copilot Struggles With + +The most common categories of unhelpful comments, with real examples: + +**Lacking domain context:** +> *Copilot:* "`shared_device_id` could be used for tracking across apps — consider hashing before emission." +> *Engineer:* "The shared_device_id is a random UUID generated by one of the participating apps and is not PII." +> +> Copilot applied a general security heuristic without understanding that this particular identifier is already random and non-linkable. + +**Suggesting tests for trivial code:** +> *Copilot:* "New telemetry attributes lack test coverage..." +> *Engineer:* "These are just telemetry related changes and adding unit tests will be overdo here." +> +> This was a recurring theme — Copilot frequently requests tests for logging/telemetry code that the team considers low-risk. + +**Misunderstanding APIs:** +> *Copilot:* "`00000003-0000-0ff1-ce00-000000000000` is the resource ID for Microsoft Graph, not SharePoint Online." +> *Engineer:* "00000003-0000-0ff1-ce00-000000000000 is SharePoint Online." +> +> Copilot was factually wrong about a well-known Microsoft service resource ID. + +**Commenting on intentional design choices:** +> *Copilot:* "`getPackageInfo() != null` is redundant since it either returns PackageInfo or throws..." +> *Engineer:* "This is fine. The verbosity makes the code clearer to understand." + +**Over-engineering suggestions:** +> *Copilot:* "Use a bounded min-heap of size `PRT_ARTIFACT_LIMIT` to reduce overhead..." +> *Engineer:* (The list can only ever contain 3 items — there are only 3 broker apps on Android.) + +--- + +## Most Reviewed Files + +| Rank | File | Comments | +|------|------|----------| +| 1 | `.github/workflows/copilot-issue-response.yml` | 27 | +| 2 | `broker4j/.../AttributeName.java` | 19 | +| 3 | `.github/workflows/copilot-ci-feedback.md` | 16 | +| 4 | `common/.../AuthorizationFragment.java` | 15 | +| 5 | `broker4j/.../MultipleWorkplaceJoinDataStore.java` | 11 | +| 6 | `broker4j/.../AbstractBrokerController.java` | 11 | +| 7 | `common/.../AzureActiveDirectoryWebViewClient.java` | 11 | +| 8 | `AADAuthenticator/.../BrowserSsoProvider.kt` | 10 | +| 9 | `broker4j/.../BrokerFlight.java` | 9 | +| 10 | `broker4j/.../DeviceRegistrationRequestHandler.java` | 9 | + +CI/workflow files and large Java classes with many change touchpoints attract the highest volume of comments. `AttributeName.java` appears frequently because many PRs add telemetry attributes. + +--- + +## Key Takeaways + +1. **57% of Copilot review comments receive no response from engineers.** This is the most significant finding. The majority of AI review feedback is never acknowledged. Some of it is silently acted on (27% of ignored comments show diff evidence of a fix), but the vast majority — 71% of ignored comments — are unresolved. We simply don't know if they were useful because no one evaluated them. + +2. **41% of all comments led to a confirmed code improvement.** But 41% are unresolved, meaning the true helpfulness rate lies between 41% (floor) and 82% (ceiling). We cannot narrow this range without engineer engagement. + +3. **When engineers engage, 60% of comments are helpful.** Of the 239 comments that received a reply, 144 (60%) led to acknowledged improvements. This suggests the AI review quality itself is decent — the bottleneck is adoption, not accuracy. + +4. **Only 18% of comments are confirmed not helpful.** When we restrict "not helpful" to comments where the engineer explicitly dismissed the feedback — the only cases where we have positive evidence of low quality — the rate is surprisingly low. + +5. **Engagement is the strongest predictor of value.** Engineers who reply to 75%+ of comments see 47-70% confirmed helpfulness with zero unresolved comments. Engineers who reply to <35% see 20-40% helpfulness with massive unresolved buckets (50-60% of their comments). The tool works better when engineers work with it. + +6. **38% of ignored comments are on the final commit before merge.** These 122 comments represent the last review round being skipped entirely — the engineer merged without pushing any further changes. These comments may have been perfectly valid, but the feedback had zero chance of impact. + +7. **Broker gets the most value (49%), Common is middling (36%), MSAL is lowest (26%).** This correlates with response rate: Broker engineers reply to 56% of comments, while Common (30%) and MSAL (25%) reply far less. + +8. **At ~5 comments per PR, the signal-to-noise conversation can't be settled without engagement.** We confirmed ~2 useful comments per PR on average, but with 41% of comments unresolved, the actual number could be higher. The only way to know is for engineers to evaluate the feedback. + +--- + +## Methodology Notes + +- **Scope.** Only inline code review comments from the `Copilot` and `copilot-pull-request-reviewer[bot]` users were counted. PR-level summary comments (205 total) were excluded from the helpfulness analysis. +- **Bot exclusions.** PRs authored by `copilot-swe-agent` (Copilot coding agent) were excluded. Only PRs authored by human engineers were analyzed. +- **Diff verification.** For suggestion blocks, we extracted the suggested code tokens and checked if they appeared as `+` (addition) lines in the compare diff between the comment's commit and the PR head. For prose comments, we checked if the diff hunk line ranges overlapped with the comment's target line range (±5 line tolerance). This is a conservative approach — some fixes that refactored code differently than suggested may be missed. +- **AI-assisted reply classification.** Every comment that could not be definitively classified by automated methods was individually read and classified by the AI conducting this analysis (Claude), based on the reply text and domain context. These AI classifications were reviewed for accuracy by the report author but were not independently verified by the original PR engineers. +- **Conservative approach.** We only classify a comment as "Confirmed Helpful" when there is positive evidence: an explicit engineer acknowledgment, or verified code changes at the exact lines/tokens suggested. We only classify as "Confirmed Not Helpful" when there is positive evidence that the comment was *wrong or irrelevant*: an explicit engineer dismissal with a stated reason. Comments where the engineer simply did not engage — including the final review round that was merged without evaluation, files that were never modified, and files modified at different lines — are classified as "Unresolved" rather than assumed to be unhelpful. +- **Account merging.** Engineers with separate public GitHub accounts and EMU (Enterprise Managed User) accounts were merged based on known identity mappings. +- **Data availability.** Raw data for all 557 comments (including comment text, engineer replies, diff verification evidence, and final verdicts) is stored at `%TEMP%\copilot-review-analysis\` for independent verification. + +--- + +*Analysis conducted March 23-24, 2026. Data covers all PRs created January 23 – March 23, 2026 in the Common, MSAL, and Broker repositories.* diff --git a/.github/skills/copilot-review-analyst/references/account-map.json b/.github/skills/copilot-review-analyst/references/account-map.json new file mode 100644 index 00000000..c0d87b27 --- /dev/null +++ b/.github/skills/copilot-review-analyst/references/account-map.json @@ -0,0 +1,22 @@ +{ + "shahzaibj": "Shahzaib", + "shjameel_microsoft": "Shahzaib", + "Prvnkmr337": "Praveen", + "prsaminathan_microsoft": "Praveen", + "siddhijain": "Siddhi", + "siddhijain_microsoft": "Siddhi", + "mohitc1": "Mohit", + "mchand_microsoft": "Mohit", + "rpdome": "Dome", + "rapong_microsoft": "Dome", + "p3dr0rv": "Pedro", + "pedroro_microsoft": "Pedro", + "melissaahn": "Melissa", + "melissaahn_microsoft": "Melissa", + "fadidurah": "Fadi", + "fadidurah_microsoft": "Fadi", + "somalaya": "Sowmya", + "somalaya_microsoft": "Sowmya", + "wzhipan": "Zhipan", + "zhipanwang_microsoft": "Zhipan" +} diff --git a/.github/skills/copilot-review-analyst/references/classification-rules.md b/.github/skills/copilot-review-analyst/references/classification-rules.md new file mode 100644 index 00000000..866700df --- /dev/null +++ b/.github/skills/copilot-review-analyst/references/classification-rules.md @@ -0,0 +1,133 @@ +# Classification Rules + +Complete classification hierarchy for Copilot review comment analysis. + +## Classification Cascade (Priority Order) + +Apply rules in this exact order. First match wins. + +### Replied Comments (has human reply) + +| Priority | Condition | Verdict | +|----------|-----------|---------| +| 1 | Reply matches positive keyword patterns | **Helpful** | +| 2 | Reply matches negative keyword patterns | **Not Helpful** | +| 3 | Both positive and negative matched (mixed) | Verdict from `mixedResponseVerdict` in manual audit file (default: **Not Helpful**) | +| 4 | Reply contains `@copilot` (delegated fix) | **Helpful** | +| 5 | Reply matches acknowledged-action patterns | **Helpful** | +| 6 | Reply matches explained-away patterns | **Not Helpful** | +| 7 | Reply matches outdated/dismissed patterns | **Not Helpful** | +| 8 | Genuinely unclear — apply AI judgment | See AI Classification below | + +### No-Response Comments (no human reply) + +| Priority | Condition | Verdict | +|----------|-----------|---------| +| 9 | `suggestion-applied` or `suggestion-likely-applied` (from diff verification) | **Helpful** | +| 10 | `exact-lines-modified` (from diff verification) | **Helpful** | +| 11 | `lines-modified-different-fix` | **Helpful** (nearby lines modified with a different approach — engineer addressed the concern) | +| 12 | `file-changed-elsewhere` or `file-changed-no-line-info` | **Check re-audit list** — helpful if evidence found, else **Unresolved** | +| 13 | `file-not-changed`, `no-subsequent-commits`, `not-applied` | **Unresolved** | +| 14 | Comment on stale/outdated code | **Not Helpful** | + +## Keyword Patterns + +### Positive Patterns (→ Helpful) + +``` +good catch, fixed, done, addressed, will fix, will address, +thanks, thank you, agreed, makes sense, updated, nice catch, +you're right, you are right, correct, valid point, great catch, +resolved, will do, good point, fair point, acknowledged, +applied, changed, modified, yep, absolutely, +i'll update, i will update, i'll fix, i will fix, +good suggestion, great suggestion, nice suggestion, +will change, will update, pushed a fix, committed, +good find, great find, indeed, +making the change, i've updated, i've fixed +``` + +### Negative Patterns (→ Not Helpful) + +``` +not applicable, n/a, won't fix, wontfix, by design, +intentional, false positive, not relevant, ignore, +doesn't apply, not needed, unnecessary, nah, no need, +disagree, incorrect, wrong, not accurate, hallucin, +not a real issue, not an issue, this is fine, it's fine, +already handled, already done, not applicable here, +copilot is wrong, bot is wrong, misunderstanding, +out of scope, does not apply, not a concern, not a problem, +doesn't matter, won't happen, can't happen, impossible +``` + +### Acknowledged-Action Patterns (→ Helpful, for unclear replies) + +These indicate the engineer acted on the feedback even if they didn't use standard positive keywords: + +``` +added tests?, refactored, removed, reverted, renamed, +implemented, reworked, update signature, log warning, +move check, add test, add unit test, nice job, good bot +``` + +Use word-boundary matching (`\b`) for these. + +### Explained-Away Patterns (→ Not Helpful, for unclear replies) + +These indicate the engineer explained why the comment is not relevant: + +``` +this is, we don't, we do not, we aren't, nope, has been, +it's a, they're meant, this has, only used, never been, +was consciously, just telemetry, is just, original behavior, +overdo, legacy, can only, can never, doesn't need, +suffix was, timing is not, skip, most of the, empty is fine, +no longer, will stick, keep the current, consciously +``` + +### Outdated/Dismissed Patterns (→ Not Helpful) + +``` +outdated, dismissed +``` + +## AI Classification for Genuinely Unclear Replies + +When no keyword pattern matches, read the reply text with domain context: + +1. **Is the engineer confirming they'll act?** Even indirect signals like "addressing in a later commit", "implemented something similar", or linking a commit hash → **Helpful** +2. **Is the engineer explaining why the feedback doesn't apply?** Phrases like "this is just telemetry", "we consciously chose this", "legacy code" → **Not Helpful** +3. **Is the reply tangential or administrative?** E.g., "will consider in another PR" → **Helpful** if they acknowledge the issue, **Not Helpful** if they're deflecting +4. **Does the reply contain a commit SHA or link?** → **Helpful** (engineer is showing they applied a fix) + +## Diff Verification Logic + +### For Suggestion Blocks + +1. Extract code between `` ```suggestion `` and `` ``` `` markers +2. Tokenize: keep lines >3 chars, skip punctuation-only lines +3. Compare tokens against `+` (addition) lines in the diff +4. Token match ratio ≥ 50% AND line range overlap → `suggestion-applied` +5. Token match ratio ≥ 50% without line overlap → `suggestion-likely-applied` + +### For Prose Comments + +1. Get the comment's line range (`start_line` to `line`) +2. Parse diff hunk headers (`@@ -old,count +new,count @@`) +3. Check if any hunk's old-line range overlaps the comment range ±5 lines +4. Overlap found → `exact-lines-modified` + +### No-Subsequent-Commits Check + +If the commit SHA the comment was left on equals the PR head SHA → `no-subsequent-commits`. This means the PR was merged without any further changes after the review. + +## Account Mapping + +Engineers have separate personal GitHub accounts and EMU (Enterprise Managed User) accounts. Merge them for per-engineer statistics: + +``` +personal_login → emu_login → Display Name +``` + +The mapping is defined in `references/account-map.json` (external JSON file). Update for new team members by editing the JSON directly — no script changes needed. diff --git a/.github/skills/copilot-review-analyst/references/manual-audit-template.json b/.github/skills/copilot-review-analyst/references/manual-audit-template.json new file mode 100644 index 00000000..8400daee --- /dev/null +++ b/.github/skills/copilot-review-analyst/references/manual-audit-template.json @@ -0,0 +1,17 @@ +{ + "_comment": "This file is produced by Phase 3 (AI-assisted classification). For each new analysis run, the agent reads unclear replies, classifies them, and writes the results here. Phase 4 (final-classification.ps1) reads this file to finalize verdicts.", + + "genuineUnclearHelpful": [ + "_comment: Reply text patterns (lowercased) that indicate the engineer acted on the feedback, even though keyword matching didn't catch it. These are discovered during Phase 3 manual/AI audit of 'replied-unclear' comments." + ], + + "genuineUnclearHelpfulExtra": [ + "_comment: Additional patterns like commit SHAs or unique strings that confirm a fix was applied." + ], + + "reauditFlipKeys": [ + "_comment: Keys in format 'repo/prNumber/filePattern' for no-response comments where diff verification returned 'file-changed-elsewhere' or 'file-changed-no-line-info' but manual re-audit confirmed the comment was helpful. Discovered during Phase 3." + ], + + "mixedResponseVerdict": "not-helpful" +} diff --git a/.github/skills/copilot-review-analyst/references/report-formatting.md b/.github/skills/copilot-review-analyst/references/report-formatting.md new file mode 100644 index 00000000..04e75a7b --- /dev/null +++ b/.github/skills/copilot-review-analyst/references/report-formatting.md @@ -0,0 +1,175 @@ +# Report Formatting Guide + +Rules for generating Copilot Code Review Effectiveness reports in Markdown and Outlook-compatible HTML. + +## Report Structure + +Generate both formats. Templates are in `assets/` within this skill folder. + +| # | Section | Content | +|---|---------|---------| +| 1 | **Background** | Team context, what repos are covered, what was enabled | +| 2 | **At a Glance** | 4 summary cards (no-response %, helpful %, not-helpful %, unresolved %) + callout about adoption | +| 3 | **Overall Results** | Response rate bar, helpfulness verdict bar, breakdown tables | +| 4 | **Results by Repository** | Per-repo bars + table (comments, response rate, helpful/not/unresolved) | +| 5 | **Results by Engineer** | Table with colored columns (anonymize names for org-wide sharing) | +| 6 | **Response Behavior Deep Dive** | What happens to ignored comments (silently applied, merged without commits, etc.) | +| 7 | **What Copilot Is Good At** | 4-5 real examples with PR references and engineer quotes | +| 8 | **What Copilot Struggles With** | 4-5 real examples showing false positives, domain gaps | +| 9 | **Most Reviewed Files** | Top 10 files by comment count | +| 10 | **Key Takeaways** | 7-8 numbered findings | +| 11 | **Recommendations** | 3 actionable next steps | +| 12 | **Methodology Notes** | How data was collected, classified, and validated | + +## Statistics to Compute + +From `final_classification.json`: + +```powershell +# Overall +$total = $data.Count +$helpful = ($data | Where-Object { $_.Verdict -eq "helpful" }).Count +$notHelpful = ($data | Where-Object { $_.Verdict -eq "not-helpful" }).Count +$unresolved = $total - $helpful - $notHelpful +$replied = ($data | Where-Object { $_.Replied -eq $true }).Count +$responseRate = [math]::Round(($replied / $total) * 100, 1) + +# Per-repo +$repoStats = $data | Group-Object Repo | ForEach-Object { ... } + +# Per-engineer +$engStats = $data | Group-Object Engineer | ForEach-Object { ... } +``` + +## Outlook HTML Formatting Rules + +Outlook strips most modern CSS. Follow these rules strictly: + +### Layout +- Wrap entire body in a centered `` for consistent margins +- Use **table-based layouts only** — no flexbox, no grid, no float +- All styles must be **inline** — Outlook strips `