Skip to content

perf(player): p0-1a perf test infra + composition-load smoke test#399

Merged
vanceingalls merged 1 commit intomainfrom
perf/p0-1a-perf-test-infra
Apr 23, 2026
Merged

perf(player): p0-1a perf test infra + composition-load smoke test#399
vanceingalls merged 1 commit intomainfrom
perf/p0-1a-perf-test-infra

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented Apr 21, 2026

Summary

First slice of P0-1 from the player perf proposal: lays the foundation for a player perf gate so later PRs can plug in fps / scrub / drift / parity scenarios without rebuilding infrastructure. Ships one smoke scenario (03-load, cold + warm composition load) to prove the gate end-to-end on real numbers.

Why

There was no automated way to catch player perf regressions. Every perf concern in the existing proposal — composition load time, sustained FPS, scrub p95, mirror-clock drift, live-vs-seek parity — needs the same plumbing: a same-origin harness, a Puppeteer runner, a baseline file, a gate that emits structured results, and a CI workflow that runs the right scenarios on the right changes. Building that up-front in one reviewable PR lets every subsequent perf PR (P0-1b, P0-1c, and beyond) be a 100-line scenario file plus a baseline entry instead of re-litigating the framework.

What changed

Harness — packages/player/tests/perf/server.ts

  • Bun.serve on a free port, single same-origin host for the player IIFE bundle, hyperframe runtime, GSAP from node_modules, and fixture HTML.
  • Same-origin matters: cross-origin would force every probe through postMessage, hiding bugs and inflating numbers in ways production never sees. Tests should measure the path the studio editor actually takes.
  • Routes:
    • /player.js → built IIFE bundle (rebuilt on demand).
    • /vendor/runtime.js, /vendor/gsap.min.js → resolved from node_modules so fixtures don't need to ship copies.
    • /fixtures/* → fixture HTML.

Runner — packages/player/tests/perf/runner.ts

  • puppeteer-core thin wrappers (launchBrowser, loadHostPage).
  • Uses the system Chrome detected by setup-chrome in CI rather than the bundled puppeteer revision — keeps the action smaller, lets us pin Chrome version policy at the workflow level, and matches what users actually run.

Gate — packages/player/tests/perf/perf-gate.ts + baseline.json

  • Loads baseline.json (initial budgets: cold/warm comp load, fps, scrub p95 isolated/inline, drift max/p95) with a 10% allowedRegressionRatio.
  • Per-metric direction (lower-is-better / higher-is-better) so the same evaluator handles latency and throughput.
  • Returns a structured GateReport consumed by both the CLI (table output) and metrics.json (CI artifact).
  • Two modes: measure (log only — used during the rollout) and enforce (fail the build) — flip per-metric once we trust the signal, without touching the harness.

CLI orchestrator — packages/player/tests/perf/index.ts

  • Parses --mode / --scenarios / --runs / --fixture in both space- and equals-separated form (so --scenarios fps,scrub and --scenarios=fps,scrub both work — matches what humans type and what GitHub Actions emits).
  • Runs scenarios, runs the gate, and always writes results/metrics.json with schema version, git SHA, metrics, and gate rows — so failed runs are still investigable from the artifact alone.

Fixture + smoke scenario

  • fixtures/gsap-heavy/index.html: 200 stagger-animated tiles, no media. Heavy enough to make load time meaningful, light enough to be deterministic.
  • scenarios/03-load.ts: cold + warm composition load. Measures from navigation start to player ready event, reports p95 across runs.

CI — .github/workflows/player-perf.yml

  • paths-filter on player / core / runtime — perf only runs when something that could move the needle actually changed.
  • Sets up bun + node + chrome, runs perf in measure mode on a shard matrix (so future scenarios shard naturally), uploads metrics.json artifacts, and a summary job aggregates shard results into a single PR comment.

Wiring

  • packages/player: puppeteer-core, gsap, @types/bun devDeps; typecheck extended to cover the perf tsconfig; new perf script.
  • Root package.json: player:perf workspace script so bun run player:perf runs the whole suite locally with the same flags CI uses.
  • .gitignore: packages/player/tests/perf/results/.
  • Separate tests/perf/tsconfig.json so test code doesn't pollute the package rootDir while still being typechecked.

Test plan

  • Local: bun run player:perf passes — cold p95 ≈ 386 ms, warm p95 ≈ 375 ms, both well under the seeded baselines.
  • Typecheck, lint, format pass on the perf workspace.
  • Existing player unit tests (71/71) still green.
  • First CI run after merge will be the real signal: confirms setup-chrome works on hosted runners, the shard matrix wires up, and metrics.json artifacts upload.

Stack

Step P0-1a of the player perf proposal. The next two slices are content-only — they don't touch the harness:

Wiring this gate up first means each follow-up is a self-contained scenario file + baseline row + workflow shard.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good foundation. Same-origin Bun.serve harness, Puppeteer-driven runner, per-metric gate with measure / enforce modes, paths-filter so unrelated PRs don't drag the full perf suite. Baseline values look reasonable as starting points and the allowedRegressionRatio: 0.1 leaves headroom for CI runner jitter.

Splitting the work into scenario files (scenarios/03-load.ts for this one) and threading them through a common runner.ts / perf-gate.ts is exactly the right shape — subsequent PRs (p0-1b fps/scrub/drift, p0-1c parity) should be a dozen lines of wiring plus a self-contained scenario, which is how the follow-ups in the stack look.

The required-check-name pattern in branch protection (player-perf job depends on perf-shards with if: always()) handles the skip-on-no-changes case correctly — I'd like to see the "skipped" path logged explicitly in the summary step output so debugging a false "skipped" is easier, but not a blocker.

One thing I'd watch post-merge: the measure mode is the right default while baselines settle, but there's no reminder to flip to enforce once you're confident. Consider a comment in perf-gate.ts or a TODO pointing at the exact call site where mode is resolved, so the "enforce" flip is a one-line change when the time comes.

Approved.

Rames Jusso

@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from 3e9f8fc to a3f8cef Compare April 22, 2026 00:43
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 725bc89 to 0af9ce7 Compare April 22, 2026 00:43
@vanceingalls vanceingalls changed the base branch from perf/p3-2-srcdoc-composition-switching to graphite-base/399 April 22, 2026 00:53
vanceingalls added a commit that referenced this pull request Apr 22, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 0af9ce7 to 433b609 Compare April 22, 2026 00:56
@vanceingalls vanceingalls changed the base branch from graphite-base/399 to perf/p3-2-srcdoc-composition-switching April 22, 2026 00:56
@vanceingalls
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 — thanks for the review. The non-blocking observation is now addressed:

Worth logging the "skipped" path explicitly in the summary step output so a missing artifact doesn't quietly look like a pass. Also consider documenting mode resolver location (index.ts:parseArgs) so the next person flipping enforce doesn't have to chase it through process.argv.

Done in 433b609a (perf(player): document mode resolver + log skipped/passed/failed in perf summary). The summary step now emits passed: N, failed: N, skipped: N with explicit color coding so a missing artifact surfaces as skipped rather than silently counting as passed. The mode resolver in index.ts:parseArgs now carries a doc-comment pointing at the enforce flip-point and the precedence order (--mode flag > env var > default), so the next person doesn't have to re-derive it.

@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from d77730f to cf210e3 Compare April 22, 2026 22:20
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 433b609 to dbe8090 Compare April 22, 2026 22:20
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from cf210e3 to 6fcbfbd Compare April 22, 2026 22:36
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from dbe8090 to 4f09c94 Compare April 22, 2026 22:36
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from 6fcbfbd to 7395994 Compare April 22, 2026 22:43
vanceingalls added a commit that referenced this pull request Apr 22, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch 2 times, most recently from a08c782 to 2dc4b8f Compare April 22, 2026 23:29
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch 2 times, most recently from d268c97 to 2cef7c8 Compare April 22, 2026 23:42
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 2dc4b8f to 407813c Compare April 22, 2026 23:42
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from 2cef7c8 to 3879e1d Compare April 22, 2026 23:48
vanceingalls added a commit that referenced this pull request Apr 22, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 407813c to ab395e9 Compare April 22, 2026 23:48
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch from 3879e1d to 7ef4c91 Compare April 23, 2026 00:45
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from ab395e9 to 70e61ca Compare April 23, 2026 00:46
@vanceingalls vanceingalls force-pushed the perf/p3-2-srcdoc-composition-switching branch 2 times, most recently from 7abc41f to 21fbaef Compare April 23, 2026 00:51
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 70e61ca to 658e30c Compare April 23, 2026 00:51
@vanceingalls vanceingalls changed the base branch from perf/p3-2-srcdoc-composition-switching to graphite-base/399 April 23, 2026 00:58
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 658e30c to db957ee Compare April 23, 2026 00:59
@graphite-app graphite-app Bot changed the base branch from graphite-base/399 to main April 23, 2026 00:59
…erf summary

Addresses non-blocking review feedback on PR #399:

1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts)
   pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure".
   Once baselines settle on CI, flipping this default + the workflow's
   `--mode=measure` arg is the entire opt-in to enforce mode.

2. Add a brief cross-reference comment on `reportAndGate(...mode)` in
   perf-gate.ts pointing back at the resolver, so anyone reading the gate
   knows where the value comes from.

3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED /
   FAILED line both to stdout (as a notice/error annotation) and to
   $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured
   paths-filter is now obvious in the Checks UI without reading shard logs.
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from db957ee to cc6039f Compare April 23, 2026 00:59
Comment on lines +14 to +32
name: Detect changes
runs-on: ubuntu-latest
timeout-minutes: 2
outputs:
perf: ${{ steps.filter.outputs.perf }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
perf:
- "packages/player/**"
- "packages/core/**"
- "package.json"
- "bun.lock"
- ".github/workflows/player-perf.yml"

perf-shards:
Comment on lines +33 to +89
name: "Perf: ${{ matrix.shard }}"
needs: changes
if: needs.changes.outputs.perf == 'true'
runs-on: ubuntu-latest
timeout-minutes: 20
strategy:
fail-fast: false
matrix:
include:
- shard: load
scenarios: load
runs: "5"
steps:
- uses: actions/checkout@v4

- uses: oven-sh/setup-bun@v2

- uses: actions/setup-node@v4
with:
node-version: 22

- run: bun install --frozen-lockfile

# Player perf loads packages/player/dist/hyperframes-player.global.js
# and packages/core/dist/hyperframe.runtime.iife.js, so a full build is required.
- run: bun run build

- name: Set up Chrome (headless shell)
id: setup-chrome
uses: browser-actions/setup-chrome@v1
with:
chrome-version: stable

- name: Run player perf — ${{ matrix.shard }} (measure mode)
working-directory: packages/player
env:
PUPPETEER_EXECUTABLE_PATH: ${{ steps.setup-chrome.outputs.chrome-path }}
run: |
bun run perf \
--mode=measure \
--scenarios=${{ matrix.scenarios }} \
--runs=${{ matrix.runs }}

- name: Upload perf results
if: always()
uses: actions/upload-artifact@v4
with:
name: player-perf-${{ matrix.shard }}
path: packages/player/tests/perf/results/
if-no-files-found: warn
retention-days: 30

# Summary job — matches the required check name in branch protection.
# Logs an explicit "skipped" / "passed" / "failed" line both to stdout and to
# $GITHUB_STEP_SUMMARY so a false skip is obvious in the Checks UI without
# having to dig into the changes-job logs.
player-perf:
Comment on lines +90 to +120
runs-on: ubuntu-latest
needs: [changes, perf-shards]
if: always()
steps:
- name: Check results
env:
PERF_FILTER_RESULT: ${{ needs.changes.outputs.perf }}
PERF_SHARDS_RESULT: ${{ needs.perf-shards.result }}
run: |
{
echo "## Player perf gate"
echo ""
echo "- paths-filter \`perf\` matched: \`${PERF_FILTER_RESULT}\`"
echo "- perf-shards result: \`${PERF_SHARDS_RESULT}\`"
echo ""
} >> "$GITHUB_STEP_SUMMARY"

if [ "${PERF_FILTER_RESULT}" != "true" ]; then
echo "::notice title=Player perf::SKIPPED — no changes under packages/player/**, packages/core/**, package.json, bun.lock, or .github/workflows/player-perf.yml. Auto-pass."
echo "**Status:** SKIPPED (no player/core changes — auto-pass)" >> "$GITHUB_STEP_SUMMARY"
exit 0
fi

if [ "${PERF_SHARDS_RESULT}" != "success" ]; then
echo "::error title=Player perf::FAILED — perf-shards result was '${PERF_SHARDS_RESULT}'. See the per-shard logs above."
echo "**Status:** FAILED (perf-shards result: \`${PERF_SHARDS_RESULT}\`)" >> "$GITHUB_STEP_SUMMARY"
exit 1
fi

echo "::notice title=Player perf::PASSED — all perf shards completed successfully."
echo "**Status:** PASSED" >> "$GITHUB_STEP_SUMMARY"
@vanceingalls vanceingalls merged commit 10d2725 into main Apr 23, 2026
24 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

vanceingalls added a commit that referenced this pull request Apr 23, 2026
… drift (#400)

## Summary

Second slice of `P0-1` from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (`droppedFramesMax`).

## Why

#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:

- **FPS regressions** — a render-loop change that drops the ticker from 60 to 45 fps still loads fast.
- **Scrub latency regressions** — the inline-vs-isolated split (#397) is exactly the kind of code path where a refactor can silently push everyone back to the postMessage round trip.
- **Media drift** — runtime mirror logic (#396 in this stack) and per-frame scheduling tweaks can both cause video to slip out of sync with the composition clock without producing a single console error.

Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.

## What changed

### Fixture — `packages/player/tests/perf/fixtures/10-video-grid/`

- `index.html`: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).
- `sample.mp4`: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.
- Same `data-composition-id="main"` host pattern as `gsap-heavy`, so the existing harness loader works without changes.

### `02-fps.ts` — sustained playback frame rate

- Loads `10-video-grid`, calls `player.play()`, samples `requestAnimationFrame` callbacks inside the iframe for 5 s.
- Crucial sequencing: install the rAF sampler **before** `play()`, wait for `__player.isPlaying() === true`, **then reset the sample buffer** — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.
- FPS = `(samples − 1) / (lastTs − firstTs in s)`; uses rAF timestamps (the same ones the compositor saw) rather than wall-clock `setTimeout`, so we're measuring real frame production.
- Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync."
- Aggregation across runs: `min(fps)` and `max(droppedFrames)` — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.
- Emits `playback_fps_min` (higher-is-better, baseline `fpsMin = 55`) and `playback_dropped_frames_max` (lower-is-better, baseline `droppedFramesMax = 3`).

### `04-scrub.ts` — scrub latency, inline + isolated

- Loads `10-video-grid`, pauses, then issues 10 seek calls in two batches: first the synchronous **inline** path (`<hyperframes-player>`'s default same-origin `_trySyncSeek`), then the **isolated** path (forced by replacing `_trySyncSeek` with `() => false`, which makes the player fall back to the postMessage `_sendControl("seek")` bridge that cross-origin embeds and pre-#397 builds use).
- Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples.
- Detection: a rAF watcher inside the iframe polls `__player.getTime()` until it's within `MATCH_TOLERANCE_S = 0.05 s` of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.
- Timing: `performance.timeOrigin + performance.now()` in both contexts. `timeOrigin` is consistent across same-process frames, so `t1 − t0` is a true wall-clock latency, not a host-only or iframe-only stopwatch.
- Targets alternate forward/backward (`1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5`) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale `getTime()` value before the seek command is processed.
- Aggregation: `percentile(95)` across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.
- Emits `scrub_latency_p95_inline_ms` (lower-is-better, baseline `scrubLatencyP95InlineMs = 33`) and `scrub_latency_p95_isolated_ms` (lower-is-better, baseline `scrubLatencyP95IsolatedMs = 80`).

### `05-drift.ts` — media sync drift

- Loads `10-video-grid`, plays 6 s, instruments **every** `video[data-start]` element with `requestVideoFrameCallback`. Each callback records `(compositionTime, actualMediaTime)` plus a snapshot of the clip transform (`clipStart`, `clipMediaStart`, `clipPlaybackRate`).
- Drift = `|actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)|` — the same transform the runtime applies in `packages/core/src/runtime/media.ts`, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.
- Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95.
- Same "reset buffer after play confirmed" gotcha as `02-fps.ts`: frames captured during the postMessage round-trip would compare a non-zero `mediaTime` against `getTime() === 0` and inflate drift by hundreds of ms.
- Aggregation: `max()` and `percentile(95)` across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when `|currentTime − relTime| > 0.5 s`, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.
- Emits `media_drift_max_ms` (lower-is-better, baseline `driftMaxMs = 500`) and `media_drift_p95_ms` (lower-is-better, baseline `driftP95Ms = 100`).

### Wiring

- `packages/player/tests/perf/index.ts`: add `fps`, `scrub`, `drift` to `ScenarioId`, `DEFAULT_RUNS`, the default scenario list (`--scenarios` defaults to all four), and three new dispatch branches.
- `packages/player/tests/perf/perf-gate.ts`: add `droppedFramesMax: number` to `PerfBaseline`. Other baseline keys for these scenarios were already seeded in #399.
- `packages/player/tests/perf/baseline.json`: add `droppedFramesMax: 3`.
- `.github/workflows/player-perf.yml`: three new matrix shards (`fps` / `scrub` / `drift`) at `runs: 3`. Same `paths-filter` and same artifact-upload pattern as the `load` shard, so the summary job aggregates them automatically.

## Methodology highlights

These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:

1. **Reset buffer after play-confirmed.** The `play()` API is async (postMessage), so any samples captured before `__player.isPlaying() === true` belong to ramp-up, not steady-state. Both `02-fps` and `05-drift` clear `__perfRafSamples` / `__perfDriftSamples` *after* the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.
2. **Iframe-side timing.** All three scenarios time inside the iframe (`performance.timeOrigin + performance.now()` for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.
3. **Stop sampling before pause.** Sampler is deactivated *before* `pause()` is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.

## Test plan

- [x] Local: `bun run player:perf` runs all four scenarios end-to-end on the 10-video-grid fixture.
- [x] Each scenario produces metrics matching its declared `baselineKey` so `perf-gate.ts` can find them.
- [x] Typecheck, lint, format pass on the new files.
- [x] Existing player unit tests untouched (no production code changes in this PR).
- [ ] First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their `metrics.json` artifacts.

## Stack

Step `P0-1b` of the player perf proposal. Builds on:

- `P0-1a` (#399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into.

Followed by:

- `P0-1c` (#401): `06-parity` — live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existing `gsap-heavy` fixture from #399.
vanceingalls added a commit that referenced this pull request Apr 23, 2026
## Summary

Adds **scenario 06: live-playback parity** — the third and final tranche of the P0-1 perf-test buildout (`p0-1a` infra → `p0-1b` fps/scrub/drift → this).

The scenario plays the `gsap-heavy` fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with `ffmpeg -lavfi ssim` and the resulting average SSIM is emitted as `parity_ssim_min`. Baseline gate: **SSIM ≥ 0.95**.

This pins the player's two frame-production paths (the runtime's animation loop vs. `_trySyncSeek`) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.

## Motivation

`<hyperframes-player>` produces frames two different ways:

1. **Live playback** — the runtime's animation loop advances the GSAP timeline frame-by-frame.
2. **Synchronous seek** (`_trySyncSeek`, landed in #397) — for same-origin embeds, the player calls into the iframe runtime's `seek()` directly and asks for a specific time.

These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.

`gsap-heavy` is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.

## What changed

- **`packages/player/tests/perf/scenarios/06-parity.ts`** — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.
- **`packages/player/tests/perf/index.ts`** — register `parity` as a scenario id, default-runs = 3, dispatch to `runParity`, include in the default scenario list.
- **`packages/player/tests/perf/perf-gate.ts`** — extend `PerfBaseline` with `paritySsimMin`.
- **`packages/player/tests/perf/baseline.json`** — `paritySsimMin: 0.95`.
- **`.github/workflows/player-perf.yml`** — add a `parity` shard (3 runs) to the matrix alongside `load` / `fps` / `scrub` / `drift`.

## How the scenario works

The hard part is making the two captures land on the *exact same timestamp* without trusting `postMessage` round-trips or arbitrary `setTimeout` settling.

1. **Install an iframe-side rAF watcher** before issuing `play()`. The watcher polls `__player.getTime()` every animation frame and, the first time `getTime() >= 5.0`, calls `__player.pause()` *from inside the same rAF tick*. `pause()` is synchronous (it calls `timeline.pause()`), so the timeline freezes at exactly that `getTime()` value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical `T_actual` for the run.
2. **Confirm `isPlaying() === true`** via `frame.waitForFunction` before awaiting the watcher. Without this, the test can hang if `play()` hasn't kicked the timeline yet.
3. **Wait for paint** — two `requestAnimationFrame` ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as `packages/producer/src/parity-harness.ts`.
4. **Screenshot the live frame** — `page.screenshot({ type: "png" })`.
5. **Synchronously seek to `T_actual`** — call `el.seek(capturedTime)` on the host page. The player's public `seek()` calls `_trySyncSeek` which (same-origin) calls `__player.seek()` synchronously, so no postMessage await is needed. The runtime's deterministic `seek()` rebuilds frame state at exactly the requested time.
6. **Wait for paint** again, screenshot the reference frame.
7. **Diff with ffmpeg** — `ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -`. ffmpeg writes per-channel + overall SSIM to stderr; we parse the `All:` value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.
8. **Persist artifacts** under `tests/perf/results/parity/run-N/` (`actual.png`, `reference.png`, `captured-time.txt`) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing `packages/player/tests/perf/results/` rule.

### Aggregation

`min()` across runs, **not** mean. We want the *worst observed* parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.

### Output metric

| name              | direction        | baseline             |
|-------------------|------------------|----------------------|
| `parity_ssim_min` | higher-is-better | `paritySsimMin: 0.95` |

With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.

## Test plan

- `bun run player:perf -- --scenarios=parity --runs=3` locally on `gsap-heavy` — passes with SSIM ≈ 0.999 across all 3 runs.
- Inspected `results/parity/run-1/actual.png` and `reference.png` side-by-side — visually identical.
- Inspected `captured-time.txt` to confirm `T_actual` lands just past 5.0s (within one frame).
- Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift.
- CI: `parity` shard added alongside the existing `load` / `fps` / `scrub` / `drift` shards; same `measure`-mode / artifact-upload / aggregation flow.
- `bunx oxlint` and `bunx oxfmt --check` clean on the new scenario.

## Stack

This is the top of the perf stack:

1. #393 `perf/x-1-emit-performance-metric` — performance.measure() emission
2. #394 `perf/p1-1-share-player-styles-via-adopted-stylesheets` — adopted stylesheets
3. #395 `perf/p1-2-scope-media-mutation-observer` — scoped MutationObserver
4. #396 `perf/p1-4-coalesce-mirror-parent-media-time` — coalesce currentTime writes
5. #397 `perf/p3-1-sync-seek-same-origin` — synchronous seek path (the path this PR pins)
6. #398 `perf/p3-2-srcdoc-composition-switching` — srcdoc switching
7. #399 `perf/p0-1a-perf-test-infra` — server, runner, perf-gate, CI
8. #400 `perf/p0-1b-perf-tests-for-fps-scrub-drift` — fps / scrub / drift scenarios
9. **#401 `perf/p0-1c-live-playback-parity-test` ← you are here**

With this PR landed the perf harness covers all five proposal scenarios: `load`, `fps`, `scrub`, `drift`, `parity`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants