perf(player): p0-1a perf test infra + composition-load smoke test#399
perf(player): p0-1a perf test infra + composition-load smoke test#399vanceingalls merged 1 commit intomainfrom
Conversation
jrusso1020
left a comment
There was a problem hiding this comment.
Good foundation. Same-origin Bun.serve harness, Puppeteer-driven runner, per-metric gate with measure / enforce modes, paths-filter so unrelated PRs don't drag the full perf suite. Baseline values look reasonable as starting points and the allowedRegressionRatio: 0.1 leaves headroom for CI runner jitter.
Splitting the work into scenario files (scenarios/03-load.ts for this one) and threading them through a common runner.ts / perf-gate.ts is exactly the right shape — subsequent PRs (p0-1b fps/scrub/drift, p0-1c parity) should be a dozen lines of wiring plus a self-contained scenario, which is how the follow-ups in the stack look.
The required-check-name pattern in branch protection (player-perf job depends on perf-shards with if: always()) handles the skip-on-no-changes case correctly — I'd like to see the "skipped" path logged explicitly in the summary step output so debugging a false "skipped" is easier, but not a blocker.
One thing I'd watch post-merge: the measure mode is the right default while baselines settle, but there's no reminder to flip to enforce once you're confident. Consider a comment in perf-gate.ts or a TODO pointing at the exact call site where mode is resolved, so the "enforce" flip is a one-line change when the time comes.
Approved.
— Rames Jusso
3e9f8fc to
a3f8cef
Compare
725bc89 to
0af9ce7
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
0af9ce7 to
433b609
Compare
a3f8cef to
d77730f
Compare
|
@jrusso1020 — thanks for the review. The non-blocking observation is now addressed:
Done in |
d77730f to
cf210e3
Compare
433b609 to
dbe8090
Compare
cf210e3 to
6fcbfbd
Compare
dbe8090 to
4f09c94
Compare
6fcbfbd to
7395994
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
a08c782 to
2dc4b8f
Compare
d268c97 to
2cef7c8
Compare
2dc4b8f to
407813c
Compare
2cef7c8 to
3879e1d
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
407813c to
ab395e9
Compare
3879e1d to
7ef4c91
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
ab395e9 to
70e61ca
Compare
7abc41f to
21fbaef
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
70e61ca to
658e30c
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
21fbaef to
150d934
Compare
658e30c to
db957ee
Compare
…erf summary Addresses non-blocking review feedback on PR #399: 1. Add a TODO at the exact one-line flip site in `parseArgs` (index.ts) pointing at where `mode` defaults from PLAYER_PERF_MODE env to "measure". Once baselines settle on CI, flipping this default + the workflow's `--mode=measure` arg is the entire opt-in to enforce mode. 2. Add a brief cross-reference comment on `reportAndGate(...mode)` in perf-gate.ts pointing back at the resolver, so anyone reading the gate knows where the value comes from. 3. Make the `player-perf` summary job log an explicit SKIPPED / PASSED / FAILED line both to stdout (as a notice/error annotation) and to $GITHUB_STEP_SUMMARY. A false "skipped" caused by a misconfigured paths-filter is now obvious in the Checks UI without reading shard logs.
db957ee to
cc6039f
Compare
| name: Detect changes | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 2 | ||
| outputs: | ||
| perf: ${{ steps.filter.outputs.perf }} | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: dorny/paths-filter@v3 | ||
| id: filter | ||
| with: | ||
| filters: | | ||
| perf: | ||
| - "packages/player/**" | ||
| - "packages/core/**" | ||
| - "package.json" | ||
| - "bun.lock" | ||
| - ".github/workflows/player-perf.yml" | ||
|
|
||
| perf-shards: |
| name: "Perf: ${{ matrix.shard }}" | ||
| needs: changes | ||
| if: needs.changes.outputs.perf == 'true' | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 20 | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| include: | ||
| - shard: load | ||
| scenarios: load | ||
| runs: "5" | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: oven-sh/setup-bun@v2 | ||
|
|
||
| - uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: 22 | ||
|
|
||
| - run: bun install --frozen-lockfile | ||
|
|
||
| # Player perf loads packages/player/dist/hyperframes-player.global.js | ||
| # and packages/core/dist/hyperframe.runtime.iife.js, so a full build is required. | ||
| - run: bun run build | ||
|
|
||
| - name: Set up Chrome (headless shell) | ||
| id: setup-chrome | ||
| uses: browser-actions/setup-chrome@v1 | ||
| with: | ||
| chrome-version: stable | ||
|
|
||
| - name: Run player perf — ${{ matrix.shard }} (measure mode) | ||
| working-directory: packages/player | ||
| env: | ||
| PUPPETEER_EXECUTABLE_PATH: ${{ steps.setup-chrome.outputs.chrome-path }} | ||
| run: | | ||
| bun run perf \ | ||
| --mode=measure \ | ||
| --scenarios=${{ matrix.scenarios }} \ | ||
| --runs=${{ matrix.runs }} | ||
|
|
||
| - name: Upload perf results | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: player-perf-${{ matrix.shard }} | ||
| path: packages/player/tests/perf/results/ | ||
| if-no-files-found: warn | ||
| retention-days: 30 | ||
|
|
||
| # Summary job — matches the required check name in branch protection. | ||
| # Logs an explicit "skipped" / "passed" / "failed" line both to stdout and to | ||
| # $GITHUB_STEP_SUMMARY so a false skip is obvious in the Checks UI without | ||
| # having to dig into the changes-job logs. | ||
| player-perf: |
| runs-on: ubuntu-latest | ||
| needs: [changes, perf-shards] | ||
| if: always() | ||
| steps: | ||
| - name: Check results | ||
| env: | ||
| PERF_FILTER_RESULT: ${{ needs.changes.outputs.perf }} | ||
| PERF_SHARDS_RESULT: ${{ needs.perf-shards.result }} | ||
| run: | | ||
| { | ||
| echo "## Player perf gate" | ||
| echo "" | ||
| echo "- paths-filter \`perf\` matched: \`${PERF_FILTER_RESULT}\`" | ||
| echo "- perf-shards result: \`${PERF_SHARDS_RESULT}\`" | ||
| echo "" | ||
| } >> "$GITHUB_STEP_SUMMARY" | ||
|
|
||
| if [ "${PERF_FILTER_RESULT}" != "true" ]; then | ||
| echo "::notice title=Player perf::SKIPPED — no changes under packages/player/**, packages/core/**, package.json, bun.lock, or .github/workflows/player-perf.yml. Auto-pass." | ||
| echo "**Status:** SKIPPED (no player/core changes — auto-pass)" >> "$GITHUB_STEP_SUMMARY" | ||
| exit 0 | ||
| fi | ||
|
|
||
| if [ "${PERF_SHARDS_RESULT}" != "success" ]; then | ||
| echo "::error title=Player perf::FAILED — perf-shards result was '${PERF_SHARDS_RESULT}'. See the per-shard logs above." | ||
| echo "**Status:** FAILED (perf-shards result: \`${PERF_SHARDS_RESULT}\`)" >> "$GITHUB_STEP_SUMMARY" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "::notice title=Player perf::PASSED — all perf shards completed successfully." | ||
| echo "**Status:** PASSED" >> "$GITHUB_STEP_SUMMARY" |
Merge activity
|
… drift (#400) ## Summary Second slice of `P0-1` from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (`droppedFramesMax`). ## Why #399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production: - **FPS regressions** — a render-loop change that drops the ticker from 60 to 45 fps still loads fast. - **Scrub latency regressions** — the inline-vs-isolated split (#397) is exactly the kind of code path where a refactor can silently push everyone back to the postMessage round trip. - **Media drift** — runtime mirror logic (#396 in this stack) and per-frame scheduling tweaks can both cause video to slip out of sync with the composition clock without producing a single console error. Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change. ## What changed ### Fixture — `packages/player/tests/perf/fixtures/10-video-grid/` - `index.html`: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs). - `sample.mp4`: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run. - Same `data-composition-id="main"` host pattern as `gsap-heavy`, so the existing harness loader works without changes. ### `02-fps.ts` — sustained playback frame rate - Loads `10-video-grid`, calls `player.play()`, samples `requestAnimationFrame` callbacks inside the iframe for 5 s. - Crucial sequencing: install the rAF sampler **before** `play()`, wait for `__player.isPlaying() === true`, **then reset the sample buffer** — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps. - FPS = `(samples − 1) / (lastTs − firstTs in s)`; uses rAF timestamps (the same ones the compositor saw) rather than wall-clock `setTimeout`, so we're measuring real frame production. - Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync." - Aggregation across runs: `min(fps)` and `max(droppedFrames)` — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops. - Emits `playback_fps_min` (higher-is-better, baseline `fpsMin = 55`) and `playback_dropped_frames_max` (lower-is-better, baseline `droppedFramesMax = 3`). ### `04-scrub.ts` — scrub latency, inline + isolated - Loads `10-video-grid`, pauses, then issues 10 seek calls in two batches: first the synchronous **inline** path (`<hyperframes-player>`'s default same-origin `_trySyncSeek`), then the **isolated** path (forced by replacing `_trySyncSeek` with `() => false`, which makes the player fall back to the postMessage `_sendControl("seek")` bridge that cross-origin embeds and pre-#397 builds use). - Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples. - Detection: a rAF watcher inside the iframe polls `__player.getTime()` until it's within `MATCH_TOLERANCE_S = 0.05 s` of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid. - Timing: `performance.timeOrigin + performance.now()` in both contexts. `timeOrigin` is consistent across same-process frames, so `t1 − t0` is a true wall-clock latency, not a host-only or iframe-only stopwatch. - Targets alternate forward/backward (`1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5`) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale `getTime()` value before the seek command is processed. - Aggregation: `percentile(95)` across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95. - Emits `scrub_latency_p95_inline_ms` (lower-is-better, baseline `scrubLatencyP95InlineMs = 33`) and `scrub_latency_p95_isolated_ms` (lower-is-better, baseline `scrubLatencyP95IsolatedMs = 80`). ### `05-drift.ts` — media sync drift - Loads `10-video-grid`, plays 6 s, instruments **every** `video[data-start]` element with `requestVideoFrameCallback`. Each callback records `(compositionTime, actualMediaTime)` plus a snapshot of the clip transform (`clipStart`, `clipMediaStart`, `clipPlaybackRate`). - Drift = `|actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)|` — the same transform the runtime applies in `packages/core/src/runtime/media.ts`, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs. - Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95. - Same "reset buffer after play confirmed" gotcha as `02-fps.ts`: frames captured during the postMessage round-trip would compare a non-zero `mediaTime` against `getTime() === 0` and inflate drift by hundreds of ms. - Aggregation: `max()` and `percentile(95)` across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when `|currentTime − relTime| > 0.5 s`, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump. - Emits `media_drift_max_ms` (lower-is-better, baseline `driftMaxMs = 500`) and `media_drift_p95_ms` (lower-is-better, baseline `driftP95Ms = 100`). ### Wiring - `packages/player/tests/perf/index.ts`: add `fps`, `scrub`, `drift` to `ScenarioId`, `DEFAULT_RUNS`, the default scenario list (`--scenarios` defaults to all four), and three new dispatch branches. - `packages/player/tests/perf/perf-gate.ts`: add `droppedFramesMax: number` to `PerfBaseline`. Other baseline keys for these scenarios were already seeded in #399. - `packages/player/tests/perf/baseline.json`: add `droppedFramesMax: 3`. - `.github/workflows/player-perf.yml`: three new matrix shards (`fps` / `scrub` / `drift`) at `runs: 3`. Same `paths-filter` and same artifact-upload pattern as the `load` shard, so the summary job aggregates them automatically. ## Methodology highlights These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report: 1. **Reset buffer after play-confirmed.** The `play()` API is async (postMessage), so any samples captured before `__player.isPlaying() === true` belong to ramp-up, not steady-state. Both `02-fps` and `05-drift` clear `__perfRafSamples` / `__perfDriftSamples` *after* the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms. 2. **Iframe-side timing.** All three scenarios time inside the iframe (`performance.timeOrigin + performance.now()` for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency. 3. **Stop sampling before pause.** Sampler is deactivated *before* `pause()` is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window. ## Test plan - [x] Local: `bun run player:perf` runs all four scenarios end-to-end on the 10-video-grid fixture. - [x] Each scenario produces metrics matching its declared `baselineKey` so `perf-gate.ts` can find them. - [x] Typecheck, lint, format pass on the new files. - [x] Existing player unit tests untouched (no production code changes in this PR). - [ ] First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their `metrics.json` artifacts. ## Stack Step `P0-1b` of the player perf proposal. Builds on: - `P0-1a` (#399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into. Followed by: - `P0-1c` (#401): `06-parity` — live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existing `gsap-heavy` fixture from #399.
## Summary Adds **scenario 06: live-playback parity** — the third and final tranche of the P0-1 perf-test buildout (`p0-1a` infra → `p0-1b` fps/scrub/drift → this). The scenario plays the `gsap-heavy` fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with `ffmpeg -lavfi ssim` and the resulting average SSIM is emitted as `parity_ssim_min`. Baseline gate: **SSIM ≥ 0.95**. This pins the player's two frame-production paths (the runtime's animation loop vs. `_trySyncSeek`) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping. ## Motivation `<hyperframes-player>` produces frames two different ways: 1. **Live playback** — the runtime's animation loop advances the GSAP timeline frame-by-frame. 2. **Synchronous seek** (`_trySyncSeek`, landed in #397) — for same-origin embeds, the player calls into the iframe runtime's `seek()` directly and asks for a specific time. These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight. `gsap-heavy` is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters. ## What changed - **`packages/player/tests/perf/scenarios/06-parity.ts`** — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation. - **`packages/player/tests/perf/index.ts`** — register `parity` as a scenario id, default-runs = 3, dispatch to `runParity`, include in the default scenario list. - **`packages/player/tests/perf/perf-gate.ts`** — extend `PerfBaseline` with `paritySsimMin`. - **`packages/player/tests/perf/baseline.json`** — `paritySsimMin: 0.95`. - **`.github/workflows/player-perf.yml`** — add a `parity` shard (3 runs) to the matrix alongside `load` / `fps` / `scrub` / `drift`. ## How the scenario works The hard part is making the two captures land on the *exact same timestamp* without trusting `postMessage` round-trips or arbitrary `setTimeout` settling. 1. **Install an iframe-side rAF watcher** before issuing `play()`. The watcher polls `__player.getTime()` every animation frame and, the first time `getTime() >= 5.0`, calls `__player.pause()` *from inside the same rAF tick*. `pause()` is synchronous (it calls `timeline.pause()`), so the timeline freezes at exactly that `getTime()` value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical `T_actual` for the run. 2. **Confirm `isPlaying() === true`** via `frame.waitForFunction` before awaiting the watcher. Without this, the test can hang if `play()` hasn't kicked the timeline yet. 3. **Wait for paint** — two `requestAnimationFrame` ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as `packages/producer/src/parity-harness.ts`. 4. **Screenshot the live frame** — `page.screenshot({ type: "png" })`. 5. **Synchronously seek to `T_actual`** — call `el.seek(capturedTime)` on the host page. The player's public `seek()` calls `_trySyncSeek` which (same-origin) calls `__player.seek()` synchronously, so no postMessage await is needed. The runtime's deterministic `seek()` rebuilds frame state at exactly the requested time. 6. **Wait for paint** again, screenshot the reference frame. 7. **Diff with ffmpeg** — `ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -`. ffmpeg writes per-channel + overall SSIM to stderr; we parse the `All:` value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score. 8. **Persist artifacts** under `tests/perf/results/parity/run-N/` (`actual.png`, `reference.png`, `captured-time.txt`) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing `packages/player/tests/perf/results/` rule. ### Aggregation `min()` across runs, **not** mean. We want the *worst observed* parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged. ### Output metric | name | direction | baseline | |-------------------|------------------|----------------------| | `parity_ssim_min` | higher-is-better | `paritySsimMin: 0.95` | With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths. ## Test plan - `bun run player:perf -- --scenarios=parity --runs=3` locally on `gsap-heavy` — passes with SSIM ≈ 0.999 across all 3 runs. - Inspected `results/parity/run-1/actual.png` and `reference.png` side-by-side — visually identical. - Inspected `captured-time.txt` to confirm `T_actual` lands just past 5.0s (within one frame). - Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift. - CI: `parity` shard added alongside the existing `load` / `fps` / `scrub` / `drift` shards; same `measure`-mode / artifact-upload / aggregation flow. - `bunx oxlint` and `bunx oxfmt --check` clean on the new scenario. ## Stack This is the top of the perf stack: 1. #393 `perf/x-1-emit-performance-metric` — performance.measure() emission 2. #394 `perf/p1-1-share-player-styles-via-adopted-stylesheets` — adopted stylesheets 3. #395 `perf/p1-2-scope-media-mutation-observer` — scoped MutationObserver 4. #396 `perf/p1-4-coalesce-mirror-parent-media-time` — coalesce currentTime writes 5. #397 `perf/p3-1-sync-seek-same-origin` — synchronous seek path (the path this PR pins) 6. #398 `perf/p3-2-srcdoc-composition-switching` — srcdoc switching 7. #399 `perf/p0-1a-perf-test-infra` — server, runner, perf-gate, CI 8. #400 `perf/p0-1b-perf-tests-for-fps-scrub-drift` — fps / scrub / drift scenarios 9. **#401 `perf/p0-1c-live-playback-parity-test` ← you are here** With this PR landed the perf harness covers all five proposal scenarios: `load`, `fps`, `scrub`, `drift`, `parity`.

Summary
First slice of
P0-1from the player perf proposal: lays the foundation for a player perf gate so later PRs can plug in fps / scrub / drift / parity scenarios without rebuilding infrastructure. Ships one smoke scenario (03-load, cold + warm composition load) to prove the gate end-to-end on real numbers.Why
There was no automated way to catch player perf regressions. Every perf concern in the existing proposal — composition load time, sustained FPS, scrub p95, mirror-clock drift, live-vs-seek parity — needs the same plumbing: a same-origin harness, a Puppeteer runner, a baseline file, a gate that emits structured results, and a CI workflow that runs the right scenarios on the right changes. Building that up-front in one reviewable PR lets every subsequent perf PR (
P0-1b,P0-1c, and beyond) be a 100-line scenario file plus a baseline entry instead of re-litigating the framework.What changed
Harness —
packages/player/tests/perf/server.tsBun.serveon a free port, single same-origin host for the player IIFE bundle, hyperframe runtime, GSAP fromnode_modules, and fixture HTML.postMessage, hiding bugs and inflating numbers in ways production never sees. Tests should measure the path the studio editor actually takes./player.js→ built IIFE bundle (rebuilt on demand)./vendor/runtime.js,/vendor/gsap.min.js→ resolved fromnode_modulesso fixtures don't need to ship copies./fixtures/*→ fixture HTML.Runner —
packages/player/tests/perf/runner.tspuppeteer-corethin wrappers (launchBrowser,loadHostPage).setup-chromein CI rather than the bundled puppeteer revision — keeps the action smaller, lets us pin Chrome version policy at the workflow level, and matches what users actually run.Gate —
packages/player/tests/perf/perf-gate.ts+baseline.jsonbaseline.json(initial budgets: cold/warm comp load, fps, scrub p95 isolated/inline, drift max/p95) with a 10%allowedRegressionRatio.lower-is-better/higher-is-better) so the same evaluator handles latency and throughput.GateReportconsumed by both the CLI (table output) andmetrics.json(CI artifact).measure(log only — used during the rollout) andenforce(fail the build) — flip per-metric once we trust the signal, without touching the harness.CLI orchestrator —
packages/player/tests/perf/index.ts--mode/--scenarios/--runs/--fixturein both space- and equals-separated form (so--scenarios fps,scruband--scenarios=fps,scrubboth work — matches what humans type and what GitHub Actions emits).results/metrics.jsonwith schema version, git SHA, metrics, and gate rows — so failed runs are still investigable from the artifact alone.Fixture + smoke scenario
fixtures/gsap-heavy/index.html: 200 stagger-animated tiles, no media. Heavy enough to make load time meaningful, light enough to be deterministic.scenarios/03-load.ts: cold + warm composition load. Measures from navigation start to playerreadyevent, reports p95 across runs.CI —
.github/workflows/player-perf.ymlpaths-filteronplayer/core/runtime— perf only runs when something that could move the needle actually changed.measuremode on a shard matrix (so future scenarios shard naturally), uploadsmetrics.jsonartifacts, and a summary job aggregates shard results into a single PR comment.Wiring
packages/player:puppeteer-core,gsap,@types/bundevDeps; typecheck extended to cover the perftsconfig; newperfscript.package.json:player:perfworkspace script sobun run player:perfruns the whole suite locally with the same flags CI uses..gitignore:packages/player/tests/perf/results/.tests/perf/tsconfig.jsonso test code doesn't pollute the packagerootDirwhile still being typechecked.Test plan
bun run player:perfpasses — cold p95 ≈ 386 ms, warm p95 ≈ 375 ms, both well under the seeded baselines.setup-chromeworks on hosted runners, the shard matrix wires up, andmetrics.jsonartifacts upload.Stack
Step
P0-1aof the player perf proposal. The next two slices are content-only — they don't touch the harness:P0-1b(perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift #400): adds02-fps,04-scrub,05-driftscenarios on a 10-video-grid fixture.P0-1c(perf(player): p0-1c live-playback parity test via SSIM #401): adds06-parity(live playback vs. synchronously-seeked reference, compared via SSIM).Wiring this gate up first means each follow-up is a self-contained scenario file + baseline row + workflow shard.