investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob by fglock · Pull Request #635 · fglock/PerlOnJava

fglock · 2026-04-30T09:48:05Z

Summary

Investigation work on the DBIx::Class t/52leaks.t schema-detached intermittent failure that surfaces under ./jcpan -t DBIx::Class once #644's storable fixes unblock visibility past the early t/84serialize.t crashes.

This PR is diagnostic-and-plan only, no fix yet. The actual code-change PR for the walker fix should land on top once Step A–C of the plan have produced concrete evidence about which seeding gate is missing.

Commit	What
`docs(dbic)`: document new t/52leaks.t schema-detached harness regression	Describes the symptom and where it appears in the test
`docs(dbic)`: pinpoint root cause of schema-detached t/52leaks.t failure	Initial hypothesis (later partially disconfirmed)
`sandbox(walker)`: walker blind spot reproducer attempts + handoff doc	Two minimal reproducers (`*_PASSES.t`) that do NOT trigger the bug — proves the simple lexical-seeding case works correctly
`docs(dbic)`: plan to make t/52leaks.t schema-detached bug reliably reproducible	Tiered repro strategy + diagnostic infrastructure design
`fix(MortalList)`: `JPERL_FORCE_SWEEP_EVERY_FLUSH` debug knob + corrected walker plan	The actual code change: env-gated debug knob to bypass the 5-s auto-sweep throttle, making timing-dependent walker bugs deterministic for the next investigator
`docs(dbic)`: concrete next-steps plan for the walker investigation	Steps A–E for whoever picks up the actual fix

What we learned today (and what we don't yet know)

✅ The bug is the auto-sweep walker prematurely clearing the weak ref from ResultSource → Schema while DBIC still expects to dereference it.

✅ JPERL_NO_AUTO_GC=1 removes the crash but exposes 14/23 leak-tracer failures, so disabling the sweep is NOT the fix.

✅ The walker DOES seed my $scalar = $ref lexicals (verified — both dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t and dbic_real_pattern_PASSES.t pass under JPERL_FORCE_SWEEP_EVERY_FLUSH=1).

❌ We do NOT yet know which specific seeding gate the walker is missing in DBIC's actual code path — likely tied to Moo / Class::C3::XS / Sub::Quote / accessor-magic / Storable seen-table interaction. Need JPERL_WALKER_TRACE instrumentation under a real DBIC failure to find out.

Next steps

See dev/modules/dbix_class.md § Next steps (concrete, in order).

Summary: add JPERL_WALKER_TRACE to ReachabilityWalker.sweepWeakRefs(), run JPERL_FORCE_SWEEP_EVERY_FLUSH=1 JPERL_WALKER_TRACE=1 ./jcpan -t DBIx::Class, find the first WALKER_CLEAR line with a DBIx::Class::Schema/ResultSource/Storage::DBI target, the seeding-state in that line tells us which gate to fix.

Test plan

make (unit tests) — green
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 opt-in: doesn't change behaviour when env var unset (verified — full DBIC t/52leaks.t runs the same as without the knob).
Walker fix and DBIC 0/314 — deferred to follow-up PR after Step A–C produce evidence.

Dependencies

Depends on / shares context with:

Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog #644 — Storable hook cookie + leaf seen-tag fix; AGENTS timeout discipline; subprocess orphan watchdog. Without Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog #644 the t/52leaks.t failure is masked by earlier storable crashes.

Generated with Devin

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Related issue: #646

…me -p` Adds a new mandatory rule for investigative agents: - ALL `jperl` / `jcpan` / `prove` invocations that could hang must be wrapped in `timeout N`, never `/usr/bin/time -p` (which only measures) and never bare `./jperl …`. - Explains why: `./jperl` ends with `exec java …`, so when the agent's bash exits, hung JVMs get reparented to PID 1 and keep running at 100% CPU forever — there is no SIGHUP propagation and no JVM-side watchdog. A handful of these orphans silently starves the whole machine. - Includes WRONG/RIGHT examples and the post-investigation cleanup-check command (`ps aux | awk '$3>20 {...}'` + `pkill -f "perlonjava-.*\.t"`). Adds an Incident Log entry for today's PR-#635 work, where this exact trap caused phantom `t/76joins.t` / `t/96_is_deteministic_value.t` SIGKILLs in `./jcpan -t DBIx::Class` runs — the symptom looked like a real DBIx::Class perf regression, but was actually CPU starvation from ~14 orphan JVMs left behind by an earlier investigative agent. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Adds an Investigation Plan section to dev/modules/dbix_class.md for the NEW failure mode observed today under `./jcpan -t DBIx::Class`: DBIx::Class::ResultSource::schema(): Unable to perform storage- dependent operations with a detached result source (source 'Artist' is not associated with a schema). at t/52leaks.t line 430 This is distinct from the existing "tests 12-18 leak detection at line 526" entry — that's a leak (objects not getting destroyed), this is the opposite (a schema getting destroyed too eagerly while a child resultset still expects it). Test passes standalone (11/11 in 46s); only fails when ~20+ prior DBIC tests have run through the same harness JVM. Suspected cause: the walker-gate property fix in PR #618 (commit ce8186e) widened DESTROY gating to every storedInPackageGlobal object — under cumulative state pressure, the gate fails to rescue a Schema/ResultSource pair, causing the weak ref from RS → Schema to read as undef. The plan section includes: - exact symptom + reproducer - code path that triggers it - hypothesis - 4-step diagnostic plan (bisect prefix, instrument Java side, reachability check, c4db69e-baseline verification) - what's NOT the cause (parent harness JVM is 99.7% idle in select polling) - "why we can't ship" — DBIx::Class is published as PASS in the CPAN compatibility report Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Confirmed via experiment: the failure is a timing-dependent walker blind spot in `MortalList.maybeAutoSweep()`. Diagnostic table added to dev/modules/dbix_class.md: Mode | t/52leaks.t under harness ------------------------------|--------------------------- default (auto-GC every 5 s) | crashes mid-test: | "detached result source" at line 430 JPERL_NO_AUTO_GC=1 | runs to completion; | 14/23 subtests fail at leak-detection So: - WITH auto-sweep: walker incorrectly decides the Schema is unreachable (it isn't — `my $schema = DBICTest->init_schema()` in the test's top-level scope holds a strong ref). Auto-sweep clears the Schema's weak refs from each ResultSource → row's `->result_source->resultset` then dereferences a now-undef weak back-ref → "detached result source" exception. - WITHOUT auto-sweep: schema stays alive (so no crash), but the underlying t/52leaks.t tests 12-18 leak-detection failures surface — those are the documented "deep refcount inflation" blockers from the existing plan. Fix path is narrower than disabling the sweep: fix ReachabilityWalker so it correctly seeds JVM-stack lexicals as roots. Currently it only walks from global symbol tables; closures following captures works but lexicals themselves aren't seeded. Plan section now includes: - exact symptom + experiment confirming the timer dependency - ref-graph diagram of the schema/RS/row chain - 3-step audit checklist for ReachabilityWalker (lexical seeding, capture-following, identity matching) - explicit "don't disable the sweep" note (breaks leak detection) Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Adds dev/sandbox/walker_blind_spot/ with: - README.md explaining the bug (linking to the full plan in dev/modules/dbix_class.md), what we tried, and concrete next steps for the next investigator. - simple_lexical_repro.t — minimal Schema/ResultSource pair with one weakened back-ref, exercises auto-sweep over 7s. Status of the simple reproducer: passes in both modes (with and without JPERL_NO_AUTO_GC=1). The DBIC failure must depend on a more complex pattern (closure captures, JVM-stack temporaries during DBIC's accessor chain, etc.) that the walker's seeding gates incorrectly exclude. The next investigator needs to either: 1. Add `ReachabilityWalker.sweepWeakRefs()` diagnostic logging to pinpoint which gate drops the schema, or 2. Mirror DBIC's accessor-chain pattern more precisely in the reproducer. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…roducible Today's testing of the schema-detached bug is flaky: - Different victim test on every full DBIC run. - Simple reproducers don't fail (walker handles trivial my-lexicals fine). - Even with explicit Internals::jperl_gc() x 50 the bug doesn't trip. This is intrinsic — the bug only fires when the auto-sweep 5-s timer expires at a precise moment relative to Perl's statement boundaries inside DBIC's accessor chain. Naive standalone reproducers are either too short (no sweep) or too simple (lexical too easy for the walker). Adds a "How to make this reliably reproducible" section to the plan with four pieces of infrastructure: 1. JPERL_FORCE_SWEEP_EVERY_FLUSH=1 — debug env var that fires the auto-sweep on every MortalList.flush() call, bypassing the 5-s throttle and the weakRefsExist gate. Converts the stochastic race into deterministic "sweep here → next access dies". 2. JPERL_WALKER_TRACE=1 — structured log of every weak-ref the sweep clears: target classname + identity, findPathTo() output, snapshot of seeding sources active. The first cleared Schema in the transcript is the bug. 3. Tiered reproducers T1..T6 — graduate from "1 schema + 1 weakened ref" (current simple_lexical_repro.t, passes) up to a DBIC-shape pattern (closures + @_ temporaries + overloaded "" + thousands of unrelated weakened scalars + interleaved dclone). Smallest tier that fails under (1) becomes the unit test. 4. Prefix bisection on the full DBIC suite — find the shortest sequence of test files that triggers a failure under (1)+(2). That sequence is the deterministic harness reproducer. Plan ordering: implement (1)+(2) first (~30 min), then (4) prefix bisection (~1 h), then inspect transcripts to identify the failing seeding gate, fix in ReachabilityWalker, promote smallest failing reproducer to src/test/resources/unit/refcount/walker_blind_spot.t. This gets us off the flaky-repro treadmill we've been stuck on today. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…lker plan Adds the deterministic-sweep debug knob the "How to make this reliably reproducible" section of dev/modules/dbix_class.md committed to needing: if (System.getenv("JPERL_FORCE_SWEEP_EVERY_FLUSH") != null) { // bypass weakRefsExist gate AND the 5-s throttle on every // MortalList.flush() — every Perl statement boundary runs // a full sweepWeakRefs walk } This converts timing-dependent walker bugs (like the DBIC "detached result source" mid-test crash on t/52leaks.t line 430) into deterministic "sweep here → next access dies" sequences for diagnostic work. Hypothesis testing under this knob disconfirms the earlier "walker doesn't seed `my $scalar` lexicals" theory: - `dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t` — `my $obj = bless` + weakened back-ref + 20× Internals::jperl_gc() → PASSES under JPERL_FORCE_SWEEP_EVERY_FLUSH=1. - `dev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t` — DBIC-shape with schema in global %REGISTRY and a chain replacing $phantom each iteration → also PASSES. So the walker DOES correctly seed both `my $scalar` lexicals and globally-registered schemas. The actual DBIC blind spot is somewhere else — Moo/MRO, accessor magic, Storable's seen-table, or some other DBIC-specific structural cycle. The fix path in dev/modules/dbix_class.md is updated: stop speculating about which seeding gate; the next investigator should add `JPERL_WALKER_TRACE=1` instrumentation to `ReachabilityWalker.sweepWeakRefs()` and capture an actual DBIC failure to identify the real gate. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Adds a "Next steps (concrete, in order)" section to dev/modules/dbix_class.md so whoever picks up this PR can act without re-reading the whole investigation history: Step A — Add JPERL_WALKER_TRACE=1 instrumentation (env-gated System.err.println in sweepWeakRefs that logs cleared-target identity + refcount/state + findPathTo output + seed-stats snapshot) Step B — Run jcpan -t DBIx::Class with the new trace + the JPERL_FORCE_SWEEP_EVERY_FLUSH knob already in this PR Step C — Identify the failing seeding gate from the trace (3 most-likely candidates listed) Step D — Promote the smallest reproducer to a unit test Step E — Verify on full DBIC suite Step A is small (~20 lines in ReachabilityWalker), Step B is one command, Step C is the actual diagnosis once we have the trace — no more speculating about which seeding gate is at fault. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…114519

fglock force-pushed the fix/storable-hooks-and-leaf-tags-20260430-114519 branch 2 times, most recently from 718b1d4 to 2b4576b Compare April 30, 2026 12:56

fglock mentioned this pull request Apr 30, 2026

Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog #644

Merged

5 tasks

fglock and others added 6 commits April 30, 2026 17:06

fglock force-pushed the fix/storable-hooks-and-leaf-tags-20260430-114519 branch from 3db93f5 to d036674 Compare April 30, 2026 15:07

fglock changed the title ~~fix(storable): hook cookie UTF-8 + leaf seen-tag drift~~ investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob Apr 30, 2026

fglock mentioned this pull request Apr 30, 2026

DBIx::Class t/52leaks.t intermittent: detached result source from premature weak-ref clear by ReachabilityWalker auto-sweep #646

Open

4 tasks

Merge branch 'master' into fix/storable-hooks-and-leaf-tags-20260430-…

86f3d90

…114519

fglock merged commit 0e0e849 into master May 1, 2026
2 checks passed

fglock deleted the fix/storable-hooks-and-leaf-tags-20260430-114519 branch May 1, 2026 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob#635

investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob#635
fglock merged 7 commits intomasterfrom
fix/storable-hooks-and-leaf-tags-20260430-114519

fglock commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contents

What we learned today (and what we don't yet know)

Next steps

Test plan

Dependencies

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fglock commented Apr 30, 2026 •

edited

Loading