Skip to content

investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob#635

Merged
fglock merged 7 commits intomasterfrom
fix/storable-hooks-and-leaf-tags-20260430-114519
May 1, 2026
Merged

investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob#635
fglock merged 7 commits intomasterfrom
fix/storable-hooks-and-leaf-tags-20260430-114519

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 30, 2026

Summary

Investigation work on the DBIx::Class t/52leaks.t schema-detached intermittent failure that surfaces under ./jcpan -t DBIx::Class once #644's storable fixes unblock visibility past the early t/84serialize.t crashes.

This PR is diagnostic-and-plan only, no fix yet. The actual code-change PR for the walker fix should land on top once Step A–C of the plan have produced concrete evidence about which seeding gate is missing.

Contents

Commit What
docs(dbic): document new t/52leaks.t schema-detached harness regression Describes the symptom and where it appears in the test
docs(dbic): pinpoint root cause of schema-detached t/52leaks.t failure Initial hypothesis (later partially disconfirmed)
sandbox(walker): walker blind spot reproducer attempts + handoff doc Two minimal reproducers (*_PASSES.t) that do NOT trigger the bug — proves the simple lexical-seeding case works correctly
docs(dbic): plan to make t/52leaks.t schema-detached bug reliably reproducible Tiered repro strategy + diagnostic infrastructure design
fix(MortalList): JPERL_FORCE_SWEEP_EVERY_FLUSH debug knob + corrected walker plan The actual code change: env-gated debug knob to bypass the 5-s auto-sweep throttle, making timing-dependent walker bugs deterministic for the next investigator
docs(dbic): concrete next-steps plan for the walker investigation Steps A–E for whoever picks up the actual fix

What we learned today (and what we don't yet know)

✅ The bug is the auto-sweep walker prematurely clearing the weak ref from ResultSource → Schema while DBIC still expects to dereference it.

JPERL_NO_AUTO_GC=1 removes the crash but exposes 14/23 leak-tracer failures, so disabling the sweep is NOT the fix.

✅ The walker DOES seed my $scalar = $ref lexicals (verified — both dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t and dbic_real_pattern_PASSES.t pass under JPERL_FORCE_SWEEP_EVERY_FLUSH=1).

❌ We do NOT yet know which specific seeding gate the walker is missing in DBIC's actual code path — likely tied to Moo / Class::C3::XS / Sub::Quote / accessor-magic / Storable seen-table interaction. Need JPERL_WALKER_TRACE instrumentation under a real DBIC failure to find out.

Next steps

See dev/modules/dbix_class.md § Next steps (concrete, in order).

Summary: add JPERL_WALKER_TRACE to ReachabilityWalker.sweepWeakRefs(), run JPERL_FORCE_SWEEP_EVERY_FLUSH=1 JPERL_WALKER_TRACE=1 ./jcpan -t DBIx::Class, find the first WALKER_CLEAR line with a DBIx::Class::Schema/ResultSource/Storage::DBI target, the seeding-state in that line tells us which gate to fix.

Test plan

  • make (unit tests) — green
  • JPERL_FORCE_SWEEP_EVERY_FLUSH=1 opt-in: doesn't change behaviour when env var unset (verified — full DBIC t/52leaks.t runs the same as without the knob).
  • Walker fix and DBIC 0/314 — deferred to follow-up PR after Step A–C produce evidence.

Dependencies

Depends on / shares context with:

Generated with Devin

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Related issue: #646

fglock added a commit that referenced this pull request Apr 30, 2026
…me -p`

Adds a new mandatory rule for investigative agents:
- ALL `jperl` / `jcpan` / `prove` invocations that could hang must be
  wrapped in `timeout N`, never `/usr/bin/time -p` (which only measures)
  and never bare `./jperl …`.
- Explains why: `./jperl` ends with `exec java …`, so when the agent's
  bash exits, hung JVMs get reparented to PID 1 and keep running at 100%
  CPU forever — there is no SIGHUP propagation and no JVM-side watchdog.
  A handful of these orphans silently starves the whole machine.
- Includes WRONG/RIGHT examples and the post-investigation cleanup-check
  command (`ps aux | awk '$3>20 {...}'` + `pkill -f "perlonjava-.*\.t"`).

Adds an Incident Log entry for today's PR-#635 work, where this exact
trap caused phantom `t/76joins.t` / `t/96_is_deteministic_value.t`
SIGKILLs in `./jcpan -t DBIx::Class` runs — the symptom looked like a
real DBIx::Class perf regression, but was actually CPU starvation from
~14 orphan JVMs left behind by an earlier investigative agent.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the fix/storable-hooks-and-leaf-tags-20260430-114519 branch 2 times, most recently from 718b1d4 to 2b4576b Compare April 30, 2026 12:56
fglock and others added 6 commits April 30, 2026 17:06
Adds an Investigation Plan section to dev/modules/dbix_class.md for the
NEW failure mode observed today under `./jcpan -t DBIx::Class`:

  DBIx::Class::ResultSource::schema(): Unable to perform storage-
  dependent operations with a detached result source (source 'Artist'
  is not associated with a schema). at t/52leaks.t line 430

This is distinct from the existing "tests 12-18 leak detection at line
526" entry — that's a leak (objects not getting destroyed), this is the
opposite (a schema getting destroyed too eagerly while a child resultset
still expects it).

Test passes standalone (11/11 in 46s); only fails when ~20+ prior DBIC
tests have run through the same harness JVM. Suspected cause: the
walker-gate property fix in PR #618 (commit ce8186e) widened DESTROY
gating to every storedInPackageGlobal object — under cumulative state
pressure, the gate fails to rescue a Schema/ResultSource pair, causing
the weak ref from RS → Schema to read as undef.

The plan section includes:
- exact symptom + reproducer
- code path that triggers it
- hypothesis
- 4-step diagnostic plan (bisect prefix, instrument Java side,
  reachability check, c4db69e-baseline verification)
- what's NOT the cause (parent harness JVM is 99.7% idle in select polling)
- "why we can't ship" — DBIx::Class is published as PASS in the CPAN
  compatibility report

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Confirmed via experiment: the failure is a timing-dependent walker
blind spot in `MortalList.maybeAutoSweep()`.

Diagnostic table added to dev/modules/dbix_class.md:

  Mode                          | t/52leaks.t under harness
  ------------------------------|---------------------------
  default (auto-GC every 5 s)   | crashes mid-test:
                                |   "detached result source" at line 430
  JPERL_NO_AUTO_GC=1            | runs to completion;
                                |   14/23 subtests fail at leak-detection

So:
- WITH auto-sweep: walker incorrectly decides the Schema is
  unreachable (it isn't — `my $schema = DBICTest->init_schema()`
  in the test's top-level scope holds a strong ref). Auto-sweep
  clears the Schema's weak refs from each ResultSource → row's
  `->result_source->resultset` then dereferences a now-undef weak
  back-ref → "detached result source" exception.

- WITHOUT auto-sweep: schema stays alive (so no crash), but the
  underlying t/52leaks.t tests 12-18 leak-detection failures
  surface — those are the documented "deep refcount inflation"
  blockers from the existing plan.

Fix path is narrower than disabling the sweep: fix
ReachabilityWalker so it correctly seeds JVM-stack lexicals as
roots. Currently it only walks from global symbol tables; closures
following captures works but lexicals themselves aren't seeded.

Plan section now includes:
- exact symptom + experiment confirming the timer dependency
- ref-graph diagram of the schema/RS/row chain
- 3-step audit checklist for ReachabilityWalker (lexical seeding,
  capture-following, identity matching)
- explicit "don't disable the sweep" note (breaks leak detection)

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds dev/sandbox/walker_blind_spot/ with:

- README.md explaining the bug (linking to the full plan in
  dev/modules/dbix_class.md), what we tried, and concrete next
  steps for the next investigator.
- simple_lexical_repro.t — minimal Schema/ResultSource pair with
  one weakened back-ref, exercises auto-sweep over 7s.

Status of the simple reproducer: passes in both modes (with and
without JPERL_NO_AUTO_GC=1). The DBIC failure must depend on a more
complex pattern (closure captures, JVM-stack temporaries during
DBIC's accessor chain, etc.) that the walker's seeding gates
incorrectly exclude. The next investigator needs to either:

1. Add `ReachabilityWalker.sweepWeakRefs()` diagnostic logging to
   pinpoint which gate drops the schema, or
2. Mirror DBIC's accessor-chain pattern more precisely in the
   reproducer.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…roducible

Today's testing of the schema-detached bug is flaky:
- Different victim test on every full DBIC run.
- Simple reproducers don't fail (walker handles trivial my-lexicals fine).
- Even with explicit Internals::jperl_gc() x 50 the bug doesn't trip.

This is intrinsic — the bug only fires when the auto-sweep 5-s timer
expires at a precise moment relative to Perl's statement boundaries
inside DBIC's accessor chain. Naive standalone reproducers are either
too short (no sweep) or too simple (lexical too easy for the walker).

Adds a "How to make this reliably reproducible" section to the plan
with four pieces of infrastructure:

1. JPERL_FORCE_SWEEP_EVERY_FLUSH=1 — debug env var that fires the
   auto-sweep on every MortalList.flush() call, bypassing the 5-s
   throttle and the weakRefsExist gate. Converts the stochastic race
   into deterministic "sweep here → next access dies".

2. JPERL_WALKER_TRACE=1 — structured log of every weak-ref the
   sweep clears: target classname + identity, findPathTo() output,
   snapshot of seeding sources active. The first cleared Schema in
   the transcript is the bug.

3. Tiered reproducers T1..T6 — graduate from "1 schema + 1 weakened
   ref" (current simple_lexical_repro.t, passes) up to a DBIC-shape
   pattern (closures + @_ temporaries + overloaded "" + thousands
   of unrelated weakened scalars + interleaved dclone). Smallest tier
   that fails under (1) becomes the unit test.

4. Prefix bisection on the full DBIC suite — find the shortest
   sequence of test files that triggers a failure under (1)+(2).
   That sequence is the deterministic harness reproducer.

Plan ordering: implement (1)+(2) first (~30 min), then (4) prefix
bisection (~1 h), then inspect transcripts to identify the failing
seeding gate, fix in ReachabilityWalker, promote smallest failing
reproducer to src/test/resources/unit/refcount/walker_blind_spot.t.

This gets us off the flaky-repro treadmill we've been stuck on today.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…lker plan

Adds the deterministic-sweep debug knob the
"How to make this reliably reproducible" section of
dev/modules/dbix_class.md committed to needing:

  if (System.getenv("JPERL_FORCE_SWEEP_EVERY_FLUSH") != null) {
      // bypass weakRefsExist gate AND the 5-s throttle on every
      // MortalList.flush() — every Perl statement boundary runs
      // a full sweepWeakRefs walk
  }

This converts timing-dependent walker bugs (like the DBIC
"detached result source" mid-test crash on t/52leaks.t line 430)
into deterministic "sweep here → next access dies" sequences for
diagnostic work.

Hypothesis testing under this knob disconfirms the earlier
"walker doesn't seed `my $scalar` lexicals" theory:

- `dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t` —
  `my $obj = bless` + weakened back-ref + 20× Internals::jperl_gc()
  → PASSES under JPERL_FORCE_SWEEP_EVERY_FLUSH=1.

- `dev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t` —
  DBIC-shape with schema in global %REGISTRY and a chain replacing
  $phantom each iteration → also PASSES.

So the walker DOES correctly seed both `my $scalar` lexicals and
globally-registered schemas. The actual DBIC blind spot is somewhere
else — Moo/MRO, accessor magic, Storable's seen-table, or some other
DBIC-specific structural cycle.

The fix path in dev/modules/dbix_class.md is updated: stop
speculating about which seeding gate; the next investigator should
add `JPERL_WALKER_TRACE=1` instrumentation to
`ReachabilityWalker.sweepWeakRefs()` and capture an actual
DBIC failure to identify the real gate.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds a "Next steps (concrete, in order)" section to
dev/modules/dbix_class.md so whoever picks up this PR can act
without re-reading the whole investigation history:

  Step A — Add JPERL_WALKER_TRACE=1 instrumentation
           (env-gated System.err.println in sweepWeakRefs that logs
            cleared-target identity + refcount/state + findPathTo
            output + seed-stats snapshot)
  Step B — Run jcpan -t DBIx::Class with the new trace + the
           JPERL_FORCE_SWEEP_EVERY_FLUSH knob already in this PR
  Step C — Identify the failing seeding gate from the trace
           (3 most-likely candidates listed)
  Step D — Promote the smallest reproducer to a unit test
  Step E — Verify on full DBIC suite

Step A is small (~20 lines in ReachabilityWalker), Step B is one
command, Step C is the actual diagnosis once we have the trace —
no more speculating about which seeding gate is at fault.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the fix/storable-hooks-and-leaf-tags-20260430-114519 branch from 3db93f5 to d036674 Compare April 30, 2026 15:07
@fglock fglock changed the title fix(storable): hook cookie UTF-8 + leaf seen-tag drift investigate(dbic): t/52leaks.t schema-detached intermittent + diagnostic knob Apr 30, 2026
@fglock fglock merged commit 0e0e849 into master May 1, 2026
2 checks passed
@fglock fglock deleted the fix/storable-hooks-and-leaf-tags-20260430-114519 branch May 1, 2026 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant