Skip to content

Clean up tutorial notebook output noise and version drift#170

Open
dimitri-yatsenko wants to merge 3 commits into
mainfrom
docs/notebook-output-hygiene
Open

Clean up tutorial notebook output noise and version drift#170
dimitri-yatsenko wants to merge 3 commits into
mainfrom
docs/notebook-output-hygiene

Conversation

@dimitri-yatsenko
Copy link
Copy Markdown
Member

Summary

Tutorial and how-to notebooks are committed with their executed cell outputs, but the kernel image used to execute them was leaking three forms of noise into the rendered docs:

  • TqdmWarning: IProgress not found — visible at /tutorials/examples/blob-detection/. tqdm.auto falls back to text mode because the execution image had no ipywidgets, so the warning got serialized into the notebook output and is rendered on the live site.
  • scikit-image download chatter (Downloading file 'data/mitosis.tif' from gitlab.com ...) — appears in the same cell because the dataset isn't pre-cached.
  • Stale DataJoint 2.1.1 connected banners in 21 notebooks — mkdocs.yaml declares extra.datajoint_version: "2.2" and current release is 2.2.2. The committed banners are over a year behind.

What's in this PR

  • docker-compose.yaml — add ipywidgets to the pip-install line in both MODE=EXECUTE and MODE=EXECUTE_PG branches. This silences the TqdmWarning at the source (lets tqdm.auto resolve to tqdm.notebook).
  • scripts/execute_notebooks.py — warm the scikit-image cache (hubble_deep_field, human_mitosis) before nbconvert spawns its kernel, so the one-time download message doesn't get captured into any future re-executed output.
  • scripts/check_notebook_versions.py (new) — scans every committed notebook's DataJoint X.Y.Z connected banner and fails if the major.minor doesn't match extra.datajoint_version in mkdocs.yaml. Currently flags all 21 stale notebooks.
  • README.md — short "Notebook execution policy" section documenting that outputs are intentionally committed and how to refresh them.

The notebook output refresh itself (re-executing all 21 notebooks against DataJoint 2.2.2) lands in a separate follow-up PR so the dependency/infrastructure change here stays reviewable in isolation.

Test plan

  • python scripts/check_notebook_versions.py runs from a checkout (no PyYAML dependency on Material's custom YAML tags) — currently reports the 21 stale notebooks and exits 1.
  • MODE=EXECUTE_PG docker compose up --build brings the stack up; the docs container logs Pre-caching scikit-image datasets... and cached: hubble_deep_field / cached: human_mitosis before any [N/21] notebook execution starts.
  • After a refresh-PR is merged, grep -rln 'TqdmWarning\|IProgress not found' src/ returns empty.
  • After a refresh-PR is merged, grep -rln 'Downloading file.*scikit-image' src/ returns empty.
  • After a refresh-PR is merged, python scripts/check_notebook_versions.py exits 0.

Tutorial notebooks committed with executed outputs were leaking three
forms of build-environment noise into the rendered docs:

  - TqdmWarning ("IProgress not found, please update ipywidgets")
    because the EXECUTE/EXECUTE_PG image had no ipywidgets installed,
    so tqdm.auto fell back to text mode and emitted the warning.
  - scikit-image "Downloading file mitosis.tif from gitlab.com ..."
    chatter because the dataset wasn't cached in the kernel image.
  - Stale DataJoint connection banners (2.1.1) in 21 notebooks
    because outputs hadn't been refreshed since the last DJ release.

Changes:
  - docker-compose.yaml: add ipywidgets to both EXECUTE branches.
  - scripts/execute_notebooks.py: warm the scikit-image cache before
    nbconvert spawns kernels.
  - scripts/check_notebook_versions.py: new guard that compares each
    notebook's "DataJoint X.Y.Z connected" banner against
    extra.datajoint_version in mkdocs.yaml and fails on drift.
  - README.md: document the committed-outputs policy and the guard.

Refreshed .ipynb outputs will land in a follow-up PR.
All 23 tutorial and how-to notebooks have been re-executed against the
current released DataJoint version (2.2.2). The committed cell outputs
were stale: 21 notebooks still showed "DataJoint 2.1.1 connected" in
their connection banner, and blob-detection additionally rendered a
spurious TqdmWarning and a scikit-image dataset download line.

Built on top of #170 (which added ipywidgets to the executor image and
pre-cached the scikit-image datasets), so the regenerated outputs are
free of the previous noise:

  - No "TqdmWarning: IProgress not found" stderr blocks.
  - No "Downloading file 'data/mitosis.tif' from gitlab.com ..." stdout.
  - All connection banners read "DataJoint 2.2.2 connected to ...".

scripts/check_notebook_versions.py now exits 0.
Copy link
Copy Markdown
Collaborator

@MilagrosMarin MilagrosMarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dimitri-yatsenko! Verified the three forms of noise end-to-end:

✅ Reproduced TqdmWarning: IProgress not found in src/tutorials/examples/blob-detection.ipynb cell 6 — same cell also has the scikit-image Downloading file 'data/mitosis.tif' from gitlab.com... chatter.
python scripts/check_notebook_versions.py runs from a clean checkout (no PyYAML / Material-tag headaches), flags 21 stale notebooks on DataJoint 2.1.1 (target: 2.2.x), exits 1. Matches the PR description exactly.
✅ Pre-cache loaders match the skimage calls in blob-detection.ipynb cell 6 — both data.hubble_deep_field() and data.human_mitosis() are used there, and blob-detection is the only notebook with the download chatter.
ipywidgets correctly scoped to EXECUTE / EXECUTE_PG only — LIVE/BUILD render committed outputs via mkdocs-jupyter and don't need it.
✅ Scope is well-isolated: infrastructure here, notebook refresh as a follow-up PR.

A few small things worth thinking about — none blocking:

1. The guard script isn't wired into CI. check_notebook_versions.py is a manual step right now, so the next time extra.datajoint_version bumps, stale banners can land unflagged again. Consider adding it to .github/workflows/development.yml so the check fails the PR rather than relying on a contributor remembering to run it. (Could be a follow-up — flagging it so we don't lose track.)

2. Pre-cache failure is swallowed silently. In execute_notebooks.py:

except Exception as _e:
    print(f"  pre-cache warn: {_loader.__name__}: {_e}")

If gitlab.com is down or the dataset URL changes, pre-cache fails quietly and the "Downloading file ..." chatter returns in the next refresh — exactly the noise this PR is trying to prevent. A non-zero exit (or at minimum a print(..., file=sys.stderr) so it's visible in CI logs) would catch that case. Minor.

3. skimage imported at module-execute time inside main(). Fine inside docker (always pip-installed), but python scripts/execute_notebooks.py --help would now error out if scikit-image isn't installed locally. Wrapping the import in a try/except and skipping the pre-cache block when unavailable would make the script friendlier to local invocations. Probably overengineering — your call.

4. had_banner short-circuits after the first stale match. If a notebook somehow has multiple banners (e.g., a cell re-running dj.conn()), only the first stale one is reported. Unlikely in practice — leaving as a note.

Otherwise this is clean — happy to approve once you've decided on the CI question.

Refresh tutorial notebook outputs against DataJoint 2.2.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants