zalkowitsch · hbmartin · May 15, 2026 · May 15, 2026 · May 15, 2026 · May 15, 2026
diff --git a/.agents/skills/debug-linkedin-sample-pdfs/SKILL.md b/.agents/skills/debug-linkedin-sample-pdfs/SKILL.md
@@ -0,0 +1,91 @@
+---
+name: debug-linkedin-sample-pdfs
+description: Use automatically when working in this repo and the user or developer discusses a specific PDF file/path, sample PDF, parser JSON, generated JSON output, baseline JSON, or whether JSON accurately reflects the original PDF. Also use for LinkedIn PDF extraction bugs, parser misses, section or column errors, unpdf/pdfplumber/Poppler comparisons, source evidence bundles, and sample completeness audits.
+---
+
+# Debug LinkedIn Sample PDFs
+
+Use source-derived artifacts as the authority. Parser JSON and sample baselines are useful regression outputs, but they are not proof of what the PDF contains.
+
+Default to the repo-local `samples/` directory when the user does not provide a PDF path, sample directory, or specific parser symptom. Do not ask which PDF to inspect before running the default repo-wide sample pass.
+
+## Workflow
+
+1. If `samples/` contains PDFs but no JSON files yet, generate initial JSON before checking:
+
+   ```bash
+   pnpm run samples:verify
+   ```
+
+   The generated JSON is not golden output. Treat it as suspect parser output that exists only to make coverage, diffing, and review workflows possible. Debug questionable values against the original PDFs with CLI PDF tools and the scripts in `scripts/`.
+
+2. Generate evidence before diagnosing. When no PDF path or sample directory is provided, inspect the repo-local `samples/` directory by default:
+
+   ```bash
+   pnpm run source:inspect
+   ```
+
+   For a specific PDF:
+
+   ```bash
+   pnpm run source:inspect -- <pdf-path>
+   ```
+
+   For a custom output folder:
+
+   ```bash
+   pnpm run source:inspect -- <pdf-path> --output .debug/<short-case-name>
+   ```
+
+3. Inspect source artifacts first:
+   - `poppler.layout.txt` for readable columns and visible line order.
+   - `overlay.html` for visual page geometry and text box placement.
+   - `unpdf.items.json` for the extractor input the parser actually receives.
+   - `pdfplumber.words.json` for independent word geometry.
+   - `parser-lines.json` and `parser.structural.json` for parser reconstruction.
+   - `parser-output.json` for current parser output, including `warnings` and `diagnostics`.
+   - `parser-source-coverage.json` or `baseline-source-coverage.json` for section-aware coverage prompts.
+   - `fieldMismatchOutputMatches` in coverage reports for high-confidence field-role mistakes inside a section, such as experience location or duration lines captured as descriptions.
+
+4. Decide whether the failure is source extraction, layout reconstruction, section assignment, field parsing, or fixture expectation drift. Cite artifact filenames and source lines/items when explaining the diagnosis.
+
+5. If changing parser behavior, add focused unit tests for the failing shape. Use a small synthetic text item or structural-line fixture unless the bug requires an end-to-end PDF fixture.
+
+6. Run the repo-required verification after changes:
+
+   ```bash
+   pnpm run check
+   pnpm run samples:verify
+   ```
+
+   `samples/` is local and gitignored, so `samples:verify` is intentionally separate from the default check. If no JSON files are present, `samples:verify` writes initial suspect JSON baselines before checking. After `samples:verify`, report its result and make no further changes from that output unless the user explicitly asks.
+
+## Required Final Report
+
+After using this skill, clearly document:
+
+- Which PDF files produced incorrect or incomplete parser output, with the source evidence used to identify each problem.
+- What code changes specifically address each failure case. Tie each fix to the PDF symptom it resolves rather than describing changes only by file name.
+- How the generated JSON should appear different after the changes, including the fields or sections expected to be added, removed, moved, or normalized.
+- Any `warnings` and `diagnostics` present in generated output JSON, including warnings that remain after the fix.
+- Any generated JSON that remains suspect and still needs source-level review. Generated JSON is never golden output just because it was written by the CLI.
+
+## Batch Audit
+
+Use the section-aware audit to scan all samples or compare a candidate fix:
+
+```bash
+pnpm run samples:audit-coverage -- --samples samples/
+```
+
+Use strict mode when validating the local sample corpus:
+
+```bash
+pnpm run samples:audit-coverage -- --samples samples/ --strict
+```
+
+Strict mode fails on unmatched source, loose source matches, untraced output, high-confidence field mismatches, and section warnings. Treat `crossSectionOutputMatches` as informational review prompts: the output was traced to source text, but not in the section inferred from its JSON path. Treat `fieldMismatchOutputMatches` as stronger evidence: the output traces to the right broad section but the source line has an inferred field role that conflicts with the JSON path. Verify suspicious rows against `poppler.layout.txt`, `overlay.html`, and source geometry before changing parser code.
+
+## Artifact Reference
+
+Read [references/source-evidence.md](references/source-evidence.md) when you need artifact meanings, coverage-report interpretation, or a triage checklist.
diff --git a/.agents/skills/debug-linkedin-sample-pdfs/agents/openai.yaml b/.agents/skills/debug-linkedin-sample-pdfs/agents/openai.yaml
@@ -0,0 +1,7 @@
+interface:
+  display_name: 'Debug LinkedIn Sample PDFs'
+  short_description: 'Debug PDF and parser JSON output'
+  default_prompt: 'Use $debug-linkedin-sample-pdfs when I mention a specific PDF file, sample directory, parser symptom, or generated JSON output.'
+
+policy:
+  allow_implicit_invocation: true
diff --git a/.agents/skills/debug-linkedin-sample-pdfs/references/source-evidence.md b/.agents/skills/debug-linkedin-sample-pdfs/references/source-evidence.md
@@ -0,0 +1,42 @@
+# Source Evidence Reference
+
+## Artifact Inventory
+
+- `manifest.json`: Bundle index and tool failures. Check this first.
+- `pdfinfo.txt`: Page count, producer, metadata, encryption, and page size.
+- `pdffonts.txt`: Embedded font data; useful for odd glyph or spacing behavior.
+- `pdfimages.txt`: Confirms whether visible content is text or image-backed.
+- `poppler.layout.txt`: Best first read for visible line order and column breaks.
+- `poppler.raw.txt`: Poppler extraction without layout preservation.
+- `poppler.bbox.xhtml`: Poppler word and block coordinates.
+- `pdfplumber.words.json`: Independent word-level geometry with font and size.
+- `pdfplumber.chars.json`: Character-level geometry for split glyphs, ligatures, or wrapped tokens.
+- `unpdf.items.json`: Raw unpdf/PDF.js text items before parser normalization.
+- `parser.structural.json`: Parser debug export with detected layout, raw text, text items, and structural lines.
+- `parser-lines.json`: Reconstructed structural lines consumed by section parsers.
+- `parser-output.json`: Current parser output with `rawText`, `warnings`, and `diagnostics`.
+- `source-segments.json`: Poppler layout text split into inferred source sections.
+- `parser-source-coverage.json`: Source coverage of the current parser output.
+- `baseline-source-coverage.json`: Source coverage of the adjacent sample JSON baseline, when present.
+- `page-*.png`: Rendered pages used by the overlay.
+- `overlay.html`: Rendered pages with unpdf text item boxes overlaid.
+
+## Coverage Signals
+
+- `unmatchedSourceSegments`: PDF text in an inferred source section that did not appear in same-section JSON. Verify before changing code; common causes are section inference mistakes, parser omissions, or intentionally unmodeled fields.
+- `looseSourceMatches`: Source matched only by token containment, not exact normalized text. Use these to find punctuation, spacing, URL wrapping, or normalization issues.
+- `crossSectionOutputMatches`: JSON values traced to PDF text in a different inferred section. Treat these as review prompts for section inference or intentional duplicated content, not as untraced output failures.
+- `fieldMismatchOutputMatches`: JSON values traced to the same inferred section but to a source line with a conflicting field role. These are high-confidence prompts for values like standalone experience locations or dates being captured as descriptions.
+- `untracedOutputValues`: JSON values not traceable to same-section PDF text. These can reveal hallucinated/misassigned fields, normalized URLs, derived date fields, or text assigned to the wrong section.
+- `sectionWarnings`: Parser warnings from generated or baseline JSON. Treat `section_parse_warning` as higher priority than heuristic coverage noise.
+- `warnings` and `diagnostics`: Parser self-reporting in output JSON. Include these in the investigation notes even when the visible source text looks correct.
+
+## Triage Checklist
+
+1. Confirm visible truth in `poppler.layout.txt` and `overlay.html`.
+2. Compare Poppler, pdfplumber, and unpdf geometry if text is missing or split unexpectedly.
+3. Compare `unpdf.items.json` to `parser-lines.json` when columns, page transitions, or wrapped lines are wrong.
+4. Check `fieldMismatchOutputMatches` before accepting section coverage as sufficient; a same-section match can still be a field-level parse error.
+5. Compare `parser-lines.json` to `parser-output.json` when parser input is correct but fields are wrong.
+6. Use `baseline-source-coverage.json` only to audit fixture completeness; do not treat the baseline as source truth.
+7. Keep generated artifacts in `.debug/` for ad hoc investigation and `.debug-dist/` for reproducible script output.
diff --git a/.attw.json b/.attw.json
@@ -0,0 +1,3 @@
+{
+	"profile": "node16"
+}
diff --git a/.claude/CLI_IMPLEMENTATION_SUMMARY.md b/.claude/CLI_IMPLEMENTATION_SUMMARY.md