Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
a662cda
Replaced pdfjs-dist with unpdf.
hbmartin May 15, 2026
315b31b
package.json (line 41): npm run -> pnpm run, npx tsx -> pnpm dlx tsx
hbmartin May 15, 2026
82d024f
Configure attw and knip and update CI job
hbmartin May 15, 2026
d09a96a
CI now runs duplicate checks, type coverage, knip, publint, and attw.
hbmartin May 15, 2026
5803b5a
Removed the hardcoded company/person knowledge from parser source and…
hbmartin May 15, 2026
818eb48
Key changes landed in parser heuristics, section boundaries, skill li…
hbmartin May 15, 2026
62b69ea
CLI JSON baseline subcommands in src/cli.ts
hbmartin May 15, 2026
5e42de8
Added bundlephobia.yml (line 1).
hbmartin May 15, 2026
e7ba152
fix release publishing auth in .github/workflows/release.yml by addin…
hbmartin May 15, 2026
39834e9
Contact.email, name, headline, and location are now optional in src/i…
hbmartin May 15, 2026
3bad002
Improve parser robustness
hbmartin May 15, 2026
a8dd764
Reworked ParsedDateRange from isCurrent: boolean to a kind discrimina…
hbmartin May 15, 2026
74351a3
Tightened ISO day parsing so YYYY-MM-00 is rejected, with regression …
hbmartin May 15, 2026
3136809
src/parsers/basic-info.ts: added intent comments and split the bounda…
hbmartin May 16, 2026
86ffab3
Replace esbuild minification with Rollup
hbmartin May 16, 2026
0267195
Build now runs tsc --noEmit before Rollup.
hbmartin May 16, 2026
a9cf7c9
rollup.config.js (line 22) now builds both library artifacts and dist…
hbmartin May 16, 2026
b98e923
Updated strict fixture expectations in tests/unit/library.test.ts (li…
hbmartin May 16, 2026
3551dad
Extract the reusable write-json / verify-json batch logic out of src/…
hbmartin May 16, 2026
6b835b5
Removed pdf-parse completely from package.json, pnpm-lock.yaml, and E…
hbmartin May 16, 2026
37cf60b
Added import type { LinkedInProfile } in tests/unit/library.test.ts (…
hbmartin May 16, 2026
53fb1e1
fixing fixture json
hbmartin May 16, 2026
7adb65a
First pass at extraction improvements based on the fixtures
hbmartin May 16, 2026
3ce01ee
Structural extra sections now merge wrapped sidebar entries, so split…
hbmartin May 16, 2026
f287d47
education.ts (line 324): stopped arbitrary structural lines from bein…
hbmartin May 16, 2026
2d20030
src/parsers/structural-parser.ts: compact sidebars now detect as two-…
hbmartin May 16, 2026
c5177e8
Improved parser heuristics:
hbmartin May 16, 2026
4255679
Education locations now accept dots and hyphens, e.g. Washington, D.C…
hbmartin May 16, 2026
b42f0a9
experience-structural.ts: named description thresholds, documented tr…
hbmartin May 16, 2026
5921aa5
Expanded education degree detection for associate, certificate, and c…
hbmartin May 16, 2026
618d860
Tightened structural education degree continuation in education.ts, w…
hbmartin May 16, 2026
95f255a
Add support for publication extraction
hbmartin May 16, 2026
b610e82
src/utils/date-parser.ts: normalizes May to canonical capitalization.
hbmartin May 16, 2026
1947ea2
Changed experience-structural.ts (line 497) so a dotted title like Ma…
hbmartin May 16, 2026
60c1113
Raised coverage above 95%
hbmartin May 16, 2026
4a6893e
Targeted unit tests for parser/util files
hbmartin May 16, 2026
55aacb3
pin pnpm to 11.1.3
hbmartin May 19, 2026
e54dd26
Replace the brittle global column heuristic with page-aware structura…
hbmartin May 25, 2026
8f769a4
Fix the high-severity mixed-page layout invariant first: build pageLa…
hbmartin May 25, 2026
f9c3e46
Added typed/schema support for profile.honors_awards, contact.links, …
hbmartin May 25, 2026
af8ca77
Contact parsing in basic-info.ts: split adjacent links correctly, all…
hbmartin May 25, 2026
198d2a4
Updated education.ts to precompile the education date regexes as stat…
hbmartin May 25, 2026
e0f0309
Updated src/parsers/basic-info.ts (line 549) to trim once and use the…
hbmartin May 25, 2026
fd93038
Changed experience-structural.ts to use context-aware boundary scans,…
hbmartin May 25, 2026
97ddcd7
experience-structural.ts (line 47): extracted combined org/title rege…
hbmartin May 26, 2026
166b9c9
extraction fixes in experience-structural.ts: wrapped title/org recon…
hbmartin May 26, 2026
5549365
fixes: single-letter acronym spacing was being collapsed, descriptor …
hbmartin May 26, 2026
68cb348
Fixed extraction for:
hbmartin May 26, 2026
ca42b10
scripts/inspect-pdf-source.mjs writes a per-PDF evidence bundle under…
hbmartin May 26, 2026
a3dbb88
Created repo-local skill: debug-linkedin-sample-pdfs
hbmartin May 26, 2026
16202b1
Add scripts/verify-samples.mjs plus package.json script samples:verify
hbmartin May 26, 2026
b85c7c4
Structural PDF parsing only uses an actual structural Summary section…
hbmartin May 26, 2026
34a3a82
Added diagnostics to ParseResult via src/diagnostics.ts and wired it …
hbmartin May 26, 2026
d577715
make short plain-text input return a text-specific parse error messag…
hbmartin May 26, 2026
0988307
Centralized warning/diagnostic section names in warning-sections.ts (…
hbmartin May 26, 2026
e2f55e1
src/formatter.ts (line 58) now uses optional chaining for contact lin…
hbmartin May 26, 2026
b44ce37
src/formatter.ts: added a single guard for nullish contact links befo…
hbmartin May 26, 2026
88bf82c
Markdown output for formatLinkedInProfile.
hbmartin May 26, 2026
96e88b4
Add field-level mismatch detection so strict audits fail on high-conf…
hbmartin May 26, 2026
2357f4e
Changed experience-structural.ts to treat high-confidence standalone …
hbmartin May 26, 2026
bfbca4c
short title-case place-shaped lines are only treated as locations whe…
hbmartin May 26, 2026
b5469be
Improve verify-json mismatch reporting with a compact context diff by…
hbmartin May 26, 2026
0b4b280
The main change is location-classifier.ts (line 1): it scores indepen…
hbmartin May 26, 2026
bea346d
scripts/lib/source-coverage-helpers.mjs (line 85) includes ct, while …
hbmartin May 26, 2026
1fbb23d
location-classifier.ts (line 293): early returns for hard rejects, ZI…
hbmartin May 26, 2026
b5ae8af
Changed experience-structural.ts (line 356) to remove the redundant b…
hbmartin May 26, 2026
5d179ea
experience-structural.ts (line 356) to remove the redundant boolean c…
hbmartin May 26, 2026
2ef048e
location regex now strips trailing US/U.S./USA/U.S.A. suffixes even w…
hbmartin May 26, 2026
f3999e4
Added a canonical experience-header pass in experience-structural.ts …
hbmartin May 27, 2026
54a22b0
Contact extraction now builds a contact-only email search string that…
hbmartin May 27, 2026
89c9e5f
Changed src/parsers/basic-info.ts to use the canonical {2,63} TLD lim…
hbmartin May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions .agents/skills/debug-linkedin-sample-pdfs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
name: debug-linkedin-sample-pdfs
description: Use automatically when working in this repo and the user or developer discusses a specific PDF file/path, sample PDF, parser JSON, generated JSON output, baseline JSON, or whether JSON accurately reflects the original PDF. Also use for LinkedIn PDF extraction bugs, parser misses, section or column errors, unpdf/pdfplumber/Poppler comparisons, source evidence bundles, and sample completeness audits.
---

# Debug LinkedIn Sample PDFs

Use source-derived artifacts as the authority. Parser JSON and sample baselines are useful regression outputs, but they are not proof of what the PDF contains.

Default to the repo-local `samples/` directory when the user does not provide a PDF path, sample directory, or specific parser symptom. Do not ask which PDF to inspect before running the default repo-wide sample pass.

## Workflow

1. If `samples/` contains PDFs but no JSON files yet, generate initial JSON before checking:

```bash
pnpm run samples:verify
```

The generated JSON is not golden output. Treat it as suspect parser output that exists only to make coverage, diffing, and review workflows possible. Debug questionable values against the original PDFs with CLI PDF tools and the scripts in `scripts/`.

2. Generate evidence before diagnosing. When no PDF path or sample directory is provided, inspect the repo-local `samples/` directory by default:

```bash
pnpm run source:inspect
```

For a specific PDF:

```bash
pnpm run source:inspect -- <pdf-path>
```

For a custom output folder:

```bash
pnpm run source:inspect -- <pdf-path> --output .debug/<short-case-name>
```

3. Inspect source artifacts first:
- `poppler.layout.txt` for readable columns and visible line order.
- `overlay.html` for visual page geometry and text box placement.
- `unpdf.items.json` for the extractor input the parser actually receives.
- `pdfplumber.words.json` for independent word geometry.
- `parser-lines.json` and `parser.structural.json` for parser reconstruction.
- `parser-output.json` for current parser output, including `warnings` and `diagnostics`.
- `parser-source-coverage.json` or `baseline-source-coverage.json` for section-aware coverage prompts.
- `fieldMismatchOutputMatches` in coverage reports for high-confidence field-role mistakes inside a section, such as experience location or duration lines captured as descriptions.

4. Decide whether the failure is source extraction, layout reconstruction, section assignment, field parsing, or fixture expectation drift. Cite artifact filenames and source lines/items when explaining the diagnosis.

5. If changing parser behavior, add focused unit tests for the failing shape. Use a small synthetic text item or structural-line fixture unless the bug requires an end-to-end PDF fixture.

6. Run the repo-required verification after changes:

```bash
pnpm run check
pnpm run samples:verify
```

`samples/` is local and gitignored, so `samples:verify` is intentionally separate from the default check. If no JSON files are present, `samples:verify` writes initial suspect JSON baselines before checking. After `samples:verify`, report its result and make no further changes from that output unless the user explicitly asks.

## Required Final Report

After using this skill, clearly document:

- Which PDF files produced incorrect or incomplete parser output, with the source evidence used to identify each problem.
- What code changes specifically address each failure case. Tie each fix to the PDF symptom it resolves rather than describing changes only by file name.
- How the generated JSON should appear different after the changes, including the fields or sections expected to be added, removed, moved, or normalized.
- Any `warnings` and `diagnostics` present in generated output JSON, including warnings that remain after the fix.
- Any generated JSON that remains suspect and still needs source-level review. Generated JSON is never golden output just because it was written by the CLI.

## Batch Audit

Use the section-aware audit to scan all samples or compare a candidate fix:

```bash
pnpm run samples:audit-coverage -- --samples samples/
```

Use strict mode when validating the local sample corpus:

```bash
pnpm run samples:audit-coverage -- --samples samples/ --strict
```

Strict mode fails on unmatched source, loose source matches, untraced output, high-confidence field mismatches, and section warnings. Treat `crossSectionOutputMatches` as informational review prompts: the output was traced to source text, but not in the section inferred from its JSON path. Treat `fieldMismatchOutputMatches` as stronger evidence: the output traces to the right broad section but the source line has an inferred field role that conflicts with the JSON path. Verify suspicious rows against `poppler.layout.txt`, `overlay.html`, and source geometry before changing parser code.

## Artifact Reference

Read [references/source-evidence.md](references/source-evidence.md) when you need artifact meanings, coverage-report interpretation, or a triage checklist.
7 changes: 7 additions & 0 deletions .agents/skills/debug-linkedin-sample-pdfs/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
interface:
display_name: 'Debug LinkedIn Sample PDFs'
short_description: 'Debug PDF and parser JSON output'
default_prompt: 'Use $debug-linkedin-sample-pdfs when I mention a specific PDF file, sample directory, parser symptom, or generated JSON output.'

policy:
allow_implicit_invocation: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Source Evidence Reference

## Artifact Inventory

- `manifest.json`: Bundle index and tool failures. Check this first.
- `pdfinfo.txt`: Page count, producer, metadata, encryption, and page size.
- `pdffonts.txt`: Embedded font data; useful for odd glyph or spacing behavior.
- `pdfimages.txt`: Confirms whether visible content is text or image-backed.
- `poppler.layout.txt`: Best first read for visible line order and column breaks.
- `poppler.raw.txt`: Poppler extraction without layout preservation.
- `poppler.bbox.xhtml`: Poppler word and block coordinates.
- `pdfplumber.words.json`: Independent word-level geometry with font and size.
- `pdfplumber.chars.json`: Character-level geometry for split glyphs, ligatures, or wrapped tokens.
- `unpdf.items.json`: Raw unpdf/PDF.js text items before parser normalization.
- `parser.structural.json`: Parser debug export with detected layout, raw text, text items, and structural lines.
- `parser-lines.json`: Reconstructed structural lines consumed by section parsers.
- `parser-output.json`: Current parser output with `rawText`, `warnings`, and `diagnostics`.
- `source-segments.json`: Poppler layout text split into inferred source sections.
- `parser-source-coverage.json`: Source coverage of the current parser output.
- `baseline-source-coverage.json`: Source coverage of the adjacent sample JSON baseline, when present.
- `page-*.png`: Rendered pages used by the overlay.
- `overlay.html`: Rendered pages with unpdf text item boxes overlaid.

## Coverage Signals

- `unmatchedSourceSegments`: PDF text in an inferred source section that did not appear in same-section JSON. Verify before changing code; common causes are section inference mistakes, parser omissions, or intentionally unmodeled fields.
- `looseSourceMatches`: Source matched only by token containment, not exact normalized text. Use these to find punctuation, spacing, URL wrapping, or normalization issues.
- `crossSectionOutputMatches`: JSON values traced to PDF text in a different inferred section. Treat these as review prompts for section inference or intentional duplicated content, not as untraced output failures.
- `fieldMismatchOutputMatches`: JSON values traced to the same inferred section but to a source line with a conflicting field role. These are high-confidence prompts for values like standalone experience locations or dates being captured as descriptions.
- `untracedOutputValues`: JSON values not traceable to same-section PDF text. These can reveal hallucinated/misassigned fields, normalized URLs, derived date fields, or text assigned to the wrong section.
- `sectionWarnings`: Parser warnings from generated or baseline JSON. Treat `section_parse_warning` as higher priority than heuristic coverage noise.
- `warnings` and `diagnostics`: Parser self-reporting in output JSON. Include these in the investigation notes even when the visible source text looks correct.

## Triage Checklist

1. Confirm visible truth in `poppler.layout.txt` and `overlay.html`.
2. Compare Poppler, pdfplumber, and unpdf geometry if text is missing or split unexpectedly.
3. Compare `unpdf.items.json` to `parser-lines.json` when columns, page transitions, or wrapped lines are wrong.
4. Check `fieldMismatchOutputMatches` before accepting section coverage as sufficient; a same-section match can still be a field-level parse error.
5. Compare `parser-lines.json` to `parser-output.json` when parser input is correct but fields are wrong.
6. Use `baseline-source-coverage.json` only to audit fixture completeness; do not treat the baseline as source truth.
7. Keep generated artifacts in `.debug/` for ad hoc investigation and `.debug-dist/` for reproducible script output.
3 changes: 3 additions & 0 deletions .attw.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"profile": "node16"
}
214 changes: 0 additions & 214 deletions .claude/CLI_IMPLEMENTATION_SUMMARY.md

This file was deleted.

Loading