Skip to content

Wrapped email fix followup#4

Closed
hbmartin wants to merge 71 commits into
zalkowitsch:mainfrom
hbmartin:wrapped-email-fix-followup
Closed

Wrapped email fix followup#4
hbmartin wants to merge 71 commits into
zalkowitsch:mainfrom
hbmartin:wrapped-email-fix-followup

Conversation

@hbmartin
Copy link
Copy Markdown

No description provided.

hbmartin added 30 commits May 15, 2026 08:05
parseLinkedInPDF now accepts ArrayBuffer | Uint8Array | string with no Buffer type in public declarations.
StructuralParser.extractStructuredText now uses getDocumentProxy + extractTextItems.
Removed TextItem.transform.
Externalized unpdf in Rollup/esbuild.
Added CJS .d.cts generation and fixed the export map so attw passes.
Updated README and added Buffer, Uint8Array, ArrayBuffer, and string input tests.
ci.yml (line 20) and release.yml (line 21): added pnpm/action-setup, switched cache/install/run/publish commands to pnpm
pnpm-workspace.yaml (line 1): approved esbuild and unrs-resolver builds so pnpm 11 does not block install/build
Add test and update readme
Registered the CLI package surface and package metadata.
Removed/satisfied unused dependency findings.
Added focused structural parser coverage.

Added Jest global coverage thresholds in jest.config.cjs.
Added pnpm run dupes and pnpm run type-coverage to CI in .github/workflows/ci.yml.
Added type-coverage --strict --at-least 100 and wired dupes/type coverage into quality:check in package.json.
publint now uses pnpm via package.json (line 62), and the profile fixture resolves Profile.pdf relative to the test file in profile-fixture.test.ts (line 6).

The README now includes:
Sample parsed JSON output at README.md (line 119)
A real Vercel Edge route example at README.md (line 199)
A link from the Experience interface to docs/work-experience-semantics.md (line 1)
a Developing and Testing the CLI section
… replaced it with generic helpers in profile-text.ts. Updated the experience, structural experience, basic info, and skills parsers to use those heuristics. Added tests for generic organization detection, suffix preservation, fallback text parsing, and skill filtering.

Updated README.md  :
Replaced stale @zalko/linkedin-parser and old clone URL references.
Clarified npx usage with the real CLI command: linkedin-pdf-parser.
Consolidated duplicated Quick Start / Basic Usage content.
Switched Development commands to pnpm.
Added Node.js 22.0.0+ and supported runtime notes.
Moved local CLI testing into the Development section.
…mits, localized structural parsing, education year cleanup, headline/email handling, and focused unit coverage.

Main files include profile-text.ts, lists.ts, and experience-structural.ts.
Docs in README.md and expanded tests in tests/unit/cli.test.ts.
It runs on PRs to main, sets up pnpm@11.1.2 with Node 22, installs, builds, then reports compressed size diffs for dist/**/*.{js,mjs,cjs} using preactjs/compressed-size-action@v2.

Changed src/cli.ts (line 122) to classify symlinked files correctly, keep symlinked directories out of PDF parsing, check directoryExists() before fileExists(), and match PDF/JSON stems case-insensitively. Added coverage in tests/unit/cli.test.ts (line 177) for the directory contract, symlinked PDFs, case-insensitive pairing, verify-json --raw-text, and empty text baselines. Updated @jest/globals to ^30.4.1 in package.json (line 75) and pnpm-lock.yaml (line 18).

Updated release.yml (line 1) to publish on GitHub Release published, use Node 22.x, request id-token: write, and publish with npm publish --provenance using latest for normal releases and next for prereleases.
…g NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} to the npm publish step.

pin bundlephobia.yml actions to full commit SHAs
…ndex.ts.

ParseResult now always includes warnings.
Added certifications, volunteer_work, and projects arrays.
Added structural line normalization, structural identity parsing, extra section parsing, and structural education parsing.
Simplified text fallback name/location/email parsing and added Portuguese degree support.
Add shared parser line normalization, structured date parsing, section-level parse warnings, and exported Zod schemas while preserving existing public string fields.

Model: GPT-5 Codex Desktop

Thread: 019e2d0f-478e-77e1-9bd9-571c14ec39ce
…ted union in src/types/profile.ts and src/schemas.ts.

Fixed date parsing in src/utils/date-parser.ts: ISO/month hyphens no longer split as ranges, compact year ranges still parse, regexes are precompiled, and formatIsoDate now enforces precision invariants.
Fixed parser warning accuracy in basic info, structural experience, extra sections, and language parsing.
Exported the esbuild config for regression testing and added tests/unit/build-config.test.ts.
Updated README schema example import and date range docs.
…coverage in tests/unit/date-parser.test.ts (line 70).

Fixed basic-info header scanning so contact and summary warnings can both be detected before later sections in src/parsers/basic-info.ts (line 414).
Added duration whitespace normalization coverage in tests/unit/experience.test.ts (line 124).
Preserved non-section extra-section warnings in src/parsers/extra-sections.ts (line 118).
Logged esbuild errors before exit in esbuild.config.js (line 25).
Updated README ParsedDateRange docs to the discriminated union in README.md (line 368).
…ry/section-transition branches.

src/parsers/extra-sections.ts: exported the warning filter for focused unit coverage.
Model: GPT-5; reasoning effort: medium; Thread: 019e30d2-7ec3-7f81-b2ae-4787a3d0d93f
…/cli.js; the CLI artifact imports ./index.js instead of duplicating the full bundle.

Added verify:artifacts, verify:package, and budgeted size:check scripts in package.json (line 59).
Added artifact/package verification scripts under scripts (line 1).
Wired checks into PR CI, release CI, and bundlephobia workflow; main CI runs both artifact and packed-package verification after build.
…ne 28) for Arkady Zalkowitsch’s profile.

Replaced loose truthy / array-shape checks with exact expected values across the library tests.
Updated both JS e2e scripts to use the new fixture values and fixed the broken dist import in tests/e2e/e2e-test.js (line 3).
Reworked tests/e2e/full-e2e-test.js (line 6) to run strict fixture checks against both independent PDF text extraction and the built parser.
…cli.ts into an internal TypeScript module, then use it from both the CLI and Jest-based end-to-end fixture tests.
…2E usage.

Updated E2E scripts to validate the built unpdf-backed parser output directly.
Fixed two-column PDF section parsing so left-column Languages no longer leaks into the right-column summary.
Added structural language parsing, now extracting:
Português (Native or Bilingual)
Inglês (Professional Working)
Espanhol (Elementary)
Centralized the expected test-resume fixture via tests/fixtures/expected-test-resume-profile.js.
Updated scripts/verify-packed-package.mjs so the CJS consumer catches rejected promises and exits nonzero.
…line 4)

Added the pronoun-preserving regex intent comment in src/utils/structural-lines.ts (line 114)
Added per-column section-state comments in src/utils/structural-sections.ts (line 32)
package.json: added pnpm run check, which runs format, dupes, test, build, then knip.
AGENTS.md: updated the required verification command to pnpm run check.
… certifications like MITx ... in + Social Science become one item.

Structural identity/top skills flow now prefers already-parsed sidebar skills instead of re-running text fallback and emitting false warnings.
Structural experience parsing now handles:
person-shaped organization names after descriptions, like Boba Joy and Partiu Vantagens!
contractor-style titles with parentheticals
Greater Rio de Janeiro locations
split address/postcode location lines
description lines that previously got misclassified as titles
Structural education parsing now joins wrapped degree lines before extracting dates, including slash-wrapped text like Technology/ + Technician.
…g appended to existing degrees, while preserving wrapped degree/date continuations.

experience-structural.ts (line 398): fixed dotted address-prefix matching and normalized joined location strings as a whole.
profile-text.ts (line 203): constrained title parentheticals to exactly one allowlisted trailing suffix.
…column layouts.

src/utils/profile-text.ts: producer titles and Bay Area locations are recognized; sentence fragments like Manager. are not titles.
src/parsers/experience-structural.ts: wrapped description fragments are preserved without turning them into bogus roles.
experience-structural.ts and education.ts and profile-text.ts
…. and Winston-Salem, NC.

No-date wrapped structural degree fragments now append when they look like short academic continuations.
…icky heuristics, fixed short description continuation after short location lines, and guarded punctuation continuations so orgs like Golden Angle Productions, LLC. and Partiu Vantagens! still start new entries.

structural-parser.ts: extracted column/layout magic numbers.
…ertificação.

Removed the unused lower local in education.ts (line 215).
Moved description-continuation detection after structural checks in experience-structural.ts (line 254).
Added regression tests for adjacent-job splitting and short at/by/on continuation fragments.
hbmartin added 28 commits May 25, 2026 16:58
… avoid promoting no-date prose into positions, split combined org/title rows like Robert Bosch GmbH Business Controller, handle page-footer gaps between title/date, and prevent campaign bullets from being read as dates/titles. Updated profile-text.ts with narrow support for Palo Alto, GmbH, International, and fixed-term consulting title parentheticals.
…xes and reject suffix-only “titles” like LLC and Inc.

experience-structural.ts (line 551): included + in the relevant noise-prefix checks.
profile-text.ts (line 45): moved gmbh before group.
…struction, stricter whole-line duration detection, stricter location detection, and organization-boundary precedence for title/date-shaped entries.
…lines with multi-word values were skipped, a one-word role (Venture) caused a whole experience entry to drop, and organization-looking location/descriptor lines could swallow following descriptions, short media descriptors promoted to fake organizations, society as an organization cue
media-production descriptions like Directed by..., Feature Film, Television Series, and bullet lists
short descriptors such as Spatial AI, Audit, Consulting
multi-word Client: labels
one-word role title Venture without breaking Stage Venture Partners
wrapped bilingual organization names
hyphenated organization/tagline names like WhereTo - Business Travel Reimagined
short summary continuation lines
education continuations such as Minor in Speech Communication
single-letter acronym spacing such as Series A interest, Model Y production, Gen Z brand, S/S collection, Formula E and

Also followup:
src/parsers/experience-structural.ts (line 42): extracted repeated inline param object shapes into named interfaces.
src/parsers/experience-structural.ts (line 82): expanded organization connector handling and dotted GmbH. matching.
src/parsers/experience-structural.ts (line 1429) and src/utils/profile-text.ts (line 459): allowed 3-letter uppercase country codes in location shapes.

Added debugging scripts
… .debug/<pdf-stem>/

scripts/lib/sample-script-helpers.mjs (line 15): optionValue() now rejects missing values or next-flag values, and shared warning-failure formatting was added.
scripts/check-sample-warnings.mjs (line 21): per-PDF read/parse errors are now aggregated and reported together.
src/parsers/experience-structural.ts (line 95): GmbH. is handled consistently in comma-separated suffixes and boundary guards.
src/parsers/basic-info.ts (line 313) and AGENTS.md (line 5): added the requested intent comment and fixed the typo.
source-evidence.md for artifact meanings and coverage audit triage.
Trimmed AGENTS.md and scripts/README.md to point to the skill instead of carrying all workflow guidance inline.
scripts/inspect-pdf-source.mjs (line 119): multi-file output dirs now disambiguate duplicate PDF stems with a stable short path hash.
scripts/lib/source-coverage-helpers.mjs (line 35): volunteer heading regex now matches Volunteering Experience.
src/index.ts (line 79): binary parse path now reuses debugArtifacts.structuralLines instead of recomputing them.
scripts/inspect-pdf-source.mjs (line 745): failure manifest entries now include relativePath.
Normalize unpdf items in writeUnpdfArtifacts by deriving x/y from transform.
Add focused unit coverage for raw PDF.js items without direct x/y.
Tighten AGENTS.md and the repo-local debugging skill so agents follow one clear workflow for parser robustness work.

Harden source coverage matching:

- normalize punctuation spacing so London , England matches London, England exactly;
- allow adjacent same-section source segments to satisfy wrapped values like Limited + Working);
- treat output values found in another source section as traced cross-section matches, not untraced hallucinations;
- add crossSectionOutputMatches and a total count to audit reports as informational review prompts.
… for profile.summary.

The plain-text long-line fallback still exists for non-structural parsing, but it no longer pulls experience descriptions into summary fields for PDFs without an About/Summary section.
Changed scripts/inspect-pdf-source.mjs (line 358) so normalizeUnpdfTextItem returns an overlay-safe blank item for null or undefined input, avoiding the spread/runtime crash while preserving downstream fields.
Changed source-coverage-helpers.mjs (line 166) to precompute combined source text per section once and reuse that map for same-section and cross-section output matching
update the workflow so an empty-JSON samples/ corpus can bootstrap itself, and I’ll make the skill explicit that generated JSON is only a suspect baseline for debugging, not golden truth
Updated the skill in .agents/skills/debug-linkedin-sample-pdfs/SKILL.md with a new Required Final Report section.
…through src/index.ts.

Added formatLinkedInProfile in src/formatter.ts.
Added typed library errors and strict/safe parser variants in src/errors.ts and src/index.ts.
Updated schemas/types, README, checked-in fixture JSON, and unit/e2e coverage for the new result shape.
…e, not the default PDF unreadable wording.

make contact link formatting handle label-only and URL-only links without emitting undefined.
add blank-line separation between formatted experience and education entries.
…line 1), reused by schemas.ts (line 102) and the profile type.

Updated contact link formatting in formatter.ts (line 55) so labeled links without URLs are omitted instead of rendering bad text.
Updated text-input error normalization in errors.ts (line 29) so generic plain-text parse failures no longer use the PDF-only default message.
Added/updated unit coverage in formatter, schema, and public API tests.
…k label/URL reads.

src/types/profile.ts (line 1) re-exports the locally imported WarningSection type.
src/schemas.ts (line 102) exports WarningSectionSchema, and src/index.ts (line 72) re-exports it publicly.
pnpm run source:inspect now defaults to inspecting samples/ when no PDF path or --samples value is passed, while explicit PDF paths still take precedence.
updated the skill docs to use that default and to require noting warnings and diagnostics from generated output JSON.
…re reading label and url.

scripts/inspect-pdf-source.mjs: now rejects --samples combined with positional PDF paths.
Generated plain-text formatter baselines alongside the checked-in PDF/JSON fixtures
Added an E2E fixture test in json-fixtures.test.ts (line 45) that parses each checked-in PDF and verifies both formatLinkedInProfile(profile) and formatLinkedInProfile(profile, { includeContact: true }) against the new .txt baselines.
Updated docs/migrate-2.1.0.md (line 156) to document app-facing plain text via formatLinkedInProfile, the full FormatLinkedInProfileOptions shape, the includeContact default/behavior, and when to use includeRawText/result.rawText instead.
Changed src/formatter.ts to add outputFormat?: 'plainText' | 'markdown', keep plain text as the default, and render Markdown with # Name plus ## section headings. Exported the new LinkedInProfileOutputFormat type from src/index.ts.
…idence cases like standalone location lines being swallowed by description.
…city lines (Los Angeles, San Diego) as experience locations, and to strip trailing commas only when classifying/finalizing locations so address continuations still keep their punctuation.

update the labeled email matcher in basic-info.ts (line 590) so Email john@example.com is accepted, not only Email: ... or Email - ....
replace the dynamic new RegExp(...) calls in isEmailSearchLine with static regex literals to avoid recompilation and the flagged variable-regex pattern.
fix the structural contact fallback in basic-info.ts (line 350)
…n they immediately follow a duration/date line, with punctuation/org/title-shaped false positives rejected

src/formatter.ts (line 77): normalizes LinkedIn URLs for dedupe across protocol, www, case, and trailing slash variants, with the requested intent comment.
scripts/lib/source-coverage-helpers.mjs (line 476): allows whitelisted single-word locations in source-coverage role detection.
… default, plus a new --json-paths flag that prints semantic JSON keypath changes.
…dent signals instead of relying on hardcoded location regexes: known place/admin/country data, region codes, comma-region evidence, qualified area terms, structural context, and negative title/org/prose signals.
…src/utils/location-classifier.ts (line 148) does not. That can make coverage validation disagree with the parser

also fixes:
Severity 2: cubic identified verify-json short flags like -x being treated as positional args in src/cli.ts (line 260).
Severity 2: cubic identified unbounded quadratic LCS memory/time behavior in src/json-fixtures.ts (line 575).
Severity 1: formatter URL normalization can combine two prefix replacements in src/formatter.ts (line 126).
Severity 1: cubic identified duplicated added/removed JSON traversal recursion in src/json-fixtures.ts (line 824).
…P+4 support, alternate dash duration detection.

source-coverage-helpers.mjs (line 573): diacritic-normalized lookup seeds/runtime values, empty phrase guard, standalone UK/USA/U.S. handling.
experience-structural.ts (line 1458): removed metadata context for experience location classification and kept wrapped org names like University de Paris intact.
…oolean check and support dotted U.S. / U.S.A. suffixes. Changed location-classifier.ts (line 486) to use Unicode dash punctuation for duration rejection.

SKILL.md (line 3) now explicitly says to use the skill automatically when a specific PDF file/path, parser JSON, generated JSON output, baseline JSON, or PDF-vs-JSON accuracy question is discussed.
openai.yaml (line 3) now mirrors that intent and sets policy.allow_implicit_invocation: true
…heck and support dotted U.S. / U.S.A. suffixes.

experience-structural.ts (line 967): organization-name cue now only applies when the line is not location-shaped.
experience-structural.ts (line 1535): greater-area country suffix cleanup now handles comma, dotted, spaced, and trailing-dot US variants.
location-classifier.ts (line 540): contextual region-code detection now uses comma-separated segment evidence instead of any comma.
location-classifier.ts (line 486) to use Unicode dash punctuation for duration rejection.
profile-text.ts (line 461): removed the unused hasCommaSeparatedOrganizationSuffix
scripts/lib/source-coverage-helpers.mjs (line 636): comma evidence now comes from comma-separated segments with known-place or unambiguous region-code evidence, and ambiguous codes no longer count in comma-region scoring.
…(line 454) before normal line classification.

It now scores organization -> title -> duration -> optional location and organization -> total duration -> title -> duration blocks using visual alignment, font hierarchy, organization cues, description self-reference, and a person-name penalty.
Removed the narrow Angels organization cue. Nordic Angels now parses because the block is structurally and visually strong.
Prevented optional location detection from swallowing the next organization header.
… joins adjacent wrapped email domain lines like stephan.agerman@slvventure. + com, without changing link/phone parsing.

precompute lineTexts once in createCanonicalHeaderLineTypes, then pass it through candidate creation and canonicalHeaderLocationLine.
Replace repeated nextContentLine slicing in descriptionMentionsOrganization with index iteration while preserving current behavior.
Reorder cheap checks before expensive helper calls in canStartCanonicalExperienceHeader and short-circuit title detection before organization-boundary analysis.
…it, stitch wrapped email lines while excluding those same fragments from contact-link parsing, and added the requested intent comment.

Changed src/utils/location-classifier.ts so comma-region evidence ignores ambiguous codes like IN, ME, and OR.
@hbmartin hbmartin closed this May 27, 2026
@hbmartin hbmartin deleted the wrapped-email-fix-followup branch May 27, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant