Skip to content

Bibliography sections flattened from vertical bulleted list to horizontal paragraph #8

@sentientsergio

Description

@sentientsergio

Reproducer

Tested against tomd master at commit ad567e3.

Reproduce

tomd p3984r0.pdf --outdir out/
sed -n '/^### 6\. References/,/^### 7/p' out/p3984r0.md | head -20

The References section is rendered as a single line with separators rather than as a vertical list.

Symptom

Bibliography entries that should be on individual lines are joined onto a single line. Excerpt:

• [BS21] B. Stroustrup: [Type-and-resource safety in modern C++](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2410r0.pdf) • [BS22b] B. Stroustrup and G. Dos Reis: [Design Alternatives for Type-and-Resource Safe C++](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2687r0.pdf). WG21 P2687R0. 2022. • [BS22c] B. Stroustrup: A Tour of C++ (3rd Edition). Addison-Wesley 2022. ISBN 978-0136816485. • [BS23] B. Stroustrup and G. Dos Reis: [Safety Profiles: Type-and-resource Safe](url) [programming in ISO Standard C++](url). P2816R0. 2023-02-16 • [BS23a] B. Stroustrup and G. Dos Reis: [Safety Profiles: Type-and-resource Safe](url) [programming in ISO Standard C++](url). P2816R0. 2023-02-16. • ...

Expected

Each bibliography entry on its own line, prefixed with - (or if preserving the original bullet character). docling on the same PDF produces:

- [BS21] B. Stroustrup:  Type-and-resource safety in modern C++
- [BS22b] B. Stroustrup and G. Dos Reis: Design Alternatives for Type-and-Resource Safe C++. WG21 P2687R0. 2022.
- [BS22c] B. Stroustrup: A Tour of C++ (3rd Edition). Addison-Wesley 2022. ISBN 9780136816485.
- [BS23] B. Stroustrup and G. Dos Reis: Safety Profiles: Type-and-resource Safe programming in ISO Standard C++. P2816R0. 2023-02-16
- [BS23a] B. Stroustrup and G. Dos Reis: Safety Profiles: Type-and-resource Safe programming in ISO Standard C++. P2816R0. 2023-02-16.
- [BS23b] B. Stroustrup: Concrete suggestions for initial Profiles. P3038R0. 2023-12-16.

Impact

  • Duplicate-entry detection becomes structurally harder when entries cannot be compared line-by-line. p3984r0 has reviewer-confirmed real bibliography findings (PeterTurcan flagged [BS23]/[BS23a] identical entries and two [BS24] entries pointing at different works) that the discovery LLM missed when scanning the flattened-paragraph version produced by tomd.
  • Any downstream tool that walks bibliography entries by line will fail.

Uncertainty signal

p3984r0.prompts.md is written for this paper (1KB), but it flags the Acknowledgements section on page 11 — not the References section. The bibliography flattening is not surfaced as an uncertain region.

Hypothesis on root cause

One of three symptoms of the same classifier-confidence bug — see the two companion issues filed alongside this one (bullets becoming deep headings; definition prose wrapped word-by-word in backticks). All three involve over-aggressive structural classification of ambiguous content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions