An automated literature processing platform for scientific researchers.
Batch retrieve, fetch, parse, and structure papers from PubMed, arXiv, bioRxiv, and DOI-based sources.
From paper retrieval to knowledge internalization, automate the heavy lifting and keep the judgment human.
If this project helps you, please consider giving it a Star ⭐. Thank you!
- pyPaperFlow
An automated literature processing platform for scientific researchers. This tool focuses on information extraction and knowledge discovery stages, enabling researchers to efficiently complete the entire workflow from literature retrieval to knowledge internalization through a 7-stage automated process.
Core Objectives
- Rapid Domain Entry: Batch retrieve and access all available literature in a specific field
- Batch Knowledge Extraction: Utilize AI long-text processing capabilities to extract structured knowledge from massive amounts of text
- Research Trend Tracking: Quickly grasp the latest research methods, conclusions, and core papers in a field
Positioning
This tool is designed to complement rather than replace reference management software like Zotero. We focus on the two key steps of "Information Extraction" and "Knowledge Discovery" to build a structured knowledge base for you, laying the foundation for subsequent semantic search, content analysis, and review generation.
- Automated Retrieval from Multiple Sources: Automatically search and retrieve paper metadata and full-text records from
PubMed/Medline, arXiv, medRxiv, chemRxiv and bioRxiv. The repository focuses primarily on biomedical research and computational interdisciplinary fields (Biomedicine + Computational Biology). - Full-Text Access: Enable automatic downloading of open-access full texts in XML/Text format from
PMC. For preprints and other publications without accessible PMC full texts, alternative acquisition modules are integrated to fetchoriginal PDFs, withSci-Hubset as the fallback provider. - Structured Storage:
- Metadata: Preserved in well-structured detailed JSON files.
- Full Text: Stored in multiple formats including parsed JSON and Markdown for versatile downstream usage — JSON for programmatic data analysis, and Markdown optimized for LLM comprehension and processing.
- Standardized Structured Parsing:All literatures are parsed and organized into
standardized JSON schemas. The schema strictly classifies content into metadata fields (title, year, authors) and canonical academic sections (abstract, introduction, results, discussion, methods, conclusion, supplementary, availability, funding, acknowledgements, author contributions, references, other).Custom section parsing is fully supported, allowing users to apply self-defined JSON schemas for semantic parsing of literature with special formatting structures. Dedicated modules are provided to extract designated sections from bulk topic-related papers andassemble them into source-verified Markdown literature corpora, facilitating subsequent literature investigation and systematic review writing.
- LLM & Agent Empowerment: Integrate LLM skills and intelligent Agent capabilities to streamline the entire workflow of literature investigation and in-depth reading.
- CLI Tool: Provide a user-friendly command-line interface
paperflowthat supports all core operations out of the box.
You can check the Design.md for more details about our Design Philosophy.
The project is designed around a 7-stage workflow:
flowchart TD
A[Retrieval &<br>Collection] --> B[Processing &<br>Parsing]
B --> C[Structured<br>Extraction]
C --> D[Deep Encoding &<br>Vectorization]
D --> E[Dynamic Knowledge<br>Base Storage]
E --> F[Intelligent Interaction &<br>Discovery]
F --> G[Final Output &<br>Internalization]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#ffebee
style F fill:#f1f8e9
subgraph A [Stage 1: Highly Automatable]
direction LR
A1[Requirement Analysis] --> A2[Platform Search]
A2 --> A3[Initial Screening]
end
subgraph B [Stage 2: Highly Automatable]
direction LR
B1[Batch Download] --> B2[Format Parsing<br>PDF/HTML/XML]
B2 --> B3[Text Preprocessing]
end
subgraph C [Stage 3: Human-AI Collaboration Core]
direction LR
C1[Metadata Extraction] --> C2[Core Content Extraction<br>Abstract/Methods/Conclusion]
C2 --> C3[Relation & Viewpoint Extraction]
end
subgraph D [Stage 4: Fully Automatable]
direction LR
D1[Text Slicing] --> D2[Vector Embedding]
end
subgraph E [Stage 5: Fully Automatable]
direction LR
E1[Database Storage] --> E2[Vector Indexing]
end
subgraph F [Stage 6: Human-AI Collaboration Core]
direction LR
F1[Semantic Search] --> F2[Association Rec.] --> F3[Knowledge Graph Analysis] --> F4[Review & QA]
end
subgraph G [Stage 7: Human-Led]
direction LR
G1[Critical Reading] --> G2[Inspiration Generation] --> G3[Exp. Design &<br>Paper Writing]
end
# 1. install our tool
## ✏️1️⃣ option1: Install via pip (Recommended)
pip install pyPaperFlow
## ✏️2️⃣ option2: Install from source
git clone https://github.com/MaybeBio/pyPaperFlow.git
cd pyPaperFlow
pip install -e .
--------------------------------------------------------
# 2. install MinerU
# follow the official installation guide: https://github.com/opendatalab/MinerU
# verify installation: mineru --help
pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install uv -i https://mirrors.aliyun.com/pypi/simple
uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
--------------------------------------------------------
# 3. install AI backend
pip install openai anthropic
--------------------------------------------------------
# 4. install paperscraper backend
# follow the official installation guide: https://github.com/jannisborn/paperscraper
pip install paperscraper
⚠️ For typical usage, you only need to install the repository from source and MinerU, which are steps 1 and 2.
We designed pyPaperFlow as a versatile academic research tool built strictly around the real‑world workflow of researchers conducting literature investigation, paper reading, literature comprehension and analysis, and corpus utilization.
Therefore, please follow our step‑by‑step operations, which mirror your full literature research process. Through this hands‑on experience, you will fully grasp the design philosophy and usage of this tool.
The platform provides a CLI tool named paperflow.
Current available modules include (will be continuously updated):
❯ paperflow --help
Usage: paperflow [OPTIONS] COMMAND [ARGS]...
pyPaperFlow CLI
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ pubmed-search Search PubMed using Your customized query and return PMIDs. │
│ pubmed-meta Fetch paper metadata from PubMed using Your customized query, pmid list file and save to storage. │
│ pubmed-content Download full text (PMC) for given PMIDs if the paper has a PMC ID. │
│ pubmed-all Fetch BOTH metadata and full text (if available) for papers. │
│ Also extracts URLs from full text and updates metadata links. │
│ pubmed-merge-json Create a merged JSON (or JSONL) file from PubMed paper directories. │
│ pubmed-export-md Export a single Markdown view from a merged JSON file using optional YAML config. │
│ arxiv-search Search arXiv and write matching IDs to a text file. │
│ arxiv-fetch Fetch arXiv metadata and attempt to download PDFs. │
│ biorxiv-search Search bioRxiv and write matching IDs to a text file. │
│ biorxiv-fetch Fetch bioRxiv metadata and attempt to download PDFs. │
│ paper-fetch Fetch PDFs by DOI — passes through to the paper-fetch engine. │
│ pdf-parse Parse a PDF file using MinerU engine, and clean up the output directory. │
│ mineru-parse Parse mineru output content_list_v2.json into canonical sectioned JSON. │
│ mineru-export-md Export structured mineru JSON to a clean Markdown file for LLM processing. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯Classify these modules according to the workflow stages:
PubMed Modules:
- pubmed-search # search papers and return PMIDs
- pubmed-meta # fetch paper metadata from PubMed
- pubmed-content # download full text (PMC) for given PMIDs if the paper has a PMC ID
- pubmed-all # fetch BOTH metadata and full text (if available) for papers
- pubmed-merge-json # Batch merge a collection of PubMed papers of the same topic
- pubmed-export-md # export PubMed paper collections as Markdown files, supporting batch export of specific sections (🌟 e.g., batch export of introductions as your research background)
arXiv Modules:
- arxiv-search # search arXiv and return matching IDs
- arxiv-fetch # fetch arXiv metadata and attempt to download PDFs
bioRxiv Modules:
- biorxiv-search # search bioRxiv and return matching IDs
- biorxiv-fetch # fetch bioRxiv metadata and attempt to download PDFs
Third-party Modules:
- paper-fetch # fetch PDFs by DOI
- pdf-parse # parse PDF files into JSON, Markdown format using the MinerU engine
- mineru-parse # Based on your custom section configuration, re-parse the MinerU output file into a structured JSON format clustered by standard literature sections
- mineru-export-md # Based on your custom section configuration, export the structured mineru JSON to a clean Markdown file for LLM processing (🌟 e.g., batch export of introductions as your research background)
⚠️ Other preprint platforms modules are under development, please stay tuned!
The primary step in conducting a literature review is the collection and organization of literature information. When existing knowledge reserves are insufficient, academic materials need to be integrated to systematically grasp the domestic and international research status in relevant fields.
First, the intended research topic must be defined. At the initial stage of research, you may only have scattered preliminary ideas, fragmented literatures, rough investigation drafts, or even no prior materials at all—merely several core keywords.
In this phase, the research direction and scope shall be preliminarily defined based on all available information. Only broad research boundaries need to be determined here; there is no need to precisely finalize the ultimate research objective in the first iteration.
Accordingly, priori or posteriori brainstorming is required. This tool features dedicated built‑in functional modules to help you organize existing ideas and information, and refine them into well‑defined research directions and scopes.
Inputs:
- Research Direction: The intended research topic or problem domain
- Existing Information: Related literatures, investigation drafts, keywords and other prior materials you have obtained, with attachments supported
Outputs:
- Research Scope: An explicit definition covering core topics and boundary constraints. More intuitively, it can be regarded as preliminary research questions or the overall research orientation, uniformly defined as the Starting Point of Research in this document.
- Output is mainly presented as a keyword list guiding subsequent literature retrieval or standardized research question statements. Constraints can be supplemented through multiple iterations according to research requirements.Core Note:
The Starting Point of Research is not finalized once and for all. It can be continuously updated and refined through multiple iterations with newly acquired information and research progress.
You may leverage state‑of‑the‑art large language models, combined with all materials and information at hand, to repeatedly verify and refine the Starting Point of Research until it is sufficiently clear and specific, or meets the criteria to proceed to the next step of literature retrieval.
🌟 Here we provide a few brainstorming skills for literature review: Skills List
Once the starting point of research is finalized (or any intermediate brainstorming stage requiring supplementary literature review), you may proceed with paper retrieval.
This tool does not generate search queries for you. Instead, we highly recommend crafting grammatically standardized and high‑relevance queries prior to using our search module.
Our literature database primarily covers biomedical research and computational interdisciplinary fields, with core data sources as follows:
- PubMed/Medline
- arXiv
- bioRxiv,medRxiv,chemRxiv
We recommend that you proactively learn and master the search syntax of these databases, as our built‑in search module functions similarly to the search bar on official web portals.
For instance, here is a typical complex query example tailored for PubMed:
"""
(
"Intrinsically Disordered Proteins"[Mesh] OR
"Intrinsically Disordered Protein"[Title/Abstract] OR
"Intrinsically Disordered Proteins"[Title/Abstract] OR
"Intrinsically Disordered Region"[Title/Abstract] OR
"Intrinsically Disordered Regions"[Title/Abstract] OR
"Natively Unfolded Protein"[Title/Abstract] OR
"Natively Unfolded Proteins"[Title/Abstract] OR
"Unstructured Protein"[Title/Abstract] OR
"Unstructured Proteins"[Title/Abstract] OR
"IDR"[Title/Abstract] OR
"IDP"[Title/Abstract]
)
AND
(
"Protein Interaction Maps"[Mesh] OR
"Protein Interaction Maps"[Title/Abstract] OR
"Protein Interaction Networks"[Title/Abstract] OR
"Protein-Protein Interaction Map"[Title/Abstract] OR
"Protein-Protein Interaction Network"[Title/Abstract] OR
"Protein Interaction Mapping"[Mesh] OR
"Protein Interaction Mapping"[Title/Abstract] OR
"Binding Sites"[Title/Abstract] OR
"Protein Binding"[Title/Abstract] OR
"Protein Interaction Domains and Motifs"[Title/Abstract] OR
"Protein Interaction Maps"[Title/Abstract] OR
"Protein Interaction Domains and Motifs"[Mesh] OR
"Protein Interaction"[Title/Abstract] OR
"Protein-Protein Interaction"[Title/Abstract] OR
"PPI"[Title/Abstract] OR
"Interaction"[Title/Abstract] OR
"Binding"[Title/Abstract] OR
"Interface"[Title/Abstract] OR
"Complex"[Title/Abstract]
)
AND
(
"Artificial Intelligence"[Mesh] OR
"Deep Learning"[Mesh] OR
"Machine Learning"[Mesh] OR
"Neural Networks, Computer"[Mesh] OR
"Artificial Intelligence"[Title/Abstract] OR
"Deep Learning"[Title/Abstract] OR
"Machine Learning"[Title/Abstract] OR
"Neural Network"[Title/Abstract]
)
AND (
"2023/01/01"[Date - Publication] : "2026/12/31"[Date - Publication]
)
"""Once you finish constructing your search query, you can start searching for papers. We will use the PubMed-related API as an example.
❯ paperflow pubmed-search --help
Usage: paperflow pubmed-search [OPTIONS] QUERY
Search PubMed using Your customized query and return PMIDs.
Notes:
- 1, This command only searches and returns PMIDs, it does not fetch paper metadata.
- 2, This command will print the found PMIDs and also save them to 'pubmed_searched_ids.txt' in the specified output
directory.
If --output-dir is not specified, it will default to the storage directory.
- 3, Note that storage_dir is used to initialize the fetcher for consistency, while output_dir is where the PMIDs are saved.
They are different parameters!
Example usage:
- 1. Search for papers related to "machine learning" and return up to 500 PMIDs/per batch:
paperflow pubmed-search "machine learning" --retmax 500 --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key
"YOUR_NCBI_API_KEY"
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * query TEXT PubMed search query. [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --retmax -n INTEGER Max number of PMIDs to return every batch, must less than 10000. [default: 500] │
│ * --email TEXT Entrez Email. [required] │
│ --api-key TEXT NCBI API Key (recommended). │
│ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. │
│ [default: ./Papers] │
│ --output-dir -o TEXT Directory in result-level to store output IDs. │
│ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯At this stage, we recommend retrieving paper metadata (primarily abstracts) via literature search.
Literature collection is an iterative process. You can often identify target papers using only abstracts, then proceed to download the required papers in the next step. In some cases, you may still need to download all retrieved papers.
It is important to emphasize that you can re-enter the brainstorming phase at any stage. The output of each phase can serve as the input for subsequent literature research. Based on the output of this phase, you can conduct further brainstorming to refine your research starting point and define your research questions more precisely.
❯ paperflow pubmed-meta --help
Usage: paperflow pubmed-meta [OPTIONS]
Fetch paper metadata from PubMed using Your customized query, pmid list file and save to storage.
Notes:
- 1, You must provide one of --query, or --file to specify which papers to fetch. Note that they are mutually exclusive.
- 2, -f can be used to fetch one or more PMIDs listed in a text file (one PMID per line).
Example usage:
- 1. Fetch papers for a query and save to storage:
paperflow pubmed-fetch --query "machine learning" --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key "YOUR_NCBI_API_KEY"
- 2. Fetch papers from a list of PMIDs in a file:
paperflow pubmed-fetch --file ./pmid_list.txt --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key "YOUR_NCBI_API_KEY"
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --query -q TEXT PubMed search query. │
│ --file -f TEXT Text file containing PMIDs (one per line), -q and -f are mutually exclusive. │
│ --batch-size -b INTEGER Batch size for fetching. [default: 50] │
│ * --email TEXT Entrez Email. [required] │
│ --api-key TEXT NCBI API Key (recommended). │
│ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] │
│ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] │
│ --output-dir -o TEXT Directory in result-level to store output papers, default is current directory. If not specified, will be set to root │
│ directory of the repository-level which is storage_dir. 🌟 We will create a '/pubmed' subfolder under the output │
│ directory to save all pubmed related data │
│ [default: .] │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯Once you have confirmed your target papers, or, worse case, the metadata obtained during the search phase is insufficient for further evaluation and you need to download all full‑text papers, you may start downloading the papers.
Take PubMed as an example: for PubMed papers, we prioritize downloading full texts from PMC if available. If no PMC full text exists, we only retrieve PubMed metadata (mainly abstracts) and basic paper information.
Additionally, we provide a dedicated PDF‑crawling module as a fallback strategy for paper acquisition. Manual retrieval of PDF files is only recommended when all aforementioned methods fail to obtain PubMed paper data.
Output files from the PubMed database are available in two formats: JSON and Markdown. JSON is recommended for subsequent analysis, while Markdown serves as input data for Large Language Models (LLMs). Our tool generates both file formats for your selection simultaneously.
❯ paperflow pubmed-content --help
Usage: paperflow pubmed-content [OPTIONS]
Download full text (PMC) for given PMIDs if the paper has a PMC ID.
Notes:
- 1, This currently only supports PMC full text fetching if the paper has a PMC ID.
Example usage:
- 1. Download full text for PMIDs listed in a file:
paperflow download-fulltext --file ./pmid_list.txt --email "YOUR_EMAIL@example" --api-key "YOUR_NCBI_API_KEY" --output-dir ./MyPapers
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file -f TEXT File containing PMIDs (one per line). │
│ * --email TEXT Entrez Email. [required] │
│ --api-key TEXT NCBI API Key (recommended). │
│ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] │
│ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] │
│ --output-dir -o TEXT Directory in result-level to store output full texts, default is current directory. If not specified, will be set to root │
│ directory of the repository-level which is storage_dir. 🌟 We will create a '/pubmed' subfolder under the output directory │
│ to save all pubmed related data │
│ [default: .] │
│ --pmid -p TEXT Single PMID to download full text for, can be repeated. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯Alternatively, you may perform metadata retrieval and content fetching in two separate steps; we recommend handling them separately.
❯ paperflow pubmed-all --help
Usage: paperflow pubmed-all [OPTIONS]
Fetch BOTH metadata and full text (if available) for papers. Also extracts URLs from full text and updates metadata links.
Example usage:
- 1. Fetch full papers for a query:
paperflow pubmed-all --query "machine learning" --output-dir ./MyPapers --email "YOUR_EMAIL"
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --query -q TEXT PubMed search query. │
│ --file -f TEXT Text file containing PMIDs (one per line), -q and -f are mutually exclusive. │
│ --pmid -p TEXT Single PMID to download full text for, can be repeated. │
│ --batch-size -b INTEGER Batch size for fetching. [default: 50] │
│ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] │
│ * --email TEXT Entrez Email. [required] │
│ --api-key TEXT NCBI API Key (recommended). │
│ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] │
│ --output-dir -o TEXT Directory in result-level to store output papers. If not specified, defaults to storage-dir. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯For PubMed papers without PMC full texts, or papers from other databases where only the DOI is available (the pubmed‑meta module guarantees DOI acquisition), you may directly download full texts by DOI (if open‑access versions exist).
❯ paperflow paper-fetch --help
usage: paper-fetch [-h] [--title TITLE] [--batch FILE] [--out DIR] [--dry-run] [--format {json,text}] [--pretty] [--stream] [--overwrite]
[--idempotency-key KEY] [--timeout SECONDS] [--version]
[doi]
Fetch legal open-access PDFs by DOI via Unpaywall, Semantic Scholar, arXiv, PMC, and bioRxiv/medRxiv.
positional arguments:
doi DOI to fetch (e.g. 10.1038/s41586-020-2649-2). Use '-' to read from stdin.
options:
-h, --help show this help message and exit
--title TITLE paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / --batch.
--batch FILE file with one DOI per line for bulk download. Use '-' to read from stdin.
--out DIR output directory (default: pdfs)
--dry-run resolve sources without downloading; preview the PDF URL and filename
--format {json,text} output format. json for agents, text for humans. Default: json when stdout is not a TTY, text otherwise.
--pretty pretty-print JSON output (2-space indent)
--stream emit one NDJSON result per line on stdout as each DOI resolves (batch mode)
--overwrite re-download even if the destination file already exists
--idempotency-key KEY
safe-retry key; re-running with the same key replays the original envelope from <out>/.paper-fetch-idem/
--timeout SECONDS HTTP timeout in seconds per request (default: 30)
--version show program's version number and exit
exit codes:
0 all DOIs resolved successfully
1 unresolved (some DOIs had no OA copy; no transport failure)
3 validation error (bad arguments)
4 transport error (network / download / IO failure; retryable class)
subcommands:
schema print the machine-readable CLI schema and exit (no network)
stdin:
paper-fetch - read a single DOI from stdin
paper-fetch --batch - read DOIs line-by-line from stdin
output:
stdout emits one JSON object per invocation (NDJSON with --stream).
stderr emits NDJSON progress events when --format json, prose when --format text.
stdout format auto-detects TTY: json when piped/captured, text in a terminal.
examples:
paper-fetch 10.1038/s41586-020-2649-2
paper-fetch 10.1038/s41586-020-2649-2 --dry-run
paper-fetch --batch dois.txt --out ./papers --format text
echo 10.1038/s41586-020-2649-2 | paper-fetch --batch -
paper-fetch schemaWe acknowledge the work of paper-fetch!We have modified, refactored, and encapsulated one of its core scripts for tailored integration into our pipeline.
The workflow of our paper acquisition module is outlined below:
┌─────────────────────────────────────────┐
│ Input: DOI / Paper Title / Batch File │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Title‑based Resolution? → Crossref → Semantic Scholar
│ (Resolves to DOI with confidence score) │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 1. Unpaywall (requires UNPAYWALL_EMAIL) │
│ → Fastest open‑access (OA) links with metadata
└─────────────────────────────────────────┘
Failure / Skip ↓
┌─────────────────────────────────────────┐
│ 2. Semantic Scholar │
│ → PDF URLs + external identifiers (arXiv / PMCID)
└─────────────────────────────────────────┘
Failure ↓
┌─────────────────────────────────────────┐
│ 3. arXiv (via S2 externalIds.ArXiv) │
│ 4. Europe PMC → PMC (via PMCID) │
│ 5. bioRxiv / medRxiv (DOI prefix: 10.1101/)
└─────────────────────────────────────────┘
Total Failure ↓
┌─────────────────────────────────────────┐
│ 6. Publisher Direct Links (Institutional Mode Only)
│ Nature / Science / Elsevier / Springer, etc.
│ Requires institutional IP / subscription / EZproxy authorization
└─────────────────────────────────────────┘
Persistent Failure ↓
┌─────────────────────────────────────────┐
│ 7. Sci‑Hub Mirror Fallback (enabled by default, configurable)
│ → 1 request‑per‑second rate‑limiting to prevent CAPTCHA triggers
│ → Automatic discovery of active new mirrors
└─────────────────────────────────────────┘
Resolution Priority Sequence
Unpaywall: The optimal open‑access source covering the broadest range of publishers with the highest hit rate.
Semantic Scholar: Retrieves OA PDF links and cross‑platform external identifiers.
arXiv: Activated when an arXiv identifier is available for the target paper.
PubMed Central (PMC) OA Subset: Activated when a PMCID is associated with the paper.
bioRxiv / medRxiv: Triggered for preprints with the DOI prefix 10.1101/.
Publisher Direct Links: Enabled only under institutional mode (PAPER_FETCH_INSTITUTIONAL=1), authorized via the caller’s institutional subscription IP, cookies, or EZproxy access.
Sci‑Hub Mirror Fallback: Enabled by default as the final retrieval backup.
Mirrors are attempted in the order specified by the environment variable PAPER_FETCH_SCIHUB_MIRRORS (default list: sci‑hub.ru, sci‑hub.st, sci‑hub.su, sci‑hub.box, sci‑hub.red, sci‑hub.al, sci‑hub.mk, sci‑hub.ee).
If all predefined mirrors fail, the module fetches the latest live mirror list from https://www.sci‑hub.pub/ and retries.
Set PAPER_FETCH_NO_SCIHUB=1 to disable Sci‑Hub retrieval.
If all sources fail, metadata is returned with a recommendation for interlibrary loan (ILL) acquisition.
⚠️ Prior to using the paper‑fetch module, configure your Unpaywall contact email via environment variable:
export UNPAYWALL_EMAIL=you@example.comUnlike PMC parsing, non‑PubMed papers can only be obtained as PDF files via the paper‑fetch module.
We recommend standardizing all paper information into Markdown or JSON formats.
Given subsequent requirements for paragraph segmentation and information extraction, JSON is the most suitable intermediate format for programmatic processing.
We provide a pdf‑parser module that parses input PDFs into preliminary Markdown and JSON files using MinerU.
Refer to official documentation for details. Since typical users lack sufficient GPU resources for acceleration, we use the basic parsing mode by default (pipeline backend).
❯ paperflow pdf-parse --help
Usage: paperflow pdf-parse [OPTIONS]
Parse a PDF file using MinerU engine, and clean up the output directory.
Notes:
- 1, MinerU generates a subfolder /auto under --output with .md, .json, .pdf, and images/. Use --clear to strip anything
unnecessary,
note that we only use .md files and _content_list_v2.json/_content_list.json files for further processing like structuring.
- 2, ⚠️ Remember to switch to domestic mirror source when you can not access huggingface.
Example usage:
paperflow pdf-parse -i paper.pdf -o ./output
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input -i TEXT Input PDF file path. [required] │
│ * --output -o TEXT Output directory for parsed output. [required] │
│ --clear After conversion, keep only the .md files and necessary .json │
│ files(_content_list_v2.json/_content_list.json). │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
🌟 Regarding the PDF paper retrieval module, we also provide a suite of reference scripts, which can be integrated into existing skills or implemented independently: Paper pdf fetch
In the preceding stage, we acquired metadata and textual content of academic papers:
- For PubMed papers: Metadata is retrieved, and full‑text content is downloaded from the PubMed Central (PMC) database (when available), then parsed into Markdown and JSON formats.
- For non‑PubMed papers: PDF files are obtained via Digital Object Identifiers (DOIs) and parsed using the MinerU parsing engine, with outputs standardised to Markdown and JSON formats.
The generated Markdown files from both sources serve as viable full‑text alternatives for direct literature reading, yet they are not amenable to chapter‑level extraction and standardised processing.
By contrast, JSON files retain raw parsed outputs with intricate structures, containing comprehensive textual content and positional metadata, but lack standardisation for direct downstream utilisation.
This stage processes raw JSON files by parsing and classifying textual segments to produce standardised, chapter‑organised JSON outputs.
Specifically, content is extracted and partitioned into canonical academic sections as listed below (with minor configurable variations in section delineation):
metadata(title,year,authors)
abstract
introduction
results
discussion
methods
conclusion
supplementary
availability
funding
acknowledgements
author contributions
references
other
Our objective is to fundamentally segment papers into fixed canonical sections aligned with the internal structural conventions of individual publications and the core downstream analytical demands of researchers. Teleologically, this standardised partitioning enables scholars to review and utilise literature knowledge within a consistent cognitive framework.
For PubMed papers, textual data is sourced from the PMC database; accordingly, our parsing workflow commences with JSON outputs generated from PMC parsing responses.
To preserve complete data provenance (not all PubMed papers have PMC full‑text access), we implement two modular components for structured extraction and representation of PubMed literature.
First, metadata and textual content (where PMC full‑text exists) are merged to generate a single JSON file encapsulating complete paper information:
❯ paperflow pubmed-merge-json --help
Usage: paperflow pubmed-merge-json [OPTIONS]
Create a merged JSON (or JSONL) file from PubMed paper directories.
This produces a canonical merged JSON representation per paper and is
intended as the first stage in a two-stage pipeline (merge-json -> export-md).
Example usage:
- 1. Merge JSON files for all papers in a directory:
paperflow pubmed-merge-json --input ./MyPapers --output ./MyPapers
- 2. Merge JSON files for PMIDs listed in a file:
paperflow pubmed-merge-json --input ./MyPapers --output ./MyPapers --pmid-file ./pmid_list.txt --jsonl
--stats-path ./MyPapers/stats
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input -i TEXT Directory containing paper data │
│ ({INPUT_PAPER_DIR_HERE}/pubmed/year/pmid/structure). │
│ [required] │
│ * --output -o TEXT Output directory or file path. If a directory or path without extension is given, │
│ the merged file is auto-named as │
│ <input-directory-base-name>_<datetime>.json/.jsonl. │
│ [required] │
│ --pmid-file -p TEXT File containing PMIDs to merge (one per line). │
│ --jsonl Write output as JSONL, one JSON per line. │
│ --stats-path -s TEXT Optional path to save merge statistics file, defaults to current directory. │
│ [default: .] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯Designed primarily for batch‑processing workflows to enable bulk content extraction from standardised section schemas, this module also supports single‑paper processing via file‑level specification.
By default, independent merging is executed for all PubMed papers within the input directory, and JSON files corresponding to specified PMID inventories are further aggregated into a single consolidated JSON file. This workflow is particularly suited for compiling papers on a unified research topic to construct preliminary literature knowledge bases.
This aggregated JSON file serves as the input for subsequent structured classification and extraction:
❯ paperflow pubmed-export-md --help
Usage: paperflow pubmed-export-md [OPTIONS]
Export a single Markdown view from a merged JSON file using optional YAML config.
Notes:
- 1, The input merged JSON/JSONL should be produced by the pubmed-merge-json command, which
creates a canonical representation of paper metadata and content.
- 2, The optional YAML config can specify which metadata fields and content sections to include
in the Markdown output. If not provided, it defaults to including basic metadata and the FULL
content.
Example usage:
- 1. Export Markdown for all papers in a merged JSON:
paperflow pubmed-export-md --input ./MyPapers/merged.jsonl --output ./MyPapers/exported.md
--config ./config.yaml
- 2. Export Markdown for PMIDs listed in a file:
paperflow pubmed-export-md --input ./MyPapers/merged.jsonl --output ./MyPapers/exported.md
--config ./config.yaml --pmid-file ./pmid_list.txt
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────╮
│ * --input -i TEXT Path to merged JSON or JSONL produced by pubmed-merge-json. │
│ [required] │
│ * --output -o TEXT Output Markdown file path. [required] │
│ --config -c TEXT YAML config file specifying metadata_fields and │
│ content_sections. If not provided, defaults to basic metadata │
│ and FULL content. │
│ --pmid-file -p TEXT Optional PMID file to filter exported papers. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────╯Metadata key‑value pairs for each paper follow a fixed schema:
content
abstract # abstract text, 🌟 important
keywords # keywords, 🌟 important
mesh_terms # mesh terms, 🌟 important
pub_types # article or review, can be used for filtering, 🌟 important
contributors
medline # contributors parsed from medline format, MIXED PERSONS PER DICT, LESS DETAILED
affiliations # affiliations of contributors
auids # ORCID
full_names # full names of contributors
short_names # short names of contributors, 🌟 important for citation
xml # contributors parsed from xml format, ONE PERSON PER DICT, MORE DETAILED
affiliations # same as above
full_name
identifiers
short_name
identity
doi # DOI of the paper, 🌟 important, can be used for DOI-based fetching module
pmid # PubMed ID, 🌟 important
title # title of the paper, 🌟 important
links
cites # cite this paper, 🌟 important
entrez # other entrez links
external # other external database links, ONE LINK PER DICT, MORE DETAILED (⚠️ there may be Full text source)
attribute
category
linkname
provider
url # URL of the external database link, 🌟 important
pmc # PMC ID used to download full text, 🌟 important
refs # (pmid) cited by this paper, 🌟 important
review # (pmid) All review articles highly relevant to the theme of this paper , 🌟 important
similar # (pmid) topic-similar papers, 🌟 important
text_mined # links mined from PMC full text(if available), 🌟 important (there may be github links or other sources)
metadata
entrez_date # date when the paper was added to PubMed
fetched_at # date when the paper was fetched by our tool
source
journal_abbrev # abbreviation abbreviation of the journal
journal_title # full name of the journal
pub_date # publication date
pub_types # publication types, similar to pub_types in content above
pub_year # publication year, 🌟 important for citationSemantic segmentation and classification are applied exclusively to textual content.
Within the batch‑export module pubmed-export-md, the -c parameter accepts a YAML configuration file for section extraction pubmed export yaml, enabling bulk extraction of designated sections—for instance, batch retrieval of introduction sections for background research.
⚠️ Keys within this YAML file are fixed; users may only comment out specific keys to extract targeted sections, or retain default settings to export all sections.
metadata_fields:
- identity.title
- identity.pmid
- identity.doi
- content.keywords
- content.mesh_terms
- content.pub_types
- content.abstract # abstract in metadata first, fall back in content sections(deprecated)
- contributors.medline
- contributors.xml
- links.cites
- links.entrez
- links.external
- links.pmc
- links.refs
- links.review
- links.similar
- links.text_mined
- metadata.entrez_date
- metadata.fetched_at
- source.journal_abbrev
- source.journal_title
- source.pub_date
- source.pub_types
- source.pub_year
content_sections:
- abstract
- introduction
- methods
- results
- discussion
- conclusion
- supplementary
- availability
- funding
- acknowledgements
- author_contributions
The core parsing logic is illustrated below:
flowchart TD
A[Initiate Markdown Export] --> B{YAML Config Provided?}
B -- Yes --> C[Load yaml_cfg]
C --> D[Parse metadata_fields / content_sections]
D --> E[Write paper‑level title and metadata]
E --> F[Extract section tree from content.body]
F --> G[_extract_section_records: raw sections → structured records]
G --> H[_normalize_section_title: map to canonical_type]
H --> I[_order_section_records: sort per content_sections]
I --> J[_aggregate_section_records: merge identical canonical_type entries]
J --> K{canonical_type in content_sections?}
K -- No --> L[Skip section]
K -- Yes --> M[_render_section_records: format as Markdown headings]
M --> N[Insert paper separator]
L --> N
B -- No --> O[Omit section mapping]
O --> P[Write paper‑level title and metadata]
P --> Q{content.body Exists?}
Q -- Yes --> R[Recursively expand raw section tree]
R --> S[render_raw_content_tree: output title/content/subsections directly]
Q -- No --> T[Supplement abstract from metadata]
T --> U[Output metadata fields + abstract]
S --> N
U --> N
N --> V[Process Next Paper]
V --> W[Terminate Export]
The above workflow describes structured extraction for PubMed papers. For non‑PubMed publications, parsing commences with preliminary JSON outputs(content_list_v2.json)generated by the MinerU parsing engine.
The content_list_v2.json file generated by processing PDFs with MinerU organizes data on a page-by-page basis: an outer array represents all pages, and each element is a list of rendered blocks for that page. These blocks include diverse types such as paper titles, paragraphs, interline equations, images/charts, tables, page headers, footers, and footnotes, which are mixed together and cannot be directly used for downstream semantic analysis or LLM input.
Our goal is to convert this raw JSON into a unified, structured JSON organized by standard sections in the literature domain.
Input JSON structure:
[
[ // page 0
{"type": "title", "content": {"title_content": [...], "level": 1}},
{"type": "paragraph", "content": {"paragraph_content": [...]}},
{"type": "title", "content": {"title_content": [...], "level": 2}},
{"type": "paragraph", "content": {"paragraph_content": [...]}},
{"type": "page_header", ...}, // noise
{"type": "page_footnote", ...}, // noise
...
],
[ // page 1
...
]
]Common block types (categorized by content values):
| Type | Is Main Content | Text Extraction Path |
|---|---|---|
title |
Yes (Section Anchor) | content.title_content[*].content + level (1 = Paper Title, 2 = Primary Section) |
paragraph |
Yes (Main Text) | content.paragraph_content[*].content, supports equation_inline sub-items |
equation_interline |
Yes (Interline Equation) | content.math_content (LaTeX) |
table |
Partial | content.html (HTML Table) + content.table_caption |
image / chart |
No (Caption Preserved) | content.image_caption[*].content / content.chart_caption |
page_header / page_footer / page_footnote |
Noise (Discarded) | Used for metadata scanning (year/DOI/journal name) |
Our parsing pipeline is as follows:
content_list_v2.json
│
───────────────── Step 1: Flattening ─────────────────
│
_flatten() — Remove noise blocks
(page_header/footer/footnote)
Preserve title / paragraph / table, etc.
│
────────────── Step 2: Metadata Extraction ────────────────
│
┌─ title ← First level=1 title block
├─ authors ← First short line after title (contains commas, <400 characters)
├─ year ← Extract "2025" from page_footer
├─ doi ← Match "10.1002/..." from page_footnote
└─ journal ← Select all-uppercase short name from page_header
│
────────────── Step 3: Abstract Extraction ──────────────────
│
_extract_abstract()
Skip author lines → Collect all paragraphs before the first section
│
─────────┐ Step 4: Section Segmentation ─────────────────────
│
│ Split paragraphs by title blocks:
│ level=1 → Skip (Paper Title)
│ level=2 → New Primary Section
│ level>=3 or numbered "2.1." → Subsection, grouped under parent section
│
─────────┤ Step 5: Title Normalization ─────────────────────
│
│ normalize_section_title()
│ Remove numeric prefixes "2.2. IDPFold..." → "IDPFold..."
│ Match CANONICAL_TYPES table → "results"
│
─────────┤ Step 6: Section Aggregation ───────────────────────
│
│ _aggregate_sections()
│ Merge content with the same canonical_type
│ Preserve subsections list
│
─────────┘ Step 7: Table Extraction ─────────────────────
│
_extract_tables()
Collect html + caption of all table blocks
│
▼
Structured Output JSON
This JSON schema is more complex and less straightforward to parse than PMC‑derived JSON files.
Analogous to the PubMed processing pipeline, two sequential modules are deployed for structured extraction of non‑PubMed JSON outputs.
The combination mineru‑parse + mineru‑export‑md serves as an enhanced counterpart to pubmed‑merge‑json + pubmed‑export‑md.
❯ paperflow mineru-parse --help
Usage: paperflow mineru-parse [OPTIONS]
Parse mineru output content_list_v2.json into canonical sectioned JSON.
Extracts metadata (title, authors, year, DOI, journal),
and sections normalised to canonical types (abstract, introduction, results,
discussion, methods, etc.). Tables are preserved as HTML.
Notes:
- 1, Two backends: 'regex' (pattern + context, no API) and 'ai' (LLM batch classification).
- 2, AI backend supports Anthropic native, OpenAI native, and any OpenAI-compatible
endpoint via --base-url (DeepSeek, university proxies, self-hosted, etc.).
- 3, Set the appropriate API key env var (ANTHROPIC_API_KEY, OPENAI_API_KEY,
DEEPSEEK_API_KEY) or pass --api-key.
- 4, Configure provider/model via --model, --base-url, or a YAML config file.
Examples:
paperflow mineru-parse -i content_list_v2.json -o paper.json
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai \
--base-url https://api.deepseek.com --model deepseek-v4-pro --api-key sk-xxx
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai \
--base-url https://models.sjtu.edu.cn/api/v1 --model deepseek-chat
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend regex --config custom.yaml
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input -i TEXT Path to mineru content_list_v2.json. [required] │
│ * --output -o TEXT Output path for the structured JSON file. [required] │
│ --backend -b TEXT Section classification backend: 'regex' (default, no API needed) or 'ai'. │
│ [default: regex] │
│ --config -c TEXT Path to YAML config file for canonical types, aliases, and AI settings. │
│ --api-key TEXT API key for AI backend. Overrides config file and env var. │
│ --model TEXT Override AI model (e.g. 'deepseek-v4-pro', 'claude-haiku-4-5', 'gpt-4o-mini'). │
│ --base-url TEXT Custom API base URL for OpenAI-compatible endpoints (e.g. │
│ 'https://api.deepseek.com'). │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
mineru-parse transforms flat JSON outputs from MinerU into standardised structured JSON, classifying each segment into canonical academic sections while extracting metadata (title, authors, year, DOI, journal) and figure captions.
Two backends are provided for textual segment parsing:
Two Backends
| Backend | How it works | API needed? | Best for |
|---|---|---|---|
| regex (default) | Pattern matching: exact string → regex → context keyword. Configurable via YAML. | No | Common papers, batch processing |
| ai | Sends all section titles + context to an LLM in one batch API call. | Yes | Non-standard titles, multi-publisher |
1. Regex matching layers :
1. strong (exact match) → "Introduction" == "introduction" ✓
2. weak (regex search) → "1. Introduction" matches r"introduction" ✓
3. context_keywords → "Overview" → check text for "we used..." → methods
4. fallback → classify as "other"
A sliding positional pointer tracks document sequence to minimise misclassification: subsequent section matching initiates from the endpoint of the preceding matched segment rather than the document start.
2. AI workflow :
content_list_v2.json
→ extract all titles + surrounding text (~200 chars)
→ build JSON payload: [{index, title, context_preview}, ...]
→ one API call → AI returns {classifications: [{index, canonical_type}]}
→ merge classifications into structured JSON
⚠️ The regex backend is enabled by default; the AI backend is under active development. 🌟 For the-cparameter ofmineru‑parse, please refer to the provided template configuration file mineru config file. Default settings suffice for general usage without modification. This configuration file is engineered for compatibility with both regex and AI backends, with documentation and revision guidelines embedded within the file.
All matching rules are encapsulated within mineru_config.yaml, with sensible defaults preconfigured. Modifications are only required for journal‑specific adaptation.
Users may globally customise section categorisation and individually classify arbitrary textual segments according to personal reading and downstream analytical requirements.
🌟 This enables highly personalised section parsing: theoretically, custom section schemas and parsing logic can be tailored for any paper type.
Config file layout :
| Section | Purpose |
|---|---|
ai |
model, api_key, base_url for AI backend |
canonical_order |
Which types exist + their output order |
display_names |
Human-readable labels (can be Chinese, etc.) |
aliases |
Matching rules: strong (exact), weak (regex), context_keywords |
Common customization scenarios :
| Scenario | Where to edit |
|---|---|
| Title misclassified as "other" | Add to matching type's strong or weak |
| Need a new section type | Add to canonical_order + display_names + aliases |
| Switch AI model | Edit ai.model and ai.base_url |
| Chinese labels | Edit display_names |
A representative structured JSON output is provided below:
{
"source": "mineru",
"file": "paper_content_list_v2.json",
"backend": "regex",
"metadata": {
"title": "Accurate Generation of Conformational Ensembles...",
"authors": "Junjie Zhu, Zhengxin Li, ...",
"year": 2025,
"doi": "10.1002/advs.202511636",
"journal": "Advanced Science"
},
"sections": [
{
"canonical_type": "abstract",
"raw_title": "Abstract",
"display_title": "Abstract",
"level": 2,
"paragraphs": ["In this paper, we..."],
"subsections": []
},
{
"canonical_type": "introduction",
"raw_title": "1. Introduction",
"display_title": "Introduction",
"paragraphs": ["...", "[Figure: Figure 1. Architecture overview...]"],
"subsections": []
},
{
"canonical_type": "results",
"raw_title": "2. Results",
"display_title": "Results",
"subsections": [
{"raw_title": "2.1. Global Features", "paragraphs": ["..."]}
]
}
]
}Approximately 15 standard section types are supported, consistent with conventional academic paper structure:
abstract introduction results discussion methods conclusion supplementary availability funding acknowledgements author_contributions keywords conflicts references other
Following generation of structured JSON files, targeted bulk section export can be performed on demand.
Functionally, the pubmed‑export‑md module for PubMed papers integrates the capabilities of mineru‑parse and mineru‑export‑md.
❯ paperflow mineru-export-md --help
Usage: paperflow mineru-export-md [OPTIONS]
Export structured mineru JSON to a clean Markdown file for LLM
processing.
Reads one or more JSON files produced by ``mineru-parse`` and
writes a
single Markdown file. Metadata (title, authors, year, DOI,
journal) is
always included. Content sections are included based on the
optional
YAML config.
YAML config format:
content_sections:
- abstract
- introduction
- methods
- results
- discussion
- conclusion
Examples:
paperflow mineru-export-md -i paper.json -o paper.md
paperflow mineru-export-md -i paper.json -o paper.md --config
extract.yaml
paperflow mineru-export-md -i ./parsed_dir -o all_papers.md
╭─ Options ────────────────────────────────────────────────────────╮
│ * --input -i TEXT Path to structured JSON file (from │
│ mineru-parse), or a directory of such │
│ files. │
│ [required] │
│ * --output -o TEXT Output Markdown file path. [required] │
│ --config -c TEXT YAML config specifying │
│ content_sections to include. If not │
│ provided, all sections are included. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────╯🌟 Similarly, the
-cparameter ofmineru‑export‑mdaccepts a dedicated YAML configuration file mineru export config file for bulk section export configuration, with embedded documentation and revision guidelines.⚠️ Section types defined in this export configuration file must be pre‑declared in canonical_order within mineru_config.yaml. Custom section types (e.g., ethics) defined during parsing may only be invoked in the export phase if pre‑registered upstream. In short, mineru export config file and mineru config file must be mutually consistent.
mineru_config.yaml mineru_export_config.yaml
┌──────────────────────┐ ┌──────────────────────┐
│ canonical_order: │ │ content_sections: │
│ - abstract │── 定义 → │ - abstract │
│ - introduction │ 类型池 │ - introduction │
│ - results │ │ - results │
│ - ... │ │ - discussion │
│ - ethics ← 自定义 │ │ - methods │
└──────────────────────┘ │ - ethics ← 引用 │
└──────────────────────┘For instance, if ethics is added to canonical_order with corresponding aliases in mineru_config.yaml, the heading "Ethics Statement" within papers will be classified under the ethics section type during parsing. This type may then be selected in the export configuration file to extract relevant content. Unregistered section types cannot be recognised in the export phase.
Engineered for batch processing workflows, mineru‑export‑md scans all .json files within a specified non‑PubMed paper directory (it is recommended to store only mineru‑parse outputs in an isolated directory to avoid extraneous JSON files). Files are sorted by name, with individual papers separated by ---, and consolidated into a single merged Markdown file.
The preceding Steps 1‑4 are illustrated using PubMed as a representative literature database. The same processing logic applies to other academic platforms, such as arXiv, bioRxiv, medRxiv, chemRxiv, and more.
In theory, all DOI‑driven literature workflows can be standardised following the pipeline described above:
Retrieve PDF via DOI → Preliminary PDF Parsing → Content Extraction and Structured Processing
Modules dedicated to the aforementioned preprint platforms are still under development and refinement. Preprint‑related subcommands are provided for testing purposes only. For detailed test cases, refer to Cases
Upon completing literature retrieval, parsing, and structured processing as outlined above, users obtain chapter‑organised Markdown files and structured JSON files, which serve as the fundamental inputs for subsequent critical reading and knowledge graph analysis.
Whether conducting continuous parsing of cutting‑edge individual papers or batch‑processing literature for thematic research, Markdown files form the unified starting point. State‑of‑the‑art (SOTA) text‑processing and logical‑analysis models can be leveraged to assist knowledge graph construction or straightforward real‑time literature reading.
🌟 As the most subjective downstream task, literature reading can still be transformed into quantifiable, repeatable workflows. Highly customised reading skills are commonly adopted to facilitate paper analysis. Relevant references are provided at paper reading skill
We provide a set of test cases in Test Documentation, covering multiple types of literature data including PubMed, arXiv, bioRxiv, and more.
It also contains highly detailed step‑by‑step execution logs of script workflows arranged in the logical order of literature research.
You may directly run the test scripts to verify the correctness and completeness of all functionalities.
🌟 By combining the aforementioned
usage instructionswith thesetest cases, users can quickly get started with our tool.
1. Starting Point for Research
- Extend the BrainStorm skill and explore programmable integration of background prior knowledge.
2. Literature Search (and Metadata Scraping)
- Supplement query syntax for various literature databases and implement skill‑based support. Currently only partial MeSH‑aware syntax priors for PubMed are integrated.
- Maintain and update the BioPython library (E‑utilities API) for PubMed parsing from this stage onward. Current version: BioPython 1.87; see biopython Repository for details.
3. Literature Acquisition (and Full‑Text Download)
- Refine and encapsulate the
paper‑fetchmodule. Refer to 2026‑05‑08 paper‑fetch Encapsulation; evaluate integration or replacement with more robust modules offering higher hit rates.- The
pdf‑parsemodule currently wraps basic MinerU parsing commands with the CPU backend (‑b pipeline). Future integration of GPU‑accelerated features; see MinerU Repository for details.
4. Literature Content Extraction and Structured Processing
- Improve JSON‑structured parsing of PMC plain‑text content within the
pubmed‑export‑mdmodule: enhance semantic boundary validation by expanding regular‑expression matching ranges, or introduce an AI backend analogous to themineru‑export‑mdmodule.- The
mineru‑parsemodule parsescontent_list_v2.json. Official documentation indicates this output format is still evolving; ongoing tracking and maintenance are required. See MinerU Output File Documentation.- Enhance semantic boundary validation for the regex backend of
mineru‑parseby expanding regular‑expression matching ranges.- Deepen integration of the AI backend within the
mineru‑parsemodule.- Optimize coordination between YAML configuration files for the
mineru‑parseandmineru‑export‑mdmodules to achieve efficient mapping.- Design a standalone skill for segment extraction and structured processing of raw parsed Markdown content. Current workflows default to JSON files and underutilize Markdown outputs.
5. Processing for Other Literature Databases
- Develop a unified
search‑fetch‑parsepipeline for non‑PubMed databases and complete corresponding modules. Refer to open‑source implementations such as paperscraper and paper‑tracker.
6. Critical Reading and Knowledge Graph Analysis: Downstream End‑Use
- Develop highly customized skills for in‑depth literature analysis, preferably integrated into downstream workflows.
- Introduce persistent databases to scale and deepen functionality beyond a pure Python‑based project.
