Skip to content

Documentation request: where does the input TSV for data_loader come from? #22

@gaurav

Description

@gaurav

We're trying to figure out how to refresh the database that backs DocumentMetadataAPI, and we got stuck on the input format expected by data_loader.process_file / data_loader.lambda_handler.

The loader reads a tab-separated file with (at least) ten columns in this exact order:

document_id  pub_year  pub_month  pub_day  journal_name  journal_abbrev  volume  issue  article_title  abstract

…with - used as a sentinel for missing pub_day and PMID:-prefixed document_id values. We were hoping this was a standard PubMed bulk-export format we could regenerate from scratch, but searching turned up no such thing:

Our tentative conclusion is that the TSV is produced by an internal/private job (probably whatever uploads to the GCS bucket the Lambda variant reads from), and isn't documented or open-sourced anywhere. We've written up what we figured out in a downstream README update here: ncats#23

It would be really helpful to have, in this repo's README or a CONTRIBUTING.md:

  1. A pointer to whatever script/pipeline produces the TSV (even if it's a private repo or an internal job — just a name).
  2. Confirmation of the column contract: column order, the - sentinel for missing pub_day, whether len > 1 empty-string handling is intentional, and what happens if more than 10 columns are present.
  3. Guidance for someone who wants to regenerate the database from a fresh PubMed XML download (baseline + updates).

It might be easier to load the PubMed data into a new database, but if a data ingest tool exists for this tool as it currently exists, that would be really helpful. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions