HeartBioPortal DataHub is the data integration and publishing layer behind HeartBioPortal. It standardizes heterogeneous cardiovascular genomics and omics datasets, preserves provenance, performs raw-level integration, and emits both legacy-compatible analyzed outputs and newer serving artifacts.
Comprehensive documentation lives under docs/ and can also be served as a documentation website.
- Start with:
docs/index.md - Script standards:
SCRIPT_MANIFESTO.md - Architecture guide:
docs/architecture/ - Pipeline guides:
docs/pipelines/ - Extension/contributor guides:
docs/extending/anddocs/contributing.md
Local docs preview:
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[docs]"
mkdocs serveProduction docs site:
mkdocs.ymldefines the site..github/workflows/docs.ymlbuilds and deploys to GitHub Pages.- Enable GitHub Pages in the repository and choose
GitHub Actionsas the source.
git clone https://github.com/HeartBioPortal/DataHub.git
cd DataHub
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[test]"
python -m pytestFor script-only environments, requirements.txt is still available. The
package metadata in pyproject.toml is the canonical development install path.
pip install -r requirements.txtCurrent runtime requirements:
jsonschemajsonschema2mdPyGithubpandasduckdbrequests
Test dependencies live under the test optional extra in pyproject.toml.
scripts/prepare_association_raw.pyscripts/build_legacy_association.pyscripts/run_ingestion.pyscripts/run_structural_variant_ingestion.pyscripts/dataset_specific_scripts/mvp/run_mvp_pipeline.pyscripts/dataset_specific_scripts/unified/run_unified_pipeline.pyscripts/dataset_specific_scripts/unified/run_secondary_analyses.pyscripts/dataset_specific_scripts/unified/run_gene_profile_pipeline.pyscripts/dataset_specific_scripts/unified/build_dbsnp_frequency_index.pyscripts/dataset_specific_scripts/unified/build_dbsnp_frequency_parquet.pyscripts/dataset_specific_scripts/unified/canonicalize_variant_viewer_artifacts.pyscripts/dataset_specific_scripts/expression/run_expression_pipeline.pyscripts/report_artifact_qa.py
Editable installs also expose console commands such as:
datahub-run-ingestiondatahub-run-unified-pipelinedatahub-ingest-mvp-duckdb-fastdatahub-publish-unified-from-duckdbdatahub-build-serving-duckdbdatahub-run-secondary-analysesdatahub-run-gene-profile-pipelinedatahub-build-dbsnp-frequency-indexdatahub-build-dbsnp-frequency-parquetdatahub-run-expression-pipelinedatahub-report-artifact-qa
Expression v3 uses Python for orchestration, metadata, validation, and DataHub artifact generation. GEO microarray differential-expression execution uses GEOquery, limma, and Biobase from Bioconductor through a local R library.
On Ubuntu/AWS, the system libraries for R curl and xml2 packages are
installed before GEOquery:
cd /data/DataHub
mkdir -p .r-lib
sudo apt-get update
sudo apt-get install -y libcurl4-openssl-dev libxml2-dev libssl-dev
R_LIBS_USER="$PWD/.r-lib" Rscript -e ".libPaths(c(Sys.getenv('R_LIBS_USER'), .libPaths())); if (!requireNamespace('BiocManager', quietly=TRUE)) install.packages('BiocManager', repos='https://cloud.r-project.org')"
R_LIBS_USER="$PWD/.r-lib" Rscript -e ".libPaths(c(Sys.getenv('R_LIBS_USER'), .libPaths())); install.packages(c('curl','xml2'), repos='https://cloud.r-project.org')"
R_LIBS_USER="$PWD/.r-lib" Rscript -e ".libPaths(c(Sys.getenv('R_LIBS_USER'), .libPaths())); BiocManager::install(c('GEOquery','limma','Biobase'), ask=FALSE, update=FALSE)"Verification:
R_LIBS_USER="$PWD/.r-lib" Rscript -e ".libPaths(c(Sys.getenv('R_LIBS_USER'), .libPaths())); pkgs <- c('BiocManager','GEOquery','limma','Biobase'); print(setNames(vapply(pkgs, requireNamespace, logical(1), quietly=TRUE), pkgs))"All four packages print TRUE before we run approved GEO/limma jobs.
src/datahub/: reusable pipeline, adapter, config, validation, storage, and publisher codeconfig/: profiles, manifests, runtime configs, phenotype hierarchy, output contracts, and export manifestsraw_data/: small checked-in standalone source files organized by source IDanalyzed_data/: curated analyzed artifacts and merge/metadata seed payloads organized by source IDscripts/: operational entrypoints for preparation, ingest, publish, and orchestrationtests/: focused coverage for adapters, manifests, publishers, runners, and serving buildersdocs/: contributor-facing documentation published athttps://heartbioportal.github.io/DataHub/
Config JSON files are validated by JSON Schemas in config/schemas/.
- Keep biological/analytical logic in DataHub, not in downstream application layers.
- Preserve provenance as early as possible and keep source detail through normalization.
- Make source-specific behavior explicit through config and adapters, not hidden conditionals.
- Keep published outputs stable for consumers while allowing additive metadata evolution.
- Separate concerns between raw preparation, canonical ingestion, analyzed publication, and serving artifacts.
DataHub still supports legacy HeartBioPortal-compatible analyzed payloads, but the codebase now also maintains a newer serving-artifact path based on DuckDB. The legacy path exists for compatibility; the unified DuckDB-first path is the strategic direction.
DataHub is the canonical HBP 3.0 data-owner repository. It prepares source manifests, normalizes source-specific fields, publishes association and secondary-analysis artifacts, and builds serving datamarts consumed by the HeartBioPortal backend and frontend.
Related HBP 3.0 repositories:
- HeartBioPortal organization: https://github.com/HeartBioPortal
- Live site: https://heartbioportal.org/
- HCG guideline extraction resource: https://github.com/HeartBioPortal/HCG
- HCG-KG guideline knowledge graph resource: https://github.com/HeartBioPortal/HCG-KG
This repository supports the HeartBioPortal 3.0 NAR Database Issue manuscript release (v3.0.0-nar). The release-support files in this repository describe source provenance, licensing constraints, generated artifacts, reproducibility expectations, and which files we include or exclude from a public archive.
Release metadata and manifests:
CITATION.cff.zenodo.jsonRELEASE_NOTES.mdMANIFEST.mdDATA_SOURCES.tsvDATA_SOURCES.mdARTIFACT_MANIFEST.tsvBUILD_METADATA.jsonLICENSES.mdPROVENANCE_SCHEMA.mddocs/schemas/*.mdscripts/generate_checksums.sh
Use this checklist before creating a GitHub release or Zenodo archive:
- Confirm the release branch and commit:
git status --short --branch
git rev-parse HEAD- Validate the code and docs in the target environment:
python -m pytest
mkdocs build --strict- Regenerate or review manifest files:
- Review
DATA_SOURCES.tsvandDATA_SOURCES.mdagainst the current source configs and pipeline inputs. - Review
ARTIFACT_MANIFEST.tsvagainst production QA reports, generated artifact directories, and serving DB tables. - Update
BUILD_METADATA.jsonwith the final release commit, build date, schema version, and verified production metrics.
- Verify counts from production artifacts where available:
datahub-report-artifact-qa --helpCounts that we cannot verify from committed local artifacts remain TBD; verify from production QA.
- Generate release checksums for release-relevant static files:
scripts/generate_checksums.sh- Include in Zenodo:
- repository source code
- config schemas and manifests
- documentation
- small examples or seed metadata that are redistributable
- generated release manifests and checksum files
- Exclude from Zenodo unless redistribution has been confirmed:
- controlled individual-level human data
- API keys, credentials, tokens, or secrets
- raw DrugBank full database files
- large source datasets with unclear redistribution rights
- controlled-access or license-restricted third-party source files
- massive generated artifacts unless they are intended, permitted, and documented for the release package
Controlled individual-level human data, API keys, credentials, protected data, tokens, and restricted source data stay out of this repository. Source-specific licensing controls redistribution of third-party data; when redistribution rights are uncertain, we document the source in DATA_SOURCES.tsv or LICENSES.md rather than committing the data.
See LICENSE.