Skip to content

pg-arrow/pg_arrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pg_arrow

Status: Work in progress. Public API, error types, and on-disk format coverage may change. Not yet production-ready.

Current implementation: reads PostgreSQL heap files directly from disk (no shared buffer pool yet). A buffer-pool / page-cache layer is on the roadmap.

PostgreSQL version: only tested against PostgreSQL 18. Older versions may work, but multi-version testing is WIP.

Low-level library for reading PostgreSQL data files directly and converting them to Apache Arrow format. Used by pgfusion as the page-parsing and Arrow conversion layer.

Prerequisites

  • Rustrustup.rs
  • just — command runner for all recipes
# macOS
brew install just

# Linux / Windows (via cargo)
cargo install just

# All platforms (pre-built binary)
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/.local/bin

For flamegraph and profiling recipes:

cargo install cargo-flamegraph  # flamegraph-* recipes
cargo install samply            # samply-* recipes

Quick start

# Build
cargo build

# Run the table_reader example
just example-table-reader /path/to/pgdata           # defaults to db "postgres"
just example-table-reader /path/to/pgdata pgbench_test

# Run tests
just test

Common commands

just build                    # Debug build
just release                  # Release build
just test                     # Unit tests
just bench                    # Criterion benchmarks
just bench-iai                # iai instruction-count benchmarks
just bench-io                 # File I/O latency benchmarks
just flamegraph-bench         # Flamegraph for criterion bench
just flamegraph-example /path/to/pgdata  # Flamegraph for table_reader example
just doc                      # Open rustdoc
just --list                   # Show all available recipes

PostgreSQL Setup for Testing

Prerequisites

PostgreSQL setup uses the pg-test-harness scripts. Clone the pg-arrow/utils repo and point PG_HARNESS_DIR at the harness subdirectory:

git clone https://github.com/pg-arrow/utils /path/to/utils
export PG_HARNESS_DIR=/path/to/utils/pg-test-harness

Add the export to your shell profile (~/.zshrc, ~/.bashrc) to persist it.

Quick Setup

Setup writes testdata/ and pg-test-config.toml under $PG_HARNESS_DIR. The just recipes inherit that env var — no extra flags needed:

# Full setup: build from source, init cluster, load test data
just pg-setup pg18            # or pg17 / latest

# Full setup with simple schema (no pgbench tables)
just pg-setup-simple pg18

# Individual steps
just pg-build pg18            # Build PostgreSQL source only
just pg-init pg18             # Init cluster (source must be built)
just pg-testdata pg18         # Load test data into initialised cluster

Or invoke the harness script directly:

bash "$PG_HARNESS_DIR/scripts/setup-postgres.sh" -b pg18 -B -i -t

Script options

Flag Description
-b, --branch VERSION pg18, pg17, pg16, latest, or full branch name
-B, --build Build PostgreSQL locally (meson/ninja)
-i, --init Initialize database cluster
-t, --test-data Create test database with sample data
-s, --simple-schema Single-table schema instead of full e-commerce schema
-p, --pgbench Create a pgbench_test database with pgbench data

What the script does

  1. Clones PostgreSQL from https://git.postgresql.org/git/postgresql.git into $PG_HARNESS_DIR/testdata/postgres/
  2. Creates a git worktree under $PG_HARNESS_DIR/testdata/postgres-{version}/
  3. Optionally builds PostgreSQL locally (installs to testdata/postgres-{version}/install/)
  4. Optionally initializes the database cluster (no root/postgres user needed)
  5. Optionally creates a test database and loads schema + sample data
  6. Writes paths to $PG_HARNESS_DIR/pg-test-config.toml for use in Rust tests

Directory structure after setup

$PG_HARNESS_DIR/                  # = utils/pg-test-harness in this repo
├── pg-test-config.toml           # one config, shared by pg_arrow + pgfusion
├── testdata/
│   ├── postgres/                 # main PostgreSQL git repository
│   ├── postgres-latest/          # worktree for master branch
│   │   ├── data/
│   │   ├── build/
│   │   └── install/bin/
│   └── postgres-pg18/
│       ├── data/
│       ├── build/
│       └── install/bin/
└── scripts/
    ├── setup-postgres.sh         # PostgreSQL build/init/test-data setup
    └── pgbackrest-backup.sh      # WAL archiving and backup management

pg-test-config.toml format

Generated by the setup script. Paths inside are relative to $PG_HARNESS_DIR (the config file's parent directory) — read programmatically in tests via pg_test_harness::read_pg_config(); never hardcode paths.

[postgres.pg18]
version = "REL_18_STABLE"
source_dir = "testdata/postgres-pg18"
data_dir = "testdata/postgres-pg18/data"
bin_dir = "testdata/postgres-pg18/install/bin"
initialized = true
test_db_created = true

Test database schemas

Simple schema (-s flag): single test_types table covering all common PostgreSQL datatypes — ideal for basic parsing tests.

Full e-commerce schema (default): 5 tables (categories, products, customers, orders, order_items) with foreign keys, multiple index types, and ~20 rows of sample data.

pgbackrest

just backup-setup             # Configure WAL archiving
just backup-full              # Full backup
just backup-incr              # Incremental backup
just backup-info              # Show backup info
just backup-restore /path     # Restore to directory

About

Library to directly read PostgresSQL data folder to Apache Arrow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors