LanceDB_VectorSearch

A high-performance rust-based benchmark tool for testing LanceDB vector search capabilities.

LanceDB on S3 — Benchmarks

Two Rust binaries to benchmark LanceDB against an S3-compatible object store:

lancedb-qps-10m — end-to-end pipeline benchmark: ingest → compact → IVF_PQ index → concurrent QPS sweep.
lancedb-atomic-demo — side-by-side demo showing why LanceDB's conditional-PUT commits prevent the silent data loss you get with a naive last-writer-wins S3 PUT.

Prerequisites

1. Install Rust / Cargo

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
rustc --version

2. S3 credentials

Both binaries read credentials from environment variables. Nothing is baked in. You need an endpoint, an access key, a secret key, and a bucket you can write to.

3. Clone

git clone https://github.com/PureStorage-OpenConnect/LanceDB_VectorSearch.git
cd LanceDB_VectorSearch

4. Build

From the repository root (where Cargo.toml lives):

cargo build --release --bin lancedb-qps-10m --bin lancedb-atomic-demo

Binaries land in ./target/release/. First build downloads + compiles all dependencies and takes ~5-10 minutes; incremental rebuilds are seconds.

Scenario 1 — `lancedb-qps-10m` (end-to-end pipeline benchmark)

Runs the full lifecycle of a vector table against S3 and prints a summary.

Phases

Phase 1: INGEST     Read .arrow IPC or .parquet files → push to S3 (parallel writers)
Phase 2: COMPACT    Merge fragments to target_rows_per_fragment + prune old versions
Phase 3: INDEX      Build IVF_PQ index (256 partitions, 64 sub-vectors, cosine)
Phase 4: WARMUP     Fire 50 concurrent queries to warm the connection pool + S3 caches
Phase 5: QPS SWEEP  Ramp concurrency through multiple tiers, 30s each, record p50/p99/p99.9

Each tier uses shared atomic counters and a lock-free latency window. Only queries that return actual result batches count as OK; failures are reported separately with an error rate.

Environment variables

Required:

Variable	Description
`S3_ENDPOINT`	S3 endpoint URL (e.g. `http://xx.0.0.xx` or AWS regional)
`S3_ACCESS_KEY`	Access key
`S3_SECRET_KEY`	Secret key
`DATA_DIR`	Directory of `.arrow` or `.parquet` files to ingest
`QUERY_FILE`	`.arrow` or `.parquet` file with query vectors

Optional:

Variable	Default	Description
`S3_BUCKET`	`lance-demo`	Bucket name
`DB_PATH`	`lance-data-10m`	Lance DB path inside bucket
`TABLE_NAME`	`wiki_10m`	Table name
`SELECT_COL`	`id`	Metadata column returned with each hit
`PHASES`	`ingest,index,search`	Any comma-separated subset
`K`	`10`	Top-k nearest neighbors
`NPROBES`	`5`	IVF partitions to probe
`REFINE_FACTOR`	`1`	Re-ranking factor
`TARGET_ROWS_PER_FRAGMENT`	`1000000`	Rows per fragment after compaction

How to run

export S3_ENDPOINT="http://<your-endpoint>"
export S3_ACCESS_KEY="<access-key>"
export S3_SECRET_KEY="<secret-key>"
export S3_BUCKET="wiki-bench"
export DB_PATH="lance-data-10m"
export TABLE_NAME="wiki_10m"
export DATA_DIR="/path/to/wiki-embeddings"
export QUERY_FILE="/path/to/wiki-embeddings/000.parquet"
export SELECT_COL="id"
export PHASES="ingest,index,search"

./target/release/lancedb-qps-10m

To re-run just the search phase against an existing table:

PHASES=search ./target/release/lancedb-qps-10m

Sample output

--- Phase: Ingest ---
Ingestion complete: 10000000 rows, ~100 fragments in xx.xxs

--- Phase: Compaction + Cleanup ---
  target_rows_per_fragment: 1000000
Fragments before: ~100
Fragments after:  10
Rows after:       10000000

--- Phase: IVF_PQ Index ---
Index built in xx.xxs

--- Phase: QPS Benchmark ---
  k=10, nprobes=5, refine_factor=1, select=[<element you wanted to retrieve>]

Threads    QPS          p50(ms)      p99(ms)      p99.9(ms)    OK         Errors
--------------------------------------------------------------------------------
10         xxx.x        xx.xx        xx.xx        xx.xx        xxxxx      0
25         xxx.x        xx.xx        xx.xx        xx.xx        xxxxx      0
50         xxx.x        xx.xx        xx.xx        xx.xx        xxxxx      0
100        xxx.x        xx.xx        xx.xx        xx.xx        xxxxx      0
150        xxx.x        xx.xx        xx.xx        xx.xx        xxxxx      0

================================================================================
                        BENCHMARK SUMMARY
================================================================================
  Total Rows:           10000000
  Ingest Throughput:    xxxxxx rows/s
  Index Build Time:     xx.xxs
  Best QPS:             xxxx.x (at 150 threads)
  Best p50:             xx.xx ms (at 10 threads)
  Best p99:             xx.xx ms (at 10 threads)
  Error rate:           x.xxxx%
================================================================================

Scenario 2 — `lancedb-atomic-demo` (conditional-PUT safety demo)

Verify LanceDB's conditional_put = etag storage option when multiple writers can commit concurrently. It runs the same N-writer race twice against the same S3 key:

Test 1 — plain PUT (no conditional header): every writer gets HTTP 200, but only one body survives. The rest are silently lost.
Test 2 — PUT with If-None-Match: *: exactly one writer gets HTTP 200, the rest get HTTP 412 Precondition Failed. No silent loss.

This is the primitive LanceDB uses to make manifest commits on S3 safe under concurrent writers and compaction.

Environment variables

Variable	Required	Description
`S3_ENDPOINT`	yes	S3 endpoint URL
`S3_ACCESS_KEY`	yes	Access key
`S3_SECRET_KEY`	yes	Secret key
`S3_BUCKET`	yes	Bucket name
`NUM_WRITERS`	no	Concurrent writers (default 50)

How to run

export S3_ENDPOINT="http://<your-endpoint>"
export S3_ACCESS_KEY="<access-key>"
export S3_SECRET_KEY="<secret-key>"
export S3_BUCKET="wiki-bench"
export NUM_WRITERS=50

./target/release/lancedb-atomic-demo

The bucket must exist and be writable. The demo writes and deletes a single key under atomic_test/manifest.json.

Sample output

  S3 Atomic Commit Demo  ·  50 concurrent writers  ·  same key

──────  TEST 1  ·  STANDARD S3 PUT  (no conditional headers)  ──────
  [W 00]  →  HTTP 200 OK
  [W 01]  →  HTTP 200 OK
  ... (all 50 return 200) ...

  HTTP 200 OK returned : 50 / 50   ← every writer "succeeded"
  Actually on S3       :  1        ← only one writer's data survived
  SILENTLY LOST        : 49        ← zero errors, zero warnings
  ✗  SILENT DATA CORRUPTION

──────  TEST 2  ·  CONDITIONAL PUT  (If-None-Match: *)  ──────
  [W 17]  →  HTTP 200 OK   ← WINNER
  [W 00]  →  HTTP 412 rejected
  ... (49 rejections) ...

  HTTP 200 OK returned :  1 / 50   ← exactly one winner
  HTTP 412 rejected    : 49 / 50   ← immediate, explicit failure
  Silently lost        :  0
  ✓  SAFE ATOMIC COMMIT

╔══════════════════════════════════════════════════════════════════╗
║         50 CONCURRENT WRITERS  ·  RACING TO COMMIT               ║
╠════════════════════════════════╦═════════════════════════════════╣
║  STANDARD S3 PUT               ║  CONDITIONAL PUT (ETag)         ║
║  Accepted       : 50 / 50      ║  Accepted       :  1 / 50       ║
║  Survived       :  1 / 50      ║  Rejected (412) : 49 / 50       ║
║  SILENTLY LOST  : 49           ║  Silently lost  :  0            ║
║  ✗  SILENT CORRUPTION          ║  ✓  SAFE ATOMIC COMMIT          ║
╚════════════════════════════════╩═════════════════════════════════╝

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LanceDB_VectorSearch

LanceDB on S3 — Benchmarks

Prerequisites

1. Install Rust / Cargo

2. S3 credentials

3. Clone

4. Build

Scenario 1 — `lancedb-qps-10m` (end-to-end pipeline benchmark)

Phases

Environment variables

How to run

Sample output

Scenario 2 — `lancedb-atomic-demo` (conditional-PUT safety demo)

Environment variables

How to run

Sample output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LanceDB_VectorSearch

LanceDB on S3 — Benchmarks

Prerequisites

1. Install Rust / Cargo

2. S3 credentials

3. Clone

4. Build

Scenario 1 — lancedb-qps-10m (end-to-end pipeline benchmark)

Phases

Environment variables

How to run

Sample output

Scenario 2 — lancedb-atomic-demo (conditional-PUT safety demo)

Environment variables

How to run

Sample output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Scenario 1 — `lancedb-qps-10m` (end-to-end pipeline benchmark)

Scenario 2 — `lancedb-atomic-demo` (conditional-PUT safety demo)

Packages