A Rust library for streaming data sketches — compact data structures that give approximate answers (counts, distinct counts, percentiles) over data streams too large to store exactly.
- Fast. Up to 8–14× higher insertion throughput than comparable libraries on frequency sketches, 2–3× on cardinality sketches, and 2–4× on quantile sketches. Rust-native with no language-boundary overhead. See benchmarks.
- High coverage. Supports frequency, cardinality, quantile, and distribution sketches (
CountMin,Count,HyperLogLog,KLL,DDSketch). Also includes algorithms not found in other libraries:UnivMonfor estimating a broad class of streaming statistics (L1/L2 norms, entropy) in a single pass,Hydrafor answering sketch queries over arbitrary subpopulations without per-group sketches, andNitroBatchfor accelerating sketch updates through batching. Unique sketch frameworks for sliding windows (ExponentialHistogram) and subpopulation queries (Hydra). - Easy to use. Most sketches provide a unified API style, while some (such as
KLL) useupdate/quantile; the crate also offers typed inputs viaDataInput, pluggable hashing viaSketchHasher, and multi-sketch composition with shared hashing (HashSketchEnsemble).
| Goal | Sketch | When to pick it | What it does | Polars equivalent |
|---|---|---|---|---|
| Frequency estimation | CountMin, Count |
Fast approximate counts for high-volume keys | Estimates how often each key appears in a stream | df.group_by("key").agg(pl.len()) |
| Cardinality estimation | HyperLogLog (Classic, ErtlMLE, HIP) |
Approximate distinct counts with bounded memory | Estimates the number of unique elements | df["col"].n_unique() |
| Quantiles / distribution | KLL, DDSketch |
Percentile / latency summaries over streams | Approximates arbitrary quantiles (e.g. p50, p99) of a value distribution | df["col"].quantile(0.99) |
| Subpopulation queries | Hydra |
Hierarchical / filtered sketch queries | Answers sketch queries over arbitrary subpopulations without maintaining per-group sketches | No direct equivalent — requires per-group aggregation |
| Universal monitoring | UnivMon |
G-sum queries (L1/L2 norms, cardinality, entropy) | Estimates a broad class of streaming statistics in a single pass | No direct equivalent — requires custom multi-pass pipelines |
| Update acceleration | NitroBatch |
Batch-accelerated sketch updates | Speeds up sketch insertions by batching updates | No direct equivalent |
Full sketch status and API details: APIs Index.
Minimum Supported Rust Version (MSRV): 1.85 (Rust 2024 edition)
This crate is not published on crates.io yet.
For now, install it from GitHub and pin a tag for a stable revision:
[dependencies]
asap_sketchlib = { git = "https://github.com/ProjectASAP/asap_sketchlib", tag = "v0.1.0" }If you want the latest repository state instead, you can depend on the default branch:
[dependencies]
asap_sketchlib = { git = "https://github.com/ProjectASAP/asap_sketchlib", branch = "main" }After the first crates.io release, installation will instead be:
cargo add asap_sketchlib[dependencies]
asap_sketchlib = "0.1"use asap_sketchlib::{ErtlMLE, HyperLogLog, DataInput};
// HyperLogLog estimates the number of distinct items in a stream using fixed memory.
// ErtlMLE is one of the HLL variants we offer — it tends to be more accurate than
// the `Classic` variant, especially at very low or very high cardinalities.
let mut hll = HyperLogLog::<ErtlMLE>::default();
// Insert some user IDs — HLL handles distinct counting and deduplicates items.
for user_id in [101, 202, 303, 101, 404, 202, 505, 101] {
hll.insert(&DataInput::U64(user_id));
}
let unique_users = hll.estimate();
println!("estimated unique users: {unique_users}"); // ≈ 5use asap_sketchlib::{CountMin, FastPath, Vector2D, DataInput};
// Count-Min Sketch estimates how often each item appears in a stream.
// It may over-count but never under-counts.
//
// Vector2D<i32> is the backing storage (a 2D array of 32-bit counters).
// FastPath uses a single hash with bit-masking to pick row indices — faster
// than the default RegularPath which hashes once per row.
let mut cms = CountMin::<Vector2D<i32>, FastPath>::with_dimensions(3, 2048);
// Simulate an event stream with known frequencies.
let events = [
("page_view", 1000),
("click", 500),
("signup", 100),
("purchase", 50),
];
for &(event, count) in &events {
for _ in 0..count {
cms.insert(&DataInput::Str(event));
}
}
// Estimates are close to the true counts (CMS may over-count, but never under-counts).
for &(event, true_count) in &events {
let est = cms.estimate(&DataInput::Str(event));
println!("{event:>10}: estimate = {est}, true = {true_count}");
}use asap_sketchlib::KLL;
// KLL is a quantile sketch — it tracks the distribution of values so you can
// ask questions like "what is the median?" without storing every data point.
let mut sketch = KLL::<f64>::default();
// Simulate 1000 latency samples in milliseconds
for i in 0..1000 {
let ms = (i as f64) * 0.5 + 1.0;
sketch.update(&ms);
}
let p50 = sketch.quantile(0.50);
let p99 = sketch.quantile(0.99);
println!("median ≈ {p50:.1} ms, p99 ≈ {p99:.1} ms");use asap_sketchlib::{ErtlMLE, HyperLogLog, DataInput};
// Sketches are mergeable — you can build one per node and combine them later
// to get a global answer without shipping raw data.
let mut node_a = HyperLogLog::<ErtlMLE>::default();
let mut node_b = HyperLogLog::<ErtlMLE>::default();
// Each node sees different (and some overlapping) users
for id in [1, 2, 3, 4, 5] { node_a.insert(&DataInput::U64(id)); }
for id in [4, 5, 6, 7, 8] { node_b.insert(&DataInput::U64(id)); }
node_a.merge(&node_b);
println!("total unique users ≈ {}", node_a.estimate()); // ≈ 8Several sketches address the same goal with different trade-offs — for example, CountMin vs Count for frequency, or KLL vs DDSketch for quantiles.
We are building SketchPlan, a profiler that analyzes a representative sample of your data and recommends the best sketch configuration (algorithm, memory budget, error tolerance) for your workload. Until SketchPlan is ready, the APIs Index lists guarantees, error bounds, and caveats for each sketch to help you decide.
Insertion throughput on 10M Zipf-distributed values, averaged over 10 runs:
- Frequency sketches: up to 8-14x higher insertion throughput than comparable libraries
- Cardinality sketches: roughly 2-3x higher insertion throughput
- Quantile sketches: roughly 2-4x higher insertion throughput
Benchmark methodology, tuning notes, and performance details (including cache-friendly layouts and FastPath single-hash mode) are in Performance Notes.
| Doc | Contents |
|---|---|
| APIs Index | Per-sketch API reference with status and error guarantees |
| Advanced Use Cases | Hierarchical queries, windowed sketching, multi-sketch coordination |
| Docs Index | Full documentation index |
If you are evaluating the crate for production use, start with the API index first. It calls out which APIs are stable today and which are still feature-gated or experimental.
cargo build --all-targets
cargo test --all-features--all-targetsbuilds everything: the library, binaries, and tests.--all-featuresenables every Cargo feature, so all feature-gated code is compiled and tested. The features include:experimental— enables sketches and APIs that are still under development and may change without notice.octo-runtime— enables the Octo multi-threaded runtime (pulls incore_affinityandcrossbeam-channel).
To build or test with a specific feature:
cargo build --features experimental
cargo test --features "experimental octo-runtime"This project compiles .proto files at build time via prost-build in build.rs.
The required Protocol Buffers compiler (protoc) is provided through the vendored
protoc-bin-vendored build dependency, so a separate system installation is usually
not needed on common development platforms.
If you prefer to use a system-installed compiler instead, that works too. Install
protoc with your platform package manager:
# macOS (Homebrew)
brew install protobuf
# Ubuntu / Debian
sudo apt-get update && sudo apt-get install -y protobuf-compiler
# Windows (Chocolatey)
choco install protocVerify installation:
protoc --versionIf you need to override the compiler for a custom environment, set the PROTOC
environment variable to the path of your preferred protoc binary before running
cargo build or cargo test.
- You need its broader algorithm catalog (CPC, Theta/Tuple with set operators, REQ, VarOpt/Reservoir, FM85).
- You need cross-language binary compatibility with existing DataSketches deployments in Java, C++, or Python.
- You need long-running production maturity and an Apache-governed release cycle.
Copyright 2025 - present ProjectASAP
Licensed under the MIT License. See LICENSE.