Sketchlib CountMin Support#215
Conversation
- Merged latest main (includes backend abstraction from PR #207) - Set Count-Min Sketch to use sketchlib backend by default - Keep KLL and Count-Min-With-Heap in legacy mode (not yet implemented) - UDFs already correctly configured: CMS uses sketchlib
milindsrivastava1997
left a comment
There was a problem hiding this comment.
Can you verify that the new asap-summary-ingest UDFs can be compiled by arroyo? Use the validate_udfs.py script
|
@milindsrivastava1997, the validation script currently does not handle the new implementation-mode template parameter correctly which was introduced to allow selecting either Sketchlib or the existing implementation. For now, I set the default to Sketchlib so validation can run without changing the validator script. After this update, validation passed in Sketchlib mode for the CMS UDFs with no errors. I still observed a separate failure in HydraKLL, which is unrelated to the changes in this PR. I will address that in the KLL PR, or open a dedicated follow-up PR if needed. |
|
@milindsrivastava1997, please let me know if you need any other changes for this PR |
|
@GnaneshGnani will check tomorrow and let you know. Thanks. |
|
@GnaneshGnani Can you please run the quick start in your branch and check that it works? Change the @zzylol pls help if needed. |
|
@milindsrivastava1997 , I ran quickstart on my branch and updated quickstart queries from quantile to count in Regenerated dashboards and launched quickstart cleanly. I verified
|
|
@GnaneshGnani Thank you for this. @zzylol pointed out that the quickstart is using pre-built images so actually it will not be using the code in your branch. Sorry, I didn't realize this. For now, can you change the quickstart docker-compose.yml to not use the pre-built Do this for each asap image except |
|
@milindsrivastava1997 , tested with the local images. |
|
@GnaneshGnani LGTM. Pls fix conflicts and then we can merge. Thanks. |
Summary
Integrates sketchlib-rust for Count-Min Sketch implementation, introducing an enum-based backend that allows runtime switching between legacy and sketchlib implementations. This PR adds the sketchlib-rust dependency and refactors Count-Min Sketch to support dual backends.
Changes
New Files
asap-common/sketch-core/src/count_min_sketchlib.rsSketchlibCms = CountMin<Vector2D<f64>, RegularPath>new_sketchlib_cms(): Create fresh sketchsketchlib_cms_from_matrix(): Build from existing matrixmatrix_from_sketchlib_cms(): Convert to legacy formatsketchlib_cms_update(): Update sketch with weighted keysketchlib_cms_query(): Query frequency estimateModified Files
asap-common/sketch-core/Cargo.tomlsketchlib-rustdependency from GitHubasap-common/sketch-core/src/lib.rspub mod count_min_sketchlib;asap-common/sketch-core/src/count_min.rsCountMinBackendenum (Legacy/Sketchlib)WireFormatstruct for serializationCountMinSketch:sketch: Vec<Vec<f64>>field tobackend: CountMinBackendsketch()method to get matrix viewsketch_mut()method (returnsSomeonly for Legacy)from_legacy_matrix()constructor for deserializationuse_sketchlib_for_count_min()at construction timenew(),update(),query_key(),merge()asap-query-engine/src/main.rsuse sketch_core::config::{self, ImplMode};sketch_cms_impl,sketch_kll_impl,sketch_cmwh_implconfig::configure()at startup before any sketch operationsasap-query-engine/src/lib.rssketchlib-testsfeatureasap-query-engine/src/precompute_operators/count_min_sketch_accumulator.rsCountMinSketch::from_legacy_matrix()sketch()method instead of direct fieldasap-summary-ingest/templates/udfs/countminsketch_count.rs.j2sketchlib-rustdependencyImplModeenum andIMPL_MODEconstant (set toSketchlib)SketchlibCmswith integer countersasap-summary-ingest/templates/udfs/countminsketch_sum.rs.j2asap-query-engine/Cargo.tomlsketchlib-testsfeature flagctordev dependencyasap-common/sketch-core/src/config.rsasap-query-engine/tests/test_both_backends.rsasap-common/sketch-core/src/bin/sketchlib_fidelity.rsasap-common/sketch-core/report.mdCargo.lockTechnical Approach
Backend Abstraction Pattern
All operations dispatch through the backend enum, allowing zero runtime overhead for monomorphized code paths.
Wire Format Compatibility
The msgpack serialization format remains unchanged:
Both backends serialize to and deserialize from this common format, ensuring UDF-to-QueryEngine compatibility.
Hash Function Difference
Note: QueryEngineRust uses
xxhash-rust::xxh32while Arroyo UDF templates historically usedtwox-hash::XxHash32. The UDF templates now use sketchlib-rust's internal hashing (MurmurHash3), which matches neither. This is acceptable as the sketches are probabilistic and all hash functions provide good distribution.Testing
Fidelity Results
CountMinSketch achieves near-identical accuracy between Legacy and sketchlib-rust:
See
asap-common/sketch-core/report.mdfor detailed results.