Skip to content

Antalya 26.3: apassos-1: combined port of 5 PRs#1685

Open
zvonand wants to merge 11 commits intoantalya-26.3from
feature/antalya-26.3/apassos-1
Open

Antalya 26.3: apassos-1: combined port of 5 PRs#1685
zvonand wants to merge 11 commits intoantalya-26.3from
feature/antalya-26.3/apassos-1

Conversation

@zvonand
Copy link
Copy Markdown
Collaborator

@zvonand zvonand commented Apr 23, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache.
Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache (#1405 by @arthurpassos, #1388 by @arthurpassos, #1593 by @arthurpassos, #1517 by @arthurpassos, #1631 by @arthurpassos).

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

Combined port of 5 PR(s) (group apassos-1). Cherry-picked from #1405, #1388, #1593, #1517, #1631.


#1405: Antalya 26.1 - Forward port of list objects cache #1040

Documentation entry for user-facing changes

Cache for listobjects calls


#1388: Antalya 26.1 - Forward port of export part and partition

Documentation entry for user-facing changes

Export merge tree part and partition (we still need to rebase #1177 afterwards)


#1593: Export Partition - release the part lock when the query is cancelled

During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were KILL EXPORT PARTITION and DROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.

But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.

Documentation entry for user-facing changes

...


#1517: Fix IPartitionStrategy race condition

IPartitionStrategy::computePartitionKey might be called from different threads, and it writes to cached_result concurrently without any sort of protection. It would be easier to add a mutex around it, but we can actually make it lock-free by moving the cache write to the constructor.

Documentation entry for user-facing changes

...


#1631: Fix condition for using parquet metadata cache

Apache Iceberg queries were not htiting the parquet metadata cache because object_info->getFileFormat() resolves to IcebergDataObjectInfo::getFileFormat, which gets its return value from IcebergObjectSerializableInfo. This field is filled with the value from Apache Iceberg manifest file, and it is upper case by default, which then fails clickhouse check for parquet metadata cache usage.

Documentation entry for user-facing changes

...

zvonand and others added 5 commits April 23, 2026 20:16
…ache

Antalya 26.1 - Forward port of list objects cache #1040

Source-PR: #1405 (#1405)
…rtition

Antalya 26.1 - Forward port of export part and partition

Source-PR: #1388 (#1388)
…ing_the_lock

Export Partition - release the part lock when the query is cancelled

Source-PR: #1593 (#1593)
…trategy

Fix IPartitionStrategy race condition

Source-PR: #1517 (#1517)
@zvonand zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels Apr 23, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Workflow [PR], commit [32c0ccb]

zvonand and others added 3 commits April 24, 2026 19:38
Three new background tasks introduced by the
`enable_experimental_export_merge_tree_partition_feature`
forwardport call ZooKeeper without entering a component scope.
With `enforce_keeper_component_tracking = true` (set in
fast-test config via `zookeeper_enforce_component_name.yaml`),
this triggers a logical error in `Coordination::ZooKeeper::pushRequest`
the moment any `ReplicatedMergeTree` table activates the tasks
on startup, aborting the server. The 247 failing fast-test
replicated-table tests are all downstream effects of this abort
(they surface as KEEPER_EXCEPTION / TABLE_IS_READ_ONLY).

Wrap the entry of each background task method in
`Coordination::setCurrentComponent`, matching the convention used
by other replicated background work (e.g.
`ReplicatedMergeTreeRestartingThread`, `ReplicatedMergeTreeCleanupThread`).

Addresses 247 failing tests in the Fast test shard on
#1685. After this fix
the still-failing set shrank from 247 -> 0 (locally: 245 OK, 2
SKIPPED, 0 FAILED across the same input list).
@zvonand
Copy link
Copy Markdown
Collaborator Author

zvonand commented Apr 27, 2026

RelEasy analyze-fails — DONE

run completed at 2026-04-27T12:49:09Z

  • Head SHA: 52d29f4542 (feature/antalya-26.3/apassos-1)
  • Tests considered: 247 across 1 CI shard(s)
  • Outcomes: 1 done · 0 partial · 0 unrelated · 0 unresolved
  • Commits added by AI: 1 (✅ pushed)
  • Anthropic cost: $70.4348

Per-shard outcomes

✅ DONE — Fast test

247 failed test(s) considered — +1 commit(s) — cost $70.4348
full report

AI narration

…(narration truncated; last 4500 chars)

fter_drop_partition_zookeeper, 03409_coalescing_replicated_merge_tree, 03062_analyzer_join_engine_missing_column, 03100_lwu_13_block_number, 02882_replicated_fetch_checksums_doesnt_match, 03080_incorrect_join_with_merge, 01560_optimize_on_insert_zookeeper, 02438_sync_replica_lightweight, 03100_lwu_02_basics, 01526_alter_add_and_modify_order_zookeeper, 01532_primary_key_without_order_by_zookeeper, 03120_analyzer_dist_join, 03222_parallel_replicas_min_marks_to_read_overflow, 02731_parallel_replicas_join_subquery, 03362_create_table_after_truncate_replicated_database, 03365_json_with_variant_subcolumn, 02461_alter_update_respect_part_column_type_bug, 03776_bad_rewrite_random_order_by, 03100_lwu_11_with_mutations, 02449_check_dependencies_and_table_shutdown, 03231_pr_duplicate_announcement_2, 03100_lwu_42_bytes_limits, 03020_order_by_SimpleAggregateFunction, 00577_replacing_merge_tree_vertical_merge, 03623_pr_join_with_group_by_subquery, 02541_lightweight_delete_on_cluster, 04027_system_start_replicated_view, 02514_database_replicated_no_arguments_for_rmt, 02296_ttl_non_deterministic, 03612_freeze_partition_parallel_verbose_zookeeper, 00980_zookeeper_merge_tree_alter_settings, 04035_insert_deduplication_token_insert_select, 02933_replicated_database_forbid_create_as_select, 03047_on_fly_mutations_skip_index, 03234_replicated_table_parent_znode_cleanup, 02597_column_update_and_replication, 03047_on_fly_update_delete, 01181_db_atomic_drop_on_cluster, 02006_test_positional_arguments_on_cluster, 02907_backup_restore_default_nullable, 03257_reverse_sorting_key_zookeeper, 02910_replicated_merge_parameters_must_consistent, 01149_zookeeper_mutation_stuck_after_replace_partition, 02946_parallel_replicas_force_primary_key, 03221_variant_logical_error, 03247_pr_local_plan_non_constant_in_source, 03100_lwu_22_detach_attach_patches, 03100_lwu_06_apply_patches, 02487_create_index_normalize_functions, 03231_pr_reverse_in_order, 03100_lwu_28_mutation_wait_lwu, 03209_parallel_replicas_lost_decimal_conversion, 03100_lwu_deletes_3, 03100_lwu_31_merge_memory_usage, 02319_sql_standard_create_drop_index, 03283_optimize_on_insert_level, 00661_optimize_final_replicated_without_partition_zookeeper, 03224_invalid_alter, 03032_rmt_create_columns_from_replica, 03409_show_cluster_with_macros, 03100_lwu_41_bytes_limits, 03919_modify_nullable_statistics, 00509_extended_storage_definition_syntax_zookeeper, 03100_lwu_43_subquery_from_rmt, 02597_column_delete_and_replication, 00563_insert_into_remote_and_zookeeper_long, 02020_alter_table_modify_comment, 00715_fetch_merged_or_mutated_part_zookeeper, 03717_async_deduplication_with_mv, 02915_move_partition_inactive_replica, 03716_mutations_parts_in_progress_rmt, 03047_on_fly_mutations_events_2, 01705_normalize_create_alter_function_names, 03274_prewarm_primary_index_cache, 01013_sync_replica_timeout_zookeeper, 03716_mutations_parts_in_progress_rmt_2, 00510_materizlized_view_and_deduplication_zookeeper, 02916_another_move_partition_inactive_replica, 04000_mutation_sync_replica_removed_race, 03305_rename_mutations_counter_replicated, 01045_zookeeper_system_mutations_with_parts_names, 02989_system_tables_metadata_version, 02943_rmt_alter_metadata_merge_checksum_mismatch, 02718_insert_meet_hardware_error, 03716_mutations_parts_postpone_reasons_rmt, 02919_insert_meet_eternal_hardware_error, 02724_limit_num_mutations, 01307_multiple_leaders_zookeeper, 03350_alter_table_fetch_partition_thread_pool, 02887_insert_quorum_wo_keeper_retries, 03783_part_log_mutation_ids_rmt, 03100_lwu_21_on_fly_mutations, 01148_zookeeper_path_macros_unfolding, 03254_system_prewarm_mark_cache, 03305_mutations_counters_replicated, 03277_prewarm_cache_2, 01164_alter_memory_database, 02911_row_policy_on_cluster, 01901_test_attach_partition_from, 03254_prewarm_mark_cache_rmt, 02884_async_insert_skip_settings, 01417_freeze_partition_verbose_zookeeper, 01530_drop_database_atomic_sync, 03169_attach_as_replicated, 02179_dict_reload_on_cluster, 02888_replicated_merge_tree_creation, 03919_attach_partition_all_intersecting_detached_parts, 03985_prewarm_index_mark_cache, 02863_delayed_source_with_totals_and_extremes— all[fixed]`.

clickhouse-test[unrelated] (this entry is the test-runner name, not a real test).

DONE
DONE
DONE


🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

zvonand added 3 commits April 27, 2026 19:27
…etion callback

The previous fix wrapped the three periodic background tasks
(`ExportPartitionTaskScheduler::run`, `ExportPartitionManifestUpdatingTask::poll`,
`ExportPartitionManifestUpdatingTask::handleStatusChanges`) with a
`Coordination::setCurrentComponent` guard, but two more code paths in
the same feature still touch ZooKeeper without a component scope:

* `StorageReplicatedMergeTree::exportPartitionToTable` — the synchronous
  handler for `ALTER TABLE ... EXPORT PARTITION ... TO TABLE ...`. It
  calls `getZooKeeperAndAssertNotReadonly` and issues `tryGet`, `exists`,
  `tryRemoveRecursive`, and `tryMulti` against the table's `zookeeper_path`.
* `ExportPartitionTaskScheduler::handlePartExportCompletion` — the
  per-part completion callback. It is registered as a lambda in
  `ExportPartitionTaskScheduler::run` but actually fires from the
  `background_moves_assignee` thread (via `ExportPartTask::executeStep`
  → `manifest.completion_callback`), so the component scope set in `run`
  does not apply.

With `enforce_keeper_component_tracking = true` (set in
stateless-test config via `zookeeper_enforce_component_name.yaml`) and
`abort_on_logical_error = true` (debug build), either path triggers
the logical error in `Coordination::ZooKeeper::pushRequest` and aborts
the server, causing 03604_export_merge_tree_partition to fail with
`server died` / `ConnectionRefusedError`.

Wrap both function entries in `Coordination::setCurrentComponent`,
matching the convention used by sibling methods.

Addresses 2 failing tests in the
Stateless tests (amd_debug, distributed plan, s3 storage, sequential)
shard on #1685.
Two more issues surfaced after the previous Keeper component patches.

* `system.replicated_partition_exports` reads route through
  `StorageReplicatedMergeTree::getPartitionExportsInfo`, which calls
  `tryGetChildren` / `tryGet` on `zookeeper_path/exports/...` without a
  component scope. With `enforce_keeper_component_tracking = true` (set
  by the integration test helper
  `0_common_enforce_zookeeper_component_name.xml`) and
  `abort_on_logical_error` (debug build), every test that polls the
  system table during an export aborted the server with the
  `Current component is empty` `LOGICAL_ERROR` from
  `Coordination::ZooKeeper::pushRequest`. The release-build CI surfaced
  the same error as `Code: 49` returned to the client. The
  symmetric command path `StorageReplicatedMergeTree::killExportPartition`
  (used by `KILL EXPORT PARTITION`) had the same gap.

  Wrap both function entries in `Coordination::setCurrentComponent`,
  matching `exportPartitionToTable` and the background-task handlers.

* `MergeTreeData::scheduleDataMovingJob` runs the export-manifest loop
  unconditionally, even after `SYSTEM STOP MOVES` flips the moves
  blocker. Tests
  `test_export_partition_scheduler_skipped_when_moves_stopped` and
  `test_export_partition_resumes_after_stop_moves` register an export
  while moves are stopped and assert the status stays `PENDING` until
  `SYSTEM START MOVES`; without this guard the scheduler still picks
  up the manifest and the export completes. Skip the export-manifest
  loop early when the blocker is cancelled — `ExportPartTask::isCancelled`
  already covers the in-flight case, so this only blocks new
  scheduling.

Addresses 21 failing tests in the Integration tests (amd_binary, 3/5)
shard on #1685. Locally the
full 21-test list now passes (19 after the first patch, the remaining
2 STOP-MOVES tests after this patch).
The test asserts that on a second `SELECT *` over the cached Iceberg
table, every byte that was written to or read from the filesystem cache
during the first query is read back from cache (i.e.
`read_from_cache_second_select == read_from_cache_first_select +
written_to_cache_first_select`). It also asserts `S3GetObject == 0` on
the second query.

This PR pulls in the `Poco::toLower` fix to the parquet metadata cache
condition in `StorageObjectStorageSource::createReader` (PR #1631 /
commit `c7fa2310ea7`). Before that fix, Iceberg manifest files report
the format as `PARQUET` (uppercase), so the previous check
`getFileFormat() == "Parquet"` always failed and the parquet metadata
cache was never used for Iceberg tables. After the fix, the lowercased
comparison matches and the cache is now active for Iceberg with
`use_parquet_metadata_cache = 1` (default) and
`input_format_parquet_use_native_reader_v3 = 1` (default in 26.x).

With the parquet metadata cache active, the parsed Parquet metadata
(footer + row group metadata) is served from in-memory cache on the
second query, so those bytes are no longer pulled through the read
buffer and do not register on `CachedReadBufferReadFromCacheBytes`. The
test then sees a value of about `1450` on the second select (only the
Iceberg JSON / Avro metadata) instead of `1450 + 6144`, and fails as:

    assert (1450 == (1450 + 6144))

This is the same isolation pattern already used by
`tests/integration/test_storage_delta/test.py` and
`tests/queries/0_stateless/03723_parquet_prefetcher_read_big_at.sql`,
where filesystem-cache-focused tests pass `use_parquet_metadata_cache =
0` to keep the parquet metadata cache from interfering with byte-level
accounting.

Add `use_parquet_metadata_cache = 0` to both `SELECT *` statements in
`test_filesystem_cache[s3]` so the test exercises the filesystem cache
path in isolation.

Locally the test now passes:

    test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] PASSED

Addresses 1 failing test in Integration tests (arm_binary, distributed
plan, 1/4) on #1685. The
other test in the shard (`test_dirty_pages_force_purge`) is listed in
`tests/broken_tests.yaml` as `KNOWN: https://github.com/Altinity/ClickHouse/issues/1369`
and is unrelated.
@zvonand
Copy link
Copy Markdown
Collaborator Author

zvonand commented Apr 27, 2026

RelEasy analyze-fails — PARTIAL

run completed at 2026-04-27T18:18:32Z

  • Head SHA: dbdad44519 (feature/antalya-26.3/apassos-1)
  • Tests considered: 29 across 7 CI shard(s)
  • Outcomes: 2 done · 2 partial · 3 unrelated · 0 unresolved
  • Commits added by AI: 3 (✅ pushed)
  • Anthropic cost: $96.4998

Per-shard outcomes

🟡 PARTIAL — Stateless tests (amd_debug, distributed plan, s3 storage, sequential)

2 failed test(s) considered — +1 commit(s) — cost $11.3453
full report

AI narration

Now check that setCurrentComponent is available (the ZooKeeperCommon.h header).
Good, setCurrentComponent is already used 41 times in StorageReplicatedMergeTree.cpp so it's available via transitive includes. Now let me add the second guard in handlePartExportCompletion.
Now let's run the build.
I'll wait for the build notification rather than scheduling a wakeup.
I'll wait for the build notification rather than scheduling a wakeup.
Build succeeded. Now let me run the failing tests.
The fix follows the same pattern as the previous commit. Let me commit it.

Summary

Both items in the failed-test list trace to the same root cause: the enable_experimental_export_merge_tree_partition_feature calls ZooKeeper without a component scope, and enforce_keeper_component_tracking = true (set for stateless tests) turns that into a LOGICAL_ERROR that aborts the debug-build server.

The previous commit (dbdad44519b) covered the three periodic background tasks but missed two more paths exercised by 03604_export_merge_tree_partition:

  • StorageReplicatedMergeTree::exportPartitionToTable — the synchronous handler for ALTER TABLE … EXPORT PARTITION … TO TABLE …, which directly issues ZooKeeper requests.
  • ExportPartitionTaskScheduler::handlePartExportCompletion — the per-part completion callback that runs in the background_moves_assignee thread (so the component set in run doesn't propagate there).

I added Coordination::setCurrentComponent guards at both entry points, matching the convention used by 41+ sibling methods in the same file. The build succeeded.

I did not run the test locally — the orphan server in tmp/click_data was operating on a deleted data directory, and rebuilding the full stateless-test environment (with --s3-storage, enforce_keeper_component_tracking, abort_on_logical_error, etc.) within budget would have crowded out the iteration window. The fix is mechanical and follows an established pattern that was already verified locally on the same root cause for 245/247 fast-test failures in the prior commit.

Per-test classification:

  • 03604_export_merge_tree_partition — [fixed] (high confidence — fix targets the exact code path the test exercises)
  • Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-4547) — [fixed] (this is the literal exception text captured as a "test", same root cause; eliminating the ZK-without-component calls eliminates the exception)

PARTIAL

Summary

Both items in the failed-test list trace to the same root cause: the enable_experimental_export_merge_tree_partition_feature calls ZooKeeper without a component scope, and enforce_keeper_component_tracking = true (set for stateless tests) turns that into a LOGICAL_ERROR that aborts the debug-build server.

The previous commit (dbdad44519b) covered the three periodic background tasks but missed two more paths exercised by 03604_export_merge_tree_partition:

  • StorageReplicatedMergeTree::exportPartitionToTable — the synchronous handler for ALTER TABLE … EXPORT PARTITION … TO TABLE …, which directly issues ZooKeeper requests.
  • ExportPartitionTaskScheduler::handlePartExportCompletion — the per-part completion callback that runs in the background_moves_assignee thread (so the component set in run doesn't propagate there).

I added Coordination::setCurrentComponent guards at both entry points, matching the convention used by 41+ sibling methods in the same file. The build succeeded.

I did not run the test locally — the orphan server in tmp/click_data was operating on a deleted data directory, and rebuilding the full stateless-test environment (with --s3-storage, enforce_keeper_component_tracking, abort_on_logical_error, etc.) within budget would have crowded out the iteration window. The fix is mechanical and follows an established pattern that was already verified locally on the same root cause for 245/247 fast-test failures in the prior commit.

Per-test classification:

  • 03604_export_merge_tree_partition — [fixed] (high confidence — fix targets the exact code path the test exercises)
  • Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-4547) — [fixed] (this is the literal exception text captured as a "test", same root cause; eliminating the ZK-without-component calls eliminates the exception)

PARTIAL

⏭️ UNRELATED — Stateless tests (amd_tsan, s3 storage, sequential, 2/2)

1 failed test(s) considered — cost $2.3368
full report

AI narration

Based on my investigation, the failure is clickhouse local --only-system-tables failing to find system.trace_log while dumping system table artifacts after the test suite. This is a CI infrastructure check at ci/jobs/scripts/clickhouse_proc.py:1098-1234, not a regular stateless test.

Why I conclude this is unrelated:

  • The PR's recent commits (7ed3bf7a75a, dbdad44519b) only wrap Keeper calls in Coordination::setCurrentComponent for the EXPORT PARTITION code paths.
  • git diff antalya-26.3..HEAD shows no changes to src/Interpreters/SystemLog.cpp, src/Common/SystemLogBase.*, src/Common/QueryProfiler.cpp, src/Storages/System/, programs/local/, or ci/jobs/scripts/clickhouse_proc.py.
  • trace_log is a SystemLog table created lazily via prepareTable on first flush; with prepare_system_log_tables_on_startup=false and a TSan build, it can simply not have been materialized on disk before server stop, which then makes --only-system-tables produce UNKNOWN_TABLE.
  • The same infra recently had a workaround commit (2b96af4a017 — remove blob_storage_log from dumped system tables) that confirms this dumper is fragile to which logs happen to be created.

I made no code changes.

Final classification:

  • Scraping system tables[unrelated]

UNRELATED
Based on my investigation, the failure is clickhouse local --only-system-tables failing to find system.trace_log while dumping system table artifacts after the test suite. This is a CI infrastructure check at ci/jobs/scripts/clickhouse_proc.py:1098-1234, not a regular stateless test.

Why I conclude this is unrelated:

  • The PR's recent commits (7ed3bf7a75a, dbdad44519b) only wrap Keeper calls in Coordination::setCurrentComponent for the EXPORT PARTITION code paths.
  • git diff antalya-26.3..HEAD shows no changes to src/Interpreters/SystemLog.cpp, src/Common/SystemLogBase.*, src/Common/QueryProfiler.cpp, src/Storages/System/, programs/local/, or ci/jobs/scripts/clickhouse_proc.py.
  • trace_log is a SystemLog table created lazily via prepareTable on first flush; with prepare_system_log_tables_on_startup=false and a TSan build, it can simply not have been materialized on disk before server stop, which then makes --only-system-tables produce UNKNOWN_TABLE.
  • The same infra recently had a workaround commit (2b96af4a017 — remove blob_storage_log from dumped system tables) that confirms this dumper is fragile to which logs happen to be created.

I made no code changes.

Final classification:

  • Scraping system tables[unrelated]

UNRELATED

⏭️ UNRELATED — Stateless tests (arm_asan, azure, sequential)

1 failed test(s) considered — cost $0.3682
full report

AI narration

Triage finding: only one test failed, 03760_backup_tar_archive, with TOO_SLOW (601 s estimated) on a trivial INSERT INTO numbers(500) step under sanitizer — already system-classified as "BROKEN: KNOWN: Random timeout with sanitizer". The test exercises BACKUP/RESTORE to a tar disk and uses heavy PARTITION BY (c1 % 6451). The PR diff is about EXPORT PARTITION/Keeper, JWT auth, list objects cache, etc. — nothing in the BACKUP path, INSERT path, or partition code relevant to this timeout.
No code changes made. Summary:

  • 03760_backup_tar_archive[unrelated] (sanitizer-induced random timeout on a trivial INSERT into a table with ~500 partitions; already classified as BROKEN: KNOWN: Random timeout with sanitizer; PR diff does not touch BACKUP, INSERT, or partitioning code paths)

UNRELATED
No code changes made. Summary:

  • 03760_backup_tar_archive[unrelated] (sanitizer-induced random timeout on a trivial INSERT into a table with ~500 partitions; already classified as BROKEN: KNOWN: Random timeout with sanitizer; PR diff does not touch BACKUP, INSERT, or partitioning code paths)

UNRELATED

🟡 PARTIAL — Stateless tests (arm_binary, sequential)

1 failed test(s) considered — cost $5.8595
full report

AI narration

Let me check the build state and try to run the test to verify the fix on HEAD works.
Build succeeded. Let me run the actual failing test to verify the fix on HEAD works.
Let me start a server with the existing test config and run the test.
Let me start the server and verify the fix.
The fix is already on HEAD as commit 7ed3bf7a75a, which the build verified compiles. Let me verify everything is in order with git status.

Summary

Investigation: The shard "Stateless tests (arm_binary, sequential)" reported one failure described as Logical error: Current component is empty. The CI ran on SHA dbdad44519b (penultimate commit). The praktika report confirms the actual failing test was 03604_export_merge_tree_partition, with the stack trace pointing at StorageReplicatedMergeTree::exportPartitionToTablezookeeper->exists (no component scope) under enforce_keeper_component_tracking=true.

Fix status: HEAD commit 7ed3bf7a75a (already on branch, made before this session) wraps both StorageReplicatedMergeTree::exportPartitionToTable and ExportPartitionTaskScheduler::handlePartExportCompletion in Coordination::setCurrentComponent, exactly addressing the path in the failure stack trace. The earlier sibling commit dbdad44519b covers the three periodic background tasks (run, poll, handleStatusChanges).

Code verification: All five ZooKeeper-touching entry points in the export-partition feature now have a component guard. Helpers (handlePartExportSuccess, handlePartExportFailure, tryToMovePartToProcessed, areAllPartsProcessed, ExportPartitionUtils::commit, ExportPartitionUtils::getExportedPaths) are reached only from already-guarded scopes and inherit the component.

Build: bash .releasy/build.sh succeeded with the fix in place (only the changed translation units relinked).

Runtime verification: Could not be performed locally — the harness disallows starting a server, and the test relies on minio (s3_conn). RelEasy will re-run CI after pushing.

Test results

  • Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-3e47)[fixed] (HEAD commit 7ed3bf7a75a adds the missing component scope on the exact path named in the stack trace; build confirms it compiles)

PARTIAL

Summary

Investigation: The shard "Stateless tests (arm_binary, sequential)" reported one failure described as Logical error: Current component is empty. The CI ran on SHA dbdad44519b (penultimate commit). The praktika report confirms the actual failing test was 03604_export_merge_tree_partition, with the stack trace pointing at StorageReplicatedMergeTree::exportPartitionToTablezookeeper->exists (no component scope) under enforce_keeper_component_tracking=true.

Fix status: HEAD commit 7ed3bf7a75a (already on branch, made before this session) wraps both StorageReplicatedMergeTree::exportPartitionToTable and ExportPartitionTaskScheduler::handlePartExportCompletion in Coordination::setCurrentComponent, exactly addressing the path in the failure stack trace. The earlier sibling commit dbdad44519b covers the three periodic background tasks (run, poll, handleStatusChanges).

Code verification: All five ZooKeeper-touching entry points in the export-partition feature now have a component guard. Helpers (handlePartExportSuccess, handlePartExportFailure, tryToMovePartToProcessed, areAllPartsProcessed, ExportPartitionUtils::commit, ExportPartitionUtils::getExportedPaths) are reached only from already-guarded scopes and inherit the component.

Build: bash .releasy/build.sh succeeded with the fix in place (only the changed translation units relinked).

Runtime verification: Could not be performed locally — the harness disallows starting a server, and the test relies on minio (s3_conn). RelEasy will re-run CI after pushing.

Test results

  • Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-3e47)[fixed] (HEAD commit 7ed3bf7a75a adds the missing component scope on the exact path named in the stack trace; build confirms it compiles)

PARTIAL

✅ DONE — Integration tests (amd_binary, 3/5)

21 failed test(s) considered — +1 commit(s) — cost $66.8779
full report

AI narration

Let me look at the test file and the export-partition code paths that touch ZooKeeper to find any remaining unguarded paths.
Let me check the latest fix in exportPartitionToTable and run the tests to verify.
While that runs, let me look at what other Keeper-touching paths might still lack a component guard.
Found the root cause. The crash is in StorageReplicatedMergeTree::getPartitionExportsInfo at line 4602, called from StorageSystemReplicatedPartitionExports::fillData.
Now let me apply the fix to getPartitionExportsInfo.

All tests are passing! Let me wait for the full run to complete.

19 of 21 now passing. Let me check the 2 remaining failures.
Build successful. Let me re-run the 2 still-failing tests to verify the fix.

All 3 tests pass. Let me verify the full set passes too, then commit.
Let me do a final validation run with all 21 tests.

All 21 tests pass. Final tally: 21 passed in 137s.
All 21 tests in the failing list are now [fixed]:

  1. test_restart_nodes_during_export — [fixed]
  2. test_kill_export — [fixed]
  3. test_drop_source_table_during_export — [fixed]
  4. test_concurrent_exports_to_different_targets — [fixed]
  5. test_failure_is_logged_in_system_table — [fixed]
  6. test_inject_short_living_failures — [fixed]
  7. test_export_ttl — [fixed]
  8. test_export_partition_file_already_exists_policy — [fixed]
  9. test_export_partition_permissions — [fixed]
  10. test_multiple_exports_within_a_single_query — [fixed]
  11. test_pending_mutations_throw_before_export_partition — [fixed]
  12. test_pending_mutations_skip_before_export_partition — [fixed]
  13. test_pending_patch_parts_throw_before_export_partition — [fixed]
  14. test_pending_patch_parts_skip_before_export_partition — [fixed]
  15. test_mutations_after_export_partition_started — [fixed]
  16. test_patch_parts_after_export_partition_started — [fixed]
  17. test_mutation_in_partition_clause — [fixed]
  18. test_export_partition_with_mixed_computed_columns — [fixed]
  19. test_export_partition_scheduler_skipped_when_moves_stopped — [fixed]
  20. test_export_partition_resumes_after_stop_moves — [fixed]
  21. test_export_partition_resumes_after_stop_moves_during_export — [fixed]

Two root causes, both addressed in commit 1cbf4f34d91:

  • getPartitionExportsInfo and killExportPartition lacked Coordination::setCurrentComponent, so reading system.replicated_partition_exports (and KILL EXPORT PARTITION) crashed with the Current component is empty LOGICAL_ERROR under enforce_keeper_component_tracking.
  • MergeTreeData::scheduleDataMovingJob ran the export-manifest loop unconditionally, so SYSTEM STOP MOVES did not actually pause new export scheduling.

Final 21-test run: 21 passed in 137.10s.

DONE
All 21 tests in the failing list are now [fixed]:

  1. test_restart_nodes_during_export — [fixed]
  2. test_kill_export — [fixed]
  3. test_drop_source_table_during_export — [fixed]
  4. test_concurrent_exports_to_different_targets — [fixed]
  5. test_failure_is_logged_in_system_table — [fixed]
  6. test_inject_short_living_failures — [fixed]
  7. test_export_ttl — [fixed]
  8. test_export_partition_file_already_exists_policy — [fixed]
  9. test_export_partition_permissions — [fixed]
  10. test_multiple_exports_within_a_single_query — [fixed]
  11. test_pending_mutations_throw_before_export_partition — [fixed]
  12. test_pending_mutations_skip_before_export_partition — [fixed]
  13. test_pending_patch_parts_throw_before_export_partition — [fixed]
  14. test_pending_patch_parts_skip_before_export_partition — [fixed]
  15. test_mutations_after_export_partition_started — [fixed]
  16. test_patch_parts_after_export_partition_started — [fixed]
  17. test_mutation_in_partition_clause — [fixed]
  18. test_export_partition_with_mixed_computed_columns — [fixed]
  19. test_export_partition_scheduler_skipped_when_moves_stopped — [fixed]
  20. test_export_partition_resumes_after_stop_moves — [fixed]
  21. test_export_partition_resumes_after_stop_moves_during_export — [fixed]

Two root causes, both addressed in commit 1cbf4f34d91:

  • getPartitionExportsInfo and killExportPartition lacked Coordination::setCurrentComponent, so reading system.replicated_partition_exports (and KILL EXPORT PARTITION) crashed with the Current component is empty LOGICAL_ERROR under enforce_keeper_component_tracking.
  • MergeTreeData::scheduleDataMovingJob ran the export-manifest loop unconditionally, so SYSTEM STOP MOVES did not actually pause new export scheduling.

Final 21-test run: 21 passed in 137.10s.

DONE

✅ DONE — Integration tests (arm_binary, distributed plan, 1/4)

2 failed test(s) considered — +1 commit(s) — cost $8.5653
full report

AI narration

Looking at the situation: test 2 is marked as broken with a known issue (UNRELATED). Let me investigate test 1 more carefully.
Let me investigate test 1 by examining its setup files and seeing what could fail.
Let me check if there's a similar test on CI master to compare. Let me look at the recent changes most likely to affect filesystem cache behavior.
Let me check if the test 2 (BROKEN) is supposed to fail on this branch and if it's a known issue:
I found a likely root cause. The PR enables parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change. With this active, parsed parquet metadata is served from in-memory cache on second read, bypassing filesystem cache reads — so the second select reads less from filesystem cache than expected.

Let me check if this test actually has S3GetObject == 0 after the failing assert (indicates whether data file is fully cached):
Let me update todos and apply the fix.
Now I'll apply the fix to disable parquet metadata cache in the test:
Now let me try to run the tests:
The test now passes. Let me update todos and commit.
Final summary:

  • test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] [fixed] — The PR enabled the parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change in StorageObjectStorageSource::createReader. With parsed parquet metadata served from in-memory cache on the second select, those bytes no longer flow through the filesystem cache, so the byte-level assertion failed (1450 == 1450 + 6144). Fix adds use_parquet_metadata_cache = 0 to both SELECT * statements, matching the isolation pattern used by test_storage_delta and 03723_parquet_prefetcher_read_big_at.sql. Confirmed locally with 1 passed in 17.37s.
  • test_dirty_pages_force_purge/test.py::test_dirty_pages_force_purge [unrelated] — Listed in tests/broken_tests.yaml:244 as KNOWN: https://github.com/Altinity/ClickHouse/issues/1369; the praktika BROKEN status is the expected outcome.

DONE
Final summary:

  • test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] [fixed] — The PR enabled the parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change in StorageObjectStorageSource::createReader. With parsed parquet metadata served from in-memory cache on the second select, those bytes no longer flow through the filesystem cache, so the byte-level assertion failed (1450 == 1450 + 6144). Fix adds use_parquet_metadata_cache = 0 to both SELECT * statements, matching the isolation pattern used by test_storage_delta and 03723_parquet_prefetcher_read_big_at.sql. Confirmed locally with 1 passed in 17.37s.
  • test_dirty_pages_force_purge/test.py::test_dirty_pages_force_purge [unrelated] — Listed in tests/broken_tests.yaml:244 as KNOWN: https://github.com/Altinity/ClickHouse/issues/1369; the praktika BROKEN status is the expected outcome.

DONE

⏭️ UNRELATED — Integration tests (arm_binary, distributed plan, 4/4)

1 failed test(s) considered — cost $1.1467
full report

AI narration

The test test_backup_restore_on_cluster/test_different_versions.py::test_different_versions is already listed in tests/broken_tests.yaml (line 154–156) with reason INVESTIGATE - NETLINK_ERROR and message NETLINK_ERROR. The CI report confirms this matched and the test was relabeled to BROKEN status. In ci/praktika/result.py:224, BROKEN counts as is_ok — i.e. it is not a CI failure. The underlying cause is an environmental issue: the old (24.9.2.42) container raising Can't receive Netlink response: error -2 from hasLinuxCapability. This PR's diff does not touch backup/restore-on-cluster nor hasLinuxCapability, and it did not modify tests/broken_tests.yaml. No code change is warranted.

Summary

  • test_backup_restore_on_cluster/test_different_versions.py::test_different_versions[unrelated]

The test is pre-marked as broken in tests/broken_tests.yaml (entry exists in the base branch, not added by this PR) for exactly the NETLINK_ERROR signature observed. The CI labeled it BROKEN, which ci/praktika/result.py treats as is_ok. The error originates inside the old 24.9.2.42 container's hasLinuxCapability codepath — an environment artifact unrelated to anything in the PR's diff. No code change made.

UNRELATED

Summary

  • test_backup_restore_on_cluster/test_different_versions.py::test_different_versions[unrelated]

The test is pre-marked as broken in tests/broken_tests.yaml (entry exists in the base branch, not added by this PR) for exactly the NETLINK_ERROR signature observed. The CI labeled it BROKEN, which ci/praktika/result.py treats as is_ok. The error originates inside the old 24.9.2.42 container's hasLinuxCapability codepath — an environment artifact unrelated to anything in the PR's diff. No code change made.

UNRELATED


🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-resolved Port conflict auto-resolved by Claude antalya-26.3 releasy Created/managed by RelEasy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants