Skip to content

Antalya 26.3: apassos-2: combined port of 12 PRs#1699

Closed
zvonand wants to merge 4 commits intoantalya-26.3from
feature/antalya-26.3/apassos-2
Closed

Antalya 26.3: apassos-2: combined port of 12 PRs#1699
zvonand wants to merge 4 commits intoantalya-26.3from
feature/antalya-26.3/apassos-2

Conversation

@zvonand
Copy link
Copy Markdown
Collaborator

@zvonand zvonand commented Apr 28, 2026

This PR needs manual intervention.
Cherry-pick of #1484 could not be resolved automatically (AI resolver was disabled, exhausted its iteration budget, or gave up).
The branch contains the first 4 commit(s) of the group; 7 later PR(s) were not attempted.
Conflicted files at the failure point:

  • src/Processors/Formats/Impl/ParquetFileMetaDataCache.cpp
  • src/Processors/Formats/Impl/ParquetFileMetaDataCache.h
    Resolve the conflict locally, push the fix, and mark this PR ready for review.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Forward port of export part and partition #1041, #1083, #1086, #1090, #1106, #1124, #1144, #1147, #1150, #1157, #1158, #1161, #1167, #1229, #1294, #1320, #1324 and #1330 (#1388 by @arthurpassos, #1405 by @arthurpassos, #1478 by @arthurpassos, #1402 by @arthurpassos, #1484 by @arthurpassos, #1490 by @arthurpassos, #1500 by @arthurpassos, #1517 by @arthurpassos, #1499 by @arthurpassos, #1593 by @arthurpassos, #1618 by @arthurpassos, #1631 by @arthurpassos).

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

Combined port of 12 PR(s) (group apassos-2). Cherry-picked from #1388, #1405, #1478, #1402, #1484, #1490, #1500, #1517, #1499, #1593, #1618, #1631.


#1388: Antalya 26.1 - Forward port of export part and partition

Documentation entry for user-facing changes

Export merge tree part and partition (we still need to rebase #1177 afterwards)


#1405: Antalya 26.1 - Forward port of list objects cache #1040

Documentation entry for user-facing changes

Cache for listobjects calls


#1478: Skip remote suite on export tests that block minio

Documentation entry for user-facing changes

...


#1402: Improvements to partition export

Documentation entry for user-facing changes

As of now, this PR does the following things:

  1. Limit the amount of export part operations that can be scheduled based on the BackgroundMovesExecutor. This applies both to partition and part exports. It is no longer memory bound, it solves the problem of one replica locking all parts in a task just because it ran faster even if it did not have bandwidth to execute all of them.
  2. Introduce ZooKeeper export partition requests specific metrics (I suppose we'll remove this later, for now it is good to have to be able to benchmark different approaches).
  3. Introduce lock a data part inside the task strategy as opposed to locking and only then scheduling a task. This is controlled by export_merge_tree_partition_lock_inside_the_task. I don't think we want users to do it, it is to experiment and benchmark. See ExportPartFromPartitionExportTask
  4. Only run the scheduler if we have available slots. And only schedule as many as we can (based on slots). This is subject to TOCTOU. The background executor has an internal pending queue, so even if we have available slots a task might end up on pending. To tackle this, we would need to write a new background executor, but not a must for now.
  5. Add local system.replicated_partition_exports option. Refactor querying system.replicated_partition_exports to use multi_read requests instead of several different read requests. Throws iff multi_read is not supported (this was vibe coded).
  6. Shuffle parts to export in a given partition task before choosing a part to work on to avoid locking collisions.
  7. Save the entire Settings object instead of only FormatSettings - now more settings should be preserved (part export only)
  8. Clear part references in partition export manifest once it is no longer pending

lock_inside_the_task is not production ready as of now (the entire feature is not production ready tbh) - there is a possible crash in case the user schedules an export, changes the schema of the destination table, and the export executes. This is because validation is not being done on the fly for this setting. But I think it is ok to ignore this corner case for now.

This partially tackles the following:


#1484: Use serialized metadata size to calculate the cache entry cell

Documentation entry for user-facing changes

...


#1490: add setting to define filename pattern for part exports

Documentation entry for user-facing changes

...


#1500: Fix local replicated_partition_exports table might miss entries

Documentation entry for user-facing changes

...


#1517: Fix IPartitionStrategy race condition

IPartitionStrategy::computePartitionKey might be called from different threads, and it writes to cached_result concurrently without any sort of protection. It would be easier to add a mutex around it, but we can actually make it lock-free by moving the cache write to the constructor.

Documentation entry for user-facing changes

...


#1499: Bump scheduled exports count only in case it has been scheduled

Documentation entry for user-facing changes

...


#1593: Export Partition - release the part lock when the query is cancelled

During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were KILL EXPORT PARTITION and DROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.

But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.

Documentation entry for user-facing changes

...


#1618: Export partition to apache iceberg

Export partition mechanics changes:

  1. ping restarting thread in case of zookeeper session failure
  2. adds a few failpoints to make testing better
  3. makes export_merge_tree_partition_system_table_prefer_remote_information false by default (I am considering to remove it completely)
  4. adds commit retry count / max retries to prevent a task from living forever when commit is failing. Fail the entire task if commit retries exceeds max retries.
  5. fixes race condition in ExportPartitionManifestUpdatingTask by draining the status queue while only holding the status lock instead of holding the status lock and the export partition lock
  6. abstract away common functions like getContextCopyWithTaskSettings to avoid code duplication
  7. Add task timeout. If the task exceeds the timeout, it is killed with reason: timeout exceeded. This helps with apache iceberg idempotency vs old manifest cleanup, tasks stuck in pending state forever due to missing parts OR no destination table;
  8. rename enable_experimental_export_merge_tree_partition_feature to allow_experimental_export_merge_tree_partition
  9. throw on exports if allow_experimental_insert_into_iceberg not enabled

Apache Iceberg specifics:

  1. Store apache iceberg metadata json in zookeeper task
  2. Derive destination partition values from source merge tree part (no recalculation)
  3. preserve write_full_path_in_iceberg_metadata in zookeeper task
  4. now exportparttask has a commit step that is only executed in case it is not export partition - this is because we need to commit even a single part - maybe I should re-think this architecture.
  5. some vibe coded structures for iceberg stats
  6. write f_clickhouse_export_partition_transaction_id to apache iceberg manifest so we can check it before comitting twice
  7. copy and paste and adapt the icebegergstoragesink commmit phase to icebergmetadata so we can commit export partition operations
  8. create sidecar file to persist file level statistics so they can be used at commit time - those are downloaded / read at commit time
  9. create a simple icebergimportsink
  10. add per fil stats to multifilewriter

Documentation entry for user-facing changes

...


#1631: Fix condition for using parquet metadata cache

Apache Iceberg queries were not htiting the parquet metadata cache because object_info->getFileFormat() resolves to IcebergDataObjectInfo::getFileFormat, which gets its return value from IcebergObjectSerializableInfo. This field is filled with the value from Apache Iceberg manifest file, and it is upper case by default, which then fails clickhouse check for parquet metadata cache usage.

Documentation entry for user-facing changes

...

zvonand added 4 commits April 28, 2026 16:40
…rtition

Antalya 26.1 - Forward port of export part and partition

Source-PR: #1388 (#1388)
…ache

Antalya 26.1 - Forward port of list objects cache #1040

Source-PR: #1405 (#1405)
…_export_tests_that_block_minio

Skip remote suite on export tests that block minio

Source-PR: #1478 (#1478)
Improvements to partition export

Source-PR: #1402 (#1402)
@zvonand zvonand added ai-needs-attention Releasy stopped on a conflict it could not resolve — needs human review releasy Created/managed by RelEasy labels Apr 28, 2026
@github-actions
Copy link
Copy Markdown

Workflow [PR], commit [8f7b8d8]

@zvonand zvonand closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-needs-attention Releasy stopped on a conflict it could not resolve — needs human review releasy Created/managed by RelEasy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant