Add design for ALTER TABLE EXPORT + design review skill#1673
Add design for ALTER TABLE EXPORT + design review skill#1673hodgesrm wants to merge 5 commits intoAltinity:antalya-26.1from
Conversation
06d5b84 to
227f7b1
Compare
arthurpassos
left a comment
There was a problem hiding this comment.
I see some mistakes done by Claude here in regards to plain object storage vs iceberg exports. I assume it is because it has consumed the antalya-26.1 branch docs that do not mention Iceberg at all. Those docs are in my 1618 PR.
Half way through, still more to come
| -- Split large part across multiple Parquet files | ||
| ALTER TABLE big EXPORT PART '2025_0_32_3' TO TABLE big_dest | ||
| SETTINGS allow_experimental_export_merge_tree_part = 1, | ||
| export_merge_tree_part_max_bytes_per_file = 10000000, |
There was a problem hiding this comment.
export_merge_tree_part_max_bytes_per_file
We have a funny situation with this. When I added this setting, ClickHouse did not have the ability to split out parquet files. Plus, we did not support exporting to Iceberg.
The iceberg engine now has its own settings. Which one should we respect?
There was a problem hiding this comment.
Good point. I vote we go with the Iceberg engine for consistency and to simplify EXPORT commands. Questions:
iceberg_insert_max_bytes_in_data_fileshould supersedeexport_merge_tree_part_max_bytes_per_file, correct?- Export should also obey
iceberg_insert_max_rows_in_data_file, as you should in your example? - What about settings for grow groups? (e.g.,
output_format_parquet_row_group_sizeandoutput_format_parquet_row_group_size_bytes). I assume we observe all of these. It also seems like you would want to set them at the table level for sanity since it's a lot to remember if you export more than one time.
There was a problem hiding this comment.
TLDR:
You are correct on your expectations, but that's not quite how it works nowadays.
Full
You touched on an important topic. There is a subtle difference on how settings are handled between PART and PARTITION export.
For Part export, the settings used at the moment of the query are all preserved and re-stored in the background export task.
For partition export, it is slightly harder to preserve the settings because we need to serialize them in ZooKeeper. I have done that only for a few very important settings, including export_merge_tree_part_max_bytes_per_file.
Somse settings are not preserved simply because I did not have the time and I was trying to find a better way to handle this case, which I did not yet.
There was a problem hiding this comment.
Just pushed a change to 1618 that preserves Iceberg max rows and max bytes
There was a problem hiding this comment.
What do you think about modifying export_merge_tree_max_bytes_per_file to output_format_parquet_file_size_bytes?
There was a problem hiding this comment.
hmm, I don't like this idea very much. This implies it would have to be implemented for regular inserts as well, which is extra work and introduces some conflicts with upstream.
Additionally, by putting parquet in the setting name you are scoping in to parquet files. The code that splits the files is agnostic to the format. It would also conflict with the iceberg max size and etc.
There was a problem hiding this comment.
OK. export_merge_tree_max_bytes_per_file overrides more specific values (in the design).
227f7b1 to
bcb2b8c
Compare
arthurpassos
left a comment
There was a problem hiding this comment.
I'll stop my review here (https://github.com/Altinity/ClickHouse/pull/1673/changes#diff-59e17f10a30e7277ca6b00d367036de94c72a8e1ddcdb063ad9255e3aa707a81R516) as we'll discuss these things in a meeting shortly. It is a long doc lol.
| -- Split large part across multiple Parquet files | ||
| ALTER TABLE big EXPORT PART '2025_0_32_3' TO TABLE big_dest | ||
| SETTINGS allow_experimental_export_merge_tree_part = 1, | ||
| export_merge_tree_part_max_bytes_per_file = 10000000, |
There was a problem hiding this comment.
hmm, I don't like this idea very much. This implies it would have to be implemented for regular inserts as well, which is extra work and introduces some conflicts with upstream.
Additionally, by putting parquet in the setting name you are scoping in to parquet files. The code that splits the files is agnostic to the format. It would also conflict with the iceberg max size and etc.
Signed-off-by: Robert Hodges <rhodges@altinity.com>
Also incorporated design feedback. Signed-off-by: Robert Hodges <rhodges@altinity.com>
Signed-off-by: Robert Hodges <rhodges@altinity.com>
Signed-off-by: Robert Hodges <rhodges@altinity.com>
ebbd2e2 to
915df13
Compare
Signed-off-by: Robert Hodges <rhodges@altinity.com>
arthurpassos
left a comment
There was a problem hiding this comment.
LGTM - I did not review the testing section, will leave that for QA
| - EXPORT PARTITION for MergeTree tables. Must work without Keeper installation. | ||
| - Export history. Provide a system table to track the history of part exports. | ||
| - Flexible casting that addresses issues like the following. | ||
| - Handling potentially lossy casts like INSERT SELECT: int64 -> int32. |
There was a problem hiding this comment.
By handling you mean throw, accept or a setting that allows the user to decide?
| swap in `DatabaseIceberg` or an `iceberg(...)` catalog-backed table | ||
| function to route through REST / Glue / Unity. | ||
|
|
||
| **Open Issue** We need to specify how to commit via a catalog. |
There was a problem hiding this comment.
Wdym by open issue? It is already supported to export to Iceberg tables backed by a catalog.
|
|
||
| The following notes expand on expected behavior of commands. | ||
|
|
||
| 1. When writing to object storage using `partition_strategy = 'wildcard'`, either wildcard |
There was a problem hiding this comment.
Writing here is weird:
"when writing with partition_strategy = 'wildcard', either wildcard and hive can be used"
I think you meant "when writing to plain object storage, either wildcard and hive can be used" - or something of the sort
| - For best results `EXPORT PARTITION` requires a ZooKeeper / `clickhouse-keeper` ensemble with | ||
| the `multi_read` feature flag. This reduces API calls and ensures transactional consistency | ||
| when reading multiple fields. | ||
| governed by the destination table's Iceberg partition spec instead. |
There was a problem hiding this comment.
Minor formatting issue.
| - **Plain object storage.** One Parquet data file per part (or per chunk when split by | ||
| size/rows) plus one commit file per transaction — `<dest>/<partition>/<part>_<checksum>.<N>.parquet` | ||
| and `<dest>/commit_<...>`. Readers that want atomicity filter by commit. This is the default path, | ||
| but it can be customized to handle sharding, which is not covered by the default case. See the |
| export artifacts; the Iceberg sidecars (`*_clickhouse_export_part_sidecar.avro`) | ||
| are safe to delete once the commit has landed. | ||
|
|
||
| 9. **Settings in `PART` vs `PARTIION` export.** There is a subtle difference in |
|
|
||
| Both commands accept two destination families: | ||
|
|
||
| 1. **Plain object storage** (`S3`, `AzureBlobStorage`, equivalents). |
There was a problem hiding this comment.
Is AzureBlobStorage something we want to explicitly test? Most of the regression tests are S3-centric.
| This design only covers user visible behavior. It does not internal implementatation | ||
| details. The implementation section is omitted. | ||
|
|
||
| ## 4. Test plan |
There was a problem hiding this comment.
Is this section targeted towards tests written by devs, or a guide for regression test development, or both?
I've been wondering for a while actually, what the difference is between the regression repo and the tests written by devs included in the CH repo under tests/? I'm aware that regression has deeper coverage and also some overlap with the dev tests but beyond that, why do we have both, what is the purpose of each?
There was a problem hiding this comment.
why do we have both, what is the purpose of each
I summon @vzakaznikov in attack position
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added a skill for Antalya design creation and review. Also added a first cut design spec for ALTER TABLE EXPORT based on current documentation and tests.
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: