diff --git a/docs/en/solutions/How_to_Backup_and_Restore_ClickHouse_with_clickhouse_backup.md b/docs/en/solutions/How_to_Backup_and_Restore_ClickHouse_with_clickhouse_backup.md new file mode 100644 index 00000000..dfc49e4d --- /dev/null +++ b/docs/en/solutions/How_to_Backup_and_Restore_ClickHouse_with_clickhouse_backup.md @@ -0,0 +1,604 @@ +--- +kind: + - How To +products: + - Alauda Container Platform + - Alauda Application Services +ProductsVersion: + - 4.3.x +--- +# How to Back Up and Restore ClickHouse with clickhouse-backup + +## Purpose + +This document explains how to back up and restore a ClickHouse instance deployed by the ClickHouse Operator on Alauda Container Platform by using a `clickhouse-backup` sidecar and S3-compatible object storage. + +The procedure covers: + +- Deploying ClickHouse with a `clickhouse-backup` sidecar. +- Creating a remote backup in S3-compatible storage. +- Running backups from a Kubernetes Job or CronJob. +- Restoring the backup into a separate ClickHouseInstallation. +- Verifying that restored tables contain the expected data. + +## Resolution + +### 1. Overview + +`clickhouse-backup` runs as a sidecar in the ClickHouse Pod. It connects to ClickHouse through `localhost:9000`, freezes MergeTree table parts, writes backup metadata, and uploads the backup to S3-compatible storage. + +For this workflow to back up table data correctly, the sidecar must mount the same ClickHouse data volume as the main `clickhouse` container at `/var/lib/clickhouse`. If the sidecar does not see the ClickHouse data directory, the backup may contain only metadata files. + +Distributed tables are backed up as schema only. The actual data is stored in the underlying MergeTree-family local tables, such as `events_local`. + +### 2. Prerequisites + +Prepare the following items before starting: + +| Item | Description | Example | +|------|-------------|---------| +| Namespace | Namespace for the ClickHouse instances and Jobs | `` | +| Source ClickHouseInstallation name | Source instance name | `` | +| Restore ClickHouseInstallation name | Restore instance name | `` | +| ClickHouse cluster name | Cluster name in the ClickHouseInstallation spec | `` | +| Source ClickHouse Pod name | Source Pod generated by the ClickHouse Operator | `` | +| Source ClickHouse service name | Service for the source Pod or shard | `` | +| Restore ClickHouse Pod name | Restore Pod generated by the ClickHouse Operator | `` | +| Restore ClickHouse service name | Service for the restore Pod or shard | `` | +| S3 endpoint | S3-compatible object storage endpoint | `http://:` | +| S3 bucket | Bucket for backup storage | `` | +| S3 access key | Access key for the bucket | `` | +| S3 secret key | Secret key for the bucket | `` | +| S3 credential Secret name | Kubernetes Secret that stores S3 credentials | `` | +| Backup Job name | Kubernetes Job used to create a backup | `` | +| Backup CronJob name | Kubernetes CronJob used for scheduled backups | `` | +| Restore Job name | Kubernetes Job used to restore a backup | `` | + +Set local variables for the commands in this document. If these variables are already exported in your shell, the manifest templates below can be rendered directly with `envsubst`. + +Use explicit variable lists with `envsubst`. This prevents runtime variables inside Job scripts, such as `$BACKUP_NAME`, `$COMMAND`, and `$STATUS`, from being replaced on your workstation. + +```bash +export NAMESPACE="" +export SOURCE_CHI="" +export RESTORE_CHI="" +export CLUSTER_NAME="" +export SOURCE_POD="" +export SOURCE_SERVICE="" +export RESTORE_POD="" +export RESTORE_SERVICE="" +export S3_ENDPOINT="http://:" +export S3_BUCKET="" +export S3_ACCESS_KEY="" +export S3_SECRET_KEY="" +export S3_SECRET_NAME="" +export BACKUP_JOB_NAME="" +export BACKUP_CRONJOB_NAME="" +export RESTORE_JOB_NAME="" +``` + +Create the S3 bucket before running the backup workflow. + +```bash +mc alias set backup-s3 "$S3_ENDPOINT" "$S3_ACCESS_KEY" "$S3_SECRET_KEY" +mc mb --ignore-existing "backup-s3/$S3_BUCKET" +``` + +### 3. Create a Namespace and S3 Credential Secret + +```bash +kubectl create namespace "$NAMESPACE" + +kubectl -n "$NAMESPACE" create secret generic "$S3_SECRET_NAME" \ + --from-literal=access-key="$S3_ACCESS_KEY" \ + --from-literal=secret-key="$S3_SECRET_KEY" +``` + +### 4. Deploy ClickHouse with a Backup Sidecar + +Create `clickhouse-source.yaml.tmpl`. The template uses the environment variables defined in the prerequisites section and can be rendered directly with `envsubst`. + +```yaml +apiVersion: clickhouse.altinity.com/v1 +kind: ClickHouseInstallation +metadata: + name: ${SOURCE_CHI} + namespace: ${NAMESPACE} +spec: + configuration: + clusters: + - name: ${CLUSTER_NAME} + layout: + shardsCount: 1 + replicasCount: 1 + templates: + podTemplate: clickhouse-with-backup + templates: + podTemplates: + - name: clickhouse-with-backup + spec: + containers: + - name: clickhouse + volumeMounts: + - name: clickhouse-data + mountPath: /var/lib/clickhouse + - name: clickhouse-backup + image: docker-mirrors.alauda.cn/altinity/clickhouse-backup:2.6.3 + imagePullPolicy: IfNotPresent + args: + - server + env: + - name: LOG_LEVEL + value: debug + - name: ALLOW_EMPTY_BACKUPS + value: "false" + - name: API_LISTEN + value: 0.0.0.0:7171 + - name: API_CREATE_INTEGRATION_TABLES + value: "true" + - name: BACKUPS_TO_KEEP_REMOTE + value: "3" + - name: REMOTE_STORAGE + value: s3 + - name: S3_ACL + value: private + - name: S3_ENDPOINT + value: ${S3_ENDPOINT} + - name: S3_BUCKET + value: ${S3_BUCKET} + - name: S3_PATH + value: backup/shard-{shard} + - name: S3_ACCESS_KEY + valueFrom: + secretKeyRef: + name: ${S3_SECRET_NAME} + key: access-key + - name: S3_SECRET_KEY + valueFrom: + secretKeyRef: + name: ${S3_SECRET_NAME} + key: secret-key + - name: S3_FORCE_PATH_STYLE + value: "true" + - name: S3_DISABLE_SSL + value: "true" + ports: + - containerPort: 7171 + name: backup-rest + volumeMounts: + - name: clickhouse-data + mountPath: /var/lib/clickhouse + securityContext: + runAsUser: 101 + runAsGroup: 101 + volumeClaimTemplates: + - name: clickhouse-data + spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +``` + +Notes: + +- `ALLOW_EMPTY_BACKUPS="false"` is recommended for production because it fails when no data is present. Set it to `"true"` only for initial setup, CI, testing, or reusable templates where an empty database is expected. +- `S3_DISABLE_SSL="true"` disables TLS for S3 traffic and sends backups without transport encryption. Use it only for local testing or trusted isolated networks. For production or untrusted networks, set it to `"false"` or omit it and configure TLS for the S3 endpoint. +- Keep the `clickhouse` container entry in the Pod template. If it is omitted, the generated Pod may contain only the sidecar container. + +Apply the manifest and wait for the Pod to become ready. + +```bash +envsubst '${SOURCE_CHI} ${NAMESPACE} ${CLUSTER_NAME} ${S3_ENDPOINT} ${S3_BUCKET} ${S3_SECRET_NAME}' \ + < clickhouse-source.yaml.tmpl > clickhouse-source.yaml + +kubectl apply -f clickhouse-source.yaml +kubectl -n "$NAMESPACE" wait --for=condition=Ready pod/"$SOURCE_POD" --timeout=10m +kubectl -n "$NAMESPACE" get pod "$SOURCE_POD" +``` + +The Pod must show two ready containers. + +```text +NAME READY STATUS + 2/2 Running +``` + +Verify that the backup integration tables were created. + +```bash +kubectl -n "$NAMESPACE" exec "$SOURCE_POD" -c clickhouse -- \ + clickhouse-client -q "SHOW TABLES FROM system LIKE 'backup_%'" +``` + +Expected output includes: + +```text +backup_actions +backup_list +backup_version +``` + +### 5. Insert Test Data + +Create a local MergeTree table and a Distributed table. + +```bash +kubectl -n "$NAMESPACE" exec "$SOURCE_POD" -c clickhouse -- \ + clickhouse-client -mn --query " + CREATE TABLE events_local + ( + event_date Date, + event_type Int32, + article_id Int32, + title String + ) + ENGINE = MergeTree() + PARTITION BY toYYYYMM(event_date) + ORDER BY (event_type, article_id); + + CREATE TABLE events AS events_local + ENGINE = Distributed('$CLUSTER_NAME', default, events_local, rand()); + + INSERT INTO events_local + SELECT today(), rand() % 3, number, 'backup test' FROM numbers(1000); + + SELECT count() FROM events_local; + " +``` + +The expected count is: + +```text +1000 +``` + +Confirm that ClickHouse created active data parts for the local table. + +```bash +kubectl -n "$NAMESPACE" exec "$SOURCE_POD" -c clickhouse -- \ + clickhouse-client -q " + SELECT table, sum(rows), sum(bytes_on_disk) + FROM system.parts + WHERE database = 'default' AND table = 'events_local' AND active + GROUP BY table + " +``` + +### 6. Create and Upload a Backup Manually + +The `create_remote` command creates a local backup and uploads it to the configured remote storage in one step. + +```bash +BACKUP_NAME="full-$(date +%Y%m%d%H%M%S)" + +kubectl -n "$NAMESPACE" exec "$SOURCE_POD" -c clickhouse-backup -- \ + clickhouse-backup create_remote "$BACKUP_NAME" +``` + +List the backup from the sidecar. + +```bash +kubectl -n "$NAMESPACE" exec "$SOURCE_POD" -c clickhouse-backup -- \ + clickhouse-backup list | grep "$BACKUP_NAME" +``` + +Expected output contains both local and remote entries. + +```text + ... local regular + ... remote tar, regular +``` + +Verify that the remote storage contains both metadata and data part objects. + +```bash +mc find "backup-s3/$S3_BUCKET/backup/shard-0/$BACKUP_NAME" --maxdepth 5 +``` + +Expected objects include a `shadow` path for MergeTree data parts. + +```text +backup/shard-0//metadata.json +backup/shard-0//metadata/default/events.json +backup/shard-0//metadata/default/events_local.json +backup/shard-0//shadow/default/events_local/default_202605_1_1_0.tar +``` + +### 7. Run Backup from a Kubernetes Job + +The sidecar exposes `system.backup_actions`. This allows backup automation from a Kubernetes Job that only needs `clickhouse-client` network access to the ClickHouse service. + +Create `clickhouse-backup-job.yaml.tmpl`. Render only the manifest variables with `envsubst`; do not render runtime variables such as `$BACKUP_NAME`, `$COMMAND`, or `$STATUS` inside the Job script. + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: ${BACKUP_JOB_NAME} + namespace: ${NAMESPACE} +spec: + backoffLimit: 1 + template: + spec: + restartPolicy: Never + containers: + - name: run-backup + image: clickhouse/clickhouse-client:latest + imagePullPolicy: IfNotPresent + env: + - name: CLICKHOUSE_HOST + value: ${SOURCE_SERVICE} + - name: CLICKHOUSE_PORT + value: "9000" + command: + - bash + - -ec + - | + BACKUP_NAME="full-$(date +%Y%m%d%H%M%S)" + COMMAND="create_remote ${BACKUP_NAME}" + + clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="INSERT INTO system.backup_actions(command) VALUES('${COMMAND}')" + + while true; do + STATUS=$(clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="SELECT status FROM system.backup_actions WHERE command='${COMMAND}' ORDER BY start DESC LIMIT 1 FORMAT TabSeparatedRaw") + echo "${COMMAND}: ${STATUS}" + if [ "$STATUS" != "in progress" ]; then + break + fi + sleep 2 + done + + if [ "$STATUS" != "success" ]; then + clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="SELECT command,status,error FROM system.backup_actions WHERE command='${COMMAND}' ORDER BY start DESC LIMIT 1" + exit 1 + fi + + echo "BACKUP_NAME=${BACKUP_NAME}" +``` + +Apply and verify the Job. + +```bash +envsubst '${BACKUP_JOB_NAME} ${NAMESPACE} ${SOURCE_SERVICE}' \ + < clickhouse-backup-job.yaml.tmpl > clickhouse-backup-job.yaml + +kubectl apply -f clickhouse-backup-job.yaml +kubectl -n "$NAMESPACE" wait --for=condition=complete job/"$BACKUP_JOB_NAME" --timeout=20m +kubectl -n "$NAMESPACE" logs job/"$BACKUP_JOB_NAME" +``` + +### 8. Schedule Backups with a CronJob + +After the manual Job succeeds, use the same command body in a CronJob. + +Create `clickhouse-backup-cronjob.yaml.tmpl` and render only the manifest variables with `envsubst`. + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: ${BACKUP_CRONJOB_NAME} + namespace: ${NAMESPACE} +spec: + schedule: "0 0 * * *" + concurrencyPolicy: Forbid + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 3 + jobTemplate: + spec: + backoffLimit: 1 + template: + spec: + restartPolicy: Never + containers: + - name: run-backup + image: clickhouse/clickhouse-client:latest + imagePullPolicy: IfNotPresent + env: + - name: CLICKHOUSE_HOST + value: ${SOURCE_SERVICE} + - name: CLICKHOUSE_PORT + value: "9000" + command: + - bash + - -ec + - | + BACKUP_NAME="full-$(date +%Y%m%d%H%M%S)" + COMMAND="create_remote ${BACKUP_NAME}" + clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="INSERT INTO system.backup_actions(command) VALUES('${COMMAND}')" + while true; do + STATUS=$(clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="SELECT status FROM system.backup_actions WHERE command='${COMMAND}' ORDER BY start DESC LIMIT 1 FORMAT TabSeparatedRaw") + echo "${COMMAND}: ${STATUS}" + if [ "$STATUS" != "in progress" ]; then + break + fi + sleep 2 + done + test "$STATUS" = "success" +``` + +Render and apply the CronJob. + +```bash +envsubst '${BACKUP_CRONJOB_NAME} ${NAMESPACE} ${SOURCE_SERVICE}' \ + < clickhouse-backup-cronjob.yaml.tmpl > clickhouse-backup-cronjob.yaml + +kubectl apply -f clickhouse-backup-cronjob.yaml +``` + +For multi-shard clusters, run the same command once against one replica service per shard, and use a unique `S3_PATH` or backup name per shard. + +### 9. Create a Restore ClickHouse Instance + +Create another ClickHouseInstallation with the same sidecar and S3 configuration. Use a different instance name. + +Create `clickhouse-restore.yaml.tmpl` by copying `clickhouse-source.yaml.tmpl` and changing the metadata name variable from `${SOURCE_CHI}` to `${RESTORE_CHI}`. Then render the template. + +```bash +envsubst '${RESTORE_CHI} ${NAMESPACE} ${CLUSTER_NAME} ${S3_ENDPOINT} ${S3_BUCKET} ${S3_SECRET_NAME}' \ + < clickhouse-restore.yaml.tmpl > clickhouse-restore.yaml + +kubectl apply -f clickhouse-restore.yaml +kubectl -n "$NAMESPACE" wait --for=condition=Ready pod/"$RESTORE_POD" --timeout=10m +``` + +### 10. Restore the Backup + +Use the backup name from step 6 or step 7. + +```bash +BACKUP_NAME="" + +kubectl -n "$NAMESPACE" exec "$RESTORE_POD" -c clickhouse-backup -- \ + clickhouse-backup restore_remote "$BACKUP_NAME" +``` + +The same restore can be automated through `system.backup_actions`. + +Create `clickhouse-restore-job.yaml.tmpl`. Set `BACKUP_NAME` to the backup created in step 6 or step 7, and render only the manifest variables with `envsubst`. + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: ${RESTORE_JOB_NAME} + namespace: ${NAMESPACE} +spec: + backoffLimit: 0 + template: + spec: + restartPolicy: Never + containers: + - name: run-restore + image: clickhouse/clickhouse-client:latest + imagePullPolicy: IfNotPresent + env: + - name: CLICKHOUSE_HOST + value: ${RESTORE_SERVICE} + - name: CLICKHOUSE_PORT + value: "9000" + - name: RESTORE_BACKUP + value: ${BACKUP_NAME} + command: + - bash + - -ec + - | + COMMAND="restore_remote ${RESTORE_BACKUP}" + clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="INSERT INTO system.backup_actions(command) VALUES('${COMMAND}')" + while true; do + STATUS=$(clickhouse-client --host="$CLICKHOUSE_HOST" --port="$CLICKHOUSE_PORT" -mn \ + --query="SELECT status FROM system.backup_actions WHERE command='${COMMAND}' ORDER BY start DESC LIMIT 1 FORMAT TabSeparatedRaw") + echo "${COMMAND}: ${STATUS}" + if [ "$STATUS" != "in progress" ]; then + break + fi + sleep 2 + done + test "$STATUS" = "success" +``` + +Apply and verify the restore Job. + +```bash +envsubst '${RESTORE_JOB_NAME} ${NAMESPACE} ${RESTORE_SERVICE} ${BACKUP_NAME}' \ + < clickhouse-restore-job.yaml.tmpl > clickhouse-restore-job.yaml + +kubectl apply -f clickhouse-restore-job.yaml +kubectl -n "$NAMESPACE" wait --for=condition=complete job/"$RESTORE_JOB_NAME" --timeout=20m +kubectl -n "$NAMESPACE" logs job/"$RESTORE_JOB_NAME" +``` + +### 11. Verify Restored Data + +```bash +kubectl -n "$NAMESPACE" exec "$RESTORE_POD" -c clickhouse -- \ + clickhouse-client -mn --query " + SHOW TABLES; + SELECT count() FROM events_local; + SELECT count() FROM events; + " +``` + +Expected output: + +```text +events +events_local +1000 +1000 +``` + +### 12. Troubleshooting + +#### The Pod shows only one container + +If the Pod shows `1/1` and only the `clickhouse-backup` container exists, the custom Pod template replaced the default ClickHouse container. Add the `- name: clickhouse` container entry to the Pod template. + +```bash +kubectl -n "$NAMESPACE" get pod "$SOURCE_POD" +``` + +#### The backup contains only metadata files + +Check whether both containers mount the same volume at `/var/lib/clickhouse`. + +```bash +kubectl -n "$NAMESPACE" get pod "$SOURCE_POD" -o jsonpath='{range .spec.containers[*]}{.name}{"\n"}{range .volumeMounts[*]}{.name}{" -> "}{.mountPath}{"\n"}{end}{end}' +``` + +A valid data backup contains `shadow/...tar` objects in S3. A backup that only contains `metadata/*.json` objects is schema-only or cannot see the ClickHouse data directory. + +```bash +mc find "backup-s3/$S3_BUCKET/backup/shard-0/" --maxdepth 5 +``` + +#### The backup integration tables do not exist + +Check the `clickhouse-backup` container logs. + +```bash +kubectl -n "$NAMESPACE" logs "$SOURCE_POD" -c clickhouse-backup --tail=100 +``` + +Also confirm that `API_CREATE_INTEGRATION_TABLES` is set to `true`. + +#### Distributed table data is not included + +This is expected. Distributed tables are backed up as schema only. Back up and restore the underlying MergeTree-family local tables. + +#### Restore cannot find the backup + +Confirm that the restore instance uses the same `S3_ENDPOINT`, `S3_BUCKET`, and `S3_PATH` as the source instance. + +```bash +kubectl -n "$NAMESPACE" exec "$RESTORE_POD" -c clickhouse-backup -- \ + clickhouse-backup list +``` + +### 13. Validation Result + +This procedure was validated on an Alauda Container Platform 4.3 environment with ClickHouse Operator 4.3 and `altinity/clickhouse-backup:2.6.3`. + +Validated results: + +- The source Pod ran with two containers: `clickhouse` and `clickhouse-backup`. +- Both containers mounted the same `/var/lib/clickhouse` volume. +- `system.backup_actions`, `system.backup_list`, and `system.backup_version` were created. +- `clickhouse-backup create_remote` uploaded metadata and MergeTree data part archives to S3. +- `system.backup_actions` successfully triggered `create_remote `. +- `clickhouse-backup restore_remote ` restored the backup into a separate ClickHouseInstallation. +- The restored `events_local` and `events` tables both returned `1000` rows. + +## Related Information + +- `clickhouse-backup` supports backup and restore for MergeTree-family table engines. +- Use a separate restore ClickHouseInstallation when validating a backup to avoid modifying the source instance. +- For multi-shard clusters, run backup and restore operations once per shard and avoid restoring the same shard data on multiple replicas.