clickhouse备份和恢复
为什么要备份CK
虽然CK有复制表之类的引擎,也有一些保护措施(不能人工删除使用带有MergeTree引擎且包含超过50Gb数据的表),但是无法避免的人为失误,以及倒霉透顶的机器集群故障,或者合规要求等,为了再加一份保险,就需要用到只要是数据库,就有备份的必要性
备份方法和工具来源于官网:https://clickhouse.com/docs/zh/operations/backup
使用建议先在测试环境或开发环境,把相关脚本OK了再上生产噢~
手工备份
手工备份,是使用CK的ALTER TABLE ... FREEZE PARTITION ...
命令来实现,是利用硬链接到一个目录(/var/lib/clickhouse/shadow
),恢复时从 /var/lib/clickhouse/bakcup
寻找指定的名称进行恢复
需要注意的是 手工备份和恢复
都不会涉及到表结构,只单纯的备份和恢复数据,因此相关表结构需要 另外存档和自行创建
基于上述要求,先自行创建
/var/lib/clickhouse/{shadow,bakcup}
,注意这2个目录权限属主属组是clickhouse如果在配置文件指定了data数据目录,则是在指定目录下创建
备份例子
alter table tableName freeze
#不进入交互模式,直接运行
echo 'alter table tableName freeze ' | clickhouse-client -u xxxx --password xxxx
恢复例子
alter table tableName attach partition Namexxx
#不进入交互模式,直接运行
echo 'alter table tableName attach partition Namexxx ' | clickhouse-client -u xxxx --password xxxx
#查看恢复表数据
echo 'select count() from tableName' | clickhouse-client -u xxxx --password xxxx
工具备份
官方推荐的工具 clickhouse-backup
,地址:https://github.com/AlexAkulov/clickhouse-backup
此工具不但可以备份数据,还可以备份表结构,另外还支持备份到第三方存储,比如S3(OSS)
从releases下载一个包既可,rpm包的直接rpm -ivh
安装就可以全局使用了,这里我使用的tar包。
安装部署
tar -zxvf clickhouse-backup.tar.gz # 解压
cd clickhouse-backup
cp clickhouse-backup /usr/local/bin/ # 将可执行文件复制到/usr/local/bin
mkdir /etc/clickhouse-backup # /etc下创建一个目录,用来放置配置文件config.yml
cp config.yml /etc/clickhouse-backup/
配置文件config.yml
如下
general:
remote_storage: none # REMOTE_STORAGE, if `none` then `upload` and `download` command will fail
max_file_size: 1099511627776 # MAX_FILE_SIZE, 1G by default, useless when upload_by_part is true, use for split data parts files by archives
disable_progress_bar: true # DISABLE_PROGRESS_BAR, show progress bar during upload and download, makes sense only when `upload_concurrency` and `download_concurrency` is 1
backups_to_keep_local: 1 # BACKUPS_TO_KEEP_LOCAL, how many latest local backup should be kept, 0 means all created backups will be stored on local disk
# You shall run `clickhouse-backup delete local <backup_name>` command to remove temporary backup files from the local disk
backups_to_keep_remote: 2 # BACKUPS_TO_KEEP_REMOTE, how many latest backup should be kept on remote storage, 0 means all uploaded backups will be stored on remote storage.
# If old backups are required for newer incremental backup then it won't be deleted. Be careful with long incremental backup sequences.
log_level: info # LOG_LEVEL, a choice from `debug`, `info`, `warn`, `error`
allow_empty_backups: false # ALLOW_EMPTY_BACKUPS
# concurrency means parallel tables and parallel parts inside tables
# for example 4 means max 4 parallel tables and 4 parallel parts inside one table, so equals 16 concurrent streams
download_concurrency: 1 # DOWNLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
upload_concurrency: 1 # UPLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
# RESTORE_SCHEMA_ON_CLUSTER, execute all schema related SQL queries with `ON CLUSTER` clause as Distributed DDL.
# Check `system.clusters` table for the correct cluster name, also `system.macros` can be used.
# This isn't applicable when `use_embedded_backup_restore: true`
restore_schema_on_cluster: ""
upload_by_part: true # UPLOAD_BY_PART
download_by_part: true # DOWNLOAD_BY_PART
use_resumable_state: true # USE_RESUMABLE_STATE, allow resume upload and download according to the <backup_name>.resumable file
# RESTORE_DATABASE_MAPPING, restore rules from backup databases to target databases, which is useful when changing destination database, all atomic tables will be created with new UUIDs.
# The format for this env variable is "src_db1:target_db1,src_db2:target_db2". For YAML please continue using map syntax
restore_database_mapping: {}
retries_on_failure: 3 # RETRIES_ON_FAILURE, how many times to retry after a failure during upload or download
retries_pause: 30s # RETRIES_PAUSE, duration time to pause after each download or upload failure
clickhouse:
username: default # CLICKHOUSE_USERNAME
password: "" # CLICKHOUSE_PASSWORD
host: localhost # CLICKHOUSE_HOST, to backup data clickhouse-backup requires access to the same file system as clickhouse-server, to host could be address of other docker container on the same machine, or IP address binded on some network interface on the same host.
port: 9000 # CLICKHOUSE_PORT, don't use 8123, clickhouse-backup doesn't support HTTP protocol
# CLICKHOUSE_DISK_MAPPING, use this mapping when your `system.disks` are different between the source and destination clusters during backup and restore process
# The format for this env variable is "disk_name1:disk_path1,disk_name2:disk_path2". For YAML please continue using map syntax
disk_mapping: {}
# CLICKHOUSE_SKIP_TABLES, the list of tables (pattern are allowed) which are ignored during backup and restore process
# The format for this env variable is "pattern1,pattern2,pattern3". For YAML please continue using map syntax
skip_tables:
- system.*
- INFORMATION_SCHEMA.*
- information_schema.*
timeout: 5m # CLICKHOUSE_TIMEOUT
freeze_by_part: false # CLICKHOUSE_FREEZE_BY_PART, allow freezing by part instead of freezing the whole table
freeze_by_part_where: "" # CLICKHOUSE_FREEZE_BY_PART_WHERE, allow parts filtering during freezing when freeze_by_part: true
secure: false # CLICKHOUSE_SECURE, use TLS encryption for connection
skip_verify: false # CLICKHOUSE_SKIP_VERIFY, skip certificate verification and allow potential certificate warnings
sync_replicated_tables: true # CLICKHOUSE_SYNC_REPLICATED_TABLES
tls_key: "" # CLICKHOUSE_TLS_KEY, filename with TLS key file
tls_cert: "" # CLICKHOUSE_TLS_CERT, filename with TLS certificate file
tls_ca: "" # CLICKHOUSE_TLS_CA, filename with TLS custom authority file
log_sql_queries: true # CLICKHOUSE_LOG_SQL_QUERIES, enable logging `clickhouse-backup` SQL queries on `system.query_log` table inside clickhouse-server
debug: false # CLICKHOUSE_DEBUG
config_dir: "/etc/clickhouse-server" # CLICKHOUSE_CONFIG_DIR
restart_command: "systemctl restart clickhouse-server" # CLICKHOUSE_RESTART_COMMAND, use this command when restoring with --rbac or --config options
ignore_not_exists_error_during_freeze: true # CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE, helps to avoid backup failures when running frequent CREATE / DROP tables and databases during backup, `clickhouse-backup` will ignore `code: 60` and `code: 81` errors during execution of `ALTER TABLE ... FREEZE`
check_replicas_before_attach: true # CLICKHOUSE_CHECK_REPLICAS_BEFORE_ATTACH, helps avoiding concurrent ATTACH PART execution when restoring ReplicatedMergeTree tables
use_embedded_backup_restore: false # CLICKHOUSE_USE_EMBEDDED_BACKUP_RESTORE, use BACKUP / RESTORE SQL statements instead of regular SQL queries to use features of modern ClickHouse server versions
azblob:
endpoint_suffix: "core.windows.net" # AZBLOB_ENDPOINT_SUFFIX
account_name: "" # AZBLOB_ACCOUNT_NAME
account_key: "" # AZBLOB_ACCOUNT_KEY
sas: "" # AZBLOB_SAS
use_managed_identity: false # AZBLOB_USE_MANAGED_IDENTITY
container: "" # AZBLOB_CONTAINER
path: "" # AZBLOB_PATH, `system.macros` values could be applied as {macro_name}
compression_level: 1 # AZBLOB_COMPRESSION_LEVEL
compression_format: tar # AZBLOB_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
sse_key: "" # AZBLOB_SSE_KEY
buffer_size: 0 # AZBLOB_BUFFER_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 2Mb and 4Mb
max_parts_count: 10000 # AZBLOB_MAX_PARTS_COUNT, number of parts for AZBLOB uploads, for properly calculate buffer size
max_buffers: 3 # AZBLOB_MAX_BUFFERS
s3:
access_key: "" # S3_ACCESS_KEY
secret_key: "" # S3_SECRET_KEY
bucket: "" # S3_BUCKET
endpoint: "" # S3_ENDPOINT
region: us-east-1 # S3_REGION
acl: private # S3_ACL
assume_role_arn: "" # S3_ASSUME_ROLE_ARN
force_path_style: false # S3_FORCE_PATH_STYLE
path: "" # S3_PATH, `system.macros` values could be applied as {macro_name}
disable_ssl: false # S3_DISABLE_SSL
compression_level: 1 # S3_COMPRESSION_LEVEL
compression_format: tar # S3_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
# look details in https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html
sse: "" # S3_SSE, empty (default), AES256, or aws:kms
sse_kms_key_id: "" # S3_SSE_KMS_KEY_ID, if S3_SSE is aws:kms then specifies the ID of the Amazon Web Services Key Management Service
sse_customer_algorithm: "" # S3_SSE_CUSTOMER_ALGORITHM, encryption algorithm, for example, AES256
sse_customer_key: "" # S3_SSE_CUSTOMER_KEY, customer-provided encryption key
sse_customer_key_md5: "" # S3_SSE_CUSTOMER_KEY_MD5, 128-bit MD5 digest of the encryption key according to RFC 1321
sse_kms_encryption_context: "" # S3_SSE_KMS_ENCRYPTION_CONTEXT, base64-encoded UTF-8 string holding a JSON with the encryption context
# Specifies the Amazon Web Services KMS Encryption Context to use for object encryption.
# This is a collection of non-secret key-value pairs that represent additional authenticated data.
# When you use an encryption context to encrypt data, you must specify the same (an exact case-sensitive match)
# encryption context to decrypt the data. An encryption context is supported only on operations with symmetric encryption KMS keys
disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION
use_custom_storage_class: false # S3_USE_CUSTOM_STORAGE_CLASS
storage_class: STANDARD # S3_STORAGE_CLASS, by default allow only from list https://github.com/aws/aws-sdk-go-v2/blob/main/service/s3/types/enums.go#L787-L799
concurrency: 1 # S3_CONCURRENCY
part_size: 0 # S3_PART_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 5MB and 5Gb
max_parts_count: 10000 # S3_MAX_PARTS_COUNT, number of parts for S3 multipart uploads
allow_multipart_download: false # S3_ALLOW_MULTIPART_DOWNLOAD, allow faster download and upload speeds, but will require additional disk space, download_concurrency * part size in worst case
# S3_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
# The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
object_labels: {}
# S3_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depending on the backup name regexp pattern, format nameRegexp > className
custom_storage_class_map: {}
debug: false # S3_DEBUG
gcs:
credentials_file: "" # GCS_CREDENTIALS_FILE
credentials_json: "" # GCS_CREDENTIALS_JSON
credentials_json_encoded: "" # GCS_CREDENTIALS_JSON_ENCODED
bucket: "" # GCS_BUCKET
path: "" # GCS_PATH, `system.macros` values could be applied as {macro_name}
compression_level: 1 # GCS_COMPRESSION_LEVEL
compression_format: tar # GCS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
storage_class: STANDARD # GCS_STORAGE_CLASS
# GCS_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
# The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
object_labels: {}
# GCS_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depends on backup name regexp pattern, format nameRegexp > className
custom_storage_class_map: {}
debug: false # GCS_DEBUG
cos:
url: "" # COS_URL
timeout: 2m # COS_TIMEOUT
secret_id: "" # COS_SECRET_ID
secret_key: "" # COS_SECRET_KEY
path: "" # COS_PATH, `system.macros` values could be applied as {macro_name}
compression_format: tar # COS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # COS_COMPRESSION_LEVEL
ftp:
address: "" # FTP_ADDRESS
timeout: 2m # FTP_TIMEOUT
username: "" # FTP_USERNAME
password: "" # FTP_PASSWORD
tls: false # FTP_TLS
path: "" # FTP_PATH, `system.macros` values could be applied as {macro_name}
compression_format: tar # FTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # FTP_COMPRESSION_LEVEL
debug: false # FTP_DEBUG
sftp:
address: "" # SFTP_ADDRESS
username: "" # SFTP_USERNAME
password: "" # SFTP_PASSWORD
key: "" # SFTP_KEY
path: "" # SFTP_PATH, `system.macros` values could be applied as {macro_name}
concurrency: 1 # SFTP_CONCURRENCY
compression_format: tar # SFTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # SFTP_COMPRESSION_LEVEL
debug: false # SFTP_DEBUG
custom:
upload_command: "" # CUSTOM_UPLOAD_COMMAND
download_command: "" # CUSTOM_DOWNLOAD_COMMAND
delete_command: "" # CUSTOM_DELETE_COMMAND
list_command: "" # CUSTOM_LIST_COMMAND
command_timeout: "4h" # CUSTOM_COMMAND_TIMEOUT
api:
listen: "localhost:7171" # API_LISTEN
enable_metrics: true # API_ENABLE_METRICS
enable_pprof: false # API_ENABLE_PPROF
username: "" # API_USERNAME, basic authorization for API endpoint
password: "" # API_PASSWORD
secure: false # API_SECURE, use TLS for listen API socket
certificate_file: "" # API_CERTIFICATE_FILE
private_key_file: "" # API_PRIVATE_KEY_FILE
integration_tables_host: "" # API_INTEGRATION_TABLES_HOST, allow using DNS name to connect in `system.backup_list` and `system.backup_actions`
allow_parallel: false # API_ALLOW_PARALLEL, enable parallel operations, this allows for significant memory allocation and spawns go-routines, don't enable it if you are not sure
create_integration_tables: false # API_CREATE_INTEGRATION_TABLES, create `system.backup_list` and `system.backup_actions`
complete_resumable_after_restart: true # API_COMPLETE_RESUMABLE_AFTER_RESTART, after API server startup, if `/var/lib/clickhouse/backup/*/(upload|download).state` present, then operation will continue in the background
本地和远程创建备份
备份数据到本地
因为我这里使用的tar包,在执行命令时需要指定配置文件,如不想指定,将配置文件放到/etc/clickhouse-backup/config.yml
既可,默认备份到/var/lib/clickhouse/backup
该路径下
如果在配置文件指定了data数据目录,则是在指定目录下创建,那就是
path/clickhouse/backup
查看可以备份的表
/usr/local/bin/clickhouse-backup tables
备份命令
#备份全部表
/usr/local/bin/clickhouse-backup -c config.yml create
#备份全部表,并为此次备份指定名称,,默认以当前时间,以文件夹方式存在
/usr/local/bin/clickhouse-backup -c config.yml create bak_Namexxx
#备份指定表
/usr/local/bin/clickhouse-backup -c config.yml create -t default.tableName bak_Namexxx
#备份多表
/usr/local/bin/clickhouse-backup -c config.yml create -t default.tableName1,default.tableName2 bak_Namexxx
#查看备份列表
/usr/local/bin/clickhouse-backup -c config.yml list
#删除某次备份
/usr/local/bin/clickhouse-backup -c config.yml delete local bak_Namexxx
恢复命令
#恢复某个备份名称 数据和表结构
/usr/local/bin/clickhouse-backup -c config.yml restore bak_Namexxx
#指定表还原数据 和表结构
/usr/local/bin/clickhouse-backup -c config.yml restore -t default.tableName bak_Namexxx
#还原表结构
/usr/local/bin/clickhouse-backup -c config.yml restore -s
#还原数据
/usr/local/bin/clickhouse-backup -c config.yml restore -d
远程创建备份
clickhouse-backup
备份工具支持好几种远程,如s3/ftp/sftp等等;如果需要远程备份需要修改config.yml文件中的remote_storage
参数,具体看官方文档介绍
#创建远程备份
/usr/local/bin/clickhouse-backup -c config.yml update
#下载远程备份到本地
/usr/local/bin/clickhouse-backup -c config.yml download
#恢复远程备份
/usr/local/bin/clickhouse-backup -c config.yml restore_remote
#----------------------------------------------------------------------
# 使用scp,将备份上传到服务器上,如果config.yml中设置了sftp,则不需要这一步
scp -rp /var/lib/clickhouse/backup/备份名 name@host:/data/clickhouse-backup/
# 但如果通过配置文件里的sftp上传备份到服务器
/usr/local/bin/clickhouse-backup upload bak_Namexxx
crontab定时备份
clickhouse_backup.sh
#!/bin/bash
BACKUP_NAME=ch_backup_$(date +%Y-%m-%dT%H-%M-%S)
/usr/local/bin/clickhouse-backup create $BACKUP_NAME #本地备份
/usr/local/bin/clickhouse-backup upload $BACKUP_NAME # 本地备份之后,上传到远程服务器