Backup and restore of a Kubernetes cluster involves the following components: clustes spec, secrets, etcd database content and persistent volumes content.
Kublr implements the full cluster backup procedure described in the Backup section. However, this solution might not fit customer environment.
To aid the customer in implementing the backup procedure that best suits the requirements, we provide low-level tools for application-level etcd snapshot and restore.
To create a snapshot, run the following command as a root on any of the master nodes:
/opt/kublr/bin/kublr etcd backup --file file.db
This will create an application-level snapshot of etcd database and place it
to the file.db
file. This command is intended to
be run as part of a script that will generate a timestamped name for
the file and/or upload it to the final destination.
As with any backup, it is advisable to store it somewhere outside of the node file system.
The file is a standard application-level etcd snapshot as created by
etcdctl snapshot create
command or equivalent etcd API call.
The snapshot contains consensus data, so which master nodes is used for the snapshot is not important. However, to avoid a single point of failure, you might want to schedule snapshotting on several nodes.
The snapshot made in the previous step can be restored manually according to the procedure described in etcd disaster recovery document.
Kublr uses etcdX
(where X is the master node ordinal) for
etcd instance names and https://etcdX.kublr.local:2380
for peer URLs.
You can find etcd data volume location and other aspects of Kublr etcd
environment in the file
/etc/kubernetes/manifests/etcd.manifest.
See section Addendum: locating etcd data volume
for more information.
Note that etcdctl requires that peer URLs must be resolvable during
the restore. Names etcdX.kublr.local
are not part of the Kubernetes
DNS (which will not be operational when etcd is down). You
must use /etc/hosts
or other means to make them resolvable.
Also note that Kublr marks etcd as the critical pod, so you cannot stop etcd instance manually (Kublr will forcefully start it again).
Even if you somehow manage to run the restore without stopping the Kubernetes, (may be during time window of etcd restart), the replacement of etcd database under active Kubernetes API server will render the Kubernetes inoperational, so the cold restart will be needed anyway.
You must stop all Kublr/Kubernetes services by issuing the following commands as root:
service Kublr stop
service Kublr-kubelet stop
service Docker stop
# at this point you can perform the restore
# wait until all other master nodes reach this point
service Kublr start
# no need to manually start Kublr-kubelet and Docker,
# the Kublr service will start them automatically
After the successful restore, worker nodes also will need to be restarted.
Warning: etcd database restore by itself does not restore the content of persistent volumes. This must be done separately, preferrably before the attempt to start the node.
To avoid a tedious tasks of finding node ordinals and constructing the
correct etcdctl environment and command arguments for every node
we provide a kublr etcd restore
subcommand.
To restore the etcd database using this command, issue the following commands as root on every master node:
# distribute the snapshot file to every master node
service Kublr stop
service Kublr-kubelet stop
service Docker stop
/opt/kublr/bin/kublr etcd restore --file file.db
# wait until all other master nodes reach this point
service Kublr start
As with manual restore, all master nodes must be restored from the same snapshot file.
After the successful restore, worker nodes also will need to be restarted.
Warning: etcd database restore by itself does not restore the content of persistent volumes. This must be done separately, preferrably before the attempt to start the node.
The command kublr etcd restore
does not perform the actual restore,
it just schedules the restore to be performed on etcd pod startup.
To find the output of the actual restore operation, check the logs of
etcd container using docker logs
command. Equivalent kubectl command
will be available only if the restore was successful.
Etcd container has name starting with k8s_etcd_k8s-etcd-
.
The restore operation output most likely will be
at the top of the log, before the output of the etcd process.
To abort the scheduled restore, remove the file /mnt/master-pd/etcd/restore.db
(See section Addendum: locating etcd data volume to find the etcd
data volume location for your instance).
Warning: The etcd restore is actually a destructive operation,
so avoid dry-running kublr etcd restore
When only one of etcd instances is failed there is no need to restore entire cluster database from the backup. In etcd 3.2 and higher, the single failed node can be restored by replicating the data from the cluster quorum.
The procedure for restoring the single node is also described in the document etcd disaster recovery. In Kublr environment, this procedure is unconvenient because it requires stopping etcd instance and reproducing etcd environment for etcdctl command.
To aid in the recovery of single etcd instance, Kublr 1.11.2 and higher implements a control mechanism to schedule commands to be performed by running etcd pod.
To schedule the restore of single node by replication from cluster quorum, create
a file named command
in the root directory of etcd data volume. This file must contain a
string reinitialize
Example:
echo reinitialize > /mnt/master-pd/etcd/command
See section Addendum: locating etcd data volume to find the etcd data volume location for your instance
The command will be performed by etcd pod several seconds after the file creation, or
on next restart if the pod is in crash loop. The command
file
will be removed after the execution. The results of the execution
can be checked in the pod/container logs. Some information will be available in command-result
file in the same directory.
This procedure does not involve replacing the content of etcd database, so it does not require restarting Kublr and Kubernetes services.
Warning In the future, additional commands can be added to the pod control mechanism,
so avoid creating the command
file with other content.
By default, Kublr starts etcd container using host directory /mnt/master-pd/etcd
for
etcd data volume. However this path can be overriden by custom cluster spec or
by platform-specific defaults.
Actual host path to etcd data volume can be found by two metods:
/opt/kublr/bin/kublr validate
.
Output of this command is Yaml data stream. The host path of etcd data volume is controlled
by the parameter etcd_storage.path
.kubectl describe pod
command or from Yaml file /etc/kubetnetes/manifests/etcd.manifest
on the master node.hostPath.path
of the
volume named data.' Volume parameters are located in section
spec.volumes` of the
manifest.