Database Space Exceeded

Overview

The default etcd database size limit in a Kublr environment is 4Gb. Etcd data volume depends on the cloud provider’s configuration and typically is the smallest available SSD storage option.

It is important to consider that 4Gb is the size of the main database file (by default, /mnt/master-pd/etcd/data/member/snap/db). The data volume also contains WAL and snapshot files, so the full disk space usage typicaly is several times bigger than the database itself.

The size of main database file is guarded by etcd parameter –quota-backend-bytes. It can be controlled by custom spec section etcd_flag. By default it is 4Gb, and you cannot set it to unlimited, you must set a specific value.

There are several processes that contribute to etcd file size in Kubernetes cluster:

Useful cluster data: live deployments, pods, config maps, etc.
Deleted cluster data. Normally, cluster garbage collector takes care about them. For details, view this article .
Old versions of modified cluster data. etcd stores old versions of every modified key. Kubernetes API server runs keyspace compaction (deletes old key values) every 5 minutes.
File space fragmentation

During normal cluster operation, garbage collector and scheduled compaction do a good job of maintaining etcd database. However, we have seen instances of rapidly repeating errors. For example, pods rapidly evicted and rescheduled every second or so. In this case, the database size can hit a limit before scheduled compaction takes place. This fires an etcd disk space alert and stops any write operations on cluster database. You cannot modify the cluster data, including deletion of cluster objects, and scheduled compaction also will not work.

This condition can be detected by log messages in apiserver.log, like:

compact.go:124] etcd: endpoint ([https://172.16.4.149:2379]) compact failed: etcdserver: mvcc: database space exceeded

Fixing this condition requires operator intervention.

Step-by-Step via ETDC Container

Setup kubectl to access your cluster.

Find the name of the etcd pod:

 $ kubectl get pods -n kube-system | grep etcd
 k8s-etcd-09e9eb5f99507d780db842afa1158ef3f77d9586d354031038ae34e63f2025d9-ip-172-16-4-149.ec2.internal  2/2 Running 0 3h41m

Launch interactive shell in etcd pod (replace k8s-etcd-09… with actual pod name found on previous step):

 $ kubectl exec -n kube-system k8s-etcd-09e9eb5f99507d780db842afa1158ef3f77d9586d354031038ae34e63f2025d9-ip-172-16-4-149.ec2.internal -c etcd -it /bin/sh / #

In etcd pod shell, check the etcd disk space usage (In this example we lowered disk quota, your value in DB SIZE would match your quota-backend-bytes setting):
```
 $ ETCDCTL_API=3 etcdctl --write-out=table endpoint status
```
ENDPOINT ID VERSION DB SIZE IS LEADER RAFT TERM RAFT INDEX
127.0.0.1:2379 aadebc09b69225c2 3.2.24 271MB true 2 33778

ENDPOINT	ID	VERSION	DB SIZE	IS LEADER	RAFT TERM	RAFT INDEX
127.0.0.1:2379	aadebc09b69225c2	3.2.24	271MB	true	2	33778

In etcd pod shell, follow the procedure described here.

Note In our setup you do not need to specify the —endpoint option.

 # get current revision
 $ rev=$(ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
 # compact away all old revisions
 $ ETCDCTL_API=3 etcdctl compact $rev
 compacted revision 1516
 # defragment away excessive space
 $ ETCDCTL_API=3 etcdctl defrag
 Finished defragmenting etcd member[127.0.0.1:2379]
 # disarm alarm
 $ ETCDCTL_API=3 etcdctl alarm disarm
 memberID:13803658152347727308 alarm:NOSPACE

Leave etcd pod shell by typing exit or pressing Ctrl-D.
You do not need to restart etcd or other Kubernetes or Kublr pods or services.
Repeat the procedure for all master nodes.

Step-by-Step via Kublr ETDC Subcommand

Obtain a root shell on a master node.
Check etcd disk space usage (output should be similar to previous variant of the procedure):
```
 $ /opt/kublr/bin/kublr etcd ctl -- endpoint status --write-out=table
```

Compact keyspace and defrag the database. Note that you need to supply double dash argument before first etcdctl argument starting with double dash.

 # get current revision
 $ rev=$(/opt/kublr/bin/kublr etcd ctl -- endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
 # compact away all old revisions
 $ /opt/kublr/bin/kublr etcd ctl compact $rev
 compacted revision 1516
 # defragment away excessive space
 $ /opt/kublr/bin/kublr etcd ctl defrag
 Finished defragmenting etcd member[127.0.0.1:2379]
 # disarm alarm
 $ /opt/kublr/bin/kublr etcd ctl alarm disarm
 memberID:13803658152347727308 alarm:NOSPACE

Database Space Exceeded

Overview

Step-by-Step via ETDC Container

Step-by-Step via Kublr ETDC Subcommand

See Also