This may happen if the virtual machines in the cluster were re-created, e.g. to recover after a VM failure.
The cluster-manager logs on the master will show a similar
message: Unable to find VM by UUID. VM UUID: 4215cbe6-2a4f-b8a8-9178-6219df59cd40
.
The problem is due to the way kubelet registers nodes in the Kubernetes master.
After a node is registered in the master for the first time, providerID
field
cannot be changed in the kubernetes node object.
To make sure that you have exactly this error execute the following command:
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,PROVIDER_ID:.spec.providerID,UUID:.status.nodeInfo.systemUUID
NAME PROVIDER_ID UUID
cluster-247-vsp1-group1-worker-0 vsphere://42152218-c14d-f00d-dc13-84176e986471 42152218-C14D-F00D-DC13-84176E986471
cluster-247-vsp1-group1-worker-1 vsphere://42155ab0-e5a1-7bc3-e769-8d8cb56f2c2b 42155AB0-E5A1-7BC3-E769-8D8CB56F2C2B
cluster-247-vsp1-master-0 vsphere://42155d4b-9ee2-344e-e610-b80db41130f0 42155D4B-9EE2-344E-E610-B80DB41130F0
cluster-247-vsp1-master-1 vsphere://4215cbe6-2a4f-b8a8-9178-6219df59cd40 4215DE4E-975F-1DCC-57DD-67330B1653D5
cluster-247-vsp1-master-2 vsphere://4215f891-37f9-8cf3-8bb4-8fde7762db6b 42150441-AB5B-A87E-A2EA-A01AB7560430
Note that cluster-247-vsp1-master-1
and cluster-247-vsp1-master-2
nodes have
different UUID values in fields .spec.providerID
and .status.nodeInfo.systemUUID
.
To fix it:
Remove kubernetes nodes with incorrect UUID values by running the following command
$ kubectl delete node cluster-247-vsp1-master-1
$ kubectl delete node cluster-247-vsp1-master-2
Restart the kubelet on this nodes via ssh by running the following command, or just restart the nodes.
# systemctl restart kublr-kubelet
When any operation is performed on a vSphere or VCD cluster that results in removing nodes - such as scaling a node group down, or removing a node group - Kubernetes does not remove the node from Kubernetes API automatically, and the node is reported to be in an error state.
Current workaround is to remove the node from Kubernetes API manually.
To identify the problematic nodes run the following command:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cluster-247-vsp1-group1-worker-0 Ready <none> 98d v1.18.15
cluster-247-vsp1-group1-worker-1 NotReady <none> 98d v1.18.15
cluster-247-vsp1-master-0 Ready master 27d v1.18.18
cluster-247-vsp1-master-1 Ready <none> 259d v1.18.15
cluster-247-vsp1-master-2 Ready <none> 259d v1.18.15
Note that cluster-247-vsp1-group1-worker-1
node has NotReady
status.
Remove the node from Kubernetes API:
$ kubectl delete node cluster-247-vsp1-group1-worker-1