Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our technical achievements and share some of our implemented solutions.
If you are using Rancher managed Kubernetes, you might have come across the Rancher reset/clean-up article. A node "reset" is necessary if it was previously used in a Kubernetes cluster and it should be (re-)used in another cluster. Otherwise the remaining config files, which are spread over very different paths, will interfere in the cluster register phase and might cause very weird problems.
One part of the mentioned clean-up is the removal of all container data in /var/lib/docker/ after the Docker service was stopped. At the end of the clean-up tasks, the Docker service is started again.
Modifying the Kubernetes cluster on which Rancher runs (the so-called "local" cluster), requires a modification of the Rancher YAML file, followed by a run of RKE.
After a cleaned up and Ubuntu-upgraded node (192.168.253.16) was added into the YAML config (3-node-rancher-n200.yml), RKE was run. But after a few seconds, RKE stopped with an error:
$ ./rke up --config RANCHER2_n200/3-node-rancher-n200.yml
INFO[0000] Running RKE version: v1.3.1
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] [certificates] Generating kube-etcd-192-168-253-15 certificate and key
INFO[0000] [certificates] Generating kube-etcd-192-168-253-16 certificate and key
INFO[0000] [certificates] Generating kube-etcd-192-168-253-17 certificate and key
INFO[0000] Successfully Deployed state file at [RANCHER2_n200/3-node-rancher-n200.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0000] [dialer] Setup tunnel for host [192.168.253.16]
INFO[0000] [dialer] Setup tunnel for host [192.168.253.17]
INFO[0002] [network] Deploying port listener containers
INFO[0002] [network] Starting stopped container [rke-etcd-port-listener] on host [192.168.253.15]
INFO[0002] Starting container [rke-etcd-port-listener] on host [192.168.253.15], try #1
INFO[0002] [network] Starting stopped container [rke-etcd-port-listener] on host [192.168.253.17]
INFO[0002] Starting container [rke-etcd-port-listener] on host [192.168.253.17], try #1
INFO[0002] Pulling image [rancher/rke-tools:v0.1.78] on host [192.168.253.16], try #1
INFO[0004] Pulling image [rancher/rke-tools:v0.1.78] on host [192.168.253.16], try #1
INFO[0009] Pulling image [rancher/rke-tools:v0.1.78] on host [192.168.253.16], try #1
WARN[0010] Failed to create Docker container [rke-etcd-port-listener] on host [192.168.253.16]: Error response from daemon: No such image: rancher/rke-tools:v0.1.78
WARN[0010] Failed to create Docker container [rke-etcd-port-listener] on host [192.168.253.16]: Error response from daemon: No such image: rancher/rke-tools:v0.1.78
WARN[0010] Failed to create Docker container [rke-etcd-port-listener] on host [192.168.253.16]: Error response from daemon: No such image: rancher/rke-tools:v0.1.78
FATA[0010] [Failed to create [rke-etcd-port-listener] container on host [192.168.253.16]: Failed to create Docker container [rke-etcd-port-listener] on host [192.168.253.16]: Error response from daemon: No such image: rancher/rke-tools:v0.1.78]
After using Rancher Kubernetes and RKE since 2018, this was the first time we've seen that kind of error. What happened here? Time to put the troubleshooting suit on!
Once logged in on the node to be added into the cluster (192.168.253.16), one of the images mentioned in the rke output was manually pulled. But this failed, too:
root@192.168.253.16:~# docker pull rancher/rke-tools:v0.1.78
v0.1.78: Pulling from rancher/rke-tools
540db60ca938: Pulling fs layer
0ae30075c5da: Pulling fs layer
9da81141e74e: Pulling fs layer
b2e41dd2ded0: Pulling fs layer
7f40e809fb2d: Pulling fs layer
758848c48411: Pulling fs layer
4aa8101d7589: Pulling fs layer
68bd44136930: Pulling fs layer
ae3790b81ced: Pulling fs layer
f9c62a2baf95: Pulling fs layer
759049e20249: Pulling fs layer
6713b6a87a77: Pulling fs layer
e1631138f00b: Pulling fs layer
063b41c39c12: Pulling fs layer
c7844f3999c4: Pulling fs layer
d8473758fa62: Pulling fs layer
f83b3d05af4e: Pulling fs layer
open /var/lib/docker/tmp/GetImageBlob573083372: no such file or directory
The error at the end mentions "no such file or directory" in combination with the path /var/lib/docker/tmp/. Is there a permission problem? Let's check the contents of /var/lib/docker:
root@192.168.253.16:~# ls -la /var/lib/docker/
total 0
What? It's empty?!
Let's restart the service and check again:
root@192.168.253.16:~# service docker restart
root@192.168.253.16:~# ls -la /var/lib/docker/
total 44
drwx--x--x 4 root root 4096 Nov 11 13:49 buildkit
drwx--x--- 2 root root 4096 Nov 11 13:49 containers
drwx------ 3 root root 4096 Nov 11 13:49 image
drwxr-x--- 3 root root 4096 Nov 11 13:49 network
drwx--x--- 3 root root 4096 Nov 11 13:49 overlay2
drwx------ 4 root root 4096 Nov 11 13:49 plugins
drwx------ 2 root root 4096 Nov 11 13:49 runtimes
drwx------ 2 root root 4096 Nov 11 13:49 swarm
drwx------ 2 root root 4096 Nov 11 13:49 tmp
drwx------ 2 root root 4096 Nov 11 13:49 trust
drwx-----x 2 root root 4096 Nov 11 13:49 volumes
Now the directories are back, weird...
Let's test rke again:
$ ./rke up --config RANCHER2_n200/3-node-rancher-n200.yml
INFO[0000] Running RKE version: v1.3.1
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [RANCHER2_n200/3-node-rancher-n200.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [192.168.253.17]
INFO[0000] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0000] [dialer] Setup tunnel for host [192.168.253.16]
INFO[0000] [network] Deploying port listener containers
INFO[0000] [network] Starting stopped container [rke-etcd-port-listener] on host [192.168.253.17]
INFO[0000] Starting container [rke-etcd-port-listener] on host [192.168.253.17], try #1
INFO[0000] [network] Starting stopped container [rke-etcd-port-listener] on host [192.168.253.15]
INFO[0000] Starting container [rke-etcd-port-listener] on host [192.168.253.15], try #1
INFO[0000] Pulling image [rancher/rke-tools:v0.1.78] on host [192.168.253.16], try #1
INFO[0006] Image [rancher/rke-tools:v0.1.78] exists on host [192.168.253.16]
[...]
This time, rke was able to pull and start the containers on the new node and finishing the process.
Further analysis showed that a "service docker start" (doesn't matter if you use this command or systemctl start docker as both commands use Systemd in the background) indeed does not create missing directories. A "service docker restart" however does create missing directories in /var/lib/docker/.
Going back and installing Docker 19.3.x from the Ubuntu repositories shows that a "start" has created the directories in the past:
root@focal:~# docker info | head
Client:
Debug Mode: false
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 19.03.8
root@focal:~# service docker stop
root@focal:~# rm -rf /var/lib/docker/*
root@focal:~# service docker start
root@focal:~# ll /var/lib/docker/
drwx--x--x 14 root root 4096 Nov 12 07:33 ./
drwxr-xr-x 29 root root 4096 Oct 29 14:06 ../
drwx------ 2 root root 4096 Nov 12 07:33 builder/
drwx--x--x 4 root root 4096 Nov 12 07:33 buildkit/
drwx------ 2 root root 4096 Nov 12 07:33 containers/
drwx------ 3 root root 4096 Nov 12 07:33 image/
drwxr-x--- 3 root root 4096 Nov 12 07:33 network/
drwx------ 3 root root 4096 Nov 12 07:33 overlay2/
drwx------ 4 root root 4096 Nov 12 07:33 plugins/
drwx------ 2 root root 4096 Nov 12 07:33 runtimes/
drwx------ 2 root root 4096 Nov 12 07:33 swarm/
drwx------ 2 root root 4096 Nov 12 07:33 tmp/
drwx------ 2 root root 4096 Nov 12 07:33 trust/
drwx------ 2 root root 4096 Nov 12 07:33 volumes/
As soon as Docker 20.10.x is installed, a start does not re-create missing directories, only a restart of the Docker service does.
As this is most likely a regression in the docker.io package, Ubuntu bug #195071 was reported.
The Rancher node clean-up/reset procedure was adjusted with the workaround, to use "restart" instead of "start".
Although currently called "the de-facto container infrastructure", Kubernetes is anything but easy. The complexity adds additional problems and considerations. We at Infiniroot love to share our troubleshooting knowledge when we need to tackle certain issues - but we also know this is not for everyone ("it just needs to work"). So if you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.