Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our technical achievements and share some of our implemented solutions.
In the past weeks we have been busy upgrading several Rancher managed Kubernetes clusters. We already faced Kubernetes upgrade issues related to a missing calico-node role, which we needed to analyze and solve. But on one particular cluster, additional errors were appearing in the cluster, right after Kubernetes was upgraded to 1.20 with RKE.
Kubernetes' internal services, such as coredns, are deployed in the kube-system namespace. Right after the RKE triggered Kubernetes upgrade, all kinds of services in the kube-system namespace started to spin and error, without any actual resolution.
Our monitoring, using the monitoring plugin check_rancher2, confirmed that workloads in the "System" project have problems.
The Rancher user interface does help for visual verification in this case:
However for deeper analysis (the WHY) the Rancher UI is not sufficient. We needed to dig into the containers themselves to find out more. As we already encountered kind of similar upgrading issues in the past, we also knew where to look first: The kubelet container.
Meanwhile looking at the kubelet logs, we quickly identified lots of errors, related to the issues seen in the kube-system namespace:
root@rancher1:~# docker logs --tail 50 --follow kubelet
E1122 07:13:47.396336 21820 kuberuntime_manager.go:923] Failed to stop sandbox {"docker" "dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4"}
E1122 07:13:47.396408 21820 kuberuntime_manager.go:702] killPodWithSyncResult failed: failed to "KillPodSandbox" for "06f94bd4-3c48-424c-9287-2b7550882dcf" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"coredns-55b58f978-dd8rw_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
E1122 07:13:47.396474 21820 pod_workers.go:191] Error syncing pod 06f94bd4-3c48-424c-9287-2b7550882dcf ("coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"), skipping: failed to "KillPodSandbox" for "06f94bd4-3c48-424c-9287-2b7550882dcf" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"coredns-55b58f978-dd8rw_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
I1122 07:13:50.308662 21820 kuberuntime_manager.go:469] Sandbox for pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)" has no IP address. Need to start a new one
W1122 07:13:50.312959 21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"
E1122 07:13:50.479316 21820 cni.go:387] Error deleting kube-system_calico-kube-controllers-7d5d95c8c9-qhnmt/d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
E1122 07:13:50.480337 21820 remote_runtime.go:143] StopPodSandbox "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
E1122 07:13:50.480396 21820 kuberuntime_manager.go:923] Failed to stop sandbox {"docker" "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"}
E1122 07:13:50.480524 21820 kuberuntime_manager.go:702] killPodWithSyncResult failed: failed to "KillPodSandbox" for "bf77ce71-2e12-48d8-b062-7c6908c36070" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
E1122 07:13:50.480581 21820 pod_workers.go:191] Error syncing pod bf77ce71-2e12-48d8-b062-7c6908c36070 ("calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"), skipping: failed to "KillPodSandbox" for "bf77ce71-2e12-48d8-b062-7c6908c36070" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
I1122 07:13:59.307226 21820 kuberuntime_manager.go:469] Sandbox for pod "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)" has no IP address. Need to start a new one
W1122 07:13:59.309160 21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4"
E1122 07:13:59.375890 21820 cni.go:387] Error deleting kube-system_coredns-55b58f978-dd8rw/dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4 from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
[...]
Actually these error messages looked eerily similar to our previous upgrade error analysis (again, check out our article Kubernetes upgrade issues related to a missing calico-node role) yet no missing role was mentioned in the logs. But obviously the user "system:node" seems to be unauthorized to retrieve needed resources (clusterinformations) from the Kubernetes API:
connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
To find out more, the Kubernetes cluster roles needed to be analyzed at a deeper level. Using kubectl to take a at the cluster roles and grepping for "node", reveals a couple of roles:
# kubectl get ClusterRole -n kube-system | grep node
calico-node 2021-11-22T07:04:59Z
cattle-globalrole-nodedrivers-manage 2018-11-08T14:14:36Z
system:certificates.k8s.io:certificatesigningrequests:nodeclient 2018-11-08T14:04:11Z
system:certificates.k8s.io:certificatesigningrequests:selfnodeclient 2018-11-08T14:04:11Z
system:controller:node-controller 2018-11-08T14:04:11Z
system:node 2018-11-08T14:04:11Z
system:node-bootstrapper 2018-11-08T14:04:11Z
system:node-problem-detector 2018-11-08T14:04:11Z
system:node-proxier 2018-11-08T14:04:11Z
And yes, the system:node role is also part of this list. Let us take a closer look at the rules of this role:
# kubectl get clusterrole system:node --all-namespaces -o json
{
"apiVersion": "rbac.authorization.k8s.io/v1",
"kind": "ClusterRole",
"metadata": {
"annotations": {
"rbac.authorization.kubernetes.io/autoupdate": "true"
},
"creationTimestamp": "2018-11-08T14:04:11Z",
"labels": {
"kubernetes.io/bootstrapping": "rbac-defaults"
},
"name": "system:node",
"resourceVersion": "89740098",
"uid": "2af4bab6-e35f-11e8-8a5a-0244812063b0"
},
"rules": [
{
"apiGroups": [
"authentication.k8s.io"
],
"resources": [
"tokenreviews"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
"authorization.k8s.io"
],
"resources": [
"localsubjectaccessreviews",
"subjectaccessreviews"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
""
],
"resources": [
"services"
],
"verbs": [
"get",
"list",
"watch"
]
},
{
"apiGroups": [
""
],
"resources": [
"nodes"
],
"verbs": [
"create",
"get",
"list",
"watch"
]
},
{
"apiGroups": [
""
],
"resources": [
"nodes/status"
],
"verbs": [
"patch",
"update"
]
},
{
"apiGroups": [
""
],
"resources": [
"nodes"
],
"verbs": [
"delete",
"patch",
"update"
]
},
{
"apiGroups": [
""
],
"resources": [
"events"
],
"verbs": [
"create",
"patch",
"update"
]
},
{
"apiGroups": [
""
],
"resources": [
"pods"
],
"verbs": [
"get",
"list",
"watch"
]
},
{
"apiGroups": [
""
],
"resources": [
"pods"
],
"verbs": [
"create",
"delete"
]
},
{
"apiGroups": [
""
],
"resources": [
"pods/status"
],
"verbs": [
"patch",
"update"
]
},
{
"apiGroups": [
""
],
"resources": [
"pods/eviction"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
""
],
"resources": [
"configmaps",
"secrets"
],
"verbs": [
"get",
"list",
"watch"
]
},
{
"apiGroups": [
""
],
"resources": [
"persistentvolumeclaims",
"persistentvolumes"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
""
],
"resources": [
"endpoints"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
"certificates.k8s.io"
],
"resources": [
"certificatesigningrequests"
],
"verbs": [
"create",
"get",
"list",
"watch"
]
},
{
"apiGroups": [
""
],
"resources": [
"persistentvolumeclaims/status"
],
"verbs": [
"get",
"patch",
"update"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"volumeattachments"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
""
],
"resources": [
"serviceaccounts/token"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csidrivers"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csidrivers"
],
"verbs": [
"list"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csidrivers"
],
"verbs": [
"watch"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csinodes"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csinodes"
],
"verbs": [
"delete"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csinodes"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csinodes"
],
"verbs": [
"patch"
]
},
{
"apiGroups": [
"storage.k8s.io"
],
"resources": [
"csinodes"
],
"verbs": [
"update"
]
},
{
"apiGroups": [
"coordination.k8s.io"
],
"resources": [
"leases"
],
"verbs": [
"create"
]
},
{
"apiGroups": [
"coordination.k8s.io"
],
"resources": [
"leases"
],
"verbs": [
"delete"
]
},
{
"apiGroups": [
"coordination.k8s.io"
],
"resources": [
"leases"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
"coordination.k8s.io"
],
"resources": [
"leases"
],
"verbs": [
"patch"
]
},
{
"apiGroups": [
"coordination.k8s.io"
],
"resources": [
"leases"
],
"verbs": [
"update"
]
},
{
"apiGroups": [
"node.k8s.io"
],
"resources": [
"runtimeclasses"
],
"verbs": [
"get"
]
},
{
"apiGroups": [
"node.k8s.io"
],
"resources": [
"runtimeclasses"
],
"verbs": [
"list"
]
},
{
"apiGroups": [
"node.k8s.io"
],
"resources": [
"runtimeclasses"
],
"verbs": [
"watch"
]
}
]
}
Wow, this is quite a large output! However there's one major problem: There are rules missing! For example the errors mentioned a "crd.projectcalico.org", which does not show up in any rule of this cluster role. On our research we came across a thread in the Rancher forums, which discussed the exact same problem as we were facing. Luckily one of the commenters (user 1117) also mentioned the missing rules.
The missing rules can be added on the fly in the existing cluster role, by using kubectl edit. This opens up an editor (in our case vim):
# kubectl edit clusterrole system:node -n kube-system
[...]
- apiGroups:
- crd.projectcalico.org
resources:
- clusterinformations
verbs:
- get
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
After pasting the missing rules at the end of the role and saving (:wq), the cluster role is immediately updated.
Almost immediately after the system:node cluster role was altered, the workloads in the kube-system namespaces started spinning again. And kubelet stopped showing errors; instead the containers were correctly created again:
I1122 07:28:24.463682 21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"
I1122 07:28:25.197914 21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerDied", Data:"55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077"}
W1122 07:28:25.198006 21820 pod_container_deletor.go:79] Container "55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077" not found in pod's containers
I1122 07:28:25.375735 21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"
I1122 07:28:25.395950 21820 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
I1122 07:28:26.213275 21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerStarted", Data:"55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077"}
I1122 07:28:28.329497 21820 kube_docker_client.go:347] Stop pulling image "rancher/mirrored-coredns-coredns:1.8.0": "Status: Downloaded newer image for rancher/mirrored-coredns-coredns:1.8.0"
I1122 07:28:29.289388 21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerStarted", Data:"0309eb4bb0a0fdf6fbf862710cb67459449c8647e211eac042a22f4723f302cd"}
I1122 07:28:36.307721 21820 kuberuntime_manager.go:469] Sandbox for pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)" has no IP address. Need to start a new one
W1122 07:28:36.323794 21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"
I1122 07:28:36.451630 21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"
I1122 07:28:37.403661 21820 kubelet.go:1952] SyncLoop (PLEG): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)", event: &pleg.PodLifecycleEvent{ID:"bf77ce71-2e12-48d8-b062-7c6908c36070", Type:"ContainerStarted", Data:"b59986342cd88d5e773f7879a4411c7f517ca8815d6a4745f654c4033017b241"}
I1122 07:28:37.497271 21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"
I1122 07:28:40.810715 21820 kube_docker_client.go:347] Stop pulling image "rancher/mirrored-calico-kube-controllers:v3.17.2": "Status: Downloaded newer image for rancher/mirrored-calico-kube-controllers:v3.17.2"
I1122 07:28:41.460486 21820 kubelet.go:1952] SyncLoop (PLEG): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)", event: &pleg.PodLifecycleEvent{ID:"bf77ce71-2e12-48d8-b062-7c6908c36070", Type:"ContainerStarted", Data:"3745f80270c91bc2fdeff813e27892ed16ae1af2580ebafdd8e831bbedaa9784"}
This went on for a few minutes, until finally all services were successfully deployed and running again:
Although currently called "the de-facto container infrastructure", Kubernetes is anything but easy. The complexity adds additional problems and considerations. We at Infiniroot love to share our troubleshooting knowledge when we need to tackle certain issues - but we also know this is not for everyone ("it just needs to work"). So if you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland or even in your own datacenter, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.