Cluster role issues after Kubernetes upgrade: error getting ClusterInformation: connection is unauthorized

Published on December 8th 2021

In the past weeks we have been busy upgrading several Rancher managed Kubernetes clusters. We already faced Kubernetes upgrade issues related to a missing calico-node role, which we needed to analyze and solve. But on one particular cluster, additional errors were appearing in the cluster, right after Kubernetes was upgraded to 1.20 with RKE.

kube-system namespace stuck in deployment

Kubernetes' internal services, such as coredns, are deployed in the kube-system namespace. Right after the RKE triggered Kubernetes upgrade, all kinds of services in the kube-system namespace started to spin and error, without any actual resolution.

Our monitoring, using the monitoring plugin check_rancher2, confirmed that workloads in the "System" project have problems.

The Rancher user interface does help for visual verification in this case:

Kubernetes services in kube-system in error

However for deeper analysis (the WHY) the Rancher UI is not sufficient. We needed to dig into the containers themselves to find out more. As we already encountered kind of similar upgrading issues in the past, we also knew where to look first: The kubelet container.

API connection is unauthorized

Meanwhile looking at the kubelet logs, we quickly identified lots of errors, related to the issues seen in the kube-system namespace:

root@rancher1:~# docker logs --tail 50 --follow kubelet
E1122 07:13:47.396336   21820 kuberuntime_manager.go:923] Failed to stop sandbox {"docker" "dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4"}
E1122 07:13:47.396408   21820 kuberuntime_manager.go:702] killPodWithSyncResult failed: failed to "KillPodSandbox" for "06f94bd4-3c48-424c-9287-2b7550882dcf" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"coredns-55b58f978-dd8rw_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
E1122 07:13:47.396474   21820 pod_workers.go:191] Error syncing pod 06f94bd4-3c48-424c-9287-2b7550882dcf ("coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"), skipping: failed to "KillPodSandbox" for "06f94bd4-3c48-424c-9287-2b7550882dcf" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"coredns-55b58f978-dd8rw_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
I1122 07:13:50.308662   21820 kuberuntime_manager.go:469] Sandbox for pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)" has no IP address. Need to start a new one
W1122 07:13:50.312959   21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"
E1122 07:13:50.479316   21820 cni.go:387] Error deleting kube-system_calico-kube-controllers-7d5d95c8c9-qhnmt/d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
E1122 07:13:50.480337   21820 remote_runtime.go:143] StopPodSandbox "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
E1122 07:13:50.480396   21820 kuberuntime_manager.go:923] Failed to stop sandbox {"docker" "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"}
E1122 07:13:50.480524   21820 kuberuntime_manager.go:702] killPodWithSyncResult failed: failed to "KillPodSandbox" for "bf77ce71-2e12-48d8-b062-7c6908c36070" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
E1122 07:13:50.480581   21820 pod_workers.go:191] Error syncing pod bf77ce71-2e12-48d8-b062-7c6908c36070 ("calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"), skipping: failed to "KillPodSandbox" for "bf77ce71-2e12-48d8-b062-7c6908c36070" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"
I1122 07:13:59.307226   21820 kuberuntime_manager.go:469] Sandbox for pod "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)" has no IP address. Need to start a new one
W1122 07:13:59.309160   21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4"
E1122 07:13:59.375890   21820 cni.go:387] Error deleting kube-system_coredns-55b58f978-dd8rw/dabd0aab4750c5540413eb4477adf4bcd9889be88264d89f9e4621126be926b4 from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope
[...]

Actually these error messages looked eerily similar to our previous upgrade error analysis (again, check out our article Kubernetes upgrade issues related to a missing calico-node role) yet no missing role was mentioned in the logs. But obviously the user "system:node" seems to be unauthorized to retrieve needed resources (clusterinformations) from the Kubernetes API:

connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope"

The system:node cluster role

To find out more, the Kubernetes cluster roles needed to be analyzed at a deeper level. Using kubectl to take a at the cluster roles and grepping for "node", reveals a couple of roles:

# kubectl get ClusterRole -n kube-system | grep node
calico-node                                                            2021-11-22T07:04:59Z
cattle-globalrole-nodedrivers-manage                                   2018-11-08T14:14:36Z
system:certificates.k8s.io:certificatesigningrequests:nodeclient       2018-11-08T14:04:11Z
system:certificates.k8s.io:certificatesigningrequests:selfnodeclient   2018-11-08T14:04:11Z
system:controller:node-controller                                      2018-11-08T14:04:11Z
system:node                                                            2018-11-08T14:04:11Z
system:node-bootstrapper                                               2018-11-08T14:04:11Z
system:node-problem-detector                                           2018-11-08T14:04:11Z
system:node-proxier                                                    2018-11-08T14:04:11Z

And yes, the system:node role is also part of this list. Let us take a closer look at the rules of this role:

# kubectl get clusterrole system:node --all-namespaces -o json
{
    "apiVersion": "rbac.authorization.k8s.io/v1",
    "kind": "ClusterRole",
    "metadata": {
        "annotations": {
            "rbac.authorization.kubernetes.io/autoupdate": "true"
        },
        "creationTimestamp": "2018-11-08T14:04:11Z",
        "labels": {
            "kubernetes.io/bootstrapping": "rbac-defaults"
        },
        "name": "system:node",
        "resourceVersion": "89740098",
        "uid": "2af4bab6-e35f-11e8-8a5a-0244812063b0"
    },
    "rules": [
        {
            "apiGroups": [
                "authentication.k8s.io"
            ],
            "resources": [
                "tokenreviews"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                "authorization.k8s.io"
            ],
            "resources": [
                "localsubjectaccessreviews",
                "subjectaccessreviews"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "services"
            ],
            "verbs": [
                "get",
                "list",
                "watch"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "nodes"
            ],
            "verbs": [
                "create",
                "get",
                "list",
                "watch"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "nodes/status"
            ],
            "verbs": [
                "patch",
                "update"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "nodes"
            ],
            "verbs": [
                "delete",
                "patch",
                "update"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "events"
            ],
            "verbs": [
                "create",
                "patch",
                "update"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "pods"
            ],
            "verbs": [
                "get",
                "list",
                "watch"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "pods"
            ],
            "verbs": [
                "create",
                "delete"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "pods/status"
            ],
            "verbs": [
                "patch",
                "update"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "pods/eviction"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "configmaps",
                "secrets"
            ],
            "verbs": [
                "get",
                "list",
                "watch"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "persistentvolumeclaims",
                "persistentvolumes"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "endpoints"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                "certificates.k8s.io"
            ],
            "resources": [
                "certificatesigningrequests"
            ],
            "verbs": [
                "create",
                "get",
                "list",
                "watch"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "persistentvolumeclaims/status"
            ],
            "verbs": [
                "get",
                "patch",
                "update"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "volumeattachments"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                ""
            ],
            "resources": [
                "serviceaccounts/token"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csidrivers"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csidrivers"
            ],
            "verbs": [
                "list"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csidrivers"
            ],
            "verbs": [
                "watch"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csinodes"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csinodes"
            ],
            "verbs": [
                "delete"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csinodes"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csinodes"
            ],
            "verbs": [
                "patch"
            ]
        },
        {
            "apiGroups": [
                "storage.k8s.io"
            ],
            "resources": [
                "csinodes"
            ],
            "verbs": [
                "update"
            ]
        },
        {
            "apiGroups": [
                "coordination.k8s.io"
            ],
            "resources": [
                "leases"
            ],
            "verbs": [
                "create"
            ]
        },
        {
            "apiGroups": [
                "coordination.k8s.io"
            ],
            "resources": [
                "leases"
            ],
            "verbs": [
                "delete"
            ]
        },
        {
            "apiGroups": [
                "coordination.k8s.io"
            ],
            "resources": [
                "leases"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                "coordination.k8s.io"
            ],
            "resources": [
                "leases"
            ],
            "verbs": [
                "patch"
            ]
        },
        {
            "apiGroups": [
                "coordination.k8s.io"
            ],
            "resources": [
                "leases"
            ],
            "verbs": [
                "update"
            ]
        },
        {
            "apiGroups": [
                "node.k8s.io"
            ],
            "resources": [
                "runtimeclasses"
            ],
            "verbs": [
                "get"
            ]
        },
        {
            "apiGroups": [
                "node.k8s.io"
            ],
            "resources": [
                "runtimeclasses"
            ],
            "verbs": [
                "list"
            ]
        },
        {
            "apiGroups": [
                "node.k8s.io"
            ],
            "resources": [
                "runtimeclasses"
            ],
            "verbs": [
                "watch"
            ]
        }
    ]
}

Wow, this is quite a large output! However there's one major problem: There are rules missing! For example the errors mentioned a "crd.projectcalico.org", which does not show up in any rule of this cluster role. On our research we came across a thread in the Rancher forums, which discussed the exact same problem as we were facing. Luckily one of the commenters (user 1117) also mentioned the missing rules.

The missing rules can be added on the fly in the existing cluster role, by using kubectl edit. This opens up an editor (in our case vim):

# kubectl edit clusterrole system:node -n kube-system
[...]
- apiGroups:
- crd.projectcalico.org
resources:
- clusterinformations
verbs:
- get
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get

After pasting the missing rules at the end of the role and saving (:wq), the cluster role is immediately updated.

kube-system recovering

Almost immediately after the system:node cluster role was altered, the workloads in the kube-system namespaces started spinning again. And kubelet stopped showing errors; instead the containers were correctly created again:

I1122 07:28:24.463682   21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"
I1122 07:28:25.197914   21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerDied", Data:"55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077"}
W1122 07:28:25.198006   21820 pod_container_deletor.go:79] Container "55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077" not found in pod's containers
I1122 07:28:25.375735   21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)"
I1122 07:28:25.395950   21820 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
I1122 07:28:26.213275   21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerStarted", Data:"55ac5928079fd25b8b4f97e09be875e7604a1943170d76d539c77e79580ec077"}
I1122 07:28:28.329497   21820 kube_docker_client.go:347] Stop pulling image "rancher/mirrored-coredns-coredns:1.8.0": "Status: Downloaded newer image for rancher/mirrored-coredns-coredns:1.8.0"
I1122 07:28:29.289388   21820 kubelet.go:1952] SyncLoop (PLEG): "coredns-55b58f978-dd8rw_kube-system(06f94bd4-3c48-424c-9287-2b7550882dcf)", event: &pleg.PodLifecycleEvent{ID:"06f94bd4-3c48-424c-9287-2b7550882dcf", Type:"ContainerStarted", Data:"0309eb4bb0a0fdf6fbf862710cb67459449c8647e211eac042a22f4723f302cd"}
I1122 07:28:36.307721   21820 kuberuntime_manager.go:469] Sandbox for pod "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)" has no IP address. Need to start a new one
W1122 07:28:36.323794   21820 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "d9edf42d36c1e7eb813054b41a8db2decb210cb7e7b6be73846ab9d33fb24fdc"
I1122 07:28:36.451630   21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"
I1122 07:28:37.403661   21820 kubelet.go:1952] SyncLoop (PLEG): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)", event: &pleg.PodLifecycleEvent{ID:"bf77ce71-2e12-48d8-b062-7c6908c36070", Type:"ContainerStarted", Data:"b59986342cd88d5e773f7879a4411c7f517ca8815d6a4745f654c4033017b241"}
I1122 07:28:37.497271   21820 kubelet.go:1921] SyncLoop (UPDATE, "api"): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)"
I1122 07:28:40.810715   21820 kube_docker_client.go:347] Stop pulling image "rancher/mirrored-calico-kube-controllers:v3.17.2": "Status: Downloaded newer image for rancher/mirrored-calico-kube-controllers:v3.17.2"
I1122 07:28:41.460486   21820 kubelet.go:1952] SyncLoop (PLEG): "calico-kube-controllers-7d5d95c8c9-qhnmt_kube-system(bf77ce71-2e12-48d8-b062-7c6908c36070)", event: &pleg.PodLifecycleEvent{ID:"bf77ce71-2e12-48d8-b062-7c6908c36070", Type:"ContainerStarted", Data:"3745f80270c91bc2fdeff813e27892ed16ae1af2580ebafdd8e831bbedaa9784"}

This went on for a few minutes, until finally all services were successfully deployed and running again:

Kubernetes services in kube-system running

Looking for a managed dedicated Kubernetes environment?

Although currently called "the de-facto container infrastructure", Kubernetes is anything but easy. The complexity adds additional problems and considerations. We at Infiniroot love to share our troubleshooting knowledge when we need to tackle certain issues - but we also know this is not for everyone ("it just needs to work"). So if you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland or even in your own datacenter, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.