Fix operator panic when checking for pod crashloop status #115

SatyaKuppam · 2023-10-23T04:54:11Z

Description

Note: I am facing this issue with the v1.0.0 version of the operator on my < 1.25 K8s cluster. I think the same issue exists with later versions of the operator. So this change is off of the main branch.

Druid cluster has statefulsets with OrderedReady and we observed that the operator panics and goes into an irrecoverable state with an index-out-of-bounds exception when a pod is in the Pending state without creating any containers.

In my case it occurred when:

the operator incorrectly deleted the PVC before it could be mounted to the pod and the pod remained in Pending state (this is fixed in later versions with Delay removal of orphanPVC to avoid the removal of PVC in use #67)
Kubernetes cluster didn't have the requested resources and the pod was in the Pending state

in both cases there are no containers available and hence no ContainerStatuses were available.

Error log:

Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator INFO Observed a panic in reconciler: runtime error: index out of range [0] with length 0	{"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"druid-cluster","namespace":"druid-operator-system"}, "namespace": "druid-operator-system", "name": "druid-cluster", "reconcileID": "728791e5-630d-4ece-8f1c-940d7694f719"}
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator error panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator goroutine 453 [running]:
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1fa
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator panic({0x16761a0, 0xc000780030})
	/usr/local/go/src/runtime/panic.go:884 +0x212
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator github.com/datainfrahq/druid-operator/controllers/druid.checkCrashStatus({0x19c5e38, 0xc00025c5a0}, 0xc0005e8000, {0x19c43a0, 0xc0001d7ba0})
	/workspace/controllers/druid/handler.go:597 +0x732

Possible Solution presented in this PR

The solution is to split the PodStatus and the ContainerStatus checks without using index addressing for arrays. This change:

first checks if the pod is in the Failed or Unknown state, and deletes the pod if it is
then it loops through the ContainerStatus of all containers. If any one container is not ready and has restarted more than once, then the pod is killed.

This PR has:

been tested on a real K8S cluster to ensure creation of a brand new Druid cluster works.
been tested for backward compatibility on a real K*S cluster by applying the changes introduced here on an existing Druid cluster. If there are any backward incompatible changes then they have been noted in the PR description.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Key changed/added files in this PR

handler.go

SatyaKuppam added 3 commits October 21, 2023 18:06

check index out of bounds to prevent panic

c56d4ad

simplify ordered ready crash loop checks

627a8a7

add else statement

00ba60d

SatyaKuppam changed the title ~~Fix operator panic when checking pod crashloop status on statefulset OrderedReady~~ Fix operator panic when checking for pod crashloop status Oct 23, 2023

SatyaKuppam marked this pull request as ready for review October 23, 2023 05:44

AdheipSingh merged commit c20c2d4 into datainfrahq:master Oct 24, 2023
1 check passed

SatyaKuppam deleted the fix-index-out-of-bounds branch October 24, 2023 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix operator panic when checking for pod crashloop status #115

Fix operator panic when checking for pod crashloop status #115

SatyaKuppam commented Oct 23, 2023 •

edited

Loading

Fix operator panic when checking for pod crashloop status #115

Fix operator panic when checking for pod crashloop status #115

Conversation

SatyaKuppam commented Oct 23, 2023 • edited Loading

Description

Possible Solution presented in this PR

Key changed/added files in this PR

SatyaKuppam commented Oct 23, 2023 •

edited

Loading