Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix operator panic when checking for pod crashloop status #115

Merged
merged 3 commits into from
Oct 24, 2023

Conversation

SatyaKuppam
Copy link
Contributor

@SatyaKuppam SatyaKuppam commented Oct 23, 2023

Description

Note: I am facing this issue with the v1.0.0 version of the operator on my < 1.25 K8s cluster. I think the same issue exists with later versions of the operator. So this change is off of the main branch.

Druid cluster has statefulsets with OrderedReady and we observed that the operator panics and goes into an irrecoverable state with an index-out-of-bounds exception when a pod is in the Pending state without creating any containers.

In my case it occurred when:

  1. the operator incorrectly deleted the PVC before it could be mounted to the pod and the pod remained in Pending state (this is fixed in later versions with Delay removal of orphanPVC to avoid the removal of PVC in use #67)
  2. Kubernetes cluster didn't have the requested resources and the pod was in the Pending state

in both cases there are no containers available and hence no ContainerStatuses were available.

Error log:

Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator INFO Observed a panic in reconciler: runtime error: index out of range [0] with length 0	{"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"druid-cluster","namespace":"druid-operator-system"}, "namespace": "druid-operator-system", "name": "druid-cluster", "reconcileID": "728791e5-630d-4ece-8f1c-940d7694f719"}
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator error panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator goroutine 453 [running]:
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1fa
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator panic({0x16761a0, 0xc000780030})
	/usr/local/go/src/runtime/panic.go:884 +0x212
Oct 19 12:53:53 druid-operator-65867879c9-j896f druid-operator github.com/datainfrahq/druid-operator/controllers/druid.checkCrashStatus({0x19c5e38, 0xc00025c5a0}, 0xc0005e8000, {0x19c43a0, 0xc0001d7ba0})
	/workspace/controllers/druid/handler.go:597 +0x732

Possible Solution presented in this PR

The solution is to split the PodStatus and the ContainerStatus checks without using index addressing for arrays. This change:

  1. first checks if the pod is in the Failed or Unknown state, and deletes the pod if it is
  2. then it loops through the ContainerStatus of all containers. If any one container is not ready and has restarted more than once, then the pod is killed.

This PR has:

  • been tested on a real K8S cluster to ensure creation of a brand new Druid cluster works.
  • been tested for backward compatibility on a real K*S cluster by applying the changes introduced here on an existing Druid cluster. If there are any backward incompatible changes then they have been noted in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Key changed/added files in this PR
  • handler.go

@SatyaKuppam SatyaKuppam changed the title Fix operator panic when checking pod crashloop status on statefulset OrderedReady Fix operator panic when checking for pod crashloop status Oct 23, 2023
@SatyaKuppam SatyaKuppam marked this pull request as ready for review October 23, 2023 05:44
@AdheipSingh AdheipSingh merged commit c20c2d4 into datainfrahq:master Oct 24, 2023
1 check passed
@SatyaKuppam SatyaKuppam deleted the fix-index-out-of-bounds branch October 24, 2023 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants