Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] failed to delete large pv, thus making node unschedulable. #181

Open
bernardgut opened this issue Apr 30, 2024 · 3 comments
Open

[bug] failed to delete large pv, thus making node unschedulable. #181

bernardgut opened this issue Apr 30, 2024 · 3 comments
Assignees
Milestone

Comments

@bernardgut
Copy link
Contributor

Describe the bug:
After You purposefully create a large (80% of the node ephemeral storage) pvc to test the behavior of the localpv-provisioner, the provisioner successfully creates it, but fails to delete the pv after the pvc is removed, thus leaving the node with diskPressure=true and preventing further scheduling of pods on the node. Manual deletion of the pv on kubernetes leaves the data on disk and persists the issue.
On Talos 1.7.0 using Openebs-localpv-provisioner (Helm) and the default Talos deployment instructions in the docs.

Expected behaviour:
The provisioner successfully deletes the pv after the pvc is deleted and/or successfully deletes the data after the pv is manually deleted, the diskPressure=true is removed and the node resumes operations.

Steps to reproduce the bug:

  • Create Talos cluster
  • Apply patch to mount the /var/openebs/local in the kubelet as per the docs
  • create a workload with a PVC that will bring your ephemeral storage to 85+%
  • delete the workload/pvc

The output of the following commands will help us better understand what's going on:
These are the logs of the localpv-provisioner container after the deletion. They run a loop of the following

...
I0429 21:12:46.177550       1 controller.go:1509] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": started
2024-04-29T21:12:46.177Z	INFO	app/provisioner_hostpath.go:270	Get the Node Object with label {map[kubernetes.io/hostname:v2]}
I0429 21:12:46.181181       1 provisioner_hostpath.go:282] Deleting volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 at v2:/var/openebs/local/pvc-600bcff4-c26f-43c4-bebb-6b989110c715
2024-04-29T21:14:46.664Z	ERROR	app/provisioner.go:188		{"eventcode": "local.pv.delete.failure", "msg": "Failed to delete Local PV", "rname": "pvc-600bcff4-c26f-43c4-bebb-6b989110c715", "reason": "failed to delete host path", "storagetype": "local-hostpath"}
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Delete
	/go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:188
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).deleteVolumeOperation
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1511
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolume
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1115
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolumeHandler
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1045
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem.func1
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:987
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1004
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).runVolumeWorker
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:905
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).Run.func1.3
	/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:857
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92
E0429 21:14:46.664896       1 controller.go:1519] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": volume deletion failed: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
E0429 21:14:46.664948       1 controller.go:995] Giving up syncing volume "pvc-600bcff4-c26f-43c4-bebb-6b989110c715" because failures 15 >= threshold 15
E0429 21:14:46.664972       1 controller.go:1007] error syncing volume "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
I0429 21:14:46.665321       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-600bcff4-c26f-43c4-bebb-6b989110c715", UID:"4f6f7084-bdb3-4ea5-89cc-ed217aa78da1", APIVersion:"v1", ResourceVersion:"931637", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
I0429 21:27:46.178478       1 controller.go:1509] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": started
...
  • kubectl get pods -n <openebs_namespace> --show-labels
NAME                                           READY   STATUS    RESTARTS   AGE   LABELS
openebs-localpv-provisioner-6b8bff68bd-p9d9f   1/1     Running   0          17h   app=localpv-provisioner,chart=localpv-provisioner-4.0.0,component=localpv-provisioner,heritage=Helm,name=openebs-localpv-provisioner,openebs.io/component-name=openebs-localpv-provisioner,openebs.io/version=4.0.0,pod-template-hash=6b8bff68bd,release=openebs
  • kubectl logs <upgrade_job_pod> -n <openebs_namespace>
k get job -n openebs                                                                         
No resources found in openebs namespace.

Anything else we need to know?:
NA

Environment details:

  • OpenEBS version (use kubectl get po -n openebs --show-labels): see above
  • Kubernetes version (use kubectl version): Server Version: v1.29.3
  • Cloud provider or hardware configuration: Talos 1.7.0. on Proxmox nodes with "ssd emulation":
talosctl -n v1 disks                                                                    
NODE       DEV        MODEL           SERIAL   TYPE   UUID   WWID   MODALIAS      NAME   SIZE     BUS_PATH                                                                   SUBSYSTEM          READ_ONLY   SYSTEM_DISK
10.2.0.8   /dev/sda   QEMU HARDDISK   -        SSD    -      -      scsi:t-0x00   -      22 GB    /pci0000:00/0000:00:05.0/0000:01:01.0/virtio1/host2/target2:0:0/2:0:0:0/   /sys/class/block               *
  • OS (e.g: cat /etc/os-release): Talos 1.7.0
  • kernel (e.g: uname -a): Linux v1 6.6.28-talos #1 SMP Thu Apr 18 16:21:02 UTC 2024 x86_64 Linux
  • others:
@bernardgut bernardgut changed the title [bug] failed to delete large pv on Talos, making nodes unschedulable. [bug] failed to delete large pv, thus making node unschedulable. May 1, 2024
@niladrih niladrih self-assigned this Jun 12, 2024
@tiagolobocastro
Copy link

@niladrih do we need to add toleration for DiskPressure to the cleanup Pod?

@D1StrX
Copy link

D1StrX commented Oct 27, 2024

I'm getting the same error on openebs/provisioner-localpv:4.1.1:

E1027 15:04:33.648888       1 controller.go:1007] error syncing volume "pvc-43c22848-1f2c-4471-9201-77ff5179c25c": failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: clean up volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c failed: create process timeout after 120 seconds
I1027 15:04:33.648953       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-43c22848-1f2c-4471-9201-77ff5179c25c", UID:"a1fdb419-d2c6-4a05-90cd-7c437d439bab", APIVersion:"v1", ResourceVersion:"159262224", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: clean up volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c failed: create process timeout after 120 seconds
2024-10-27T15:04:33.653Z        ERROR   app/provisioner.go:174          {"eventcode": "local.pv.delete.failure", "msg": "Failed to delete Local PV", "rname": "pvc-0c25df70-d565-4172-ae84-c79432cac3f5", "reason": "failed to delete host path", "storagetype": "local-hostpath"}
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Delete
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:174
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).deleteVolumeOperation
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1511
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolume
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1115
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolumeHandler
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1045
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem.func1
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:987
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:1004
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).runVolumeWorker
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:905
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).Run.func1.3
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:857
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92
E1027 15:04:33.653151       1 controller.go:1519] delete "pvc-0c25df70-d565-4172-ae84-c79432cac3f5": volume deletion failed: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds
I1027 15:04:33.653273       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-0c25df70-d565-4172-ae84-c79432cac3f5", UID:"1a09afdb-1288-4428-ac7f-c00dd6f0800d", APIVersion:"v1", ResourceVersion:"161476499", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds
W1027 15:04:33.653187       1 controller.go:992] Retrying syncing volume "pvc-0c25df70-d565-4172-ae84-c79432cac3f5" because failures 0 < threshold 15
E1027 15:04:33.653752       1 controller.go:1007] error syncing volume "pvc-0c25df70-d565-4172-ae84-c79432cac3f5": failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds`

@avishnu
Copy link
Member

avishnu commented Nov 28, 2024

Scoping the investigation for v4.3.

@avishnu avishnu added this to the v4.3 milestone Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants