Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition: two PVCs get the same project quota #155

Open
andreasreinhardt opened this issue Mar 31, 2023 · 1 comment
Open

Race condition: two PVCs get the same project quota #155

andreasreinhardt opened this issue Mar 31, 2023 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@andreasreinhardt
Copy link

Describe the bug: We use localpv with ext4 hard quotas. They work quite fine, but from time to time, we get the problem, that the quota has exceeded despite the folder contains less than the defined quota (10GiB). Today I could track the problem down to 2 PVCs that oviously had the same project quota ID set:

/nvme/disk# ls
lost+found  pvc-2fabebc9-8143-4b60-beef-563180845e64  pvc-6d3a015a-c547-4292-9ed6-95b35a7aea41

/nvme/disk/pvc-6d3a015a-c547-4292-9ed6-95b35a7aea41# du -h --max-depth=1
4.2G	./workspace
33M	./remoting
8.0K	./caches
4.3G	.

/nvme/disk# du -h --max-depth=1
6.1G	./pvc-2fabebc9-8143-4b60-beef-563180845e64
16K	./lost+found
4.3G	./pvc-6d3a015a-c547-4292-9ed6-95b35a7aea41
11G	.

/nvme/disk# repquota -avugP
*** Report for project quotas on device /dev/md0
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
Project         used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
#0        --      20       0       0              2     0     0       
#1        --       0 10737419 10737419              0     0     0       
#2        --       0 10737419 10737419              0     0     0       
#3        --       0 10737419 10737419              0     0     0       
#4        -- 10737416 10737419 10737419           6122     0     0       
#5        --       0 10737419 10737419              0     0     0       
#6        --       0 10737419 10737419              0     0     0       

I think the problem occurs because of a race condition when determining the project id:
https://github.com/openebs/dynamic-localpv-provisioner/blob/e797585cb1e2c3578b914102bfe0e8768b04d950/cmd/provisioner-localpv/app/helper_hostpath.go#L294+L295

I see two possible workaround: either make sure that only one create-quota-pod can run at a time on one single node or apply a random project number instead of trying to increment them.

Expected behaviour: Each PVC has the quota it is configured with.

Steps to reproduce the bug:
Unfortunately, it is really hard to reproduce the bug, as it only happens now and then. During tests I scaled a deployment with a PVC up and down very fast to check the create and cleanup and had no problem. Maybe you can reproduce it with more than one deployment scaled up in parallel

The output of the following commands will help us better understand what's going on:

  • kubectl get pods -n <openebs_namespace> --show-labels
    nvme-provisioner-localpv-provisioner-68f8494cf7-84hdv 1/1 Running 80 (12h ago) 32d app=localpv-provisioner,chart=localpv-provisioner-3.3.0,component=localpv-provisioner,heritage=Helm,name=openebs-localpv-provisioner,openebs.io/component-name=openebs-localpv-provisioner,openebs.io/version=3.3.0,pod-template-hash=68f8494cf7,release=nvme-provisioner

Anything else we need to know?:
The provisioner pod has lots of restarts, we don't know why, there is no error in the pod log, but it seems not to be related

Environment details:

  • OpenEBS version (use kubectl get po -n openebs --show-labels): 3.3.0
  • Kubernetes version (use kubectl version): 1.23.15
  • Cloud provider or hardware configuration: AWS
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • kernel (e.g: uname -a): 5.4.228-131.415.amzn2.x86_64
@niladrih niladrih added the bug Something isn't working label Jun 26, 2024
@niladrih
Copy link
Member

The provisioning jobs are asynchronous. The issue makes sense to me. I understand it'd be difficult to reproduce, so I'm not going to try it. It is clearly apparent that it exists. Thank you for reporting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants