Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

velero restore create stuck in status "InProgress" when restore pv and pvc only #488

Open
jhuisss opened this issue Oct 28, 2022 · 18 comments

Comments

@jhuisss
Copy link

jhuisss commented Oct 28, 2022

Describe the bug

"velero restore create" cmd is not working when using flag “--include-resources pv,pvc”, it stucked in status “InProgress”

To Reproduce

First I succeeded to create a backup of the whole workload cluster with pv, see:

# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                      STORAGECLASS   REASON   AGE
pvc-de661d9c-40d3-483e-a106-a0c2a3c06eea   50Mi       RWO            Delete           Bound    nginx-example/nginx-logs   vsphere-csi             4m33s

# kubectl get pvc -A
NAMESPACE       NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nginx-example   nginx-logs   Bound    pvc-de661d9c-40d3-483e-a106-a0c2a3c06eea   50Mi       RWO            vsphere-csi    4m39s
# velero backup create whole-wc
Backup request "whole-wc" submitted successfully.
Run `velero backup describe whole-wc` or `velero backup logs whole-wc` for more details.
# velero backup get
NAME                   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
whole-wc               Completed   0        1          2022-10-28 06:56:35 +0000 UTC   29d       default            <none>

Then I manually deleted the pv and pvc, to test velero restore with flag “--include-resources pv,pvc”, but it is stuck in "InProgress":

# velero restore create --from-backup whole-wc --include-resources pv,pvc --existing-resource-policy updateRestore request "whole-wc-20221028072733" submitted successfully.
Run `velero restore describe whole-wc-20221028072733` or `velero restore logs whole-wc-20221028072733` for more details.
# velero restore describe whole-wc-20221028072733
Name:         whole-wc-20221028072733
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                                 InProgress
Estimated total items to be restored:  2
Items restored so far:                 1

Started:    2022-10-28 07:27:33 +0000 UTC
Completed:  <n/a>

Backup:  whole-wc

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        pv, pvc
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   update

Preserve Service NodePorts:  auto
# kubectl get download download-8b45b2e8-d645-4cdc-a1ac-4e808340cd7c-7a1fd63f-d2f1-4b45-8e3d-e9ef1439ec7e -n velero -o yaml
apiVersion: datamover.cnsdp.vmware.com/v1alpha1
kind: Download
metadata:
  creationTimestamp: "2022-10-28T07:27:05Z"
  generation: 5
  labels:
    velero.io/exclude-from-backup: "true"
  name: download-8b45b2e8-d645-4cdc-a1ac-4e808340cd7c-7a1fd63f-d2f1-4b45-8e3d-e9ef1439ec7e
  namespace: velero
  resourceVersion: "11013595"
  uid: 48515084-d7cb-4cce-aa94-955d02b56702
spec:
  backupRepositoryName: br-20760287-6db0-4d0f-bf4f-9f5ce1f2b239
  clonefromSnapshotReference: nginx-example/485ef262-c458-422e-82c9-0aa64555e21d
  protectedEntityID: ivd:4c21622c-d402-43b3-baad-da34d9447f68
  restoreTimestamp: "2022-10-28T07:27:36Z"
  snapshotID: ivd:29d626f0-371c-448f-8de5-99ad961a258d:8b45b2e8-d645-4cdc-a1ac-4e808340cd7c
status:
  message: 'Failed to download snapshot, ivd:29d626f0-371c-448f-8de5-99ad961a258d:8b45b2e8-d645-4cdc-a1ac-4e808340cd7c,
    from durable object storage. Data copy failed from PE ivd:4c21622c-d402-43b3-baad-da34d9447f68,
    to PE ivd:29d626f0-371c-448f-8de5-99ad961a258d:8b45b2e8-d645-4cdc-a1ac-4e808340cd7c:
    Open virtual disk file failed. The error code is 1. with error code: 1'
  nextRetryTimestamp: "2022-10-28T07:32:41Z"
  phase: Retry
  processingNode: np1-76b686bbf5-lntdp
  progress: {}
  retryCount: 2
  startTimestamp: "2022-10-28T07:27:36Z"

Expected behavior

velero restore create succeed and the pv and pvc restore back.

Troubleshooting Information
Velero server version: 1.9.2
AWS plugin version: v1.5.1
vSphere plugin version: v1.4.0
Kubernetes: Vanilla (tkgm workload cluster)
Kubernetes version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8+vmware.2", GitCommit:"d625665700d901401d7504761cf473dd8adfc7b3", GitTreeState:"clean", BuildDate:"2022-07-18T23:04:08Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/amd64"}
@deepakkinni
Copy link
Collaborator

@jhuisss
Copy link
Author

jhuisss commented Nov 2, 2022

Please attach logs, see instructions at https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/troubleshooting.md#logs
bundle-2022-11-02-11-54-06.tar.gz
Hi I have attached the whole log file, with cmd "velero debug --restore test-pv-pvc-20221102114058"

@xing-yang
Copy link
Contributor

What is the "velero backup" command you used? After backup is complete, did you check the status of "Snapshot" and "Upload"?
https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/vanilla.md#backup-vsphere-cns-block-volumes

@jhuisss
Copy link
Author

jhuisss commented Nov 2, 2022

velero backup create whole-wc

I used cmd "$ velero backup create whole-wc" to backup the whole workload cluster.

$ velero backup create whole-wc
Backup request "whole-wc" submitted successfully.
Run `velero backup describe whole-wc` or `velero backup logs whole-wc` for more details.
$ velero backup get
NAME          STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
whole-wc      Completed   0        1          2022-11-02 15:58:28 +0000 UTC   29d       default            <none>
$ kubectl get snapshot -A
NAMESPACE       NAME                                        AGE
nginx-example   snap-70838ef6-80ce-4a0c-b8ec-4e9e49a3116d   44s
$ kubectl get snapshot -n nginx-example snap-70838ef6-80ce-4a0c-b8ec-4e9e49a3116d -o yaml
apiVersion: backupdriver.cnsdp.vmware.com/v1alpha1
kind: Snapshot
metadata:
  creationTimestamp: "2022-11-02T15:58:44Z"
  generation: 1
  labels:
    velero.io/backup-name: whole-wc
    velero.io/exclude-from-backup: "true"
  name: snap-70838ef6-80ce-4a0c-b8ec-4e9e49a3116d
  namespace: nginx-example
  resourceVersion: "5840374"
  uid: c0ddd860-8695-41f5-a3cf-a3414c19dbc9
spec:
  backupRepository: br-c1d8671d-a8cd-4e34-9aae-35a776875065
  resourceHandle:
    apiGroup: ""
    kind: PersistentVolumeClaim
    name: nginx-logs
status:
  completionTimestamp: "2022-11-02T15:58:58Z"
  metadata: Cs4LCgpuZ2lueC1sb2dzEgAaDW5naW54LWV4YW1wbGUiACokNmI0OWY0OWUtNjJkZS00NTgwLTk1ZGQtNTljYWNkZjk2YzI0Mgc1ODM5NTY1OABCCAiBpoqbBhAAWgwKA2FwcBIFbmdpbnhixQIKMGt1YmVjdGwua3ViZXJuZXRlcy5pby9sYXN0LWFwcGxpZWQtY29uZmlndXJhdGlvbhKQAnsiYXBpVmVyc2lvbiI6InYxIiwia2luZCI6IlBlcnNpc3RlbnRWb2x1bWVDbGFpbSIsIm1ldGFkYXRhIjp7ImFubm90YXRpb25zIjp7fSwibGFiZWxzIjp7ImFwcCI6Im5naW54In0sIm5hbWUiOiJuZ2lueC1sb2dzIiwibmFtZXNwYWNlIjoibmdpbngtZXhhbXBsZSJ9LCJzcGVjIjp7ImFjY2Vzc01vZGVzIjpbIlJlYWRXcml0ZU9uY2UiXSwicmVzb3VyY2VzIjp7InJlcXVlc3RzIjp7InN0b3JhZ2UiOiI1ME1pIn19LCJzdG9yYWdlQ2xhc3NOYW1lIjoidnNwaGVyZS1jc2kifX0KYiYKH3B2Lmt1YmVybmV0ZXMuaW8vYmluZC1jb21wbGV0ZWQSA3llc2IrCiRwdi5rdWJlcm5ldGVzLmlvL2JvdW5kLWJ5LWNvbnRyb2xsZXISA3llc2JHCi12b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXISFmNzaS52c3BoZXJlLnZtd2FyZS5jb21iQgoodm9sdW1lLmt1YmVybmV0ZXMuaW8vc3RvcmFnZS1wcm92aXNpb25lchIWY3NpLnZzcGhlcmUudm13YXJlLmNvbXIca3ViZXJuZXRlcy5pby9wdmMtcHJvdGVjdGlvbnoAigHAAgoZa3ViZWN0bC1jbGllbnQtc2lkZS1hcHBseRIGVXBkYXRlGgJ2MSIICIGmipsGEAAyCEZpZWxkc1YxOoACCv0BeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6a3ViZWN0bC5rdWJlcm5ldGVzLmlvL2xhc3QtYXBwbGllZC1jb25maWd1cmF0aW9uIjp7fX0sImY6bGFiZWxzIjp7Ii4iOnt9LCJmOmFwcCI6e319fSwiZjpzcGVjIjp7ImY6YWNjZXNzTW9kZXMiOnt9LCJmOnJlc291cmNlcyI6eyJmOnJlcXVlc3RzIjp7Ii4iOnt9LCJmOnN0b3JhZ2UiOnt9fX0sImY6c3RvcmFnZUNsYXNzTmFtZSI6e30sImY6dm9sdW1lTW9kZSI6e319fUIAigG4AgoXa3ViZS1jb250cm9sbGVyLW1hbmFnZXISBlVwZGF0ZRoCdjEiCAiCpoqbBhAAMghGaWVsZHNWMTr6AQr3AXsiZjptZXRhZGF0YSI6eyJmOmFubm90YXRpb25zIjp7ImY6cHYua3ViZXJuZXRlcy5pby9iaW5kLWNvbXBsZXRlZCI6e30sImY6cHYua3ViZXJuZXRlcy5pby9ib3VuZC1ieS1jb250cm9sbGVyIjp7fSwiZjp2b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9LCJmOnZvbHVtZS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9fX0sImY6c3BlYyI6eyJmOnZvbHVtZU5hbWUiOnt9fX1CAIoBmAEKF2t1YmUtY29udHJvbGxlci1tYW5hZ2VyEgZVcGRhdGUaAnYxIggIgqaKmwYQADIIRmllbGRzVjE6VQpTeyJmOnN0YXR1cyI6eyJmOmFjY2Vzc01vZGVzIjp7fSwiZjpjYXBhY2l0eSI6eyIuIjp7fSwiZjpzdG9yYWdlIjp7fX0sImY6cGhhc2UiOnt9fX1CBnN0YXR1cxJnCg1SZWFkV3JpdGVPbmNlEhMSEQoHc3RvcmFnZRIGCgQ1ME1pGihwdmMtNmI0OWY0OWUtNjJkZS00NTgwLTk1ZGQtNTljYWNkZjk2YzI0Kgt2c3BoZXJlLWNzaTIKRmlsZXN5c3RlbRopCgVCb3VuZBINUmVhZFdyaXRlT25jZRoRCgdzdG9yYWdlEgYKBDUwTWk=
  phase: Uploaded
  progress: {}
  snapshotID: pvc:nginx-example/nginx-logs:aXZkOjM2MWNmZjMzLTlmZmEtNDM4NS05NGI2LTgzNjAwMzQ4OWYzYjo4MmYzNGNiMy02YTkwLTQxODAtOWI3Yy1hMWI3OTMzZjg4MWE
  svcSnapshotName: ""

The snapshot CR phase is "uploaded" as above.

kubectl get upload -A
NAMESPACE   NAME                                          AGE
velero      upload-82f34cb3-6a90-4180-9b7c-a1b7933f881a   4m24s
$ kubectl get upload upload-82f34cb3-6a90-4180-9b7c-a1b7933f881a -n velero -o yaml
apiVersion: datamover.cnsdp.vmware.com/v1alpha1
kind: Upload
metadata:
  creationTimestamp: "2022-11-02T15:58:49Z"
  generation: 3
  labels:
    velero.io/exclude-from-backup: "true"
  name: upload-82f34cb3-6a90-4180-9b7c-a1b7933f881a
  namespace: velero
  resourceVersion: "5840373"
  uid: 77e7288c-4568-422b-ab4c-f11651165e31
spec:
  backupRepository: br-c1d8671d-a8cd-4e34-9aae-35a776875065
  backupTimestamp: "2022-11-02T15:58:49Z"
  snapshotID: ivd:361cff33-9ffa-4385-94b6-836003489f3b:82f34cb3-6a90-4180-9b7c-a1b7933f881a
  snapshotReference: nginx-example/snap-70838ef6-80ce-4a0c-b8ec-4e9e49a3116d
status:
  completionTimestamp: "2022-11-02T15:58:58Z"
  message: Upload completed
  nextRetryTimestamp: "2022-11-02T15:58:49Z"
  phase: Completed
  processingNode: np-wc2source-6868bf59c4-nh5sd
  progress: {}
  startTimestamp: "2022-11-02T15:58:49Z"

The upload CR phase is "completed" as above.

@jhuisss
Copy link
Author

jhuisss commented Nov 2, 2022

continue the backup "whole-wc" above, I forcely delete the pvc and pv, and try to restore the pvc and pv from the backup:

$ velero restore create --from-backup whole-wc --include-resources pv,pvc --existing-resource-policy update
Restore request "whole-wc-20221102161036" submitted successfully.
Run `velero restore describe whole-wc-20221102161036` or `velero restore logs whole-wc-20221102161036` for more details.
$ velero restore describe whole-wc-20221102161036
Name:         whole-wc-20221102161036
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                                 InProgress
Estimated total items to be restored:  2
Items restored so far:                 1

Started:    2022-11-02 16:10:36 +0000 UTC
Completed:  <n/a>

Backup:  whole-wc

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        pv, pvc
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   update

Preserve Service NodePorts:  auto
$ kubectl get clonefromsnapshots -A
NAMESPACE       NAME                                   AGE
nginx-example   7b3cfd2e-f529-4ad0-891a-c3b92d2cc762   70s
$ kubectl get clonefromsnapshot 7b3cfd2e-f529-4ad0-891a-c3b92d2cc762 -n nginx-example -o yaml
apiVersion: backupdriver.cnsdp.vmware.com/v1alpha1
kind: CloneFromSnapshot
metadata:
  creationTimestamp: "2022-11-02T16:10:36Z"
  generation: 1
  labels:
    velero.io/exclude-from-backup: "true"
  name: 7b3cfd2e-f529-4ad0-891a-c3b92d2cc762
  namespace: nginx-example
  resourceVersion: "5842919"
  uid: 0eed650a-66f6-4a6a-918f-cef167970f10
spec:
  apiGroup: ""
  backupRepository: br-c1d8671d-a8cd-4e34-9aae-35a776875065
  cloneCancel: false
  kind: PersistentVolumeClaim
  metadata: Cs4LCgpuZ2lueC1sb2dzEgAaDW5naW54LWV4YW1wbGUiACokNmI0OWY0OWUtNjJkZS00NTgwLTk1ZGQtNTljYWNkZjk2YzI0Mgc1ODM5NTY1OABCCAiBpoqbBhAAWgwKA2FwcBIFbmdpbnhixQIKMGt1YmVjdGwua3ViZXJuZXRlcy5pby9sYXN0LWFwcGxpZWQtY29uZmlndXJhdGlvbhKQAnsiYXBpVmVyc2lvbiI6InYxIiwia2luZCI6IlBlcnNpc3RlbnRWb2x1bWVDbGFpbSIsIm1ldGFkYXRhIjp7ImFubm90YXRpb25zIjp7fSwibGFiZWxzIjp7ImFwcCI6Im5naW54In0sIm5hbWUiOiJuZ2lueC1sb2dzIiwibmFtZXNwYWNlIjoibmdpbngtZXhhbXBsZSJ9LCJzcGVjIjp7ImFjY2Vzc01vZGVzIjpbIlJlYWRXcml0ZU9uY2UiXSwicmVzb3VyY2VzIjp7InJlcXVlc3RzIjp7InN0b3JhZ2UiOiI1ME1pIn19LCJzdG9yYWdlQ2xhc3NOYW1lIjoidnNwaGVyZS1jc2kifX0KYiYKH3B2Lmt1YmVybmV0ZXMuaW8vYmluZC1jb21wbGV0ZWQSA3llc2IrCiRwdi5rdWJlcm5ldGVzLmlvL2JvdW5kLWJ5LWNvbnRyb2xsZXISA3llc2JHCi12b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXISFmNzaS52c3BoZXJlLnZtd2FyZS5jb21iQgoodm9sdW1lLmt1YmVybmV0ZXMuaW8vc3RvcmFnZS1wcm92aXNpb25lchIWY3NpLnZzcGhlcmUudm13YXJlLmNvbXIca3ViZXJuZXRlcy5pby9wdmMtcHJvdGVjdGlvbnoAigHAAgoZa3ViZWN0bC1jbGllbnQtc2lkZS1hcHBseRIGVXBkYXRlGgJ2MSIICIGmipsGEAAyCEZpZWxkc1YxOoACCv0BeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6a3ViZWN0bC5rdWJlcm5ldGVzLmlvL2xhc3QtYXBwbGllZC1jb25maWd1cmF0aW9uIjp7fX0sImY6bGFiZWxzIjp7Ii4iOnt9LCJmOmFwcCI6e319fSwiZjpzcGVjIjp7ImY6YWNjZXNzTW9kZXMiOnt9LCJmOnJlc291cmNlcyI6eyJmOnJlcXVlc3RzIjp7Ii4iOnt9LCJmOnN0b3JhZ2UiOnt9fX0sImY6c3RvcmFnZUNsYXNzTmFtZSI6e30sImY6dm9sdW1lTW9kZSI6e319fUIAigG4AgoXa3ViZS1jb250cm9sbGVyLW1hbmFnZXISBlVwZGF0ZRoCdjEiCAiCpoqbBhAAMghGaWVsZHNWMTr6AQr3AXsiZjptZXRhZGF0YSI6eyJmOmFubm90YXRpb25zIjp7ImY6cHYua3ViZXJuZXRlcy5pby9iaW5kLWNvbXBsZXRlZCI6e30sImY6cHYua3ViZXJuZXRlcy5pby9ib3VuZC1ieS1jb250cm9sbGVyIjp7fSwiZjp2b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9LCJmOnZvbHVtZS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9fX0sImY6c3BlYyI6eyJmOnZvbHVtZU5hbWUiOnt9fX1CAIoBmAEKF2t1YmUtY29udHJvbGxlci1tYW5hZ2VyEgZVcGRhdGUaAnYxIggIgqaKmwYQADIIRmllbGRzVjE6VQpTeyJmOnN0YXR1cyI6eyJmOmFjY2Vzc01vZGVzIjp7fSwiZjpjYXBhY2l0eSI6eyIuIjp7fSwiZjpzdG9yYWdlIjp7fX0sImY6cGhhc2UiOnt9fX1CBnN0YXR1cxJnCg1SZWFkV3JpdGVPbmNlEhMSEQoHc3RvcmFnZRIGCgQ1ME1pGihwdmMtNmI0OWY0OWUtNjJkZS00NTgwLTk1ZGQtNTljYWNkZjk2YzI0Kgt2c3BoZXJlLWNzaTIKRmlsZXN5c3RlbRopCgVCb3VuZBINUmVhZFdyaXRlT25jZRoRCgdzdG9yYWdlEgYKBDUwTWk=
  snapshotID: pvc:nginx-example/nginx-logs:aXZkOjM2MWNmZjMzLTlmZmEtNDM4NS05NGI2LTgzNjAwMzQ4OWYzYjo4MmYzNGNiMy02YTkwLTQxODAtOWI3Yy1hMWI3OTMzZjg4MWE
status:
  phase: New
$ kubectl get download -A
NAMESPACE   NAME                                                                                 AGE
velero      download-82f34cb3-6a90-4180-9b7c-a1b7933f881a-03feca3d-04bb-4ec4-9df8-99c549324ea7   94s
$ kubectl get download download-82f34cb3-6a90-4180-9b7c-a1b7933f881a-03feca3d-04bb-4ec4-9df8-99c549324ea7 -n velero -o yaml
apiVersion: datamover.cnsdp.vmware.com/v1alpha1
kind: Download
metadata:
  creationTimestamp: "2022-11-02T16:10:38Z"
  generation: 5
  labels:
    velero.io/exclude-from-backup: "true"
  name: download-82f34cb3-6a90-4180-9b7c-a1b7933f881a-03feca3d-04bb-4ec4-9df8-99c549324ea7
  namespace: velero
  resourceVersion: "5843058"
  uid: e6b7e52a-4cb3-4799-9582-4fdcfb1feaf2
spec:
  backupRepositoryName: br-c1d8671d-a8cd-4e34-9aae-35a776875065
  clonefromSnapshotReference: nginx-example/7b3cfd2e-f529-4ad0-891a-c3b92d2cc762
  protectedEntityID: ivd:8fcad05a-24d9-43c6-997f-5983a8ccf8f0
  restoreTimestamp: "2022-11-02T16:10:38Z"
  snapshotID: ivd:361cff33-9ffa-4385-94b6-836003489f3b:82f34cb3-6a90-4180-9b7c-a1b7933f881a
status:
  message: 'Failed to download snapshot, ivd:361cff33-9ffa-4385-94b6-836003489f3b:82f34cb3-6a90-4180-9b7c-a1b7933f881a,
    from durable object storage. Data copy failed from PE ivd:8fcad05a-24d9-43c6-997f-5983a8ccf8f0,
    to PE ivd:361cff33-9ffa-4385-94b6-836003489f3b:82f34cb3-6a90-4180-9b7c-a1b7933f881a:
    Open virtual disk file failed. The error code is 1. with error code: 1'
  nextRetryTimestamp: "2022-11-02T16:16:07Z"
  phase: Retry
  processingNode: np-wc2source-6868bf59c4-nh5sd
  progress: {}
  retryCount: 2
  startTimestamp: "2022-11-02T16:10:38Z"

@jhuisss
Copy link
Author

jhuisss commented Nov 2, 2022

velero debug --backup whole-wc:
bundle-2022-11-02-16-17-46.tar.gz
velero debug --restore whole-wc-20221102161036:
bundle-2022-11-02-16-18-13.tar.gz

The log bundle is all attached, thanks.

@deepakkinni
Copy link
Collaborator

This is most likely a vddk issue, please ensure you retrieve the vddk logs: https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/troubleshooting.md#vddk

@jhuisss
Copy link
Author

jhuisss commented Nov 4, 2022

vddk

I try to follow the doc you provided, but:

$ kubectl exec -n velero -it datamgr-for-vsphere-plugin-zz5p9 -- /bin/bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "65429cce8de75f36ed9a2ae758045b2af30c7c1f042208d9a342763b8ada5cbd": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown
kubectl -n velero exec -it datamgr-for-vsphere-plugin-zz5p9 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/home/nonroot #
/home/nonroot # cd /tmp/vmware-root/
/tmp/vmware-root # ls -al
total 16
drwx------    2 root     root          4096 Nov  4 07:58 .
drwxrwxrwt    1 root     root          4096 Nov  4 07:59 ..
-rw-r-----    1 root     root          4330 Nov  4 07:58 vixDiskLib-1.log
/tmp/vmware-root # cat vixDiskLib-1.log
sh: cat: not found
/tmp/vmware-root # vi vixDiskLib-1.log
sh: vi: not found
/tmp/vmware-root # vim vixDiskLib-1.log
sh: vim: not found

It is not easy for me to collect the log, could you please suggest a better way or could you please reproduce the issue? Thanks

@xing-yang
Copy link
Contributor

Can you show more details from velero backup so we can see what resources get backed up? We do not recommend backing up the entire cluster as we can't guarantee custom resources can be restored. We recommend backup by namespaces or by label selector.

Regarding the following restore command:

velero restore create --from-backup whole-wc --include-resources pv,pvc --existing-resource-policy update

Can you delete the resources and try to do a restore without the include "pv, pvc" and "update" policy as follows? I don't think we've qualified with those options.

velero restore create --from-backup whole-wc

At restore time, PVC will be restored as long as it was backed up and PV will be re-created. If you specify to include PV resource, Velero may try to restore it on top of the PV that is already created. That could also cause the problem.

@jhuisss
Copy link
Author

jhuisss commented Nov 4, 2022

kubectl get clonefromsnapshots -A

Finally I got the vddk logfile:
vixDiskLib-1.log

This time I create backup with only "pv,pvc":

$ velero backup create backup-pv-pvc2 --include-resources pvc,pv
Backup request "backup-pv-pvc2" submitted successfully.
Run `velero backup describe backup-pv-pvc2` or `velero backup logs backup-pv-pvc2` for more details.

and then delete pvc and pvc, try restore them:

$ velero restore create --from-backup backup-pv-pvc2
Restore request "backup-pv-pvc2-20221104143207" submitted successfully.
$ kubectl get download download-022d06ac-c4c4-4266-9c37-95195b96da02-0f0057b7-b96c-42b7-94c7-8ee010a66a1f -n velero -o yaml
apiVersion: datamover.cnsdp.vmware.com/v1alpha1
kind: Download
metadata:
  creationTimestamp: "2022-11-04T14:32:09Z"
  generation: 5
  labels:
    velero.io/exclude-from-backup: "true"
  name: download-022d06ac-c4c4-4266-9c37-95195b96da02-0f0057b7-b96c-42b7-94c7-8ee010a66a1f
  namespace: velero
  resourceVersion: "6451569"
  uid: fe9c08bb-3a3a-41d5-9965-76e3f055ea18
spec:
  backupRepositoryName: br-dbd695f5-7b62-4748-8331-36027c843b98
  clonefromSnapshotReference: nginx-example/d2341d18-080f-4cde-82e4-d222fc9e5c7d
  protectedEntityID: ivd:a418b840-7827-4535-8210-4e41aaee6e64
  restoreTimestamp: "2022-11-04T14:32:09Z"
  snapshotID: ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02
status:
  message: 'Failed to download snapshot, ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02,
    from durable object storage. Data copy failed from PE ivd:a418b840-7827-4535-8210-4e41aaee6e64,
    to PE ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02:
    Open virtual disk file failed. The error code is 1. with error code: 1'
  nextRetryTimestamp: "2022-11-04T14:37:41Z"
  phase: Retry
  processingNode: np-wc2source-6868bf59c4-nh5sd
  progress: {}
  retryCount: 2
  startTimestamp: "2022-11-04T14:32:09Z"

@xing-yang
Copy link
Contributor

Can you try only include "pvc" but not "pv" when you do the backup? This is because pv will be freshly created at restore time. If Velero tries to restore it on top the newly created pv, it could cause problems.

velero backup create backup-pv-pvc2 --include-resources pvc

@jhuisss
Copy link
Author

jhuisss commented Nov 4, 2022

Can you try only include "pvc" but not "pv" when you do the backup? This is because pv will be freshly created at restore time. If Velero tries to restore it on top the newly created pv, it could cause problems.

velero backup create backup-pv-pvc2 --include-resources pvc

yes, here is the result:

$ velero restore create --from-backup backup-pv-pvc2 --include-resources pvc
Restore request "backup-pv-pvc2-20221104151007" submitted successfully.

$ kubectl get clonefromsnapshot -A
NAMESPACE       NAME                                   AGE
nginx-example   f79d494c-f65d-4bdc-9bc0-fe035c2186b7   13s

$ kubectl get download -A
No resources found
$ kubectl get clonefromsnapshot f79d494c-f65d-4bdc-9bc0-fe035c2186b7 -n nginx-example -o yaml
apiVersion: backupdriver.cnsdp.vmware.com/v1alpha1
kind: CloneFromSnapshot
metadata:
  creationTimestamp: "2022-11-04T15:10:07Z"
  generation: 1
  labels:
    velero.io/exclude-from-backup: "true"
  name: f79d494c-f65d-4bdc-9bc0-fe035c2186b7
  namespace: nginx-example
  resourceVersion: "6459921"
  uid: df7e3e08-b001-4d0a-98be-b6617e0319e6
spec:
  apiGroup: ""
  backupRepository: br-dbd695f5-7b62-4748-8331-36027c843b98
  cloneCancel: false
  kind: PersistentVolumeClaim
  metadata: Cs4LCgpuZ2lueC1sb2dzEgAaDW5naW54LWV4YW1wbGUiACokOWY3N2M5ZTctOTRhZi00ZjlmLWJmYjEtMjUwNmU4ZDJjMTliMgc2NDQ0NzY2OABCCAj1tpSbBhAAWgwKA2FwcBIFbmdpbnhixQIKMGt1YmVjdGwua3ViZXJuZXRlcy5pby9sYXN0LWFwcGxpZWQtY29uZmlndXJhdGlvbhKQAnsiYXBpVmVyc2lvbiI6InYxIiwia2luZCI6IlBlcnNpc3RlbnRWb2x1bWVDbGFpbSIsIm1ldGFkYXRhIjp7ImFubm90YXRpb25zIjp7fSwibGFiZWxzIjp7ImFwcCI6Im5naW54In0sIm5hbWUiOiJuZ2lueC1sb2dzIiwibmFtZXNwYWNlIjoibmdpbngtZXhhbXBsZSJ9LCJzcGVjIjp7ImFjY2Vzc01vZGVzIjpbIlJlYWRXcml0ZU9uY2UiXSwicmVzb3VyY2VzIjp7InJlcXVlc3RzIjp7InN0b3JhZ2UiOiI1ME1pIn19LCJzdG9yYWdlQ2xhc3NOYW1lIjoidnNwaGVyZS1jc2kifX0KYiYKH3B2Lmt1YmVybmV0ZXMuaW8vYmluZC1jb21wbGV0ZWQSA3llc2IrCiRwdi5rdWJlcm5ldGVzLmlvL2JvdW5kLWJ5LWNvbnRyb2xsZXISA3llc2JHCi12b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXISFmNzaS52c3BoZXJlLnZtd2FyZS5jb21iQgoodm9sdW1lLmt1YmVybmV0ZXMuaW8vc3RvcmFnZS1wcm92aXNpb25lchIWY3NpLnZzcGhlcmUudm13YXJlLmNvbXIca3ViZXJuZXRlcy5pby9wdmMtcHJvdGVjdGlvbnoAigHAAgoZa3ViZWN0bC1jbGllbnQtc2lkZS1hcHBseRIGVXBkYXRlGgJ2MSIICPW2lJsGEAAyCEZpZWxkc1YxOoACCv0BeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6a3ViZWN0bC5rdWJlcm5ldGVzLmlvL2xhc3QtYXBwbGllZC1jb25maWd1cmF0aW9uIjp7fX0sImY6bGFiZWxzIjp7Ii4iOnt9LCJmOmFwcCI6e319fSwiZjpzcGVjIjp7ImY6YWNjZXNzTW9kZXMiOnt9LCJmOnJlc291cmNlcyI6eyJmOnJlcXVlc3RzIjp7Ii4iOnt9LCJmOnN0b3JhZ2UiOnt9fX0sImY6c3RvcmFnZUNsYXNzTmFtZSI6e30sImY6dm9sdW1lTW9kZSI6e319fUIAigG4AgoXa3ViZS1jb250cm9sbGVyLW1hbmFnZXISBlVwZGF0ZRoCdjEiCAj2tpSbBhAAMghGaWVsZHNWMTr6AQr3AXsiZjptZXRhZGF0YSI6eyJmOmFubm90YXRpb25zIjp7ImY6cHYua3ViZXJuZXRlcy5pby9iaW5kLWNvbXBsZXRlZCI6e30sImY6cHYua3ViZXJuZXRlcy5pby9ib3VuZC1ieS1jb250cm9sbGVyIjp7fSwiZjp2b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9LCJmOnZvbHVtZS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9fX0sImY6c3BlYyI6eyJmOnZvbHVtZU5hbWUiOnt9fX1CAIoBmAEKF2t1YmUtY29udHJvbGxlci1tYW5hZ2VyEgZVcGRhdGUaAnYxIggI9raUmwYQADIIRmllbGRzVjE6VQpTeyJmOnN0YXR1cyI6eyJmOmFjY2Vzc01vZGVzIjp7fSwiZjpjYXBhY2l0eSI6eyIuIjp7fSwiZjpzdG9yYWdlIjp7fX0sImY6cGhhc2UiOnt9fX1CBnN0YXR1cxJnCg1SZWFkV3JpdGVPbmNlEhMSEQoHc3RvcmFnZRIGCgQ1ME1pGihwdmMtOWY3N2M5ZTctOTRhZi00ZjlmLWJmYjEtMjUwNmU4ZDJjMTliKgt2c3BoZXJlLWNzaTIKRmlsZXN5c3RlbRopCgVCb3VuZBINUmVhZFdyaXRlT25jZRoRCgdzdG9yYWdlEgYKBDUwTWk=
  snapshotID: pvc:nginx-example/nginx-logs:aXZkOjYzOTdlNDFkLTEwZWMtNGY4My1hY2UyLTM5OWYxMzRiZTA1YjowMjJkMDZhYy1jNGM0LTQyNjYtOWMzNy05NTE5NWI5NmRhMDI
status:
  phase: New

Here is the new vddk log:
new-vixDiskLib-1.log

velero debug --backup backup-pv-pvc2:
bundle-2022-11-04-15-17-31.tar.gz

velero debug --restore backup-pv-pvc2-20221104151007:
bundle-2022-11-04-15-17-54.tar.gz

@xing-yang
Copy link
Contributor

I see the following in the logs:

vixDiskLib logs:

2022-11-04T14:54:07.636Z| host-13| I005: DISKLIB-DSCPTR: : "ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[email protected]:902" : Failed to open NBD extent.
2022-11-04T14:54:07.636Z| host-13| I005: DISKLIB-LINK  : "ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[email protected]:902" : failed to open (NBD_ERR_GENERIC).  
2022-11-04T14:54:07.636Z| host-13| I005: DISKLIB-CHAIN : "ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[email protected]:902" : failed to open (NBD_ERR_GENERIC).
2022-11-04T14:54:07.636Z| host-13| I005: DISKLIB-LIB   : Failed to open 'ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[email protected]:902' with flags 0x800a NBD_ERR_GENERIC (290).
2022-11-04T14:54:07.636Z| host-13| E002: VixDiskLib: Detected DiskLib error 290 (NBD_ERR_GENERIC).
2022-11-04T14:54:07.636Z| host-13| E002: VixDiskLib: Failed to open disk ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[email protected]:902!52 29 44 87 ea f1 91 48-bd 51 f6 4a fd a7 1f e0. Error 1 (Unknown error) (DiskLib error 290: NBD_ERR_GENERIC) at 6507.
2022-11-04T14:54:07.640Z| host-13| E002: VixDiskLib: VixDiskLib_OpenEx: Cannot open disk [datastore-1035] VStorageObject:a418b840-7827-4535-8210-4e41aaee6e64. Error 1 (Unknown error) (DiskLib error 290: NBD_ERR_GENERIC) at 7171.
2022-11-04T14:54:07.640Z| host-13| E002: VixDiskLib: VixDiskLib_Open: Cannot open disk . Error 1 (Unknown error) at 7249.

DataMover logs:

--> 
2022-11-04T14:54:07.728Z info -[00013] [Originator@6876 sub=vimaccess] AllowVMotion: Enable VMotion.
--> 
2022-11-04T14:54:07.731Z verbose -[00013] [Originator@6876 sub=vimaccess] trying to get vStorageObjectManager as typeof N3Vim4Vslm7Vcenter21VStorageObjectManagerE
2022-11-04T14:54:07.779Z verbose -[00013] [Originator@6876 sub=vmomi.soapStub[81]] Resetting stub adapter for server <cs p:00007f32b00102d0, TCP:10.202.240.52:443> : Closed
2022-11-04T14:54:07.779Z verbose -[00013] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32c4102038, h:17, <TCP '100.96.1.253 : 42162'>, <TCP '10.202.240.52 : 443'>>)>
2022-11-04T14:54:07.779Z verbose -[00013] [Originator@6876 sub=IO] shutdown; <io_obj p:0x00007f32c4102038, h:17, <TCP '100.96.1.253 : 42162'>, <TCP '10.202.240.52 : 443'>>
2022-11-04T14:54:07.779Z verbose -[00013] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32640020f8, h:18, <TCP '100.96.1.253 : 42164'>, <TCP '10.202.240.52 : 443'>>)>
2022-11-04T14:54:07.779Z verbose -[00013] [Originator@6876 sub=IO] shutdown; <io_obj p:0x00007f32640020f8, h:18, <TCP '100.96.1.253 : 42164'>, <TCP '10.202.240.52 : 443'>>
2022-11-04T14:54:07.779Z verbose -[00026] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32c4102038, h:-1, <TCP '100.96.1.253 : 42162'>, <TCP '10.202.240.52 : 443'> FD Closed>)>
2022-11-04T14:54:07.779Z verbose -[00022] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32640020f8, h:-1, <TCP '100.96.1.253 : 42164'>, <TCP '10.202.240.52 : 443'> FD Closed>)>
2022-11-04T14:54:07.779Z verbose -[00026] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32c4102038, h:-1, <TCP '100.96.1.253 : 42162'>, <TCP '10.202.240.52 : 443'> FD Closed>)>
2022-11-04T14:54:07.780Z verbose -[00022] [Originator@6876 sub=IO] Close; <SSL(<io_obj p:0x00007f32640020f8, h:-1, <TCP '100.96.1.253 : 42164'>, <TCP '10.202.240.52 : 443'> FD Closed>)>
time="2022-11-04T14:54:07Z" level=info msg="Return from the call of ivd PE overwrite API for remote PE." Remote PEID="ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/dataMover/data_mover.go:176"
time="2022-11-04T14:54:07Z" level=error msg="Failed to overwrite from remote repository." Remote PEID="ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02" error="Data copy failed from PE ivd:a418b840-7827-4535-8210-4e41aaee6e64, to PE ivd:6397e41d-10ec-4f83-ace2-399f134be05b:022d06ac-c4c4-4266-9c37-95195b96da02: Open virtual disk file failed. The error code is 1. with error code: 1" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/ivd/ivd_protected_entity.go:117" error.function=github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/ivd.IVDProtectedEntity.getDataWriter logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/dataMover/data_mover.go:178"
……..
--> 
2022-11-04T14:54:07.960Z info -[00012] [Originator@6876 sub=vimaccess] AllowVMotion: Disable VMotion.
--> 

There are frequent "Enable VMotion" and "Disable VMotion" messages in the DataMover logs. What other operation was going on that triggered this? I think that is why we saw "Open virtual disk file failed" errors.

@deepakkinni
Copy link
Collaborator

Hi @jhuisss ,
We'd like to raise the issue with the vddk team, could you please collect vCenter log bundle and ESXi log bundles?

@jhuisss
Copy link
Author

jhuisss commented Nov 7, 2022

Hi Thanks for your reply, it is not easy for me to get the requested log bunble, I'll find how to collect them. At the same time, could you please try to reproduce the issue?

@jhuisss
Copy link
Author

jhuisss commented Dec 14, 2022

Hi, any updates? If you can not reproduce the issue, please let me know, thanks~~

@deepakkinni
Copy link
Collaborator

I opened an internal bug again VDDK to understand the error code, hopefully they get back soon. In the meanwhile could you please answer Xing's query about "Enable VMotion" and "Disable VMotion" messages in the DataMover logs, this is something we haven't seen in the past.

@deepakkinni
Copy link
Collaborator

Got a reply from the vddk team:

There is no clue for the root cause from the logging while it's usually caused by network error.

Could you

1, set verbose log level for vddk (current verbose log level is set for NFC)

vixDiskLib.transport.LogLevel=6
vixDiskLib.nfc.LogLevel = 2 # avoid log rotated quickly

2, collect the vm-support for the host where connection failed.

> 2022-11-04T15:05:01.987Z| host-17| E002: VixDiskLib: Failed to open disk ha-nfc://[vsanDatastore] 2f85df62-a08b-badd-ba23-bc97e1d50a70/_00a2/[[email protected]](mailto:[email protected]):902!52 44 8e 8b 79 de 38 df-97 34 e3 0c ba ad e8 9a. Error 1 (Unknown error) (DiskLib error 290: NBD_ERR_GENERIC) at 6507.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants