-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VirtualMachineCRCErrors alert #663
Add VirtualMachineCRCErrors alert #663
Conversation
The ssp-operator is an "Operator that deploys additional KubeVirt resources". Does this PR fit to this expectation? |
/hold Please let us discuss the necessity for this first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing some SSP context but the PR looks great!
@@ -164,6 +164,22 @@ func getAlertRules() ([]promv1.Rule, error) { | |||
componentAlertLabelKey: componentAlertLabelValue, | |||
}, | |||
}, | |||
{ | |||
Alert: "VirtualMachineCRCErrors", | |||
Expr: intstr.FromString("kubevirt_ssp_vm_rbd_volume{volume_mode=\"Block\", rxbounce_enabled=\"false\"} > 0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue specific to Block
PVCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding source here for future ref kubevirt/kubevirt#9741 (comment)
controllers/vm_controller.go
Outdated
|
||
func (r *VmReconciler) setupController(mgr ctrl.Manager) error { | ||
return ctrl.NewControllerManagedBy(mgr). | ||
Named("vm-controller"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vmControllerName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
controllers/vm_controller.go
Outdated
} | ||
pvc, err := r.getPVC(vm, volume.PersistentVolumeClaim.ClaimName) | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are better off requeueing in case of HTTP errors, we don't want to just lose that piece of information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and adding it to a list of errors, and continuing
controllers/vm_controller.go
Outdated
} | ||
pv, err := r.getPV(vm, pvc) | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about requeueing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
58e061b
to
ffe6c8a
Compare
/cc sradco |
ffe6c8a
to
8cbf8e9
Compare
/cc sradco |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small concerns, but the controller makes sense to me
go.mod
Outdated
@@ -6,15 +6,17 @@ require ( | |||
github.com/blang/semver/v4 v4.0.0 | |||
github.com/fsnotify/fsnotify v1.6.0 | |||
github.com/go-logr/logr v1.2.4 | |||
github.com/hashicorp/go-multierror v1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think libraries will maintain MPL-2.0, only tools like terraform and vault will be under a Business Source License
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
@@ -164,6 +164,22 @@ func getAlertRules() ([]promv1.Rule, error) { | |||
componentAlertLabelKey: componentAlertLabelValue, | |||
}, | |||
}, | |||
{ | |||
Alert: "VirtualMachineCRCErrors", | |||
Expr: intstr.FromString("kubevirt_ssp_vm_rbd_volume{volume_mode=\"Block\", rxbounce_enabled=\"false\"} > 0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need or vector(0)
or will this eval just fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fine, no further actions are taken on the results of kubevirt_ssp_vm_rbd_volume
return fmt.Errorf("alert %s found", alertName) | ||
}, env.Timeout(), time.Second).ShouldNot(HaveOccurred()) | ||
|
||
Consistently(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/lgtm |
@@ -26,6 +26,7 @@ COPY main.go main.go | |||
COPY api/ api/ | |||
COPY controllers/ controllers/ | |||
COPY internal/ internal/ | |||
COPY pkg/ pkg/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you place the controller in the internal path instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for the metrics, and not for the controller. But following kubevirt/community#219, we want to stay consistent in terms of monitoring between all KubeVirt components
} | ||
|
||
func alertShouldNotBeActive(alertName string) { | ||
Eventually(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this Eventually
be dropped and be replaced with the Consistently
after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, Consistently
would not wait for the alert to disappear and would exit with error immediately
/cc sradco |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lyarwood The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
|
||
mounter, rxbounceEnabled := getAttributes(pv) | ||
if mounter != "" && mounter != "rbd" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - why check for ""?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no value is set, the storage class defaults to rbd mounter
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - appreciate they are all pretty much the same but do we need list
and watch
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ctrl runtime client will attempt to cache anything you GET once,
and for that, you need list and watch:
https://sdk.operatorframework.io/docs/faqs/#after-deploying-my-operator-i-see-errors-like-failed-to-watch-external-type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah apologies of course!
/lgtm cancel Gah, github is such a useless platform sometimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mostly code style comments. Otherwise it looks good.
Can you add text to the commit messages explaining why the change is done?
af7c9b5
to
447c74d
Compare
/cc sradco |
447c74d
to
134b92d
Compare
/cc sradco |
134b92d
to
f9a4cd5
Compare
/cc sradco |
Thanks! /lgtm |
/hold cancel |
Signed-off-by: João Vilaça <[email protected]>
Signed-off-by: João Vilaça <[email protected]>
f9a4cd5
to
d138f6e
Compare
/cc sradco |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Had to rebase due to some conflicts with a recent merged PR, on the RBAC related files |
/lgtm |
Should this be backported? |
yes, we should backport to release-v0.18 |
/cherry-pick release-v0.18 |
@machadovilaca: #663 failed to apply on top of branch "release-v0.18":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Following the initial work made in kubevirt/kubevirt#9741 and kubevirt/containerized-data-importer#2835, we need a way to alert users when they are using RBD volumes with VMs without the
krbd:rxbounce
option enabled because they may lose data and face severe performance degradation in the cluster.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: