-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix upstream status flicker and constant status updates #10384
Conversation
// Only mark non-kube gateways as accepted | ||
// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go | ||
allReports.Accept(nonKubeProxies.AsInputResources()...) | ||
|
||
// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy | |
// report Upstream[Group]s as Accepted initially but only if we have at least 1 edge proxy |
the status reporter compares the desired status with the existing status in the solo-kit object to determine if it should actually UPDATE the resource. the current proxy_syncer will do a once per second status sync and relies on this status comparison to be functional to prevent endless object UPDATEs. this commit fixes the solo-kit objects (really wrappers) in the krt collections to contain the status so an accurate comparison can take place.
Visit the preview URL for this PR (updated for commit 0f2a5f6): https://gloo-edge--pr10384-us-status-flicker-8uiiree0.web.app (expires Sun, 01 Dec 2024 17:00:36 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
Issues linked to changelog: |
Issues linked to changelog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great! What are the ways that I could test this manually to prove that the flickering no longer happens? Tangentially, does this expose a gap in our testing where we could use internal metrics or other signals to identify not just if gloo performed what it expected to do, but also didn't incur out of the ordinary operations? (I'm thinking of something that analyzes metrics for the gloo pod after a run and either asserts some expected values of webhook updates or other signals, or even just somethign that reports the values and we can see overtime how they change across versions of gloo)
@@ -371,6 +371,7 @@ func (s *ProxySyncer) Init(ctx context.Context, dbg *krt.DebugHandler) error { | |||
Namespace: u.GetNamespace(), | |||
} | |||
glooUs.SetMetadata(&md) | |||
glooUs.NamespacedStatuses = &u.Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, is this block of code intended to do something similar to https://github.com/solo-io/solo-kit/blob/main/pkg/api/v1/clients/kube/resource_client.go#L477 where we are converting the kubeUpstream?
Why do we just do this for Upstreams and not for all other gloo resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, it is a non-solo-kit bound way of doing the conversion from kube type to solo-kit type
Why do we just do this for Upstreams and not for all other gloo resources?
We do it for the only other gloo type, AuthConfigs, below at https://github.com/solo-io/gloo/pull/10384/files#diff-c658efd653154e7eb520a5ea6c646cdd45af7ed0ac56224360affdf481a562c4R719
Other than that, RLCs are a skv2 type and don't have the same challenges
helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) { | ||
// Upstreams no longer report status if they have not been translated at all to avoid conflicting with | ||
// other syncers that have translated them, so we can only detect that the objects exist here | ||
helpers.EventuallyResourceExists(func() (resources.Resource, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
core.Status_Accepted, | ||
gloo_defaults.GlooReporter, | ||
) | ||
// we need to make sure Gloo has had a chance to process it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Does this assert that Gloo had a chance to process it? Wouldn't we want an accepted status to be used to prove that gloo processed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where it starts breaking down. you're right that it doesn't actually assert gloo has processed it, or in reality what we really care about is that it has been added to the internal api snapshot.
Accepted status previously did that, but the problem we are solving is that it was reporting Accepted without doing translation in some cases, this the flicker.
Now that we dont report Accepted naively, we cant use it for a signal that it is in the api snapshot.
All of the tests where i added this are written where the Upstream is applied and exists in the snapshot before a proxy is generated, or really a VirtualService is created.
So this is a cheap way of giving gloo time for the upstream to be created and added to the api snapshot.
We could have just done a sleep as well.
Unfortunately i dont see a good way around this if we have different syncers operating/processing a single config e.g. Upstreams
@@ -715,6 +716,7 @@ func (s *ProxySyncer) translateProxy( | |||
Namespace: kac.GetNamespace(), | |||
} | |||
gac.SetMetadata(&md) | |||
gac.NamespacedStatuses = &kac.Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the other resource types below not need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Only mark non-kube gateways as accepted | ||
// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go | ||
allReports.Accept(nonKubeProxies.AsInputResources()...) | ||
|
||
// mark Upstream[Group]s as Accepted initially, but only if we have at least 1 edge proxy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems frail to assume certain behavior based on the existence of an edge proxy, but this logic will probably be ripped out soon anyway so it's ok (side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems frail to assume certain behavior based on the existence of an edge proxy
I agree with your statement but I actually think it doesn't apply here. This check is effectively saying that this syncer is not going to mark status as Accepted if I'm not also going to translate that same resource. The presence of an edge proxy is just the mechanism it uses to know if it will also do translation on that resource.
side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?
This is probably what we need to do if this stays around long term. We have the namespaced status concept that could be applied to this, although it 1. is a bit a misuse and 2. will tie us further into legacy solo-kit types
helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) { | ||
// Upstreams no longer report status if they have not been translated at all to avoid conflicting with | ||
// other syncers that have translated them, so we can only detect that the objects exist here | ||
helpers.EventuallyResourceExists(func() (resources.Resource, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we can do anything about this right now, but just a note: do you think this will potentially cause more test flakes? waiting for resource acceptance means the controller has seen the resource (and it's in the input snapshot). just waiting for the resource to exist on the cluster might not give the same guarantee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, it's not great.
But if our system is eventually consistent, needing to wait for something to be in the internal snapshot (completely opaque to users) is a problem and smell anyway.
@sam-heilbron the steps to repro should probably have been in https://github.com/solo-io/solo-projects/issues/7243 but the gist of it is also captured in #10401 The high-level summary is:
Now the upstream is being translated by kube gw and reporting a warning correctly but the edge syncer will be reporting accepted so they will fight and flicker. With the changes in this PR that no longer happens as the Edge syncer won't report Accepted for a resource it hasn't translated |
Description
fix upstream status flicker
When only Kube GW proxies are present, we still rely on the edge translator_syncer for extension syncing. The edge translator will mark
Upstreams
&UpstreamGroups
asAccepted
then perform xds translation where status may be changed to e.g.Rejected
if there is an error.However, in the case where there are no edge proxies, translation doesn't actually occur, so any actual errors on the
Upstream
are never encountered, thus the status is never set toRejected
. We end up in a scenario where the Kube GW syncer (correctly) reports Rejected status while the Edge syncer reports Accepted and they will fight each other indefinitely.This is fixed by no longer reporting Accepted on Upstream[Groups] unless they are also going to be translated. The result is that an Upstream may have empty status if there is no proxy (either edge of kube gw) present, but this means that only accurate status from translation will be reported.
fix constant status updates
Additionally we noticed endless updates (and webhook hits) for resources tracked in krt.
the status reporter compares the desired status with the
existing status in the solo-kit object to determine if it
should actually UPDATE the resource.
the current proxy_syncer will do a once per second status sync
and relies on this status comparison to be functional to prevent
endless object UPDATEs.
this commit fixes the solo-kit objects (really wrappers) in the
krt collections to contain the status so an accurate comparison
can take place.
test updates
Several tests relied on Accepted status for Upstreams when no translation occurred, so they were updated to not rely on status but just the existence of the resource in question.
glooctl
glooctl does NOT need to be updated as the "no status" case that reports an error is only a problem when the status is nil rather than just an empty status:
gloo/projects/gloo/cli/pkg/cmd/check/root.go
Lines 433 to 452 in 89a427a
Code changes
This changes the edge translator_syncer to no longer mark
Upstream[Group]s
asAccepted
unless it will also perform translation.Keep and convert status from k8s objects when converting them to solo-kit types for krt collections
Update broken tests
Checklist:
BOT NOTES:
resolves https://github.com/solo-io/solo-projects/issues/7243
resolves https://github.com/solo-io/solo-projects/issues/7257
resolves Upstreams incorrectly report Accepted when no translation has occurred #10401