Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix upstream status flicker and constant status updates #10384

Merged
merged 23 commits into from
Nov 25, 2024
Merged

Conversation

lgadban
Copy link

@lgadban lgadban commented Nov 20, 2024

Description

fix upstream status flicker

When only Kube GW proxies are present, we still rely on the edge translator_syncer for extension syncing. The edge translator will mark Upstreams & UpstreamGroups as Accepted then perform xds translation where status may be changed to e.g. Rejected if there is an error.

However, in the case where there are no edge proxies, translation doesn't actually occur, so any actual errors on the Upstream are never encountered, thus the status is never set to Rejected. We end up in a scenario where the Kube GW syncer (correctly) reports Rejected status while the Edge syncer reports Accepted and they will fight each other indefinitely.

This is fixed by no longer reporting Accepted on Upstream[Groups] unless they are also going to be translated. The result is that an Upstream may have empty status if there is no proxy (either edge of kube gw) present, but this means that only accurate status from translation will be reported.

fix constant status updates

Additionally we noticed endless updates (and webhook hits) for resources tracked in krt.

the status reporter compares the desired status with the
existing status in the solo-kit object to determine if it
should actually UPDATE the resource.

the current proxy_syncer will do a once per second status sync
and relies on this status comparison to be functional to prevent
endless object UPDATEs.

this commit fixes the solo-kit objects (really wrappers) in the
krt collections to contain the status so an accurate comparison
can take place.

test updates

Several tests relied on Accepted status for Upstreams when no translation occurred, so they were updated to not rely on status but just the existence of the resource in question.

glooctl

glooctl does NOT need to be updated as the "no status" case that reports an error is only a problem when the status is nil rather than just an empty status:

for _, upstream := range upstreams {
if upstream.GetNamespacedStatuses() != nil {
namespacedStatuses := upstream.GetNamespacedStatuses()
for reporter, status := range namespacedStatuses.GetStatuses() {
switch status.GetState() {
case core.Status_Rejected:
errMessage := fmt.Sprintf("Found rejected upstream by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata()))
errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason())
multiErr = multierror.Append(multiErr, errors.New(errMessage))
case core.Status_Warning:
errMessage := fmt.Sprintf("Found upstream with warnings by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata()))
errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason())
multiErr = multierror.Append(multiErr, errors.New(errMessage))
}
}
} else {
errMessage := fmt.Sprintf("Found upstream with no status: %s\n", renderMetadata(upstream.GetMetadata()))
multiErr = multierror.Append(multiErr, errors.New(errMessage))
}
knownUpstreams = append(knownUpstreams, renderMetadata(upstream.GetMetadata()))

Code changes

This changes the edge translator_syncer to no longer mark Upstream[Group]s as Accepted unless it will also perform translation.

Keep and convert status from k8s objects when converting them to solo-kit types for krt collections

Update broken tests

Checklist:

// Only mark non-kube gateways as accepted
// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go
allReports.Accept(nonKubeProxies.AsInputResources()...)

// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy
// report Upstream[Group]s as Accepted initially but only if we have at least 1 edge proxy

the status reporter compares the desired status with the
existing status in the solo-kit object to determine if it
should actually UPDATE the resource.

the current proxy_syncer will do a once per second status sync
and relies on this status comparison to be functional to prevent
endless object UPDATEs.

this commit fixes the solo-kit objects (really wrappers) in the
krt collections to contain the status so an accurate comparison
can take place.
Copy link

github-actions bot commented Nov 22, 2024

Visit the preview URL for this PR (updated for commit 0f2a5f6):

https://gloo-edge--pr10384-us-status-flicker-8uiiree0.web.app

(expires Sun, 01 Dec 2024 17:00:36 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@lgadban lgadban enabled auto-merge (squash) November 22, 2024 21:42
@lgadban lgadban changed the title fix upstream status flicker fix upstream status flicker and constant status updates Nov 22, 2024
@solo-changelog-bot
Copy link

@solo-changelog-bot
Copy link

Copy link

@sam-heilbron sam-heilbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great! What are the ways that I could test this manually to prove that the flickering no longer happens? Tangentially, does this expose a gap in our testing where we could use internal metrics or other signals to identify not just if gloo performed what it expected to do, but also didn't incur out of the ordinary operations? (I'm thinking of something that analyzes metrics for the gloo pod after a run and either asserts some expected values of webhook updates or other signals, or even just somethign that reports the values and we can see overtime how they change across versions of gloo)

@@ -371,6 +371,7 @@ func (s *ProxySyncer) Init(ctx context.Context, dbg *krt.DebugHandler) error {
Namespace: u.GetNamespace(),
}
glooUs.SetMetadata(&md)
glooUs.NamespacedStatuses = &u.Status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, is this block of code intended to do something similar to https://github.com/solo-io/solo-kit/blob/main/pkg/api/v1/clients/kube/resource_client.go#L477 where we are converting the kubeUpstream?

Why do we just do this for Upstreams and not for all other gloo resources?

Copy link
Author

@lgadban lgadban Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, it is a non-solo-kit bound way of doing the conversion from kube type to solo-kit type

Why do we just do this for Upstreams and not for all other gloo resources?

We do it for the only other gloo type, AuthConfigs, below at https://github.com/solo-io/gloo/pull/10384/files#diff-c658efd653154e7eb520a5ea6c646cdd45af7ed0ac56224360affdf481a562c4R719

Other than that, RLCs are a skv2 type and don't have the same challenges

helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) {
// Upstreams no longer report status if they have not been translated at all to avoid conflicting with
// other syncers that have translated them, so we can only detect that the objects exist here
helpers.EventuallyResourceExists(func() (resources.Resource, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

core.Status_Accepted,
gloo_defaults.GlooReporter,
)
// we need to make sure Gloo has had a chance to process it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does this assert that Gloo had a chance to process it? Wouldn't we want an accepted status to be used to prove that gloo processed it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where it starts breaking down. you're right that it doesn't actually assert gloo has processed it, or in reality what we really care about is that it has been added to the internal api snapshot.

Accepted status previously did that, but the problem we are solving is that it was reporting Accepted without doing translation in some cases, this the flicker.

Now that we dont report Accepted naively, we cant use it for a signal that it is in the api snapshot.

All of the tests where i added this are written where the Upstream is applied and exists in the snapshot before a proxy is generated, or really a VirtualService is created.

So this is a cheap way of giving gloo time for the upstream to be created and added to the api snapshot.
We could have just done a sleep as well.

Unfortunately i dont see a good way around this if we have different syncers operating/processing a single config e.g. Upstreams

@@ -715,6 +716,7 @@ func (s *ProxySyncer) translateProxy(
Namespace: kac.GetNamespace(),
}
gac.SetMetadata(&md)
gac.NamespacedStatuses = &kac.Status
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the other resource types below not need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Only mark non-kube gateways as accepted
// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go
allReports.Accept(nonKubeProxies.AsInputResources()...)

// mark Upstream[Group]s as Accepted initially, but only if we have at least 1 edge proxy;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems frail to assume certain behavior based on the existence of an edge proxy, but this logic will probably be ripped out soon anyway so it's ok (side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems frail to assume certain behavior based on the existence of an edge proxy

I agree with your statement but I actually think it doesn't apply here. This check is effectively saying that this syncer is not going to mark status as Accepted if I'm not also going to translate that same resource. The presence of an edge proxy is just the mechanism it uses to know if it will also do translation on that resource.

side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?

This is probably what we need to do if this stays around long term. We have the namespaced status concept that could be applied to this, although it 1. is a bit a misuse and 2. will tie us further into legacy solo-kit types

helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) {
// Upstreams no longer report status if they have not been translated at all to avoid conflicting with
// other syncers that have translated them, so we can only detect that the objects exist here
helpers.EventuallyResourceExists(func() (resources.Resource, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we can do anything about this right now, but just a note: do you think this will potentially cause more test flakes? waiting for resource acceptance means the controller has seen the resource (and it's in the input snapshot). just waiting for the resource to exist on the cluster might not give the same guarantee?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, it's not great.

But if our system is eventually consistent, needing to wait for something to be in the internal snapshot (completely opaque to users) is a problem and smell anyway.

@lgadban
Copy link
Author

lgadban commented Nov 25, 2024

What are the ways that I could test this manually to prove that the flickering no longer happens?

@sam-heilbron the steps to repro should probably have been in https://github.com/solo-io/solo-projects/issues/7243 but the gist of it is also captured in #10401

The high-level summary is:

Now the upstream is being translated by kube gw and reporting a warning correctly but the edge syncer will be reporting accepted so they will fight and flicker.

With the changes in this PR that no longer happens as the Edge syncer won't report Accepted for a resource it hasn't translated

@lgadban lgadban merged commit 7af653c into main Nov 25, 2024
19 checks passed
@lgadban lgadban deleted the us-status-flicker branch November 25, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upstreams incorrectly report Accepted when no translation has occurred
5 participants