fix upstream status flicker and constant status updates #10384

lgadban · 2024-11-20T16:41:34Z

Description

fix upstream status flicker

When only Kube GW proxies are present, we still rely on the edge translator_syncer for extension syncing. The edge translator will mark Upstreams & UpstreamGroups as Accepted then perform xds translation where status may be changed to e.g. Rejected if there is an error.

However, in the case where there are no edge proxies, translation doesn't actually occur, so any actual errors on the Upstream are never encountered, thus the status is never set to Rejected. We end up in a scenario where the Kube GW syncer (correctly) reports Rejected status while the Edge syncer reports Accepted and they will fight each other indefinitely.

This is fixed by no longer reporting Accepted on Upstream[Groups] unless they are also going to be translated. The result is that an Upstream may have empty status if there is no proxy (either edge of kube gw) present, but this means that only accurate status from translation will be reported.

fix constant status updates

Additionally we noticed endless updates (and webhook hits) for resources tracked in krt.

the status reporter compares the desired status with the
existing status in the solo-kit object to determine if it
should actually UPDATE the resource.

the current proxy_syncer will do a once per second status sync
and relies on this status comparison to be functional to prevent
endless object UPDATEs.

this commit fixes the solo-kit objects (really wrappers) in the
krt collections to contain the status so an accurate comparison
can take place.

test updates

Several tests relied on Accepted status for Upstreams when no translation occurred, so they were updated to not rely on status but just the existence of the resource in question.

glooctl

glooctl does NOT need to be updated as the "no status" case that reports an error is only a problem when the status is nil rather than just an empty status:

gloo/projects/gloo/cli/pkg/cmd/check/root.go

Lines 433 to 452 in 89a427a

    
           for _, upstream := range upstreams { 
        
           	if upstream.GetNamespacedStatuses() != nil { 
        
           		namespacedStatuses := upstream.GetNamespacedStatuses() 
        
           		for reporter, status := range namespacedStatuses.GetStatuses() { 
        
           			switch status.GetState() { 
        
           			case core.Status_Rejected: 
        
           				errMessage := fmt.Sprintf("Found rejected upstream by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata())) 
        
           				errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason()) 
        
           				multiErr = multierror.Append(multiErr, errors.New(errMessage)) 
        
           			case core.Status_Warning: 
        
           				errMessage := fmt.Sprintf("Found upstream with warnings by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata())) 
        
           				errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason()) 
        
           				multiErr = multierror.Append(multiErr, errors.New(errMessage)) 
        
           			} 
        
           		} 
        
           	} else { 
        
           		errMessage := fmt.Sprintf("Found upstream with no status: %s\n", renderMetadata(upstream.GetMetadata())) 
        
           		multiErr = multierror.Append(multiErr, errors.New(errMessage)) 
        
           	} 
        
           	knownUpstreams = append(knownUpstreams, renderMetadata(upstream.GetMetadata()))

Code changes

This changes the edge translator_syncer to no longer mark Upstream[Group]s as Accepted unless it will also perform translation.

Keep and convert status from k8s objects when converting them to solo-kit types for krt collections

Update broken tests

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
BOT NOTES:
resolves https://github.com/solo-io/solo-projects/issues/7243
resolves https://github.com/solo-io/solo-projects/issues/7257
resolves Upstreams incorrectly report Accepted when no translation has occurred #10401

ashleywang1 · 2024-11-20T17:29:57Z

projects/gloo/pkg/syncer/envoy_translator_syncer.go

 	// Only mark non-kube gateways as accepted
 	// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go
 	allReports.Accept(nonKubeProxies.AsInputResources()...)

+	// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy


Suggested change

// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy

// report Upstream[Group]s as Accepted initially but only if we have at least 1 edge proxy

the status reporter compares the desired status with the existing status in the solo-kit object to determine if it should actually UPDATE the resource. the current proxy_syncer will do a once per second status sync and relies on this status comparison to be functional to prevent endless object UPDATEs. this commit fixes the solo-kit objects (really wrappers) in the krt collections to contain the status so an accurate comparison can take place.

github-actions · 2024-11-22T21:31:56Z

Visit the preview URL for this PR (updated for commit 0f2a5f6):

https://gloo-edge--pr10384-us-status-flicker-8uiiree0.web.app

_{(expires Sun, 01 Dec 2024 17:00:36 GMT)}

_{🔥 via Firebase Hosting GitHub Action 🌎}

_{Sign: 77c2b86e287749579b7ff9cadb81e099042ef677}

solo-changelog-bot · 2024-11-23T16:17:00Z

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/7243
https://github.com/solo-io/solo-projects/issues/7257

solo-changelog-bot · 2024-11-24T16:46:32Z

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/7243
https://github.com/solo-io/solo-projects/issues/7257
#10401

projects/gateway2/setup/ggv2setup.go

changelog/v1.18.0-rc2/fix-us-status-flicker.yaml

sam-heilbron

The code looks great! What are the ways that I could test this manually to prove that the flickering no longer happens? Tangentially, does this expose a gap in our testing where we could use internal metrics or other signals to identify not just if gloo performed what it expected to do, but also didn't incur out of the ordinary operations? (I'm thinking of something that analyzes metrics for the gloo pod after a run and either asserts some expected values of webhook updates or other signals, or even just somethign that reports the values and we can see overtime how they change across versions of gloo)

sam-heilbron · 2024-11-25T14:15:53Z

projects/gateway2/proxy_syncer/proxy_syncer.go

@@ -371,6 +371,7 @@ func (s *ProxySyncer) Init(ctx context.Context, dbg *krt.DebugHandler) error {
 			Namespace: u.GetNamespace(),
 		}
 		glooUs.SetMetadata(&md)
+		glooUs.NamespacedStatuses = &u.Status


Just so I understand, is this block of code intended to do something similar to https://github.com/solo-io/solo-kit/blob/main/pkg/api/v1/clients/kube/resource_client.go#L477 where we are converting the kubeUpstream?

Why do we just do this for Upstreams and not for all other gloo resources?

yes exactly, it is a non-solo-kit bound way of doing the conversion from kube type to solo-kit type

Why do we just do this for Upstreams and not for all other gloo resources?

We do it for the only other gloo type, AuthConfigs, below at https://github.com/solo-io/gloo/pull/10384/files#diff-c658efd653154e7eb520a5ea6c646cdd45af7ed0ac56224360affdf481a562c4R719

Other than that, RLCs are a skv2 type and don't have the same challenges

sam-heilbron · 2024-11-25T14:18:03Z

test/kube2e/gateway/gateway_test.go

-			helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) {
+			// Upstreams no longer report status if they have not been translated at all to avoid conflicting with
+			// other syncers that have translated them, so we can only detect that the objects exist here
+			helpers.EventuallyResourceExists(func() (resources.Resource, error) {


sam-heilbron · 2024-11-25T14:19:27Z

test/kubernetes/e2e/features/validation/validation_strict_warnings/suite.go

-		core.Status_Accepted,
-		gloo_defaults.GlooReporter,
+	)
+	// we need to make sure Gloo has had a chance to process it


nit: Does this assert that Gloo had a chance to process it? Wouldn't we want an accepted status to be used to prove that gloo processed it?

this is where it starts breaking down. you're right that it doesn't actually assert gloo has processed it, or in reality what we really care about is that it has been added to the internal api snapshot.

Accepted status previously did that, but the problem we are solving is that it was reporting Accepted without doing translation in some cases, this the flicker.

Now that we dont report Accepted naively, we cant use it for a signal that it is in the api snapshot.

All of the tests where i added this are written where the Upstream is applied and exists in the snapshot before a proxy is generated, or really a VirtualService is created.

So this is a cheap way of giving gloo time for the upstream to be created and added to the api snapshot.
We could have just done a sleep as well.

Unfortunately i dont see a good way around this if we have different syncers operating/processing a single config e.g. Upstreams

changelog/v1.18.0-rc2/fix-us-status-flicker.yaml

jenshu · 2024-11-25T13:41:14Z

projects/gateway2/proxy_syncer/proxy_syncer.go

@@ -715,6 +716,7 @@ func (s *ProxySyncer) translateProxy(
 			Namespace: kac.GetNamespace(),
 		}
 		gac.SetMetadata(&md)
+		gac.NamespacedStatuses = &kac.Status


do the other resource types below not need this?

see also: https://github.com/solo-io/gloo/pull/10384/files#r1856751038

jenshu · 2024-11-25T14:38:56Z

projects/gloo/pkg/syncer/envoy_translator_syncer.go

 	// Only mark non-kube gateways as accepted
 	// Regardless, kube gw proxies are filtered out of these reports before reporting in translator_syncer.go
 	allReports.Accept(nonKubeProxies.AsInputResources()...)

+	// mark Upstream[Group]s as Accepted initially, but only if we have at least 1 edge proxy;


nit: seems frail to assume certain behavior based on the existence of an edge proxy, but this logic will probably be ripped out soon anyway so it's ok (side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?)

seems frail to assume certain behavior based on the existence of an edge proxy

I agree with your statement but I actually think it doesn't apply here. This check is effectively saying that this syncer is not going to mark status as Accepted if I'm not also going to translate that same resource. The presence of an edge proxy is just the mechanism it uses to know if it will also do translation on that resource.

side note: why don't the edge and kube gw translators each read/write their own statuses that don't overlap?

This is probably what we need to do if this stays around long term. We have the namespaced status concept that could be applied to this, although it 1. is a bit a misuse and 2. will tie us further into legacy solo-kit types

jenshu · 2024-11-25T14:41:29Z

test/e2e/aws_test.go

-		helpers.EventuallyResourceAccepted(func() (resources.InputResource, error) {
+		// Upstreams no longer report status if they have not been translated at all to avoid conflicting with
+		// other syncers that have translated them, so we can only detect that the objects exist here
+		helpers.EventuallyResourceExists(func() (resources.Resource, error) {


i don't think we can do anything about this right now, but just a note: do you think this will potentially cause more test flakes? waiting for resource acceptance means the controller has seen the resource (and it's in the input snapshot). just waiting for the resource to exist on the cluster might not give the same guarantee?

Yes, I agree, it's not great.

But if our system is eventually consistent, needing to wait for something to be in the internal snapshot (completely opaque to users) is a problem and smell anyway.

lgadban · 2024-11-25T14:44:05Z

What are the ways that I could test this manually to prove that the flickering no longer happens?

@sam-heilbron the steps to repro should probably have been in https://github.com/solo-io/solo-projects/issues/7243 but the gist of it is also captured in #10401

The high-level summary is:

Create Upstream that has a misconfiguration (e.g. reference a missing service as done in the example Upstreams incorrectly report Accepted when no translation has occurred #10401)
Do not create any Edge proxies (so no VirtualService)
Create a k8s Gateway to trigger a kube gw Proxy translation & sync

Now the upstream is being translated by kube gw and reporting a warning correctly but the edge syncer will be reporting accepted so they will fight and flicker.

With the changes in this PR that no longer happens as the Edge syncer won't report Accepted for a resource it hasn't translated

fix upstream status flicker

2f9ce6d

github-actions bot added keep pr updated work in progress labels Nov 20, 2024

changelog

a87a6ad

ashleywang1 reviewed Nov 20, 2024

View reviewed changes

lgadban added 7 commits November 20, 2024 16:29

pr feedback re: commentary

3a9bb33

update tests to handle missing us status

91add95

fix more assertions

2b464a9

another one

2bc1bac

anotha one

c73cbdd

Merge branch 'main' into us-status-flicker

a13ae48

lgadban removed the keep pr updated label Nov 22, 2024

lgadban added 7 commits November 22, 2024 12:45

more fixes

409520e

more fixes

b908568

revert add of example

411c838

cleanup

c6fa36c

more test fix

d93ee03

last fix?

82b31ff

Merge branch 'main' into us-status-flicker

89a427a

lgadban enabled auto-merge (squash) November 22, 2024 21:42

lgadban changed the title ~~fix upstream status flicker~~ fix upstream status flicker and constant status updates Nov 22, 2024

lgadban added 6 commits November 22, 2024 15:52

anotha timing fix

31adabd

another us check

30622cd

Merge branch 'main' into us-status-flicker

82cd654

test cleanup

edf781d

Merge branch 'main' into us-status-flicker

6def00f

changelog

d061304

changelog

0f2a5f6

nfuden reviewed Nov 25, 2024

View reviewed changes

projects/gateway2/setup/ggv2setup.go Show resolved Hide resolved

nfuden reviewed Nov 25, 2024

View reviewed changes

changelog/v1.18.0-rc2/fix-us-status-flicker.yaml Show resolved Hide resolved

sam-heilbron reviewed Nov 25, 2024

View reviewed changes

jenshu reviewed Nov 25, 2024

View reviewed changes

jenshu approved these changes Nov 25, 2024

View reviewed changes

nfuden approved these changes Nov 25, 2024

View reviewed changes

lgadban merged commit 7af653c into main Nov 25, 2024
19 checks passed

lgadban deleted the us-status-flicker branch November 25, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix upstream status flicker and constant status updates #10384

fix upstream status flicker and constant status updates #10384

lgadban commented Nov 20, 2024 •

edited by solo-changelog-bot bot

Loading

ashleywang1 Nov 20, 2024

github-actions bot commented Nov 22, 2024 •

edited

Loading

solo-changelog-bot bot commented Nov 23, 2024

solo-changelog-bot bot commented Nov 24, 2024

sam-heilbron left a comment

sam-heilbron Nov 25, 2024

lgadban Nov 25, 2024 •

edited

Loading

sam-heilbron Nov 25, 2024

sam-heilbron Nov 25, 2024

lgadban Nov 25, 2024

jenshu Nov 25, 2024

lgadban Nov 25, 2024

jenshu Nov 25, 2024

lgadban Nov 25, 2024

jenshu Nov 25, 2024

lgadban Nov 25, 2024

lgadban commented Nov 25, 2024 •

edited

Loading

	for _, upstream := range upstreams {
	if upstream.GetNamespacedStatuses() != nil {
	namespacedStatuses := upstream.GetNamespacedStatuses()
	for reporter, status := range namespacedStatuses.GetStatuses() {
	switch status.GetState() {
	case core.Status_Rejected:
	errMessage := fmt.Sprintf("Found rejected upstream by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata()))
	errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason())
	multiErr = multierror.Append(multiErr, errors.New(errMessage))
	case core.Status_Warning:
	errMessage := fmt.Sprintf("Found upstream with warnings by '%s': %s ", reporter, renderMetadata(upstream.GetMetadata()))
	errMessage += fmt.Sprintf("(Reason: %s)", status.GetReason())
	multiErr = multierror.Append(multiErr, errors.New(errMessage))
	}
	}
	} else {
	errMessage := fmt.Sprintf("Found upstream with no status: %s\n", renderMetadata(upstream.GetMetadata()))
	multiErr = multierror.Append(multiErr, errors.New(errMessage))
	}
	knownUpstreams = append(knownUpstreams, renderMetadata(upstream.GetMetadata()))

	// report Upstream[Group]s as Accepted initially but only if we at least 1 edge proxy
	// report Upstream[Group]s as Accepted initially but only if we have at least 1 edge proxy

fix upstream status flicker and constant status updates #10384

fix upstream status flicker and constant status updates #10384

Conversation

lgadban commented Nov 20, 2024 • edited by solo-changelog-bot bot Loading

Description