Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect Gloo metrics and some snapshots on test failure #10400

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
changelog:
- type: NON_USER_FACING
description: >-
Gloo Gateway controller metrics and xds/krt snaphots are now collected and included
the test failure artifacts.
After encountering some test failures that proved difficult to debug without knowing more
about the state of the cluster, we have added additional artifacts to be collected when
a test fails.
This will help us to more easily diagnose the cause of test failures.
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,8 @@ func (s *testingSuite) TestConfigureVirtualHostOptionsWithSectionNameManualSetup
[]string{"conflict with more specific or older VirtualHostOptions"},
defaults.KubeGatewayReporter,
)

s.Assert().Equal(true, false, "intentionally failing to trigger drump, remove when done debugging")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be removing this after I have an artifact example to link.

}

// The goal here is to test the behavior when multiple VHOs are targeting a gateway without sectionName. The expected
Expand Down
54 changes: 54 additions & 0 deletions test/kubernetes/e2e/test.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package e2e

import (
"bytes"
"context"
"errors"
"fmt"
Expand Down Expand Up @@ -314,6 +315,59 @@ func (i *TestInstallation) PreFailHandler(ctx context.Context) {
kubectlGetResourcesCmd := i.Actions.Kubectl().Command(ctx, "get", strings.Join(resourcesToGet, ","), "-A", "-owide")
_ = kubectlGetResourcesCmd.WithStdout(clusterStateFile).WithStderr(clusterStateFile).Run()
clusterStateFile.WriteString("\n")

podStdOut := bytes.NewBuffer(nil)
podStdErr := bytes.NewBuffer(nil)

// Fetch the name of the Gloo Gateway controller pod
getGlooPodNameCmd := i.Actions.Kubectl().Command(ctx, "get", "pod", "-n", i.Metadata.InstallNamespace,
"--selector", "gloo=gloo", "--output", "jsonpath='{.items[0].metadata.name}'")
_ = getGlooPodNameCmd.WithStdout(podStdOut).WithStderr(podStdErr).Run()

// Clean up and check the output
glooPodName := strings.Trim(podStdOut.String(), "'")
if glooPodName == "" {
fmt.Printf("Failed to get the name of the Gloo Gateway controller pod: %s\n", podStdErr.String())
return
}
Comment on lines +322 to +332
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make this it's own method if we think we will need it more than in this case.


// Get the metrics from the Gloo Gateway controller pod and write them to a file
metricsFilePath := filepath.Join(failureDir, "metrics.log")
metricsFile, err := os.OpenFile(metricsFilePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, os.ModePerm)
i.Assertions.Require.NoError(err)

// Using an ephemeral debug pod fetch the metrics from the Gloo Gateway controller
metricsCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace,
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--",
"curl", "http://localhost:9091/metrics")
_ = metricsCmd.WithStdout(metricsFile).WithStderr(metricsFile).Run()
metricsFile.Close()

// Get krt snapshot from the Gloo Gateway controller pod and write it to a file
krtSnapshotFilePath := filepath.Join(failureDir, "krt_snapshot.log")
krtSnapshotFile, err := os.OpenFile(krtSnapshotFilePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, os.ModePerm)
i.Assertions.Require.NoError(err)

// Using an ephemeral debug pod fetch the krt snapshot from the Gloo Gateway controller
krtSnapshotCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace,
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--",
"curl", "http://localhost:9095/snapshots/krt")
_ = krtSnapshotCmd.WithStdout(krtSnapshotFile).WithStderr(krtSnapshotFile).Run()
krtSnapshotFile.Close()
Comment on lines +339 to +356
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while these make sense it would be ideal if we could add them to the main admin gloo endpoint tooling as its a pretty general case. I know its annoying to do it on a first pass (ie one case) but Im pretty sure I would love to have it on all cases

https://github.com/solo-io/gloo/blob/main/test/helpers/kube_dump.go#L46 is an example of our end dump and it includes hitting the stats endpoint like https://github.com/solo-io/gloo/blob/main/test/helpers/kube_dump.go#L311 which would be a nice setup that we can add to these types of tests


// Get xds snapshot from the Gloo Gateway controller pod and write it to a file
xdsSnapshotFilePath := filepath.Join(failureDir, "xds_snapshot.log")
xdsSnapshotFile, err := os.OpenFile(xdsSnapshotFilePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, os.ModePerm)
i.Assertions.Require.NoError(err)

// Using an ephemeral debug pod fetch the xds snapshot from the Gloo Gateway controller
xdsSnapshotCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace,
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--",
"curl", "http://localhost:9095/snapshots/xds")
_ = xdsSnapshotCmd.WithStdout(xdsSnapshotFile).WithStderr(xdsSnapshotFile).Run()
xdsSnapshotFile.Close()

fmt.Printf("Test failed. Logs and cluster state are available in %s\n", failureDir)
}

// GeneratedFiles is a collection of files that are generated during the execution of a set of tests
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ gloo:
limits:
cpu: 1000m
memory: 10Gi
stats:
enabled: true # enable stats server for gloo so we can collect the metrics in CI

# Configuration for the statically deployed gateway-proxy that ships by default with Gloo Gateway
gatewayProxies:
Expand Down
Loading