Add jaeger tracing #967

djwhatle · 2021-02-22T20:44:42Z

Todo

Add tracing to remaining controllers that are part of migration process (dim, dism, dvmp)
Hide tracing behind a env var flag since it's not needed all the time, and may have perf impact
Make sure that spans terminate when migration ends
Evaluate if sampling frequency is OK as-is (decided this isn't a problem for now since only use-case will be for dev perf evaluation, not enabling by default in user clusters)

Follow-on PR:

Figure out how to deploy a jaeger collector inside OCP 4 cluster, see if there's an official way to do this without us needing to package

To test:

Start the jaeger collector:

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.22

Navigate to http://localhost:16686

Run mig-controller with jaeger tracing enabled

JAEGER_ENABLED=true make run-fast

Span hierarchy overview

- Parent MigMigration Span
  - Child MigMigration reconcile span
    - Child MigMigration phase execution span
  - Child DVM reconcile span
    - Child DVM phase execution span
  - Child DVMP reconcile span
  - Child DIM reconcile span
    - Child DIM phase execution span
  - Child DISM reconcile span
    - Child DISM phase execution span

Preview of Jaeger tracing showing interacting controllers

Some useful phase timing statistics get for free by having Jaeger running while a migration runs

Add tracing to DVM Add mutex Cleaner implementation of initJaeger in DVM, using as a template for others Handle nil spans correctly in main reconcile Tracing on migration, DVM, DVMP, DIM, DISM Move initJaeger funcs out to trace.go Correct tracer names Add settings switch JAEGER_ENABLED=true|false Adjust comments

djwhatle · 2021-03-17T00:46:32Z

Squashed. Ready for review.

shawn-hurley

I am wondering why we have to go through two layers to get the migration owner ref

shawn-hurley · 2021-03-17T17:39:59Z

pkg/apis/migration/v1alpha1/directimagemigration_types.go

+	owner := &MigMigration{}
+	ownerRefs := r.GetOwnerReferences()
+	if len(ownerRefs) > 0 {
+		ownerRef := types.NamespacedName{Name: ownerRefs[0].Name, Namespace: r.Namespace}


why the first owner ref? are we not looking for a specific resource?

I think we're doing this because we only ever create a single owner reference, so if there are more than one, then someone has manually hacked the resource into an unsupported scenario anyway. But yes, we should probably check the type of the owning resource to make sure it's what we expect. But even then, what do we do if we find more than one? Do we take the first, or to we error out? (this is a general question which might apply elsewhere. There are various cases where it is possible to find 2 resources in a slice but we only really expect one, so we generally take the first and ignore the rest. Perhaps we should always error out if we find more than one when we expect one?)

+1 I'll check the type of the ownerRef. In the current system there is never more than one owner but since that could change, it would be good to support the future case.

shawn-hurley · 2021-03-17T17:47:13Z

pkg/controller/directimagestreammigration/trace.go

+	}
+
+	// Go from dism -> dim -> migration, use migration UID to get span
+	dim, err := dism.GetOwner(r)


Can we not add the migration as an owner of dism? it probably makes sense in this case if we have the ability to adopt a resource.

Say a DIM is accidently deleted, the DISM's then in this case would stick around and get re-adopted when the migration controller re-creates the DIM.

Currently the migration owns the dim, and the dim owns the dism. I guess we could add the migration as a second owner of the dism, although I haven't really experimented with two-owner resource behavior. If there are 2 owners and one gets deleted, what happens to the resource? Also, if the DIM gets deleted, I think we'd want any DISMs it creates to be deleted as well. Also, the DIM controller doesn't currently have any code which would adopt pre-existing DISM resources. If there is no DISM owned by this DIM, then it creates one.

One thing to keep in mind here is that it's possible for there to be a DIM that has no owner references. That just means that the user created the DIM directly rather than a migmigration creating it, which is essentially an images-only migration. Likewise with the DISM. It's possible for a single DISM to be created without an owning DIM, which will migrate a single imagestream.

I see, we may just want to create a follow-up item to handle the re-adoption.

OK, so first, to clarify, a resource is only deleted if the last owner ref is deleted, not the first, right?

And, assuming this, if we adoped the approach of adding both MigMigration and DIM owner refs, then yes, when we added MigMigration owners to DISM resources, we'd need to modify the DIM controller's DISM creation code as follows:
If no DISM owned by this DIM is found, instead of immediately moving to creation of a new DISM, we'd have a second check looking for a DISM owned by the MigMIgration owner of this DIM (if it's owned by a MigMIgration), and if it's found, then we'd adopt the existing DISM by adding the current DIM as a second owner. If not found, follow through to the existing "create a new DISM" code.

Sounds like a good follow-up item if this is something we're interested in changing. That said, it costs us next to nothing (other than more code) to traverse multiple levels here since all of these resources are coming from cache. The lookups are almost instant.

pkg/apis/migration/v1alpha1/directimagemigration_types.go

pkg/apis/migration/v1alpha1/directimagestreammigration_types.go

pkg/controller/directimagestreammigration/trace.go

pkg/controller/directimagemigration/trace.go

djwhatle · 2021-03-19T15:27:59Z

@shawn-hurley @sseago updated in response to PR review.

djwhatle · 2021-03-19T15:33:09Z

Hmm, latest update broke parenting of migmigration -> reconcile spans. Looking for fix.

djwhatle · 2021-03-19T15:48:03Z

Fixed

djwhatle · 2021-03-19T19:56:17Z

@shawn-hurley @sseago there were some issues after I last pinged you guys, just wanted to let you know that I've resolved the issues now and this is ready for review again.

shawn-hurley

Some more questions might be good to merge with follow-ups

shawn-hurley · 2021-03-22T17:08:40Z

pkg/apis/migration/v1alpha1/directimagemigration_types.go

+		ownerRef := types.NamespacedName{Name: ownerRef.Name, Namespace: r.Namespace}
+		err := client.Get(context.TODO(), ownerRef, owner)
+		if err != nil {
+			return nil, liberr.Wrap(err)


What happens here on a 500 error from the API server? what happens if there is a 404?

What I would be worried about is a slow cache update I think right?

I don't believe we've had any notable problems with cache being out of date for the local cluster. The outdated cache issue primarily showed up as an issue communicating with remote clusters. All of our CRs are located on the same cluster as mig-controller.

The strategy for creating these spans is best-effort: if we can find the associated resources then we'll create a span. Otherwise we ignore it and the migration proceeds without jaeger spans. The jaeger span is not necessary for migration success, just helps us understand the system.

https://github.com/konveyor/mig-controller/pull/967/files/bc6c07c64755c415f2b1c8ae09acb4caca4e6f74#diff-c1e9e79d4cf358ba29d2e6ec3d769c82cd837a9269aae54f489504c64dbb672bR53-R58

shawn-hurley · 2021-03-22T17:16:02Z

pkg/controller/directvolumemigrationprogress/trace.go

+	"k8s.io/apimachinery/pkg/api/errors"
+)
+
+func (r *ReconcileDirectVolumeMigrationProgress) initTracer(dvmp migapi.DirectVolumeMigrationProgress) (opentracing.Span, error) {


I don't see where we handle the error from this function. I could just be missing it.

The error isn't handled, it's ignored and the migration process proceeds. If this function fails then the reconcile span won't get set and no further child spans will be created under the reconcile span. This uses best-effort to create the span but ignores and continues if an error occurred.

shawn-hurley · 2021-03-22T17:18:33Z

pkg/tracing/trace.go

+// SetSpanForMigrationUID sets the parent jaeger span for a migration
+func SetSpanForMigrationUID(migrationUID string, span opentracing.Span) {
+	// Init map if needed
+	createMsmMapOnce.Do(func() {


Just a question,

What is the value of doing this on the first call rather than init?

I think it could go either place. This just initializes the map if the function is ever used, and skips init if it isn't used. Think of all the memory we're saving /s

TBH I was just copying this over from some old code I wrote of a mutexed map

shawn-hurley · 2021-03-22T17:19:32Z

pkg/tracing/trace.go

+		// Wait 5 minutes before terminating span, since other controllers writing to span
+		// post-close will result in undefined jaeger behavior (e.g. broken statistics page)
+		go func(migrationSpan opentracing.Span) {
+			time.Sleep(5 * time.Minute)


Is there a better way to note that other things are potentially writing to the span rather than relay on a timeout?

I thought about using a semaphore here or some kind of wrapper around the spans that would only allow writes if the span isn't already closed, I think that would be a more robust solution.

This was a sort of quick and easy hack that I think will be effective as long as no controller has a reconcile extending past 5 minutes. In general our reconciles are much shorter than this, with the exception of DirectImageMigration related things where actual image copies are done in-controller.

I actually wasn't sure why jaeger data corruption was happening, but I was able to confirm that this fixed the problem (with minimal POC effort) by making sure that the top-level migration span isn't written to post-close.

If you feel this is important to fix now I can definitely do it before merge.

I will plan to fix this if we see any problems with the current approach in the future, or if we plan to turn Jaeger tracing on outside of dev environments. Filed an issue.

djwhatle marked this pull request as ready for review March 16, 2021 21:05

djwhatle force-pushed the add-jaeger branch from a6b3cbb to 4b45534 Compare March 16, 2021 23:49

djwhatle linked an issue Mar 17, 2021 that may be closed by this pull request

[MIG-525] Implement jaeger tracing #1005

Closed

djwhatle force-pushed the add-jaeger branch from d40f297 to 639975b Compare March 17, 2021 00:46

djwhatle requested review from shawn-hurley, sseago and pranavgaikwad March 17, 2021 14:20

shawn-hurley requested changes Mar 17, 2021

View reviewed changes

sseago reviewed Mar 17, 2021

View reviewed changes

pkg/apis/migration/v1alpha1/directimagemigration_types.go Outdated Show resolved Hide resolved

pkg/apis/migration/v1alpha1/directimagestreammigration_types.go Outdated Show resolved Hide resolved

pkg/controller/directimagestreammigration/trace.go Outdated Show resolved Hide resolved

djwhatle commented Mar 18, 2021

View reviewed changes

pkg/controller/directimagemigration/trace.go Show resolved Hide resolved

Don't assume first ownerRef, attach name of resource to spans

7d2c7f1

djwhatle added 3 commits March 19, 2021 11:42

Fixes

1f67a80

Use reconcileSpan instead of migrationSpan

553aa5c

Fix

7c15f58

Fix jaeger corruption on end of reconcile

bc6c07c

shawn-hurley reviewed Mar 22, 2021

View reviewed changes

djwhatle mentioned this pull request Mar 25, 2021

Jaeger: Make close of migrationSpan more robust if planning to turn on by default in user envs. #1023

Open

djwhatle merged commit ec65b20 into migtools:master Mar 25, 2021

sseago mentioned this pull request Jun 1, 2021

Instrumentation for internal performance monitoring vmware-tanzu/velero#3841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jaeger tracing #967

Add jaeger tracing #967

djwhatle commented Feb 22, 2021 •

edited

Loading

djwhatle commented Mar 17, 2021

shawn-hurley left a comment

shawn-hurley Mar 17, 2021

sseago Mar 17, 2021

djwhatle Mar 19, 2021

djwhatle Mar 19, 2021

shawn-hurley Mar 17, 2021

sseago Mar 17, 2021

sseago Mar 17, 2021

shawn-hurley Mar 17, 2021

sseago Mar 17, 2021 •

edited

Loading

djwhatle Mar 19, 2021 •

edited

Loading

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

shawn-hurley left a comment

shawn-hurley Mar 22, 2021

djwhatle Mar 22, 2021 •

edited

Loading

shawn-hurley Mar 22, 2021

djwhatle Mar 22, 2021 •

edited

Loading

shawn-hurley Mar 22, 2021

djwhatle Mar 22, 2021 •

edited

Loading

shawn-hurley Mar 22, 2021

djwhatle Mar 22, 2021 •

edited

Loading

djwhatle Mar 22, 2021

djwhatle Mar 25, 2021

djwhatle Mar 25, 2021

Add jaeger tracing #967

Add jaeger tracing #967

Conversation

djwhatle commented Feb 22, 2021 • edited Loading

Span hierarchy overview

Preview of Jaeger tracing showing interacting controllers

Some useful phase timing statistics get for free by having Jaeger running while a migration runs

djwhatle commented Mar 17, 2021

shawn-hurley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sseago Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

djwhatle Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

djwhatle commented Mar 19, 2021

shawn-hurley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djwhatle Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djwhatle Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djwhatle Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djwhatle Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djwhatle commented Feb 22, 2021 •

edited

Loading

sseago Mar 17, 2021 •

edited

Loading

djwhatle Mar 19, 2021 •

edited

Loading

djwhatle Mar 22, 2021 •

edited

Loading

djwhatle Mar 22, 2021 •

edited

Loading

djwhatle Mar 22, 2021 •

edited

Loading

djwhatle Mar 22, 2021 •

edited

Loading