Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jaeger tracing #967

Merged
merged 6 commits into from
Mar 25, 2021
Merged

Add jaeger tracing #967

merged 6 commits into from
Mar 25, 2021

Conversation

djwhatle
Copy link
Contributor

@djwhatle djwhatle commented Feb 22, 2021

Todo

  • Add tracing to remaining controllers that are part of migration process (dim, dism, dvmp)
  • Hide tracing behind a env var flag since it's not needed all the time, and may have perf impact
  • Make sure that spans terminate when migration ends
  • Evaluate if sampling frequency is OK as-is (decided this isn't a problem for now since only use-case will be for dev perf evaluation, not enabling by default in user clusters)

Follow-on PR:

  • Figure out how to deploy a jaeger collector inside OCP 4 cluster, see if there's an official way to do this without us needing to package

To test:

Start the jaeger collector:

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.22

Navigate to http://localhost:16686

Run mig-controller with jaeger tracing enabled

JAEGER_ENABLED=true make run-fast

Span hierarchy overview

- Parent MigMigration Span
  - Child MigMigration reconcile span
    - Child MigMigration phase execution span
  - Child DVM reconcile span
    - Child DVM phase execution span
  - Child DVMP reconcile span
  - Child DIM reconcile span
    - Child DIM phase execution span
  - Child DISM reconcile span
    - Child DISM phase execution span

Jaeger

Preview of Jaeger tracing showing interacting controllers

image

Some useful phase timing statistics get for free by having Jaeger running while a migration runs

image

@djwhatle djwhatle marked this pull request as ready for review March 16, 2021 21:05
@djwhatle djwhatle linked an issue Mar 17, 2021 that may be closed by this pull request
Add tracing to DVM

Add mutex

Cleaner implementation of initJaeger in DVM, using as a template for others

Handle nil spans correctly in main reconcile

Tracing on migration, DVM, DVMP, DIM, DISM

Move initJaeger funcs out to trace.go

Correct tracer names

Add settings switch JAEGER_ENABLED=true|false

Adjust comments
@djwhatle
Copy link
Contributor Author

Squashed. Ready for review.

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why we have to go through two layers to get the migration owner ref

owner := &MigMigration{}
ownerRefs := r.GetOwnerReferences()
if len(ownerRefs) > 0 {
ownerRef := types.NamespacedName{Name: ownerRefs[0].Name, Namespace: r.Namespace}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the first owner ref? are we not looking for a specific resource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're doing this because we only ever create a single owner reference, so if there are more than one, then someone has manually hacked the resource into an unsupported scenario anyway. But yes, we should probably check the type of the owning resource to make sure it's what we expect. But even then, what do we do if we find more than one? Do we take the first, or to we error out? (this is a general question which might apply elsewhere. There are various cases where it is possible to find 2 resources in a slice but we only really expect one, so we generally take the first and ignore the rest. Perhaps we should always error out if we find more than one when we expect one?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'll check the type of the ownerRef. In the current system there is never more than one owner but since that could change, it would be good to support the future case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

// Go from dism -> dim -> migration, use migration UID to get span
dim, err := dism.GetOwner(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not add the migration as an owner of dism? it probably makes sense in this case if we have the ability to adopt a resource.

Say a DIM is accidently deleted, the DISM's then in this case would stick around and get re-adopted when the migration controller re-creates the DIM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the migration owns the dim, and the dim owns the dism. I guess we could add the migration as a second owner of the dism, although I haven't really experimented with two-owner resource behavior. If there are 2 owners and one gets deleted, what happens to the resource? Also, if the DIM gets deleted, I think we'd want any DISMs it creates to be deleted as well. Also, the DIM controller doesn't currently have any code which would adopt pre-existing DISM resources. If there is no DISM owned by this DIM, then it creates one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind here is that it's possible for there to be a DIM that has no owner references. That just means that the user created the DIM directly rather than a migmigration creating it, which is essentially an images-only migration. Likewise with the DISM. It's possible for a single DISM to be created without an owning DIM, which will migrate a single imagestream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we may just want to create a follow-up item to handle the re-adoption.

Copy link
Contributor

@sseago sseago Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so first, to clarify, a resource is only deleted if the last owner ref is deleted, not the first, right?

And, assuming this, if we adoped the approach of adding both MigMigration and DIM owner refs, then yes, when we added MigMigration owners to DISM resources, we'd need to modify the DIM controller's DISM creation code as follows:
If no DISM owned by this DIM is found, instead of immediately moving to creation of a new DISM, we'd have a second check looking for a DISM owned by the MigMIgration owner of this DIM (if it's owned by a MigMIgration), and if it's found, then we'd adopt the existing DISM by adding the current DIM as a second owner. If not found, follow through to the existing "create a new DISM" code.

Copy link
Contributor Author

@djwhatle djwhatle Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good follow-up item if this is something we're interested in changing. That said, it costs us next to nothing (other than more code) to traverse multiple levels here since all of these resources are coming from cache. The lookups are almost instant.

@djwhatle
Copy link
Contributor Author

@shawn-hurley @sseago updated in response to PR review.

@djwhatle
Copy link
Contributor Author

Hmm, latest update broke parenting of migmigration -> reconcile spans. Looking for fix.
image

@djwhatle
Copy link
Contributor Author

Fixed
image

@djwhatle
Copy link
Contributor Author

@shawn-hurley @sseago there were some issues after I last pinged you guys, just wanted to let you know that I've resolved the issues now and this is ready for review again.

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more questions might be good to merge with follow-ups

ownerRef := types.NamespacedName{Name: ownerRef.Name, Namespace: r.Namespace}
err := client.Get(context.TODO(), ownerRef, owner)
if err != nil {
return nil, liberr.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here on a 500 error from the API server? what happens if there is a 404?

What I would be worried about is a slow cache update I think right?

Copy link
Contributor Author

@djwhatle djwhatle Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we've had any notable problems with cache being out of date for the local cluster. The outdated cache issue primarily showed up as an issue communicating with remote clusters. All of our CRs are located on the same cluster as mig-controller.

The strategy for creating these spans is best-effort: if we can find the associated resources then we'll create a span. Otherwise we ignore it and the migration proceeds without jaeger spans. The jaeger span is not necessary for migration success, just helps us understand the system.

https://github.com/konveyor/mig-controller/pull/967/files/bc6c07c64755c415f2b1c8ae09acb4caca4e6f74#diff-c1e9e79d4cf358ba29d2e6ec3d769c82cd837a9269aae54f489504c64dbb672bR53-R58

"k8s.io/apimachinery/pkg/api/errors"
)

func (r *ReconcileDirectVolumeMigrationProgress) initTracer(dvmp migapi.DirectVolumeMigrationProgress) (opentracing.Span, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we handle the error from this function. I could just be missing it.

Copy link
Contributor Author

@djwhatle djwhatle Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error isn't handled, it's ignored and the migration process proceeds. If this function fails then the reconcile span won't get set and no further child spans will be created under the reconcile span. This uses best-effort to create the span but ignores and continues if an error occurred.

// SetSpanForMigrationUID sets the parent jaeger span for a migration
func SetSpanForMigrationUID(migrationUID string, span opentracing.Span) {
// Init map if needed
createMsmMapOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question,

What is the value of doing this on the first call rather than init?

Copy link
Contributor Author

@djwhatle djwhatle Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could go either place. This just initializes the map if the function is ever used, and skips init if it isn't used. Think of all the memory we're saving /s

TBH I was just copying this over from some old code I wrote of a mutexed map

// Wait 5 minutes before terminating span, since other controllers writing to span
// post-close will result in undefined jaeger behavior (e.g. broken statistics page)
go func(migrationSpan opentracing.Span) {
time.Sleep(5 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to note that other things are potentially writing to the span rather than relay on a timeout?

Copy link
Contributor Author

@djwhatle djwhatle Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using a semaphore here or some kind of wrapper around the spans that would only allow writes if the span isn't already closed, I think that would be a more robust solution.

This was a sort of quick and easy hack that I think will be effective as long as no controller has a reconcile extending past 5 minutes. In general our reconciles are much shorter than this, with the exception of DirectImageMigration related things where actual image copies are done in-controller.

I actually wasn't sure why jaeger data corruption was happening, but I was able to confirm that this fixed the problem (with minimal POC effort) by making sure that the top-level migration span isn't written to post-close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel this is important to fix now I can definitely do it before merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will plan to fix this if we see any problems with the current approach in the future, or if we plan to turn Jaeger tracing on outside of dev environments. Filed an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MIG-525] Implement jaeger tracing
3 participants