-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jaeger tracing #967
Add jaeger tracing #967
Conversation
Add tracing to DVM Add mutex Cleaner implementation of initJaeger in DVM, using as a template for others Handle nil spans correctly in main reconcile Tracing on migration, DVM, DVMP, DIM, DISM Move initJaeger funcs out to trace.go Correct tracer names Add settings switch JAEGER_ENABLED=true|false Adjust comments
Squashed. Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why we have to go through two layers to get the migration owner ref
owner := &MigMigration{} | ||
ownerRefs := r.GetOwnerReferences() | ||
if len(ownerRefs) > 0 { | ||
ownerRef := types.NamespacedName{Name: ownerRefs[0].Name, Namespace: r.Namespace} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the first owner ref? are we not looking for a specific resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're doing this because we only ever create a single owner reference, so if there are more than one, then someone has manually hacked the resource into an unsupported scenario anyway. But yes, we should probably check the type of the owning resource to make sure it's what we expect. But even then, what do we do if we find more than one? Do we take the first, or to we error out? (this is a general question which might apply elsewhere. There are various cases where it is possible to find 2 resources in a slice but we only really expect one, so we generally take the first and ignore the rest. Perhaps we should always error out if we find more than one when we expect one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I'll check the type of the ownerRef. In the current system there is never more than one owner but since that could change, it would be good to support the future case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
|
||
// Go from dism -> dim -> migration, use migration UID to get span | ||
dim, err := dism.GetOwner(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not add the migration as an owner of dism? it probably makes sense in this case if we have the ability to adopt a resource.
Say a DIM is accidently deleted, the DISM's then in this case would stick around and get re-adopted when the migration controller re-creates the DIM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the migration owns the dim, and the dim owns the dism. I guess we could add the migration as a second owner of the dism, although I haven't really experimented with two-owner resource behavior. If there are 2 owners and one gets deleted, what happens to the resource? Also, if the DIM gets deleted, I think we'd want any DISMs it creates to be deleted as well. Also, the DIM controller doesn't currently have any code which would adopt pre-existing DISM resources. If there is no DISM owned by this DIM, then it creates one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind here is that it's possible for there to be a DIM that has no owner references. That just means that the user created the DIM directly rather than a migmigration creating it, which is essentially an images-only migration. Likewise with the DISM. It's possible for a single DISM to be created without an owning DIM, which will migrate a single imagestream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we may just want to create a follow-up item to handle the re-adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so first, to clarify, a resource is only deleted if the last owner ref is deleted, not the first, right?
And, assuming this, if we adoped the approach of adding both MigMigration
and DIM
owner refs, then yes, when we added MigMigration owners to DISM resources, we'd need to modify the DIM controller's DISM creation code as follows:
If no DISM owned by this DIM is found, instead of immediately moving to creation of a new DISM, we'd have a second check looking for a DISM owned by the MigMIgration owner of this DIM (if it's owned by a MigMIgration), and if it's found, then we'd adopt the existing DISM by adding the current DIM as a second owner. If not found, follow through to the existing "create a new DISM" code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good follow-up item if this is something we're interested in changing. That said, it costs us next to nothing (other than more code) to traverse multiple levels here since all of these resources are coming from cache. The lookups are almost instant.
pkg/apis/migration/v1alpha1/directimagestreammigration_types.go
Outdated
Show resolved
Hide resolved
@shawn-hurley @sseago updated in response to PR review. |
@shawn-hurley @sseago there were some issues after I last pinged you guys, just wanted to let you know that I've resolved the issues now and this is ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more questions might be good to merge with follow-ups
ownerRef := types.NamespacedName{Name: ownerRef.Name, Namespace: r.Namespace} | ||
err := client.Get(context.TODO(), ownerRef, owner) | ||
if err != nil { | ||
return nil, liberr.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here on a 500 error from the API server? what happens if there is a 404?
What I would be worried about is a slow cache update I think right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we've had any notable problems with cache being out of date for the local cluster. The outdated cache issue primarily showed up as an issue communicating with remote clusters. All of our CRs are located on the same cluster as mig-controller.
The strategy for creating these spans is best-effort: if we can find the associated resources then we'll create a span. Otherwise we ignore it and the migration proceeds without jaeger spans. The jaeger span is not necessary for migration success, just helps us understand the system.
"k8s.io/apimachinery/pkg/api/errors" | ||
) | ||
|
||
func (r *ReconcileDirectVolumeMigrationProgress) initTracer(dvmp migapi.DirectVolumeMigrationProgress) (opentracing.Span, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where we handle the error from this function. I could just be missing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error isn't handled, it's ignored and the migration process proceeds. If this function fails then the reconcile span won't get set and no further child spans will be created under the reconcile span. This uses best-effort to create the span but ignores and continues if an error occurred.
// SetSpanForMigrationUID sets the parent jaeger span for a migration | ||
func SetSpanForMigrationUID(migrationUID string, span opentracing.Span) { | ||
// Init map if needed | ||
createMsmMapOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question,
What is the value of doing this on the first call rather than init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could go either place. This just initializes the map if the function is ever used, and skips init if it isn't used. Think of all the memory we're saving /s
TBH I was just copying this over from some old code I wrote of a mutexed map
// Wait 5 minutes before terminating span, since other controllers writing to span | ||
// post-close will result in undefined jaeger behavior (e.g. broken statistics page) | ||
go func(migrationSpan opentracing.Span) { | ||
time.Sleep(5 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to note that other things are potentially writing to the span rather than relay on a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about using a semaphore here or some kind of wrapper around the spans that would only allow writes if the span isn't already closed, I think that would be a more robust solution.
This was a sort of quick and easy hack that I think will be effective as long as no controller has a reconcile extending past 5 minutes. In general our reconciles are much shorter than this, with the exception of DirectImageMigration related things where actual image copies are done in-controller.
I actually wasn't sure why jaeger data corruption was happening, but I was able to confirm that this fixed the problem (with minimal POC effort) by making sure that the top-level migration span isn't written to post-close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel this is important to fix now I can definitely do it before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will plan to fix this if we see any problems with the current approach in the future, or if we plan to turn Jaeger tracing on outside of dev environments. Filed an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo
Follow-on PR:
To test:
Start the jaeger collector:
Navigate to http://localhost:16686
Run mig-controller with jaeger tracing enabled
Span hierarchy overview
Preview of Jaeger tracing showing interacting controllers
Some useful phase timing statistics get for free by having Jaeger running while a migration runs