-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for tracking live servers #113
base: master
Are you sure you want to change the base?
Add metrics for tracking live servers #113
Conversation
I'll fix up the test failures in a few days when I get a chance, still would appreciate feedback on the approach 🙂 |
tests are fixed! |
@eoinmccafee00 I think that is about a different issue, can you re-check please? |
Apologizes yeah I closed the wrong ticket. |
Hey @iainlane Can you wrap a feature flag around this, please? I'd rather not have it enabled by default for now. Cheers, |
This should allow us to correlate the servers that drone thinks it knows about with those that GCP has
Now we pass a collector to `ServerDelete()`, more metrics are in the registry and the ones we want are at the end.
Only expose the new metrics when this variable is set
7888e52
to
8ee3814
Compare
@eoinmcafee00 okay, re-pushed I checked this works as expected, with the flag set we get metrics like
I should be able to join that with the lists from the CSPs & we'll be able to see if one side knows about an instance the other doesn't. |
metrics/metrics.go
Outdated
|
||
func init() { | ||
registerKnownServers, _ = strconv.ParseBool( | ||
os.Getenv("DRONE_AUTOSCALER_REGISTER_KNOWN_SERVERS"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this to https://github.com/drone/autoscaler/blob/master/config/config.go & inject the config struct, similar to how we do it elsewhere please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, pushed, can you re-review please?
Rename to DRONE_METRICS_REGISTER_KNOWN_SERVERS, expose via config struct, reorder imports
Problem
Occasionally we've noticed that the autoscalers know about instances that the VM providers (e.g. GCP or AWS) don't or vice versa, probably due to latent bugs in the autoscaler. I'd like to expose the servers from the autoscaler and then we can correlate both sides to have an alert when they get out of sync. Then we should be able to look in logs and hopefully identify something that can help get the bug(s) fixed. Currently, we might not notice until a bit later and then it's hard to know where to drill down to.
Proposal
Add a new metric
drone_server_known_instance
with a label for the instance name.This should allow us to correlate the servers that drone thinks it knows about with those that GCP has.
Technically this has unbounded cardinality but I think it's okay in reality since the metric is deleted when a server goes away and we'll really only be doing instant ("give me the values now") queries on this.
Would be interested in people's thoughts on this.
Alternatives
drone_server_count
There is already
drone_server_count
which could gain some labels.Doing something with logs
For completeness-
We could make sure to log nice clear messages when servers turn up and leave.