You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current approach to plugin refresh is fairly naive. Synse notes that a plugin failed a request and marks it inactive. Later (either on a timer or when forced by user), it will attempt to reconnect/re-issue a command to the plugin to re-establish that it is "active".
I believe the premise of using active/inactive is still useful, as it does help to keep response times low for various requests, but the approach for having plugins be refreshed seems like it could use improvement.
My idea is that instead of having a retry all plugins at once on interval, retry them individually with backoff in a background task. Some details on this approach:
Only plugins that are configured should be retried. When defined explicitly in YAML, they will always be retried. When defined via a Discovery method, only plugins which are identified in the latest discovery will be retried (this prevents synse from continuing to try and retry plugin connections for a plugin that is not even there anymore).
Each plugin's retry runs in its own Task. Successful retry will put the plugin back into an active state and will terminate the task. The task will continue to run until either a retry succeeds, or the task is cancelled (e.g. if the plugin no longer shows up in discovery).
Retry will be done using an exponential backoff algorithm with jitter.
Additional data can be captured for the plugin, which can be returned on calls for plugin info. This data could include:
downtime: the total time a plugin has been inactive for (could count last and total)
uptime: the total time a plugin has been active for (could count last and total)
last_disconnect: the last time the plugin disconnected
disconnects: the number of times a plugin has transitioned from "active" to "inactive"
reconnects: the number of times a plugin has transitioned from "inactive" to "active"
Not all of the above are data points that I feel are required, but they are data points that we could (and probably therefore should) capture.
I think this is a good first pass at resolving the issue. Collecting the additional data would also give us metrics to key off of for range based queries/alerts such as multiple plugins disconnecting in < 5m, or disconnects/reconnects inflating. Things like that, which would prompt us to take a look.
As it stands right now, its very much a silent failure and relies on humans to notice the plugin misbehaving. Step one feels like adding observability to it, and then going from there with a more informed decision by the data.
I'll try to get started on this today. It is a fairly sizable change, so it'll take a few days probably, but I think in the end it'll be worth it both for performance and observability.
The current approach to plugin refresh is fairly naive. Synse notes that a plugin failed a request and marks it inactive. Later (either on a timer or when forced by user), it will attempt to reconnect/re-issue a command to the plugin to re-establish that it is "active".
I believe the premise of using active/inactive is still useful, as it does help to keep response times low for various requests, but the approach for having plugins be refreshed seems like it could use improvement.
My idea is that instead of having a retry all plugins at once on interval, retry them individually with backoff in a background task. Some details on this approach:
Only plugins that are configured should be retried. When defined explicitly in YAML, they will always be retried. When defined via a Discovery method, only plugins which are identified in the latest discovery will be retried (this prevents synse from continuing to try and retry plugin connections for a plugin that is not even there anymore).
Each plugin's retry runs in its own Task. Successful retry will put the plugin back into an active state and will terminate the task. The task will continue to run until either a retry succeeds, or the task is cancelled (e.g. if the plugin no longer shows up in discovery).
Retry will be done using an exponential backoff algorithm with jitter.
Additional data can be captured for the plugin, which can be returned on calls for plugin info. This data could include:
downtime
: the total time a plugin has been inactive for (could count last and total)uptime
: the total time a plugin has been active for (could count last and total)last_disconnect
: the last time the plugin disconnecteddisconnects
: the number of times a plugin has transitioned from "active" to "inactive"reconnects
: the number of times a plugin has transitioned from "inactive" to "active"Not all of the above are data points that I feel are required, but they are data points that we could (and probably therefore should) capture.
This relates to:
The text was updated successfully, but these errors were encountered: