Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish diagnostics_agg right away upon new warn/error diagnostics #48

Open
mikepurvis opened this issue May 20, 2016 · 5 comments
Open

Comments

@mikepurvis
Copy link
Member

The present behaviour is to only publish the aggregated diagnostics at the fixed rate (default 1Hz):

https://github.com/ros/diagnostics/blob/03e0db006175c3e6157bb50ab38021ebb4995c5b/diagnostic_aggregator/src/aggregator_node.cpp#L50-56

I believe that it would be better to trigger an immediate publish of the aggregated topic when a diagnostic transitions to WARN or ERROR. We have some error reporting mechanisms that currently have to subscribe to /diagnostics, and could instead subscribe to /diagnostics_agg if we knew that a) no error reports would be missed, and b) new error reports would be passed through immediately.

Thoughts?

@heuristicus
Copy link
Contributor

heuristicus commented Jun 8, 2016

This sounds like a good idea, but is there any particular advantage to working with /diagnostics_agg rather than /diagnostics? I guess the main thing would be fewer callbacks, and the filtering that you define through the analyzers would be applied so you only see the diagnostics you're interested in.

At the same time, it seems to me that the main purpose of the /diagnostics_agg right now is to provide information to a GUI, which is why the update rate is slow.

I guess what I'm wondering about is whether the purpose of the aggregator is to provide an aggregation of the most up to date diagnostics at any point in time, or just in slices.

@mikepurvis
Copy link
Member Author

The main consumer of diagnostics_agg is the GUI, but I think its purpose is to supply a single diagnostic snapshot which doesn't require opening a socket to every node which produces diagnostics.

Even for the GUI case though, IMO a new ERROR case should be made visible as soon as possible, rather than waiting as much as a second for the next report to arrive.

@heuristicus
Copy link
Contributor

heuristicus commented Jun 8, 2016

Definitely agree with making errors visible ASAP.

What about changing messages that are attached to the diagnostics? Does it matter that the cause of the error is different, but the level is still the same?

For example:

level: 2
name: ''
message: something went wrong...
hardware_id: ''
values: 
  - 
    key: ''
    value: ''
---
level: 2
name: ''
message: something else went wrong...
hardware_id: ''
values: 
  - 
    key: ''
    value: ''
---

Should both of these messages cause an instant update, or just the first one? Does that also apply to when the level is OK?

From my perspective, I don't think it makes sense to update instantly on changed messages in the OK or WARN levels, but it might if the diagnostic is in the ERROR level. It's a pretty small change to make though, in terms of implementation - just a check on an additional field of the message. Doing this might cause updates to be too frequent, for example if your message contains some floating point values which change a lot.

@mikepurvis
Copy link
Member Author

That seems reasonable to me.

@trainman419
Copy link
Contributor

From personal experience, I'd suggest a few adjustments to this:

  • publishing /diagnostics_agg on change doesn't reduce the latency if the diagnostic_updater library is limiting changes on /diagnostics to 1Hz, so to implement this properly, both would need to updated to publish updates immediately when items go from OK to not OK
  • Industrial users will want to know the end-to-end latency between a node transitioning to an error state and the diagnostics_agg topic reflecting that
  • Most industrial systems that have a periodic update (1Hz) and immediate update on change also limit the max publishing rate on change, so that a status item that is oscillating doesn't overwhelm the system. I think this is good idea here as well
  • I think different users will want different notification behaviors on WARN vs ERROR; it might make sense to parameterize this in the final implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants