Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health metrics for device #238

Open
oscgonfer opened this issue May 10, 2023 · 3 comments
Open

Health metrics for device #238

oscgonfer opened this issue May 10, 2023 · 3 comments

Comments

@oscgonfer
Copy link
Contributor

oscgonfer commented May 10, 2023

This issue is to open the discussion about health metrics for a device. Currently we see some common issues when devices are deployed, such as connectivity issues or hardware problems to name the most common ones. We need an easier debugging process for the users, which can be provided by some metrics and analytics of the data, and ad-hoc physical device metrics.

Initially, we are addressing this issue offline, with custom requests to the API, but down the line, the process should be integrated in the platform for an easier debugging.

To start with this, we suggest adding a property to the device indicating the device health, in which we can collect various metrics, some calculated in the physical device side, and some on the platform side. Current proposals:

Platform checks

  1. Total number of points stored, versus maximum amount (based on device creation and last update). For this to work, we would need to either assume default interval (1' in sensors except PM), or have a way to retrieve that from the physical device via hardware_info for instance
  2. Delta between reading time and reception time: this can indicate issues regarding connectivity, but it needs to have the information of the publication interval to make sense. Currently, as we understand, ingestion time is not stored, so probably it's not worth including it
  3. Missing sensors (combination between platform and device check) could highlight particular issues with the hardware if a sensor (or all the sensors) disappears.

Firmware checks

This could be sent on a /device/<token>/health mqtt topic, and ingested on the health table for later. Could be sent ad-hoc, or on boot:

  1. Missing sensors, as above
  2. Connectivity timeouts
  3. SD card issues
  4. Too-frequent resets
  5. Last reset reason (available via firmware)
  6. Reason for WARNING state of the device

@pral2a @vicobarberan please provide inputs to build it progressively.

@oscgonfer oscgonfer added this to the 1223 milestone May 10, 2023
@oscgonfer oscgonfer assigned oscgonfer and unassigned oscgonfer May 10, 2023
@oscgonfer
Copy link
Contributor Author

oscgonfer commented May 31, 2023

Adding to this topic, a possibility would be to implement simple device metrics, as already suggested here #100 (comment) for those checks that can be done in platform.

A proposal could be to add a health table linked to the device which would contain:

health:
    # on device data ingestion, calculated by the platform
    total_data_points:  # number of data points in total
    data_gaps: #% of data gaps in the whole period based on sample interval (to retrieve from hardware info?)
    missing_sensors: # list of sensors that have been present, but that aren't anymore
    # filled from a health topic on the mqtt. JSON directly to allow flexibility
    hardware_report: #json sent directly from the hardware

Data gaps / completeness

To be done at ingestion time by ahoy or similar library. The kit's firmware will post the intervals for reading and publication on boot or config change (TBC), on a /device/<token>/config topic that would fill a config table per device.

This could also provide a metric that represents the variability of the posts interval and raise a flag for a sensor that is not posting data regularly.

Missing sensors

The kit's firmware will send data normally, and the platform needs to know what to expect. This is now done by blueprints (kits) but we would like to change this as discussed in #241. This would present a list of sensors to the user, on the onboarding or on the kit edit page (device edit) in which the user can select which sensors are to be expected, and whether or not a notification should be sent in case one of them is not received after a certain threshold has been passed (related to the reading/publication intervals from above).

The user could select notifications in this page, and mark sensors in the front end for misbehaving sensors:

imagen

Hardware report

The kit's firmware would post at least these new sensors:

  • WiFi RSSI: 'String'
  • rcause: 'String'
  • sd-card status: 'String'

These shouldn't be presented in the frontend to avoid confusion, but would be supporting health diagnosis.

@oscgonfer
Copy link
Contributor Author

Summary of action points for now:

  • Check availability of simple metrics that can be gathered in RoR application @timcowlishaw
  • Assess what needs to be done externally and think of architecture for triggering that (RPC?)
  • Use current hardware_info table and mqtt topic for prototyping metrics coming from hardware directly

@oscgonfer
Copy link
Contributor Author

#288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant