-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silicon Labs Multiprotocol addon failing intermittently causing HAP, Thread and Zigbee issues #3192
Comments
Update: Only way to restore Home Assistant to a functional state was to unplug the SkyConnect from its USB port, replug, then reboot Unraid. Simply turning the VM for HAOS off and on nor just rebooting the host without the replug was not enough. FWIW, I have a PCIe USB Hub card in the server, however the SkyConnect USB is in the motherboards USB2.0 port, rather than the hub, to try and limit the hub being a cause. |
@agners I would like to report some similar experience that may not be entirely related to the same error you are receiving. However, I am having similar symptoms. Recently I added my HA SkyConnect to the Apple Border Router thread network via using the TLC Dataset provided from Nanoleaf. I have my devices all on one mesh network and things work great when they are up. For some reason my SkyConnect will just shut off. I believe it is being overloaded somehow. The only solution to the problem is to unplug and replug the device. Below is what Silicon Labs Multiprotocol addon logs look like. |
@spartandrew18 it seems that the otbr-agent is continuing to restart in a loop? Do you maybe have the logs from when things first failed? You can get more logs through the CLI:
You can also redirect into a location where you have access to so you can upload the logs:
|
New Update: Silicon Labs Multiprotocol addon just crashed:
Repeats itself every 100ms, which caused my HA VM to hang in responsiveness. Had to force the VM to reboot from Unraid. The addon attempts to startup after the VM/HAOS reboot, but same thing occurs:
Thread and Zigbee (as expected) have failed:- ZHA:
HomeKit Bridge (for the Thread lights via HAP) do not show any useful logs, however all 30 integrations are in constant reload mode. Rebooting the VM did not resolve the issue. Unraid can still see the NabuCasa Skyconnect plugged in and responding:
The fix for this at the moment (this time around anyway) was a full server reboot, but it took two attempts and several manual reboots of the addon to make it work and talk with the USB, which is less than ideal. For what it's worth, the SkyConnect USB is plugged directly into the motherboards USB port (ASRock Rack X570-2L2T) using the included extension cable that come with the unit, to avoid interference. Any help with this would be appreciated. Thanks. |
|
Thanks for the logs. Some interesting snippets:
It seems the OTBR detects that the radio isn't communicating anymore (it can't receive frames any longer). The CPC daemon similarly detects a "retransmit timeout"
The OTBR then ends up in a restart loop, presumably for the same reason (the radio stopped responding):
The second error in the log looks very similar. |
Is it something to do with my hardware setup? Or is it possible it can fixed via software? |
Further update; new crash, same outcome but slight variation in the logs: ZHA:-
Silicon Labs Multiprotocol:-
Specifically the following line:
I don't believe have seen this specific error, but everything else is the same. Restarting the SLM addon shows the following logs:
The fix now is a physical unplug and replug of the USB into the server, then a full host restart. Simple reboot of the VM did not change the outcome. It also looks like others are reporting similar issues (#3193) so it is not an isolated incident. Devs, please advise what further information you require to resolve/investigate this. Physical host restarts to resolve a VM issue are far from ideal (in my case anyway) and we'd like to be able to assist to resolve this problem as soon as practical. |
Have some with this problem one ESP module and can try using it as serial port over Ethernet ? |
If you would like to try experimental firmware for the SkyConnect that should eliminate the need for physical resets (source NabuCasa/silabs-firmware-builder#33), I've attached a pre-compiled copy here: If you have any step-by-step information for how to reproduce this issue, it would be very useful! |
Not much interesting about my setup (city setting, very busy 2.4Ghz band). I'm on 2023.9.2 with a mix of Wifi (Sengled, Cync, generic), zigbee (misc mfg), and 3 Nanoleaf bulbs. Flashing with the web tool seemed to work, but the probed GBL metadata seems identical during boot. Should your build appear differently?
|
Thanks! The metadata will be identical, especially when probing. This also makes sure the addon doesn't re-flash the bundled firmware. |
I've installed the experimental firmware and so far so good. Unfortunately I haven't had the failures again since my last batch of logs, but nothing has changed... so the inconsistencies aren't helping resolve the issue. As for steps on replicating the issue, there-in lies the problem; I don't actually do anything to trigger the problems, they just happen. I've setup additional logging from the unRAID host in the event the registered USB ID disconnects or does something weird from its perspective, but HA doesn't give any more indication of what caused the failure other than what I provided on original report. |
Overnight, the container went into the The new firmware didn't seem to have an effect. |
Perfect, thank you for the feedback. The watchdog isn't integrated tightly into CPC so it looks like just the CPC part is crashing, not the whole firmware. I'll post an updated one later next week. |
Just to ensure clarity: I had to un-plug and re-plug the SkyConnect. The new firmware didn't seem to have an effect. Whatever initially breaks with the dongle triggers the After re-plugging the dongle, the container comes up on the next watchdog restart. I suspect if you tweak the cpcd |
Seems reproducible: same symptoms this morning (crash loop cpcd, restarting container results in firmware flasher probe failures, replug SkyConnect fixes it). I've just re-flashed the SkyConnect with your build again, just to ensure I didn't make an error last time. Will comment if anything changes. |
Any progress made on the fix? |
To everyone having this problem, are you all using Nanoleaf devices? Because in my case, it's been a while since I had to restart the addon + unplug-replug my skyconnect. |
Nope. Not a single nanoleaf. Only hue, aqara, ikea and some thread devices. |
Unfortunately that firmware only applies to the Matter over Thread Nanoleaf bulbs. The HomeKit over Thread (non-Matter) bulbs latest firmware is 1.6.49. I've started to add Matter over Thread Nanoleaf bulbs to my mix of things, and out of the box they are 3.2.0, which the Android Nanoleaf indicates has a 'critical firmware upgrade', pointing to 3.5.37 as the FW in question. They all default to 3.5.41 once upgraded though. |
I spoke too soon, I had to restart/unplug-replug today. |
Mine also failed yesterday as well. I've now split my Skyconnect to do just OTBR work (Sonoff ZBDongle-E to do the Zigbee work), as well as put both USB devices on extension cables to minimise interference. Skyconnect crashed again about 6 hours after the split duties, and needed a replug to a different USB port to come back to life. |
Can you describe the Zigbee and Thread devices you have on your networks? Any Zigbee Green Power? I've been able to replicate a crash and will try to get a firmware out that possibly mitigates it but there may be multiple concurrent bugs here causing issues. |
I have the same crashes and have: Zigbee (values from ZHA): Thread (via Homekit Controller): Wifi (via Homekit Controller): |
Think I'll need to close this issue out as my network topology has changed substantially since first logging. I've now split OTBR and Zigbee duties into two devices (OTBR being Skyconnect, Zigbee being a Sonoff ZBDongle-E until my second Skyconnect turns up). Skyconnect is now OpenThreadRCP only, Sonoff is now Zigbee only. To answer your question though; Zigbee: Thread:
I've removed the multiprotocol addon and reconfigured my Zigbee channel to be elsewhere (multiprotocol required the channels to be the same), and reinstalled the standalone OTBR addon in HACS. |
@DunklerPhoenix Id the |
I'm not sure if I understand you correctly. The Eve Thermo is just a child and cant be used as a router or leader. |
All current versions are affected, unfortunately, so there's no version that you can roll back to. You can try out the firmware I have here but it doesn't include all of the changes in the addon so it would be easiest to wait for a few days. |
No problem, thank you :) |
I believe the latest version of the addon (2.4.4) fixes most of the problems people are facing. Let me know how it works for you. |
I've started using the custom one-off version that was posted here a while ago, and then updated to the newer versions of the add-on, and ever since that I've not had any bugs or problems. Before that things regularly broke. Well done and thanks :) |
My exact experience, thanks for the fix! |
Does this mean you're using the default firmware the plugin provides now? |
Well, I’ve updated the add-on to the latest official (2.4.4), and have auto-flash enabled. I haven’t manually flashed anything since the custom one that puddly linked to in the earlier comments. Is there a way I can query the firmware to be sure? |
In theory with auto flash enabled it should enable the current firmware, and it looks like 4.4.0 is the current firmware from a week or so ago? |
If you have automatic flashing enabled, the exact firmware in the addon will be installed. The firmware version string incorporates the Git commit and the tree hash. |
Yes, I have auto flashing on, so I'm no longer on the custom firmware but just the add-on firmware. Things work great. |
Sadly, autoflashing doesn't work for my skyconnect device:
Is there a specific firmware I should install manually using the web flasher at https://skyconnect.home-assistant.io/firmware-update/ ? The one in this thread? Something from https://github.com/NabuCasa/silabs-firmware/tree/main/RCPMultiPAN/beta ? |
@satmandu You have one of the very rare batch of SkyConnects that don't identify as a SkyConnect. You can fix it by installing the SkyConnect CP2102N Programmer addon from the development repo (https://github.com/home-assistant/addons-development/) and running it. The SkyConnect should then be identified properly after you unplug it and plug it back in. |
@puddly Thanks! That appears to have changed the USB ID....
Is this what I should be seeing? |
As the topic mentions the Multiprotocol is causing issues here as well on my SkyConnect. I have been stopping this Multiprotocol and went back to just Zigbee this morning (Before the protocol crashes); had to reconnect/re-add most of the Zigbee devices. Luckily some did come up. So as for now it is best not to use the Multiprotocol on the SkyConnect?? Before I went back to just the Zigbee on the stick: |
In my case, while the 2.4.4 seems to solve the 2.4.3 fatal crashes, it also seems to make my Thread network worse than any previous versions (I only have Matter over Thread devices ATM but I want to believe in multiprotocol hahaha 😄) 2.4.3 Thread instability logs
I don't know if this information can help but I also saw the FYI I restarted the addon (w/ SkyConnect unplug-replug), restarted HA (2024.1.5), HAOS and even the VM host, same result: it works then it becomes unstable (several cycles of few seconds/minutes of downtime where devices are unavailable followed by a recovery) then it breaks completly after a while (and "a while" here is faster than previous versions. Like minutes/hours vs days/weeks in older versions like the one from this summer) Since a picture is worth a thousand words, here's my entities history. |
|
Thanks. |
If rolling back one backup you is also reversing the current state of the frame counter so all devices is throwing the frames as they is thinking its one reapply attack so its expected. |
@MattWestb Sorry I don’t fully understand what you’re saying. |
I updated everything to the last versions and still getting an error. Restart + reconfiguring ZHA fixes its, but it always returns after a day or two.
|
@puddly Unfortunately it didn't help. I changed the channel to an better/empty one (with the help from a scanner app), successfully got rid of the 2.4.3 Thread instability logs (after channel switch)
Timeline after channel change : Dont hesitate if I can do something to help you pinpoint the issue :) |
TBH, personally, I've kinda lost my faith 🙈 😉 This is my Matter devices running on a Thread network with dedicated/pure OpenThread Border Router add-on. Black bars are reboots 😆 😉 . (that one black bar was probably just that particular Nanoleaf bulb crashing 🙈 ) Especially since you don't have any Zigbee devices, I can really recommend switching to the pure/dedicated firmware. From a Multiprotocol stand point, it would also be interesting to see if you have the same errors on the pure OTBR firmware (if so, then it is probably more related to you RF environment). Switching to the dedicated firmware is rather easy: Just disable multiprotocol and install the OTBR add-on (see also this guide). In the Thread configuration page you'll be able to reconfigure the newly added OTBR to use your old (the preferred) network. With that your devices should talk to your new OTBR and be reachable soon after you've reconfigured the OTBR. Btw, the pure OTBR has TREL enabled as well. TREL allows Thread border routers to pass Thread frames through WiFi/Ethernet, and hence lower the network load on the mesh. One of the main selling points of Thread IMHO 🤩 |
@agners Your screenshot is private so cant look on it :-( |
Ok I'll try switching to pure OTBR in few days and see how it goes.
Hum interesting. Isn't that how the Nanoleaf Desktop app can control my Thread+Matter bulbs or this have nothing to do with it? Because I'm not pure OTBR and yet the application works (not all the time but sometimes yes)
I can see them. Could be a temporary connectivity/github issue or an ad blocker maybe ? |
Looks like one bug is fixed in the thread communication that can being the problem we have seen (at least for HomeKit devices but i dont knowing if its also helping matter connected ones). |
This is not related. Afaik, the Nanoleaf bulb talk their own IP based protocol with the bulbs. And that works with both add-ons. By default, any Thread border router which gets a frame routes it through the (RF-only) mesh. Imagine you have two border routers, and one is close and one is far away from a particular bulb. Without TREL, the packet will travel through the RF mesh, even though it is far away. With TREL, the frame will get forwarded to the closer router via Ethernet, and only then goes through the RF network. If you have a single border router, and for smaller mesh networks it doesn't' really matter. But it can make ea difference for large network. |
May you forward the issue to: And crosslink to it |
Describe the issue you are experiencing
Every few days to a week, the Silicon Labs Multiprotocol addon will stop communicating and indicate 'resource temporarily unavailable'. yet does not make mention of which resource this is. When this addon stops, it breaks my light integration (Nanoleaf via Thread using HAP) and Zigbee sensors via ZHA.
Restarting the addon doesn't fix the issue, nor does restarting home assistant. Usually requires a complete reboot of the host, and even then sometimes it will repeatedly indicate 'resource temporarily unavailable'.
What type of installation are you running?
Home Assistant OS
Which operating system are you running on?
Home Assistant Operating System
Which add-on are you reporting an issue with?
Silicon Labs Multiprotocol
What is the version of the add-on?
2.3.2
Steps to reproduce the issue
Wish I knew, as it fails whenever it wants to (sometimes 3am in the morning, sometimes 5pm in the afternoon).
System Health information
System Information
Home Assistant Community Store
AccuWeather
Home Assistant Cloud
Home Assistant Supervisor
Dashboards
Recorder
Anything in the Supervisor logs that might be useful for us?
Anything in the add-on logs that might be useful for us?
Additional information
HAOS is hosted in a VM on my Unraid server. My Unraid server is still able to see and interact with the USB device(s) when HAOS fails to. For context:
I have the SkyConnect USB as well as a Sonoff Zigbee 3.0 USB Dongle Plus V2” (model “ZBDongle-E”). I mainly use the Skyconnect for everything, and the Sonoff is a recent purchase. Both are flashed with the latest version of the MultiPAN firmware.
I have tried both the stable and beta version of HAOS, no change to the outcome. Sometimes it works for several days, sometimes it fails > 5 times a day.
The text was updated successfully, but these errors were encountered: