-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual boot can lose reset vector, requires ISP programming to unbrick - Superseded #398
Comments
Okay, a better solution was suggested - namely, to write a "shim" to the start of the second page in on upload operation, followed by a normal upload, which thus removes dependence on modifications to core, linker, or upload method. Thus it requires only changes to platform.txt |
As you noted, page erase only takes about 4ms, so there is not much time between erase and write. It would be helpful to be able to reproduce the problem before attempting a fix. If your theory is correct about how the flash is getting erased but not programmed, then my suggestion would be to erase all flash pages lower than the bootloader before programming the first page. |
Thanks for that 0xFFFF is not SBRS r31, 7, though? Per AVR Instruction Set Manual Rev. 0856L , SBRS has opcode 1111 111r rrrr 0bbb, so sbrs r31, 7 is 0xFFF7 I like your clear non-sketch flash approach,. that is a superior solution (and is what micronucleus does, and they don't have this problem... IMO their approach of putting the application vector in the last word(s) of flash before the bootloader is also better than rewriting the vector tables... though it makes the implementation somewhat awkward to tack on to the existing code. |
If you look at the instruction set manual, NOP encoding is 0x0000. 0xFFFF is undocumented, so the fact that it gets decoded as SBRS technically could change. However this has been discussed on places like AVRfreaks and tested by multiple people including myself. I had done a bit of work with micronucleus years ago, but the person controlling the project was hard to work with, and had little experience with embedded development. Since Tim took over, it looks like there were big improvements. However it's main failing is the use of a custom protocol, which therefore requires an upload tool for every platform that runs the Arduino IDE. Adafruit's USB bootloader is better since it conforms to the USBtiny protocol supported by AVRdude. |
I just finished writing the picoboot-lib code to erase all the flash preceeding the bootloader before the page write. I had to leave out EEPROM read in order to keep it within 256 bytes. I'll think about having a build option for a version that includes EE read & write and fits in 320 bytes. But first I have to do some testing on the bulk erase version. |
I think Micronucleus got it right on how to make a resilient bootloader on parts without hardware bootloader. And indeed, their implementation seems to be very stable. A lot of real novices do stuff with those USB tinies, even that doesn't use the USB, because they.... don't like having to own either serial adapters or ISP programmers? They don't like connecting those? I don't know, but they are surprisingly popular. Anyway - the only references you see to bricked boards are ones where a bootloader that only enters on external reset was uploaded to something with reset disabled (using that program that rewrites the bootloader using SPM, despite the fact that the whole thing is done using a bitbang USB algorithm on RC oscillators frequently tuned way outside spec... with the frequency set to 12 MHz, which was designed to not allow slack time for an imperfect clock... They have an autotune that uses the USB, doing binary search across osccal)
Per datasheet, If the chip experiences a brown out or external reset, according to the datasheet, it will still (attempt to) complete the current operation (in event of power problem, this largely depend on how abruptly he power supply stops supplying voltage, and whether it plunges to nothing, or just starts to droop) - but as long as that option resulted in successful completion of erase or write, we would be safe, and the next run would always run only the bootloader and not the unsuccessful upload:
That means in the event of a transition on reset during programming due to dodgy reset circuit, the result will always be a failed programming attempt, where (as an added bonus) no broken, partly uploaded sketch will be run - instead, only the bootloader will start and wait for an upload. In the event of a power brownout which triggers a BOD reset, but does not fall to a voltage below the minimum self programming voltage referred to in the datasheet, before completion of the flash operation, the result will be the same as above. Thus, the only condition where the flash could be left in a state where the bootloader might not run upon reset would be if the power supply was above the BOD threshold at the start of a write/erase, and then fell so rapidly that it was below that critical mystery voltage (which Microchip doesn't seem to have specified or even hinted at, other than that it's below the minimum BOD voltage...) at which flash corruption during write occurs (my expectation in that case would be that, viewing flash contents, you would find the page on which the failure occurred with content that matched neither blank memory, nor the old or new sketch (depending on whether it was an erase or a write that failed). In the event that that occurred, there are only four critical points in the process which result in anything other than staying in bootloader: the times it is writing or erasing the first or writing or erasing the last page. And in the event that it was the last page that failed, the bootloader would still run, even if the app start would fail, allowing further programming without use of ISP. Thus, only if the board was disconnected from power right in the middle of the upload process could it fail and require ISP programming (or HV if reset disabled). Currently Optiboot does 1, but does not do 2 or 3... As a result the first programming operation it does is fatal if it fails (the erase of the first page). And that's the one that's most likely to have both a power glitch AND a reset glitch, being as it's right at the start of the process :-/ |
I'm actually sort of curious about how hard you have to work to actually corrupt the flash. I think I'll wire up a PFET to switch the power of the "victim", a classic AVR ATTiny... Victim would have a sketch that sets one port to pullup, and then waits for one of the pins to go low, at which point, it will acknowledge this by driving a pulled up pin low, and pause briefly, and then depending on which pin went low, either write one of 4 pages of flash that were empty, or erase one of 4 pages of flash that was full of known content. After filling page buffer (if applicable), right before kicking off SPM, it will drive one final pin low. The other device - the "tormentor" will wait a specified amount of time (less than the time it takes to write to flash, of course - so it's mid-write), and then drive the gate of the PFET high, cutting off the power to the victim. Resets victim, does it to the next pin, then repeats with 3 different delays.. Read out with ISP, confirm that I got the expected trashed flash. Could repeat with different BOD settings, values of decoupling caps, etc. Am I curious enough to build it? Not with the sort of to-do list I have... But you've got to admit that it would be cool to be able to add a note in the docs, instead of warning in vague, general terms, specifying some examples of what conditions do and do not cause flash corruption. |
If anyone is interested, I've encountered this in the wild with a slow-decay power supply: https://ericdraken.com/digispark-blinkstick-microcontroller-hacking/#eeprom. I've enabled BOD now, but I'm still exploring NOPing the first page to cover my bases. |
This has been superseded by #750. Optiboot in this state is unfit for purpose, Fixing it is hard, and urboot does it correctly uploads faster and uses less flash and best of all doesn't need to be written.
When using Optiboot on a chip without hardware bootloader support (ie, most of the parts in this core) the bootloader can, if a programming cycle is started but not completed, result in a page being erased - but not rewritten. This can happen with any bootloader, of course - but on parts with hardware bootloader support, the chip will just jump to the bootloader on startup and you can try to upload again. On a virtual boot part, the bootloader is only entered because the reset vector is rewritten to point to it (and another vector is "taken over" to point to the sketch.
On a virtual boot part, a particularly ill-timed reset can result in the first page being erased, but not rewritten. It is unclear exactly what sequence of events leads to this - there is very little time between the erase and write - each operation is spec'ed at 4.5ms max, and they are right next to eachother in the bootloader code. However, it has been observed twice in internal testing. The impacted board could no longer be programmed via the bootloader. Dumping the flash via ISP revealed that the first 64 words in the flash were all 0xFF - in other words, a four page erase had been executed for that page - but the following page write was not. I suspect that either:
There is one solution that I think could deal with this problem - however implementing it will require more knowledge than I have of how to control the linker to make it do this - basically, the solution is to ensure that the first word (or two words on 16K parts) after the page boundary (of either the first page, or first four pages) contains an appropriate JMP or RJMP instruction pointed to the bootloader. Thus, after this event occurs, execution will slide down the blank flash (0xFFFF is a no-op) until it hits the jump instruction, and jump to the bootloader. Since the subsequent upload would almost certainly involve On all the parts I've looked at, this is after the end of the vector table - but generally not by very much - never more than 10-20 bytes on parts I've looked at... so not too much space would be lost.
I realize that this issue is rather serious for people using it - but I don't know how often it is occuring in the field. Currently there are megaavr projects that I consider to be a higher priority.
The text was updated successfully, but these errors were encountered: