Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual boot can lose reset vector, requires ISP programming to unbrick - Superseded #398

Closed
SpenceKonde opened this issue Apr 14, 2020 · 9 comments
Labels
no plans to change Pre-2.0.0 bug Bug present in versions of the core older than 2.0.0 - needs to be reteted in 2.0.0

Comments

@SpenceKonde
Copy link
Owner

SpenceKonde commented Apr 14, 2020

This has been superseded by #750. Optiboot in this state is unfit for purpose, Fixing it is hard, and urboot does it correctly uploads faster and uses less flash and best of all doesn't need to be written.

When using Optiboot on a chip without hardware bootloader support (ie, most of the parts in this core) the bootloader can, if a programming cycle is started but not completed, result in a page being erased - but not rewritten. This can happen with any bootloader, of course - but on parts with hardware bootloader support, the chip will just jump to the bootloader on startup and you can try to upload again. On a virtual boot part, the bootloader is only entered because the reset vector is rewritten to point to it (and another vector is "taken over" to point to the sketch.

On a virtual boot part, a particularly ill-timed reset can result in the first page being erased, but not rewritten. It is unclear exactly what sequence of events leads to this - there is very little time between the erase and write - each operation is spec'ed at 4.5ms max, and they are right next to eachother in the bootloader code. However, it has been observed twice in internal testing. The impacted board could no longer be programmed via the bootloader. Dumping the flash via ISP revealed that the first 64 words in the flash were all 0xFF - in other words, a four page erase had been executed for that page - but the following page write was not. I suspect that either:

  • The programmer mishandled the reset pin and a second reset pulse was generated during the programming process.
  • The voltage on the processor was low and supply was weak when the write command was initiated, leading to it resetting due to BOD - or simply malfunctioning - during programming process.

There is one solution that I think could deal with this problem - however implementing it will require more knowledge than I have of how to control the linker to make it do this - basically, the solution is to ensure that the first word (or two words on 16K parts) after the page boundary (of either the first page, or first four pages) contains an appropriate JMP or RJMP instruction pointed to the bootloader. Thus, after this event occurs, execution will slide down the blank flash (0xFFFF is a no-op) until it hits the jump instruction, and jump to the bootloader. Since the subsequent upload would almost certainly involve On all the parts I've looked at, this is after the end of the vector table - but generally not by very much - never more than 10-20 bytes on parts I've looked at... so not too much space would be lost.

I realize that this issue is rather serious for people using it - but I don't know how often it is occuring in the field. Currently there are megaavr projects that I consider to be a higher priority.

@SpenceKonde SpenceKonde changed the title Virtual boot can lose reset vector, requires ISP programming to unbrick\ Virtual boot can lose reset vector, requires ISP programming to unbrick Apr 14, 2020
@SpenceKonde SpenceKonde added the Pre-2.0.0 bug Bug present in versions of the core older than 2.0.0 - needs to be reteted in 2.0.0 label Apr 14, 2020
@SpenceKonde SpenceKonde added this to the 1.3.4 release milestone Apr 14, 2020
@SpenceKonde
Copy link
Owner Author

SpenceKonde commented Jun 2, 2020

Okay, a better solution was suggested - namely, to write a "shim" to the start of the second page in on upload operation, followed by a normal upload, which thus removes dependence on modifications to core, linker, or upload method. Thus it requires only changes to platform.txt

@nerdralph
Copy link

As you noted, page erase only takes about 4ms, so there is not much time between erase and write. It would be helpful to be able to reproduce the problem before attempting a fix.
To be pedantic, 0xFFFF is not a nop; it's sbrs r31, 7. I've made the mistake of treating it like a nop, by "sliding" into the reset vector from the end of the bootloader code. When the last instruction is not aligned to four bytes, the reset vector can get skipped over, resulting in the code at address 0x0002 getting run instead of 0x0000.
It wouldn't be a problem for your idea though, since a page is always 4-byte aligned.

If your theory is correct about how the flash is getting erased but not programmed, then my suggestion would be to erase all flash pages lower than the bootloader before programming the first page.

@SpenceKonde
Copy link
Owner Author

Thanks for that

0xFFFF is not SBRS r31, 7, though? Per AVR Instruction Set Manual Rev. 0856L , SBRS has opcode 1111 111r rrrr 0bbb, so sbrs r31, 7 is 0xFFF7

I like your clear non-sketch flash approach,. that is a superior solution (and is what micronucleus does, and they don't have this problem... IMO their approach of putting the application vector in the last word(s) of flash before the bootloader is also better than rewriting the vector tables... though it makes the implementation somewhat awkward to tack on to the existing code.

@nerdralph
Copy link

If you look at the instruction set manual, NOP encoding is 0x0000. 0xFFFF is undocumented, so the fact that it gets decoded as SBRS technically could change. However this has been discussed on places like AVRfreaks and tested by multiple people including myself.

I had done a bit of work with micronucleus years ago, but the person controlling the project was hard to work with, and had little experience with embedded development. Since Tim took over, it looks like there were big improvements. However it's main failing is the use of a custom protocol, which therefore requires an upload tool for every platform that runs the Arduino IDE. Adafruit's USB bootloader is better since it conforms to the USBtiny protocol supported by AVRdude.

@nerdralph
Copy link

I just finished writing the picoboot-lib code to erase all the flash preceeding the bootloader before the page write. I had to leave out EEPROM read in order to keep it within 256 bytes. I'll think about having a build option for a version that includes EE read & write and fits in 320 bytes. But first I have to do some testing on the bulk erase version.

@nerdralph
Copy link

@SpenceKonde
Copy link
Owner Author

I think Micronucleus got it right on how to make a resilient bootloader on parts without hardware bootloader. And indeed, their implementation seems to be very stable. A lot of real novices do stuff with those USB tinies, even that doesn't use the USB, because they.... don't like having to own either serial adapters or ISP programmers? They don't like connecting those? I don't know, but they are surprisingly popular. Anyway - the only references you see to bricked boards are ones where a bootloader that only enters on external reset was uploaded to something with reset disabled (using that program that rewrites the bootloader using SPM, despite the fact that the whole thing is done using a bitbang USB algorithm on RC oscillators frequently tuned way outside spec... with the frequency set to 12 MHz, which was designed to not allow slack time for an imperfect clock... They have an autotune that uses the USB, doing binary search across osccal)

  1. Ensure bootloader is entered - rewrite the reset vector (this is the one thing that seems vile to do).
  2. Once it sees that it's getting an upload, it should erase all the application flash that could contain user code starting with the page with highest address, down to 0 - since it will then receive and write the first page first.
  3. The vector to jump to the app should be the last word () of memory that could be part of user code , ie, the word (or 2 words) before the bootloader.

Per datasheet, If the chip experiences a brown out or external reset, according to the datasheet, it will still (attempt to) complete the current operation (in event of power problem, this largely depend on how abruptly he power supply stops supplying voltage, and whether it plunges to nothing, or just starts to droop) - but as long as that option resulted in successful completion of erase or write, we would be safe, and the next run would always run only the bootloader and not the unsuccessful upload:

  1. If it was reset or BOD was triggered during erase, reset vector will point to bootloader, but the part with the instruction to jump to app is gone -> runs bootloader like there was nothing else on flash, because the application was partially errased.
  2. If it was reset or BOD was triggered when the first page was being erased and that operation completed, all of the app flash would be empty - we just erased it! So it's the same as freshly bootloaded virtual boot board, ie, PC trundles along all those 0xFFFF instructions, which you say are SBRS r31 bit 7, but since page size is power of two, it doesn't matter whether bit 7 of r31 is set or not.
  3. If it was reset or BOD was triggered off when a page after the first was being written, the situation is same as in first case.

That means in the event of a transition on reset during programming due to dodgy reset circuit, the result will always be a failed programming attempt, where (as an added bonus) no broken, partly uploaded sketch will be run - instead, only the bootloader will start and wait for an upload. In the event of a power brownout which triggers a BOD reset, but does not fall to a voltage below the minimum self programming voltage referred to in the datasheet, before completion of the flash operation, the result will be the same as above.

Thus, the only condition where the flash could be left in a state where the bootloader might not run upon reset would be if the power supply was above the BOD threshold at the start of a write/erase, and then fell so rapidly that it was below that critical mystery voltage (which Microchip doesn't seem to have specified or even hinted at, other than that it's below the minimum BOD voltage...) at which flash corruption during write occurs (my expectation in that case would be that, viewing flash contents, you would find the page on which the failure occurred with content that matched neither blank memory, nor the old or new sketch (depending on whether it was an erase or a write that failed). In the event that that occurred, there are only four critical points in the process which result in anything other than staying in bootloader: the times it is writing or erasing the first or writing or erasing the last page. And in the event that it was the last page that failed, the bootloader would still run, even if the app start would fail, allowing further programming without use of ISP. Thus, only if the board was disconnected from power right in the middle of the upload process could it fail and require ISP programming (or HV if reset disabled).

Currently Optiboot does 1, but does not do 2 or 3... As a result the first programming operation it does is fatal if it fails (the erase of the first page). And that's the one that's most likely to have both a power glitch AND a reset glitch, being as it's right at the start of the process :-/

@SpenceKonde
Copy link
Owner Author

SpenceKonde commented Jul 30, 2020

I'm actually sort of curious about how hard you have to work to actually corrupt the flash. I think I'll wire up a PFET to switch the power of the "victim", a classic AVR ATTiny... Victim would have a sketch that sets one port to pullup, and then waits for one of the pins to go low, at which point, it will acknowledge this by driving a pulled up pin low, and pause briefly, and then depending on which pin went low, either write one of 4 pages of flash that were empty, or erase one of 4 pages of flash that was full of known content. After filling page buffer (if applicable), right before kicking off SPM, it will drive one final pin low. The other device - the "tormentor" will wait a specified amount of time (less than the time it takes to write to flash, of course - so it's mid-write), and then drive the gate of the PFET high, cutting off the power to the victim. Resets victim, does it to the next pin, then repeats with 3 different delays.. Read out with ISP, confirm that I got the expected trashed flash.

Could repeat with different BOD settings, values of decoupling caps, etc.

Am I curious enough to build it? Not with the sort of to-do list I have...

But you've got to admit that it would be cool to be able to add a note in the docs, instead of warning in vague, general terms, specifying some examples of what conditions do and do not cause flash corruption.

@ericdraken
Copy link

ericdraken commented Dec 8, 2020

If anyone is interested, I've encountered this in the wild with a slow-decay power supply: https://ericdraken.com/digispark-blinkstick-microcontroller-hacking/#eeprom. I've enabled BOD now, but I'm still exploring NOPing the first page to cover my bases.

@SpenceKonde SpenceKonde removed this from the 2.0.0 release milestone Feb 11, 2022
@SpenceKonde SpenceKonde added this to the Some Future Version milestone Aug 14, 2022
@SpenceKonde SpenceKonde changed the title Virtual boot can lose reset vector, requires ISP programming to unbrick Virtual boot can lose reset vector, requires ISP programming to unbrick - Superseded Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no plans to change Pre-2.0.0 bug Bug present in versions of the core older than 2.0.0 - needs to be reteted in 2.0.0
Projects
None yet
Development

No branches or pull requests

3 participants