OTA upgrade is not fail-safe

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

OTA upgrade is not fail-safe

Postby tve » Fri May 15, 2015 10:24 am

I am testing how robust the OTA upgrade is. I am using my own upgrade server (i.e, I am flashing the next partition myself) and I'm using the system_upgrade_flag_set(UPGRADE_FLAG_FINISH) and system_upgrade_reboot() once that's done. For a test I uploaded a garbage binary into the next partition and rebooted. I was running in user1.bin, and uploaded garbage into user2.bin. After the reboot the following happened:

Code: Select all

 ets Jan  8 2013,rst cause:4, boot mode:(3,7)

wdt reset
load 0x40100000, len 1320, room 16
tail 8
chksum 0xb8
load 0x3ffe8000, len 776, room 0
tail 8
chksum 0xd9
load 0x3ffe8308, len 412, room 0
tail 12
chksum 0xb9
csum 0xb9

2nd boot version : 1.3(b3)
  SPI Speed      : 40MHz
  SPI Mode       : QIO
  SPI Flash Size : 4Mbit
jump to run user2

Fatal exception (0):
epc1=0x402406bc, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x402406bc, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):


This fatal exception keeps recurring. After a manual reset the same thing happens:

Code: Select all

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x40100000, len 1320, room 16
tail 8
chksum 0xb8
load 0x3ffe8000, len 776, room 0
tail 8
chksum 0xd9
load 0x3ffe8308, len 412, room 0
tail 12
chksum 0xb9
csum 0xb9

2nd boot version : 1.3(b3)
  SPI Speed      : 40MHz
  SPI Mode       : QIO
  SPI Flash Size : 4Mbit
jump to run user2

Fatal exception (0):
epc1=0x402406bc, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x402406bc, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x402406bc, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000


The chip is now irrecoverable except by flashing proper firmware via the serial port. I'm rather disappointed, I was expecting the bootloader to detect the problem and reboot back into user1.bin.

It seems to me that you need to add a step to your upgrade protocol: after booting into the new partition the new code should be required to make a "confirm" call to lock-in the switch to the new firmware. At the next reset, if the confirm call hasn't been made the bootloader should revert to the old firmware. This way the new code can do an overall sanity check and call confirm. It could even be user-driven, i.e. the user could be required to click on a button on a web interface. It's really up to the firmware to determine this.

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

Re: OTA upgrade is not fail-safe

Postby tve » Fri May 15, 2015 10:32 am

Interestingly, after the above test I proceeded to serially flash user1.bin (only the 236KB user1.bin, not any of the other parts, such as the bootloader or the system params). At the end of the serial flash the following happened:

Code: Select all

load 0x40100000, len 1320, room 16
tail 8
chksum 0xb8
load 0x3ffe8000, len 776, room 0
tail 8
chksum 0xd9
load 0x3ffe8308, len 412, room 0
tail 12
chksum 0xb9
csum 0xb9

2nd boot version : 1.3(b3)
  SPI Speed      : 40MHz
  SPI Mode       : QIO
  SPI Flash Size : 4Mbit
jump to run user2

error magic!
first boot failed, reboot to try backup bin


I then reset and it booted fine into user1.bin. So somehow the bootloader could tell that user2.bin contained garbage. It should have been able to determine that earlier!

ESP_Faye
Posts: 1646
Joined: Mon Oct 27, 2014 11:08 am

Re: OTA upgrade is not fail-safe

Postby ESP_Faye » Fri May 15, 2015 11:39 am

Hi,

How did you make the a garbage binary, was it changed (eg. some bytes) from a normal user2.bin ?

We will detect if it's a user bin, but we only check some parts of user bin, you have raised a very good suggestion, we are working on checking the whole user bin. Could you offer more details about how you make the garbage bin ?

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

Re: OTA upgrade is not fail-safe

Postby tve » Fri May 15, 2015 12:42 pm

I flashed the segment that is normally flashed to 0x00000 when using esp_open_sdk to produce a 0x0000.bin and a 0x40000.bin. I'm attaching it (ignore the .jpg extension, it's just there so I can attach it here).
BTW, I noticed that the checksum in the user1.bin/user2.bin binaries does not include the irom segment. Why is that?
Attachments
0x00000.bin.xls
(35.53 KiB) Downloaded 361 times

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

Re: OTA upgrade is not fail-safe

Postby tve » Fri May 15, 2015 2:54 pm

There is a further bug, which is that after all of the above the system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot() sequence reboots into user1.bin instead of user2.bin. In other words:
    - running in user1.bin
    - upload garbage user2.bin
    - call system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot(), bootloader fails
    - upload fresh user1.bin via serial port
    - system restarts into user1.bin at end of upload
    - query system_upgrade_userbin_check, it returns 0
    - upload user2.bin in accordance with result from system_upgrade_userbin_check
    - call system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot()
    - the system incorrectly reboots into user1.bin

ESP_Faye
Posts: 1646
Joined: Mon Oct 27, 2014 11:08 am

Re: OTA upgrade is not fail-safe

Postby ESP_Faye » Fri May 15, 2015 4:27 pm

Hi,

Did you write blank.bin into flash as initialization by flash download tool?

blank.bin always need to be downloaded when using flash download tool.

for 512KB flash, blank.bin 0x7E000; for 1MB flash, blank.bin 0xFE000; for 2MB flash, 0x1FE000

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

Re: OTA upgrade is not fail-safe

Postby tve » Sat May 16, 2015 11:07 am

Blank.bin is annoying because it resets all the wifi settings...
What I'm concerned is that the problem I encountered highlights a deeper issue. It seems that the bootloader keeps track of "next partition for upgrade" in the settings in flash as opposed to basing it off which partition is currently running. If I'm currently running in user1.bin then an upgrade should always go into user2.bin and always reboot into user2.bin regardless of where the last upgrade happened or what's stored in the system settings.

ESP_Faye
Posts: 1646
Joined: Mon Oct 27, 2014 11:08 am

Re: OTA upgrade is not fail-safe

Postby ESP_Faye » Wed May 20, 2015 11:48 am

tve wrote:There is a further bug, which is that after all of the above the system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot() sequence reboots into user1.bin instead of user2.bin. In other words:
    Step 1- running in user1.bin
    Step 2- upload garbage user2.bin
    Step 3- call system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot(), bootloader fails
    Step 4- upload fresh user1.bin via serial port
    Step 5- system restarts into user1.bin at end of upload
    Step 6- query system_upgrade_userbin_check, it returns 0
    Step 7- upload user2.bin in accordance with result from system_upgrade_userbin_check
    Step 8- call system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot()
    Step 9- the system incorrectly reboots into user1.bin


If downloading blank.bin as initialization everytime using “flash download tool” (step 4),was this problem solved?

tve
Posts: 123
Joined: Sun Feb 15, 2015 4:33 pm

Re: OTA upgrade is not fail-safe

Postby tve » Thu May 21, 2015 12:31 pm

yes

kerpz
Posts: 3
Joined: Tue Feb 03, 2015 12:27 pm

Re: OTA upgrade is not fail-safe

Postby kerpz » Sun May 31, 2015 8:24 am

I think I encounter same issue with you, I run at user1.bin and upload a user2.bin and after system_upgrade_flag_set(UPGRADE_FLAG_FINISH) + system_upgrade_reboot(), i got this:

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x40100000, len 816, room 16
tail 0
chksum 0x8d
load 0x3ffe8000, len 788, room 8
tail 12
chksum 0xcf
ho 0 tail 12 room 4
load 0x3ffe8314, len 288, room 12
tail 4
chksum 0xcf
csum 0xcf

2nd boot version : 1.2
SPI Speed : 40MHz
SPI Mode : QIO
SPI Flash Size : 4Mbit
jump to run user2

get flash_addr error!
user code done


any thoughts on this?
Tested on esp-01 (512k)

Who is online

Users browsing this forum: No registered users and 125 guests