Strange slow access to I/O registers

mculibrk · Postby **mculibrk** » Mon Apr 27, 2015 3:12 am

I noticed a strange behavior accessing the I/O registers of the ESP - especially SPI FIFO reigisters SPI_W0..15.

I need to fill the FIFO as fast as possible so I ended up using "pure" asm to squeeze the max out of it. I'm using a "plain" RAM to FIFO copy, and the CCOUNT register to count the cycles needed to complete the action.
Here is the "main" code snippet (tmr hold the cycles count at the end):

Code: Select all

        uint32 spiFIFO[16];

   asm volatile (
      "l32i.n %0, %2, 0   \r\n"                                    // this is here to preload the registers with correct address so to count only cycles needed to actually transfer data from RAM to FIFO
      "l32i.n %0, %3, 0   \r\n"
      "rsr.ccount %1      \r\n"
      "l32i.n %0, %2, 0   \r\n"                               // this line and 
      "s32i.n %0, %3, 0   \r\n"                               // this one are a "move pair", moving one DWORD from RAM to FIFO
      "l32i.n %0, %2, 4   \r\n"
      "s32i.n %0, %3, 4   \r\n"
      "l32i.n %0, %2, 8   \r\n"
      "s32i.n %0, %3, 8   \r\n"
      "l32i.n %0, %2, 12   \r\n"
      "s32i.n %0, %3, 12   \r\n"
                ...                                                                  // "pairs" are repeated for various tests
      "rsr.ccount %0      \r\n"
      "sub %1, %0, %1      \r\n"
             : "=&r"(data),"=&r"(tmr):"r"(spiFIFO),"r"(SPI_W0(ESP_SPI_HSPI)));

So, here is the strange thing:

5 pairs = 11 cycles (10 l32i/s32i + rsr) which looks great
8 pairs = 17 cycles again great
10 pairs = 28 cycles slowing down - 1.4 cy/inst
12 pairs = 51 cycles
16 pairs = 64 cycles - 2 times slower?!

WHY?? Or better - how to avoid that?
It seems like some sort of "cache overrun" but the statement "ESP has no caches" is everywhere in the forum(s)...

reading the FIFO register (or GPIO) is "constantly slow" - 12 cycles for a single l32i read instruction

Any hint?

regards,
mculibrk

jcmvbkbc · Postby **jcmvbkbc** » Mon Apr 27, 2015 10:06 am

mculibrk wrote:It seems like some sort of "cache overrun" but the statement "ESP has no caches" is everywhere in the forum(s)...

Looks more like memory-mapped hardware doesn't allow faster writes.

But apart from it you're using the following sequence:

Code: Select all

      "l32i.n %0, %2, 0   \r\n"
      "s32i.n %0, %3, 0   \r\n"

it results in a pipeline stall, where you lose 1 cycle, because you're trying to use a register right after it's loaded from memory.
You'd better use

Code: Select all

      "l32i.n %0, %3, 0   \r\n"
      "l32i.n %1, %3, 4   \r\n"
      "s32i.n %0, %4, 0   \r\n"
      "s32i.n %1, %4, 4   \r\n"

mculibrk · Postby **mculibrk** » Mon Apr 27, 2015 10:48 am

Thanks for the hint....

but I already tried that - with no difference.

I even tried just to repeat the same write instruction (at the same offset) with identical results. I thought it could be "address/offset" related but it's not. There is no difference in just doing

Code: Select all

s32i %0, %2, 0
s32i %0, %2, 4
s32i %0, %2, 8
s32i %0, %2, 12
...

or

Code: Select all

s32i %0, %2, 0
s32i %0, %2, 0
s32i %0, %2, 0
s32i %0, %2, 0
s32i %0, %2, 0
...

the first 8 writes are fine - each taking just 1 cycle - but then it start to increase.

I tried to insert a few NOPs after 8 writes hoping that at least something will be better but there is no difference. I should maybe wait for 8+ cycles but then I'm at the same result.
If the CPU frequency is set to 160MHz its the same - just that all writes takes exactly the double number of cycles resulting in identical "absolute time" figures.

Reads from the same registers take a whole 12 cycles to complete that number is constant regardless of number of consecutive reads. The writes are bothering me.
If they also took a constant time I would just accept that as just being slow.... but they're fast... for some time..

jcmvbkbc · Postby **jcmvbkbc** » Mon Apr 27, 2015 11:43 am

mculibrk wrote:I even tried just to repeat the same write instruction (at the same offset) with identical results. I thought it could be "address/offset" related but it's not. There is no difference in just doing
Code: Select all
s32i %0, %2, 0 s32i %0, %2, 4 s32i %0, %2, 8 s32i %0, %2, 12 ...

or

Code: Select all
s32i %0, %2, 0 s32i %0, %2, 0 s32i %0, %2, 0 s32i %0, %2, 0 s32i %0, %2, 0 ...

I wonder if the SPI signals pattern is the same in both cases.

What SPI mode do you use (single/double/quad)? What is the SPI clock frequency?
That is the question is whether the configured SPI bandwidth matches the speed at which you pump data from CPU.

mculibrk wrote:reading the FIFO register (or GPIO) is "constantly slow" - 12 cycles for a single l32i read instruction

You obviously cannot read instantly: first you need to receive data from the bus. So there must be at least some initial delay.

Strange slow access to I/O registers

Strange slow access to I/O registers

Re: Strange slow access to I/O registers

Re: Strange slow access to I/O registers

Re: Strange slow access to I/O registers

Who is online

Login

Newbies Start Here

Latest SDK

Documentation

About Us

Information