Cortex-M7 cache coherency using ChibiOS/HAL

While porting the ChibiOS HAL to the new STM32F7xx inevitably the issues with cache coherency popped up. Unfortunately the DMAs do not update/invalidate the cache in HW so the burden of coherency is on the shoulders of software developers.

The Issue

The issue has two aspects, lets consider DMA engines reading from RAM or writing in RAM.

DMA Transmission Buffers

The data cache present in Cortex-M7 devices works using a write-back mechanism, this means that data written by the CPU to RAM does not necessarily reach the RAM immediately but can be parked in cache for an indefinite long time. This means that the DMA engines can read data from RAM data that is not an exact copy of the data that the CPU wrote.

DMA Receive Buffers

On the other hand, data written by DMA engines to RAM does not invalidate the corresponding cache lines so the CPU could read cache content that is no more an exact copy of data in RAM.

Discarded Solutions

This is a list of solutions we considered but discarded for various reasons.

Disabling Data Cache

Disabling Data Cache over the whole RAM array would resolve all problems. The rationale for trying this is that the STM32F7xx Reference Manual states that the RAM is accessible without wait states. Zero wait states would mean that caching RAM is not necessary, unfortunately this is not true, the device offers zero-wait-states-LIKE performance when the data cache is enabled. Disabling the data cache simply results reducing the device performance to about 1/3 of its potential.

Changing Cache mode to Write-Through

We saw this solution in some ST's STM32Cube-F7 demos. Putting the cache in write-through mode is, unfortunately, an incomplete solution. It fixes the problem for DMA transmission buffers but it does nothing for DMA receive buffers. In addition it reduces the system performance of about 10%..20% because this is a less efficient caching mode. This solution also requires the use of the MPU and this adds extra complexity.

The following solutions can be adopted for an efficient handling.

Dedicated DMA RAM

Probably this is the most efficient solution: dedicate a portion of RAM for DMA buffers and make it non-cacheable using the MPU or place buffers in DTCM RAM (always not cached).

Advantages
  • Does not require active runtime-handling.
  • It is perfectly transparent to the application.
  • Does not have any performance hit.
  • DTCM RAM is already non cached and can be used for DMA buffers.
  • The STM32F7xx has a 16KB area optimized for DMA accesses that could be used for this (RAM2).
Disadvantages
  • Requires a complex scatter file (ld file).
  • Requires an MPU region dedicated to the DMA RAM. ChibiOS/HAL will offer in 3.1.x a MPU helper driver that will allow to program MPU regions.

Application Handling of Buffers

This solution simply requires the application to handle the invalidation and/or flushing of the cache over DMA buffers. The HAL offers two function that easily allow to secure buffers for use with the DMA.

Advantages
  • Simple to implement.
  • Portable.
Disadvantages
  • Buffers must be aligned to cache page size boundary, always 32 bytes.
  • The application must explicitly flush the cache to RAM on transmit buffers before transmit operation.
  • The application must explicitly invalidate the cache on receive buffers after the receive operation.
  • Buffers handling has impact on overall performance.
Example

Buffers declaration, note that the buffers mush be aligned to a cache page boundary.

  #define SPI_BUFFERS_SIZE    128U
 
  #if defined(__GNUC__)
  __attribute__((aligned (32)))
  #endif
  static uint8_t txbuf[SPI_BUFFERS_SIZE];
 
  #if defined(__GNUC__)
  __attribute__((aligned (32)))
  #endif
  static uint8_t rxbuf[SPI_BUFFERS_SIZE];

The following code exchange data over the SPI using a transmission buffer and a receive buffer. MISO and MOSI are connected together so the data is looped back. You can see that the cache handling is not particularly difficult.

  /* Bush acquisition and SPI reprogramming.*/
  spiAcquireBus(&SPID2);
  spiStart(&SPID2, &hs_spicfg);
 
  /* Preparing data buffer and flushing cache.*/
  for (i = 0; i < SPI_BUFFERS_SIZE; i++)
    txbuf[i] = (uint8_t)i;
  dmaBufferFlush(txbuf, SPI_BUFFERS_SIZE);
 
  /* Slave selection and data exchange.*/
  spiSelect(&SPID2);
  spiExchange(&SPID2, SPI_BUFFERS_SIZE, txbuf, rxbuf);
  spiUnselect(&SPID2);
 
  /* Invalidating cache over the buffer then checking the
     loopback result.*/
  dmaBufferInvalidate(rxbuf, SPI_BUFFERS_SIZE);
  if (memcmp(txbuf, rxbuf, SPI_BUFFERS_SIZE) != 0)
    chSysHalt("loopback failure");
 
  /* Releasing the bus.*/
  spiReleaseBus(&SPID2);

The functions dmaBufferFlush() and dmaBufferInvalidate() are also present on devices without cache but do nothing in that case. This is done in order to preserve SW compatibility across all devices.

More articles and guides are available on the technical wiki.

learn more

Need Tutorials?

Try the video tutorials and guides on Play Embedded.

learn more

Need Support?

The forums is the best place, registration required.

learn more