2026-06-14 · Davide Carrese

STM32H7 D-Cache Coherency:
The DMA Pitfall Every Embedded Engineer Must Understand

STM32 · STM32H7 · DMA · D-Cache · Cortex-M7 · MPU · Embedded

Spent the morning debugging why your SPI DMA buffer contains zeros while the peripheral is clearly clocking data in? Or why a UART transfer returns the same stale packet twice, even though a logic analyser shows fresh bytes on the wire? Chances are the Cortex-M7 data cache is silently serving you stale data. On STM32H7, the D-cache sits between the CPU and the bus matrix, and DMA accesses bypass it entirely. Understanding when and how to clean or invalidate the cache is not optional — it's the difference between a reliable product and a heisenbug that only manifests in production.

The STM32H7 is ST's highest-performing general-purpose MCU family, built around the Arm Cortex-M7 core with a 64-bit AXI bus, separate instruction and data caches, and up to 2 MB of flash. The L1 data cache is a 16-KB four-way set-associative write-back, write-allocate cache. In write-back mode — the default after reset — a CPU store to a cached address does not immediately propagate to the main memory. The data sits in the cache line until it is evicted. Meanwhile, a DMA controller accessing the same physical address reads from (or writes to) the actual SRAM, completely unaware of the dirty cache line.

This asymmetry is the root cause of two classic failure modes:

How the D-Cache Works on Cortex-M7

The Cortex-M7 L1 D-cache is organised into 512 lines of 32 bytes each, arranged as 4-way set-associative (128 sets × 4 ways). Each cache line has a tag (the upper address bits), a valid bit, and a dirty bit. In write-back mode (the default), stores hit the cache line and mark it dirty without writing through to the AXI bus. A cache line is written back only when:

The cache operates on the AXI master port of the Cortex-M7. The DMA controllers (MDMA, DMA1, DMA2, BDMA) are separate bus masters on the AHB/AXI matrix — they have no visibility into the D-cache. This is the fundamental architectural constraint: the CPU sees a coherent view of memory through the cache, but every other bus master sees the raw SRAM.

The Two Cache Maintenance Operations You Need

1. Clean D-Cache (write-back dirty lines to SRAM)

Call before starting a DMA read from a buffer that the CPU has recently written:

/* Buffer the CPU just filled with data to transmit via SPI DMA */
uint8_t tx_buffer[256];
fill_packet(tx_buffer, sizeof(tx_buffer));

/* Ensure all CPU writes have reached SRAM before DMA reads them */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
  SCB_CleanDCache_by_Addr((uint32_t *)tx_buffer, sizeof(tx_buffer));
#endif

/* Now safe to start DMA from tx_buffer */
HAL_SPI_Transmit_DMA(&hspi1, tx_buffer, sizeof(tx_buffer));

2. Invalidate D-Cache (discard stale lines, force fetch from SRAM)

Call after a DMA transfer completes, before the CPU reads the received data:

/* DMA has finished filling rx_buffer with SPI data */
uint8_t rx_buffer[256];
HAL_SPI_Receive_DMA(&hspi1, rx_buffer, sizeof(rx_buffer));
// ... wait for transfer-complete callback ...

/* Discard stale cache lines so the next CPU read goes to SRAM */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
  SCB_InvalidateDCache_by_Addr((uint32_t *)rx_buffer, sizeof(rx_buffer));
#endif

/* Now safe to read rx_buffer - fresh data from DMA */
process_data(rx_buffer, sizeof(rx_buffer));

3. Clean + Invalidate Combined

When a buffer is used for bidirectional DMA (e.g., half-duplex SPI with the same buffer), or when you are not sure of the dirty state, use the combined operation:

SCB_CleanInvalidateDCache_by_Addr((uint32_t *)buf, size);

This writes back any dirty lines and then marks them invalid, so the next read hits the bus.

Alignment and Line-Size Gotchas

⚠ Common Mistake

SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr() operate on 32-byte cache lines. If your address or size is not 32-byte aligned, the function clips to line boundaries: it may clean/invalidate data outside your buffer, or miss the tail end. Always align DMA buffers to 32 bytes.

Use GCC/Clang attributes or a linker-section approach:

/* Force 32-byte alignment with the attribute */
static uint8_t rx_buffer[256] __attribute__((aligned(32)));
static uint8_t tx_buffer[256] __attribute__((aligned(32)));

/* Or for larger buffers, use a dedicated non-cacheable section:
 * In linker script: .NonCacheable (NOLOAD) : { *(.noncacheable) } > SRAM
 */
__attribute__((section(".noncacheable")))
static uint8_t dma_pool[4096];

Alternatively, place DMA buffers in SRAM4 or SRAM3 on STM32H7, which can be configured as non-cacheable via the MPU (see next section). But even with alignment, you still need explicit maintenance when the DTCM or AXI SRAM is cacheable.

Practical Example: SPI DMA Transfer with Correct Cache Handling

Here is a complete, production-grade pattern for an SPI transaction on STM32H7:

/* stm32h7_spi_dma.c — cache-safe SPI DMA transaction */

#include "stm32h7xx_hal.h"
#include "cmsis_compiler.h"

#define DMA_BUF_SIZE  256

/* 32-byte aligned DMA buffers */
static __ALIGNED(32) uint8_t dma_tx[DMA_BUF_SIZE];
static __ALIGNED(32) uint8_t dma_rx[DMA_BUF_SIZE];

/* Shared flags (non-cacheable region or volatile) */
static volatile uint8_t xfer_done = 0;

/* Callback from HAL */
void HAL_SPI_TxRxCpltCallback(SPI_HandleTypeDef *hspi)
{
    if (hspi == &hspi1) {
        xfer_done = 1;
    }
}

/**
 * @brief  Perform a cache-coherent SPI DMA transaction.
 * @param  tx_data   data to transmit (copied into aligned buffer)
 * @param  rx_data   buffer to receive into (copied out after invalidate)
 * @param  len       transaction length (must be ≤ DMA_BUF_SIZE)
 * @retval HAL_StatusTypeDef
 */
HAL_StatusTypeDef spi_dma_transaction(uint8_t *tx_data, uint8_t *rx_data, uint16_t len)
{
    HAL_StatusTypeDef ret;

    /* Copy user data into aligned DMA buffer */
    memcpy(dma_tx, tx_data, len);

    /* Clean D-cache before DMA reads the TX buffer */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
    SCB_CleanDCache_by_Addr((uint32_t *)dma_tx, len);
#endif

    xfer_done = 0;

    /* Start full-duplex SPI DMA */
    ret = HAL_SPI_TransmitReceive_DMA(&hspi1, dma_tx, dma_rx, len);
    if (ret != HAL_OK) return ret;

    /* Wait for completion (with timeout in real code) */
    while (!xfer_done) {
        __WFE();
    }

    /* Invalidate D-cache before CPU reads the RX buffer */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
    SCB_InvalidateDCache_by_Addr((uint32_t *)dma_rx, len);
#endif

    /* Copy fresh data from DMA to user buffer */
    memcpy(rx_data, dma_rx, len);

    return HAL_OK;
}

Key points: the user data is copied into aligned DMA buffers, the cache is cleaned before the DMA starts and invalidated after it completes, and volatile is used for the completion flag (or it must live in a non-cacheable region). The pattern works identically for UART, I²S, ADC, DAC, and any peripheral-to-memory DMA.

Using the MPU to Mark DMA Regions as Non-Cacheable

For some systems, explicit cache maintenance on every transaction is too brittle. A cleaner architectural approach is to dedicate a section of SRAM as non-cacheable by programming the MPU. The MPU on Cortex-M7 can override the default memory attributes on a per-region basis.

/* Configure MPU region for a 16-KB non-cacheable DMA pool in AXI SRAM */

void MPU_Config_DMA_Pool(void)
{
    MPU_Region_InitTypeDef MPU_Init = {0};

    /* Disable MPU before configuration */
    HAL_MPU_Disable();

    /* Region: 0x24000000 (AXI SRAM upper half), 16 KB */
    MPU_Init.Enable           = MPU_REGION_ENABLE;
    MPU_Init.Number           = MPU_REGION_NUMBER1;
    MPU_Init.BaseAddress      = 0x24004000;
    MPU_Init.Size             = MPU_REGION_SIZE_16KB;
    MPU_Init.SubRegionDisable = 0x00;
    MPU_Init.TypeExtField     = MPU_TEX_LEVEL1;
    MPU_Init.AccessPermission = MPU_REGION_FULL_ACCESS;
    MPU_Init.DisableExec      = MPU_INSTRUCTION_ACCESS_DISABLE;
    MPU_Init.IsShareable      = MPU_ACCESS_NOT_SHAREABLE;
    MPU_Init.IsCacheable      = MPU_ACCESS_NOT_CACHEABLE;  /* !!! */
    MPU_Init.IsBufferable     = MPU_ACCESS_BUFFERABLE;

    HAL_MPU_ConfigRegion(&MPU_Init);

    /* Enable MPU with PRIVDEFENA (enable default memory map for priv) */
    HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);

    /* Ensure the MPU config takes effect before any DMA operations */
    __DSB();
    __ISB();
}

With this MPU configuration, any variable placed in the address range 0x24004000–0x24007FFF is never cached by the D-cache. No explicit clean/invalidate is needed. The trade-off is a 15–20% performance hit for CPU accesses to that region, since every load/store goes directly to SRAM.

In practice, I use a hybrid approach: a small MPU-based non-cacheable section for frequently-accessed DMA buffers (ADC double buffers, Ethernet descriptors), and explicit cache maintenance for large or infrequent transfers (SPI flash reads, UART packets).

Practical Checklist

ScenarioAction
CPU writes buffer, then DMA reads it (TX path)SCB_CleanDCache_by_Addr() before starting DMA
DMA writes buffer, then CPU reads it (RX path)SCB_InvalidateDCache_by_Addr() after DMA complete
Bidirectional DMA (same buffer)SCB_CleanInvalidateDCache_by_Addr()
Buffer is const or lives in flashNo action needed (flash is read-only, never cached for write)
Frequent small DMA (double-buffered ADC)MPU non-cacheable region recommended
Ethernet descriptor rings (ETH DMA)MPU non-cacheable — always
DMA to/from DTCM (0x20000000)DTCM is not cached by D-cache, but DTCM is TCM — DMA access is slower; use DTCM for CPU-hot data, DMA buffers in AXI SRAM
Using HAL with DMAHAL does NOT manage cache — you must add SCB calls yourself

How I Would Approach This on a Client Project

On any STM32H7 project, the first thing I do before writing a single line of application code is set up the cache and MPU policy. Here is my standard template:

  1. Disable D-cache globally during initialisation while configuring clocks and GPIO — saves debugging time during bring-up.
  2. Enable I-cache unconditionally (instruction cache has no coherency issues).
  3. Allocate a dedicated DMA pool in a linker section mapped to AXI SRAM.
  4. Configure a single MPU region marking that pool as non-cacheable, strong ordering (no speculative accesses).
  5. Enable D-cache and the MPU in a controlled order: enable MPU → DSB/ISB → enable D-cache.
  6. Add a thin wrapper around the DMA HAL callbacks that performs cache maintenance based on the transfer direction. This wrapper is reused across all peripherals.

This prevents the "works on Nucleo, crashes on prototype" class of bugs that come from running debug builds without cache optimisation, then enabling D-cache for the production build and discovering all the missing maintenance calls.

Sources and Further Reading

💬 Comments by email

Reply to this article by email — I read and respond to every message.