STM32H7 D-Cache Coherency:
The DMA Pitfall Every Embedded Engineer Must Understand
Spent the morning debugging why your SPI DMA buffer contains zeros while the peripheral is clearly clocking data in? Or why a UART transfer returns the same stale packet twice, even though a logic analyser shows fresh bytes on the wire? Chances are the Cortex-M7 data cache is silently serving you stale data. On STM32H7, the D-cache sits between the CPU and the bus matrix, and DMA accesses bypass it entirely. Understanding when and how to clean or invalidate the cache is not optional — it's the difference between a reliable product and a heisenbug that only manifests in production.
The STM32H7 is ST's highest-performing general-purpose MCU family, built around the Arm Cortex-M7 core with a 64-bit AXI bus, separate instruction and data caches, and up to 2 MB of flash. The L1 data cache is a 16-KB four-way set-associative write-back, write-allocate cache. In write-back mode — the default after reset — a CPU store to a cached address does not immediately propagate to the main memory. The data sits in the cache line until it is evicted. Meanwhile, a DMA controller accessing the same physical address reads from (or writes to) the actual SRAM, completely unaware of the dirty cache line.
This asymmetry is the root cause of two classic failure modes:
- DMA-to-CPU (peripheral → memory → CPU reads): DMA writes fresh data to SRAM. The CPU reads the address — but the D-cache still holds a stale copy from before the DMA, so the CPU sees old data. This requires a cache invalidate.
- CPU-to-DMA (CPU writes → DMA reads): The CPU writes data to a buffer, then triggers a DMA transfer from that buffer. If the written data is still dirty in the cache and has not reached SRAM, the DMA reads stale zeros or garbage from physical memory. This requires a cache clean (write-back) before starting the DMA.
How the D-Cache Works on Cortex-M7
The Cortex-M7 L1 D-cache is organised into 512 lines of 32 bytes each, arranged as 4-way set-associative (128 sets × 4 ways). Each cache line has a tag (the upper address bits), a valid bit, and a dirty bit. In write-back mode (the default), stores hit the cache line and mark it dirty without writing through to the AXI bus. A cache line is written back only when:
- It is evicted by another load or store mapping to the same set.
- A
SCB_CleanDCache_by_Addr()or full-cache clean is issued. - The line is explicitly invalidated while dirty (must clean first, or use
SCB_CleanInvalidateDCache).
The cache operates on the AXI master port of the Cortex-M7. The DMA controllers (MDMA, DMA1, DMA2, BDMA) are separate bus masters on the AHB/AXI matrix — they have no visibility into the D-cache. This is the fundamental architectural constraint: the CPU sees a coherent view of memory through the cache, but every other bus master sees the raw SRAM.
The Two Cache Maintenance Operations You Need
1. Clean D-Cache (write-back dirty lines to SRAM)
Call before starting a DMA read from a buffer that the CPU has recently written:
/* Buffer the CPU just filled with data to transmit via SPI DMA */ uint8_t tx_buffer[256]; fill_packet(tx_buffer, sizeof(tx_buffer)); /* Ensure all CPU writes have reached SRAM before DMA reads them */ #if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1 SCB_CleanDCache_by_Addr((uint32_t *)tx_buffer, sizeof(tx_buffer)); #endif /* Now safe to start DMA from tx_buffer */ HAL_SPI_Transmit_DMA(&hspi1, tx_buffer, sizeof(tx_buffer));
2. Invalidate D-Cache (discard stale lines, force fetch from SRAM)
Call after a DMA transfer completes, before the CPU reads the received data:
/* DMA has finished filling rx_buffer with SPI data */ uint8_t rx_buffer[256]; HAL_SPI_Receive_DMA(&hspi1, rx_buffer, sizeof(rx_buffer)); // ... wait for transfer-complete callback ... /* Discard stale cache lines so the next CPU read goes to SRAM */ #if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1 SCB_InvalidateDCache_by_Addr((uint32_t *)rx_buffer, sizeof(rx_buffer)); #endif /* Now safe to read rx_buffer - fresh data from DMA */ process_data(rx_buffer, sizeof(rx_buffer));
3. Clean + Invalidate Combined
When a buffer is used for bidirectional DMA (e.g., half-duplex SPI with the same buffer), or when you are not sure of the dirty state, use the combined operation:
SCB_CleanInvalidateDCache_by_Addr((uint32_t *)buf, size);
This writes back any dirty lines and then marks them invalid, so the next read hits the bus.
Alignment and Line-Size Gotchas
SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr() operate on 32-byte cache lines. If your address or size is not 32-byte aligned, the function clips to line boundaries: it may clean/invalidate data outside your buffer, or miss the tail end. Always align DMA buffers to 32 bytes.
Use GCC/Clang attributes or a linker-section approach:
/* Force 32-byte alignment with the attribute */
static uint8_t rx_buffer[256] __attribute__((aligned(32)));
static uint8_t tx_buffer[256] __attribute__((aligned(32)));
/* Or for larger buffers, use a dedicated non-cacheable section:
* In linker script: .NonCacheable (NOLOAD) : { *(.noncacheable) } > SRAM
*/
__attribute__((section(".noncacheable")))
static uint8_t dma_pool[4096];
Alternatively, place DMA buffers in SRAM4 or SRAM3 on STM32H7, which can be configured as non-cacheable via the MPU (see next section). But even with alignment, you still need explicit maintenance when the DTCM or AXI SRAM is cacheable.
Practical Example: SPI DMA Transfer with Correct Cache Handling
Here is a complete, production-grade pattern for an SPI transaction on STM32H7:
/* stm32h7_spi_dma.c — cache-safe SPI DMA transaction */
#include "stm32h7xx_hal.h"
#include "cmsis_compiler.h"
#define DMA_BUF_SIZE 256
/* 32-byte aligned DMA buffers */
static __ALIGNED(32) uint8_t dma_tx[DMA_BUF_SIZE];
static __ALIGNED(32) uint8_t dma_rx[DMA_BUF_SIZE];
/* Shared flags (non-cacheable region or volatile) */
static volatile uint8_t xfer_done = 0;
/* Callback from HAL */
void HAL_SPI_TxRxCpltCallback(SPI_HandleTypeDef *hspi)
{
if (hspi == &hspi1) {
xfer_done = 1;
}
}
/**
* @brief Perform a cache-coherent SPI DMA transaction.
* @param tx_data data to transmit (copied into aligned buffer)
* @param rx_data buffer to receive into (copied out after invalidate)
* @param len transaction length (must be ≤ DMA_BUF_SIZE)
* @retval HAL_StatusTypeDef
*/
HAL_StatusTypeDef spi_dma_transaction(uint8_t *tx_data, uint8_t *rx_data, uint16_t len)
{
HAL_StatusTypeDef ret;
/* Copy user data into aligned DMA buffer */
memcpy(dma_tx, tx_data, len);
/* Clean D-cache before DMA reads the TX buffer */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
SCB_CleanDCache_by_Addr((uint32_t *)dma_tx, len);
#endif
xfer_done = 0;
/* Start full-duplex SPI DMA */
ret = HAL_SPI_TransmitReceive_DMA(&hspi1, dma_tx, dma_rx, len);
if (ret != HAL_OK) return ret;
/* Wait for completion (with timeout in real code) */
while (!xfer_done) {
__WFE();
}
/* Invalidate D-cache before CPU reads the RX buffer */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT == 1
SCB_InvalidateDCache_by_Addr((uint32_t *)dma_rx, len);
#endif
/* Copy fresh data from DMA to user buffer */
memcpy(rx_data, dma_rx, len);
return HAL_OK;
}
Key points: the user data is copied into aligned DMA buffers, the cache is cleaned before the DMA starts and invalidated after it completes, and volatile is used for the completion flag (or it must live in a non-cacheable region). The pattern works identically for UART, I²S, ADC, DAC, and any peripheral-to-memory DMA.
Using the MPU to Mark DMA Regions as Non-Cacheable
For some systems, explicit cache maintenance on every transaction is too brittle. A cleaner architectural approach is to dedicate a section of SRAM as non-cacheable by programming the MPU. The MPU on Cortex-M7 can override the default memory attributes on a per-region basis.
/* Configure MPU region for a 16-KB non-cacheable DMA pool in AXI SRAM */
void MPU_Config_DMA_Pool(void)
{
MPU_Region_InitTypeDef MPU_Init = {0};
/* Disable MPU before configuration */
HAL_MPU_Disable();
/* Region: 0x24000000 (AXI SRAM upper half), 16 KB */
MPU_Init.Enable = MPU_REGION_ENABLE;
MPU_Init.Number = MPU_REGION_NUMBER1;
MPU_Init.BaseAddress = 0x24004000;
MPU_Init.Size = MPU_REGION_SIZE_16KB;
MPU_Init.SubRegionDisable = 0x00;
MPU_Init.TypeExtField = MPU_TEX_LEVEL1;
MPU_Init.AccessPermission = MPU_REGION_FULL_ACCESS;
MPU_Init.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
MPU_Init.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
MPU_Init.IsCacheable = MPU_ACCESS_NOT_CACHEABLE; /* !!! */
MPU_Init.IsBufferable = MPU_ACCESS_BUFFERABLE;
HAL_MPU_ConfigRegion(&MPU_Init);
/* Enable MPU with PRIVDEFENA (enable default memory map for priv) */
HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
/* Ensure the MPU config takes effect before any DMA operations */
__DSB();
__ISB();
}
With this MPU configuration, any variable placed in the address range 0x24004000–0x24007FFF is never cached by the D-cache. No explicit clean/invalidate is needed. The trade-off is a 15–20% performance hit for CPU accesses to that region, since every load/store goes directly to SRAM.
In practice, I use a hybrid approach: a small MPU-based non-cacheable section for frequently-accessed DMA buffers (ADC double buffers, Ethernet descriptors), and explicit cache maintenance for large or infrequent transfers (SPI flash reads, UART packets).
Practical Checklist
| Scenario | Action |
|---|---|
| CPU writes buffer, then DMA reads it (TX path) | SCB_CleanDCache_by_Addr() before starting DMA |
| DMA writes buffer, then CPU reads it (RX path) | SCB_InvalidateDCache_by_Addr() after DMA complete |
| Bidirectional DMA (same buffer) | SCB_CleanInvalidateDCache_by_Addr() |
Buffer is const or lives in flash | No action needed (flash is read-only, never cached for write) |
| Frequent small DMA (double-buffered ADC) | MPU non-cacheable region recommended |
| Ethernet descriptor rings (ETH DMA) | MPU non-cacheable — always |
| DMA to/from DTCM (0x20000000) | DTCM is not cached by D-cache, but DTCM is TCM — DMA access is slower; use DTCM for CPU-hot data, DMA buffers in AXI SRAM |
| Using HAL with DMA | HAL does NOT manage cache — you must add SCB calls yourself |
How I Would Approach This on a Client Project
On any STM32H7 project, the first thing I do before writing a single line of application code is set up the cache and MPU policy. Here is my standard template:
- Disable D-cache globally during initialisation while configuring clocks and GPIO — saves debugging time during bring-up.
- Enable I-cache unconditionally (instruction cache has no coherency issues).
- Allocate a dedicated DMA pool in a linker section mapped to AXI SRAM.
- Configure a single MPU region marking that pool as non-cacheable, strong ordering (no speculative accesses).
- Enable D-cache and the MPU in a controlled order: enable MPU → DSB/ISB → enable D-cache.
- Add a thin wrapper around the DMA HAL callbacks that performs cache maintenance based on the transfer direction. This wrapper is reused across all peripherals.
This prevents the "works on Nucleo, crashes on prototype" class of bugs that come from running debug builds without cache optimisation, then enabling D-cache for the production build and discovering all the missing maintenance calls.
Sources and Further Reading
- Arm Cortex-M7 Processor Technical Reference Manual, r1p2 — Chapter 5: Memory System, §5.5 L1 Caches. developer.arm.com/documentation/ddi0489/latest/
- ST Application Note AN4839 — Management of data cache and MPU on STM32H7 series. st.com/an4839
- ST Application Note AN4807 — STM32H7 memory mapping and cache configuration guidelines.
- CMSIS-Core (Cortex-M) SCB Functions —
SCB_CleanDCache,SCB_InvalidateDCache,SCB_CleanInvalidateDCache. Arm CMSIS 5.9.0+. - STM32CubeH7 firmware examples — CACHE_CleanInvalidate, CACHE_NonCacheable under
Projects/. - ST Community: STM32H7 D-Cache and DMA — How to Proceed

Reply to this article by email — I read and respond to every message.