STM32 DMA Circular Mode with Double Buffering: The Pattern That Prevents Data Loss

2026-05-29 · Davide Carrese

STM32 · DMA · Firmware Architecture

DMA on STM32 is one of those peripherals that looks simple in a CubeMX screenshot but causes subtle data corruption in production. The most common failure mode I see in client projects is not a configuration error in the DMA init struct. It is the absence of a double-buffering strategy: the application reads from a DMA buffer while the DMA controller is still writing into it. This article shows the circular-mode + half-transfer-interrupt pattern that solves that problem and explains why it matters on Cortex-M7 parts with data cache.

The problem: single-buffer DMA is a race condition by design

When you configure a DMA stream to fill a buffer in normal mode and then process the buffer after the transfer-complete interrupt, the firmware is correct but the throughput is limited: the CPU waits for the entire buffer before doing any work. That is fine for occasional transfers, but for continuous ADC sampling at 100 kHz, a UART stream at several megabaud, or a DAC waveform generator, the latency of waiting for a full buffer before processing is waste. The natural instinct is to use circular mode so the DMA keeps transferring while the application reads from the same buffer. That instinct creates the race.

In circular mode, the DMA controller wraps back to the start of the buffer when it reaches the end. If the application reads from the buffer at any point without synchronisation, it may read partially updated data. There is no hardware mutex. The only clean solution on a single-core STM32 is to split the buffer into two halves and let the DMA controller signal each half-completion through a dedicated interrupt: the half-transfer interrupt and the transfer-complete interrupt. The application processes the half that just completed while the DMA controller fills the other half. This is double buffering, and it is the foundation of most reliable DMA pipelines on STM32.

How STM32 DMA signals half and complete transfers

Every STM32 DMA stream can generate interrupts at three points: half-transfer (HT), transfer-complete (TC), and transfer-error (TE). The half-transfer fires when the DMA controller has transferred exactly half of the configured number of data items. On circular mode, HT fires at half of the buffer, then TC at the end, then the pointer wraps, HT fires again at the halfway point of the next cycle, and so on.

In the HAL, these interrupts map to HAL_ADC_ConvHalfCpltCallback() or the generic XferHalfCpltCallback() depending on the peripheral driver. The key insight is that when HT fires, the first half of the buffer is stable and can be processed. When TC fires, the second half is stable. The application always works on the half that is not being written by DMA.

A minimal ADC double-buffer pattern

The example below uses the STM32 ADC with DMA in circular mode. It does not use a bare-metal approach but keeps the HAL calls visible because most client codebases I inherit are HAL-based, not register-level. The buffer is twice the size of one "frame" of samples so that each half is a complete frame.

#include "stm32f4xx_hal.h"
#include <stdbool.h>

#define ADC_FRAME_SAMPLES  64
#define ADC_BUF_SIZE       (ADC_FRAME_SAMPLES * 2)

static uint16_t adc_buf[ADC_BUF_SIZE];
static volatile uint8_t adc_frame_ready; /* 0=none, 1=first half, 2=second half */

static uint16_t adc_frame[ADC_FRAME_SAMPLES];

void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc)
{
    if (hadc->Instance == ADC1) {
        memcpy(adc_frame, &adc_buf[0],
               ADC_FRAME_SAMPLES * sizeof(uint16_t));
        adc_frame_ready = 1;
    }
}

void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
    if (hadc->Instance == ADC1) {
        memcpy(adc_frame, &adc_buf[ADC_FRAME_SAMPLES],
               ADC_FRAME_SAMPLES * sizeof(uint16_t));
        adc_frame_ready = 2;
    }
}

void adc_dma_init(void)
{
    __HAL_RCC_DMA2_CLK_ENABLE();

    ADC_ChannelConfTypeDef sConfig = {0};
    sConfig.Channel = ADC_CHANNEL_0;
    sConfig.Rank = 1;
    sConfig.SamplingTime = ADC_SAMPLETIME_3CYCLES;
    HAL_ADC_ConfigChannel(&hadc1, &sConfig);

    HAL_ADC_Start_DMA(&hadc1,
        (uint32_t *)adc_buf, ADC_BUF_SIZE);
}

void adc_process_loop(void)
{
    for (;;) {
        if (adc_frame_ready) {
            uint8_t frame = adc_frame_ready;
            adc_frame_ready = 0;

            /* Process adc_frame[] here.
               The DMA is writing the other half. */
            (void)frame;

            /* SCB cache maintenance on Cortex-M7 (see below) */
        }
        __WFI();
    }
}

This pattern works on STM32F4, F7, H7, G0, G4, L4, L5, and U5. The HAL callback names vary slightly across families but the concept is identical. On F0/F1/L0/L1 parts, check whether the DMA controller supports half-transfer interrupts at all; most do, but some older F0 sub-families require the transfer-complete interrupt plus a software timer as a fallback.

UART RX with DMA circular mode: the silent overflow

UART reception with DMA in circular mode is another place where the half-transfer interrupt solves a real problem. Without it, the application must poll the DMA NDTR register to discover how many bytes have arrived since the last check. On a busy system, between two polls the DMA can wrap the entire buffer, overwrite old data, and the application will never know because NDTR changed correctly but the data is gone.

The standard UART idle-line detection plus DMA circular mode with HT/TC is the reliable pattern. The HT callback processes the first half, the TC callback processes the second half, and the UART idle interrupt processes whatever partial bytes remain in the current half since the last DMA event. Combined with a ring-buffer abstraction in the application layer, this gives zero-copy UART RX that tolerates bursts.

On STM32G0/G4/U5 parts that have a DMA mux (DMAMUX), you must map the UART RX request to the correct DMA channel through the DMAMUX register. CubeMX generates this by default, but when I see a hand-written DMA init that forgets the DMAMUX, the stream never triggers and the first debugging step is staring at the DMA status register showing zero transfers.

Cortex-M7 data cache: the hidden source of DMA corruption

On STM32F7 and STM32H7 parts with a Cortex-M7 core, the data cache adds another layer to this problem even when the double-buffer logic is correct. The DMA controller writes directly to SRAM. The CPU reads from the same SRAM addresses, but if those addresses happen to be cached, the CPU may read a stale cache line instead of the freshly DMA-written data.

The fix is explicit cache maintenance in the HT and TC callbacks before the application reads the half-buffer that just completed. For a buffer allocated in a non-cacheable region (through the MPU), no maintenance is needed but the performance penalty of non-cacheable accesses may be unacceptable for the processing loop. The pragmatic approach is to keep the DMA buffer in a cacheable region and clean/invalidate the relevant cache lines before each read.

/* Cortex-M7 cache invalidate for DMA buffer halves */
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc)
{
    if (hadc->Instance == ADC1) {
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U)
        SCB_InvalidateDCache_by_Addr(
            (uint32_t *)&adc_buf[0],
            ADC_FRAME_SAMPLES * sizeof(uint16_t));
#endif
        memcpy(adc_frame, &adc_buf[0],
               ADC_FRAME_SAMPLES * sizeof(uint16_t));
        adc_frame_ready = 1;
    }
}

Note that SCB_InvalidateDCache_by_Addr() operates on 32-byte cache-line boundaries. The CMSIS implementation rounds the address down and the size up automatically, but if your buffer is not aligned to a 32-byte boundary, you may invalidate adjacent data used by other code. Align DMA buffers to 32 bytes on Cortex-M7. The MPU can also be configured to mark the specific SRAM region used by DMA as non-cacheable, shared, or strongly-ordered. That removes the need for manual cache maintenance but makes every CPU access to that region slower. Choose based on the processing throughput the application needs.

DMA stream priority conflicts on busier parts

On parts like the STM32F4 with two DMA controllers and eight streams each, or the STM32H7 with multiple DMA instances spread across domain buses, stream priority is real and poorly understood. When two streams contend for the same bus or the same AHB matrix port, the lower-priority stream stalls. If that stream is feeding a DAC or a UART TX FIFO, the stall can cause a visible gap in the output.

For double-buffered ADC or UART RX, stream priority rarely matters because the producer (DMA) is ahead of the consumer (CPU). But for DMA-driven DAC, SPI TX, or SAI/I2S output, map the time-critical stream to a higher-priority channel and avoid sharing the same DMA controller between a high-rate output stream and a bulk memory-to-memory copy. On H7, prefer BDMA for low-speed peripherals and MDMA for memory-to-memory so the main DMA1/DMA2 controllers are free for real-time streams.

Practical example: a 4-channel ADC data logger with UART export

Consider a product that samples four ADC channels at 50 kHz each, buffers one second of data, and exports it over UART at 921600 baud. The naive approach starts one DMA circular buffer per channel and processes each one in its own HT/TC callback. At 50 kHz × 4 channels × 2 bytes per sample, the DMA throughput is 400 KB/s. That is within the capability of any STM32F4 DMA controller, but four concurrent streams fighting for the AHB matrix plus four sets of callbacks firing at different cadences makes the firmware hard to reason about and debug.

I would scan all four channels in a single ADC injected or regular sequence with one DMA stream, producing a multiplexed buffer where every four samples correspond to one scan of the four channels. The HT/TC callbacks then demultiplex the frame into four separate arrays in the processing stage. The result is one DMA stream, one set of interrupts, predictable timing, and the UART TX stream has its own DMA channel on a different controller so the two pipelines do not interfere.

Practical checklist

Never let the application read from a DMA buffer region that is currently being written by the controller. Use HT/TC interrupt pairing.
On Cortex-M7 (STM32F7/H7), invalidate data cache before reading DMA-written buffers, or configure the MPU to mark that SRAM region as non-cacheable.
Align DMA buffers to 32-byte boundaries on Cortex-M7 to avoid cache-line spill-over.
On parts with DMAMUX (G0, G4, L5, U5, H5), verify the DMAMUX request mapping; a missing DMAMUX config is a silent no-transfer bug.
Use the UART idle interrupt together with DMA HT/TC for reliable variable-length UART RX without polling NDTR.
Assign DMA stream priorities for output streams (DAC, SPI TX, SAI). Input streams with double buffering can run at default priority.
On STM32H7, separate real-time DMA streams onto DMA1/DMA2 and offload memory copies to MDMA.
Measure DMA overhead with a GPIO toggle inside the HT/TC callbacks during integration testing.

How I would approach this on a client project

I start by listing every DMA consumer in the system: ADC streams, UART RX/TX, SPI transactions, DAC output, timer-triggered captures. I assign each one a buffer design explicitly: single-buffer with stop/restart, double-buffer with HT/TC, or circular with ring-buffer consumption. Then I draw the DMA controller allocation across the available instances and channels, avoiding conflicts between time-critical output streams and bulk memory operations. Cache maintenance goes into a central header with #if defined(__DCACHE_PRESENT) guards so it compiles cleanly across STM32 families. Finally, I instrument each HT/TC callback with a GPIO trace during the first integration sprint. Without that trace, the real interrupt latency on the target PCB is invisible.

In code review, the first thing I look for in a DMA callback is a memcpy without a preceding cache invalidate on M7 parts. The second thing I look for is a UART RX DMA without an idle-line interrupt. The third is a circular-mode buffer where the application reads without any synchronisation to the half/complete boundary. These three patterns account for most of the DMA-related field bugs I have debugged.

Sources consulted

Comments

Have a specific STM32 DMA double-buffering case or a cache-coherence war story? Send me a short note by email.

Email Davide