STM32 USART Interrupt-Driven Ring Buffer: Lock-Free RX/TX on STM32F4

2026-06-04 · Davide Carrese

STM32 · USART · STM32F4 · Embedded

Every embedded project needs serial I/O. Debug logging, CLI commands, sensor polling, bootloader protocols — USART is the universal backchannel. The naive approach (blocking HAL_UART_Transmit/Receive) works for prototypes, but the moment you need concurrent code paths — a control loop running while you print, or a command arriving while a sensor acquisition is in progress — blocking UART becomes a bottleneck. An interrupt-driven ring buffer decouples the ISR from the application context without a single lock, without malloc, and with bounded latency. Here is how I implement one on STM32F4 at register level.

Why ring buffers, and why lock-free

A ring buffer (circular buffer) is a fixed-size FIFO backed by a linear array. Two pointers track the state: head (where the next byte is written) and tail (where the next byte is read). When either pointer reaches the end of the buffer, it wraps around to index zero.

The beauty of a single-producer, single-consumer (SPSC) ring buffer is that no atomic operations or mutexes are needed — provided you obey the following rule:

The producer (RX ISR, in this case) writes to head and advances it. It never touches tail.
The consumer (application code, or main loop) reads from tail and advances it. It never touches head.

Because each index is written by exactly one context, there is never a data race on the pointers themselves. For the data bytes, the producer writes into a slot only after advancing head, and the consumer reads from a slot only before advancing tail. With a volatile-qualified shared structure and __DSB() where necessary, this is correct on Cortex-M4 without explicit atomics.

USART register-level setup on STM32F4

Before we talk about the buffer, the USART must be configured. On STM32F401RE (USART2 on PA2-TX, PA3-RX via the Nucleo header, or the virtual COM port on PA2/PA3 routed through the ST-Link), the peripheral registers are:

USART_BRR: baud rate divider. For 115200 baud on an 84 MHz APB clock (typical for USART1 on APB2) or 42 MHz APB1 (USART2/3): USART_BRR = 42000000 / 115200 = 365 = 0x16D.
USART_CR1: UE (enable), TE (transmitter enable), RE (receiver enable), RXNEIE (RX not empty interrupt enable), TCIE (transmit complete interrupt enable).
USART_CR2: stop bits (default 1 stop bit is 0b00).
USART_CR3: no flow control by default, no oversampling by 8.

Initialisation boils down to enabling the clock on the right APB bus, configuring the GPIO pins to alternate function mode, then writing the peripheral registers. Here is the complete setup for USART2 at 115200 8N1:

void usart_init(void)
{
    // Enable clocks: USART2 on APB1, GPIOA on AHB1
    RCC->APB1ENR |= RCC_APB1ENR_USART2EN;
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
    __DSB();

    // PA2 = TX (AF7), PA3 = RX (AF7)
    GPIOA->MODER  &= ~(GPIO_MODER_MODER2 | GPIO_MODER_MODER3);
    GPIOA->MODER  |=  (2 << GPIO_MODER_MODER2_Pos | 2 << GPIO_MODER_MODER3_Pos);
    GPIOA->AFR[0] &= ~(0xF << (2 * 4) | 0xF << (3 * 4));
    GPIOA->AFR[0] |=  (7 << (2 * 4) | 7 << (3 * 4));  // AF7 = USART2

    // 115200 baud @ 42 MHz APB1
    USART2->BRR = 42000000 / 115200;  // = 365 -> 0x16D

    // Enable USART, TX, RX, RXNE interrupt
    USART2->CR1 = USART_CR1_UE | USART_CR1_TE | USART_CR1_RE |
                  USART_CR1_RXNEIE;
    // No interrupts for TX yet — we enable TCIE on demand
}

Note that we do not enable TX complete interrupts (TCIE) in init. They are enabled only when we have data to send, and disabled again when the transmit buffer is empty. This prevents the TX ISR from firing constantly on an idle line.

Ring buffer data structure

The ring buffer uses a power-of-2 size so that wrapping can be done with a bitwise mask instead of a modulo operation. On Cortex-M4, a single-cycle AND is measurably faster than the division-based modulo that compilers emit for arbitrary sizes.

#define UART_BUF_SIZE  256   // must be power of 2
#define UART_BUF_MASK  (UART_BUF_SIZE - 1)

typedef struct {
    volatile uint16_t head;           // producer index (ISR writes here)
    volatile uint16_t tail;           // consumer index (main reads here)
    uint8_t buf[UART_BUF_SIZE];
} ringbuf_t;

static ringbuf_t rx_buf;
static ringbuf_t tx_buf;

The volatile qualifier tells the compiler that head and tail can change outside the current execution context — i.e., in the ISR. The buffer buf[] itself does not need volatile because it is accessed indirectly through these indexes.

Push and pop operations

For the RX buffer, the ISR pushes bytes and the main loop pops them:

static inline bool ringbuf_push(ringbuf_t *rb, uint8_t byte)
{
    uint16_t next = (rb->head + 1) & UART_BUF_MASK;
    if (next == rb->tail)           // full
        return false;
    rb->buf[rb->head] = byte;
    __DMB();                        // ensure byte written before head update
    rb->head = next;
    return true;
}

static inline bool ringbuf_pop(ringbuf_t *rb, uint8_t *byte)
{
    if (rb->tail == rb->head)       // empty
        return false;
    *byte = rb->buf[rb->tail];
    __DMB();
    rb->tail = (rb->tail + 1) & UART_BUF_MASK;
    return true;
}

static inline uint16_t ringbuf_avail(const ringbuf_t *rb)
{
    return (rb->head - rb->tail) & UART_BUF_MASK;
}

The memory barrier (__DMB()) ensures the CPU does not reorder the data write before the head update (in push) or the data read before the tail update (in pop). On Cortex-M4 without caches this is technically not required for correctness on a single core because loads/stores to the same peripheral address space are not reordered by the CM4 pipeline. I keep it as documentation and for portability to Cortex-M7 with cache.

RX interrupt handler

The RXNE flag is set as soon as a byte is received. The ISR reads USART_DR (which clears RXNE) and pushes the byte into the ring buffer.

void USART2_IRQHandler(void)
{
    uint32_t sr = USART2->SR;

    if (sr & USART_SR_RXNE) {
        uint8_t byte = (uint8_t)(USART2->DR & 0xFF);
        if (!ringbuf_push(&rx_buf, byte)) {
            // Buffer full — byte lost. Optionally set an overflow flag.
            rx_overflow = 1;
        }
    }

    if (sr & USART_SR_TC) {
        // TX complete: if more bytes in tx_buf, send next
        uint8_t byte;
        if (ringbuf_pop(&tx_buf, &byte)) {
            USART2->DR = byte;
        } else {
            // No more data: disable TC interrupt
            USART2->CR1 &= ~USART_CR1_TCIE;
        }
    }
}

The RX ISR must be fast. Pushing into a 256-byte ring buffer is 6 loads/stores and an ALU operation — roughly 12 CPU cycles at 84 MHz, or 140 ns. That is well within the time budget for 115200 baud (one byte every 86.8 µs).

TX with demand-driven TC interrupt

Transmitting is trickier because we do not want the TX ISR to fire when there is nothing to send. The strategy:

A uart_write(uint8_t byte) function pushes the byte into the TX ring buffer.
If the TC interrupt is not already enabled, enable it now. The pending TC flag from a previous transmission triggers one ISR call immediately, which pops the first byte and sends it.
The TC ISR pops subsequent bytes. When the TX buffer is empty, it disables TCIE.

void uart_write(uint8_t byte)
{
    // Spin if TX buffer full (back-pressure)
    while (ringbuf_avail(&tx_buf) == UART_BUF_SIZE - 1) {
        // If TCIE is off and we are stuck, ensure ISR will run
        if (!(USART2->CR1 & USART_CR1_TCIE)) {
            USART2->CR1 |= USART_CR1_TCIE;
        }
    }
    ringbuf_push(&tx_buf, byte);
    // Prime the TC pump if idle
    USART2->CR1 |= USART_CR1_TCIE;
}

void uart_write_str(const char *s)
{
    while (*s) uart_write((uint8_t)*s++);
}

int uart_read(uint8_t *byte)
{
    return ringbuf_pop(&rx_buf, byte) ? 0 : -1;
}

uint16_t uart_available(void)
{
    return ringbuf_avail(&rx_buf);
}

This scheme is strictly single-producer, single-consumer: the main context calls uart_write and the ISR consumes from the TX buffer. No lock needed.

Practical example: echo server with line-based command parser

Here is how the ring buffer integrates into a real main loop. The application echoes received characters and recognises a command led\n that toggles an output pin:

static char line_buf[64];
static uint8_t line_pos = 0;

void process_line(const char *line)
{
    if (strcmp(line, "led") == 0) {
        GPIOA->ODR ^= GPIO_ODR_OD5;       // toggle PA5 (Nucleo LED)
        uart_write_str("LED toggled\r\n");
    } else if (strcmp(line, "help") == 0) {
        uart_write_str("Commands: led, help, hello\r\n");
    } else if (strncmp(line, "hello ", 6) == 0) {
        uart_write_str("Hi, ");
        uart_write_str(line + 6);
        uart_write_str("!\r\n");
    } else {
        uart_write_str("Unknown: ");
        uart_write_str(line);
        uart_write_str("\r\n");
    }
}

void main_loop(void)
{
    while (1) {
        uint8_t c;
        if (uart_read(&c) == 0) {
            uart_write(c);                     // echo
            if (c == '\r' || c == '\n') {
                uart_write_str("\r\n");
                line_buf[line_pos] = '\0';
                if (line_pos > 0)
                    process_line(line_buf);
                line_pos = 0;
            } else if (line_pos < sizeof(line_buf) - 1) {
                line_buf[line_pos++] = c;
            }
        }
        // Application code here — sensor polling, control loops, etc.
    }
}

The key property: every call to uart_read and uart_write is non-blocking (or very briefly spinning only on TX buffer full). This means your control loop continues to execute even when the serial terminal has nothing to say.

Practical checklist

Buffer size is power of 2 — This makes idx & mask wrap instead of idx % size. For 256 bytes that is ~180 ns overhead per push vs ~60 ns. At 115200 baud the difference is negligible, but at 921600 it matters.
RX overflow flag — Always add an overflow counter or a sticky flag. Without it, a lost byte is invisible during debugging.
TC vs TXE — TC fires when the shift register empties (all bits sent on wire). TXE fires when DR is empty (ready for next byte). Using TC for flow control gives you back-pressure: you know the byte has left the chip. Using TXE gives lower latency between bytes. For debug output, TC is safer because it prevents dropping bytes when the line is backed up by a slow terminal.
NVIC priority — Set the USART interrupt priority high enough that it can preempt lower-priority tasks but not the SysTick if you use it for timing. A priority of 5 on STM32F4 (NVIC 4-bit priority, lower number = higher priority) is a reasonable middle ground.
Overrun error — Check USART_SR_ORE in the ISR. If ORE is set, the USART has locked up and needs a reinit or at minimum a read of DR then SR to clear it.
No UART HAL interop — If you use this ring buffer approach, do not call HAL_UART_Receive_IT or HAL_UART_IRQHandler afterwards. The HAL owns the same interrupt vectors. You must either use bare registers everywhere or implement the ISR in the HAL's callback.

How I would approach this on a client project

On a commercial firmware project, I would not ship the ring buffer in this raw form directly. I would wrap it in a module that provides two independent serial channels: one for debug/CLI (human interface, low throughput, needs echo and line editing) and one for a binary protocol (machine interface, high throughput, raw framing). Both share the same ring buffer primitives but have different ISR policies — the debug channel uses TC-based flow control, the binary channel uses TXE-based streaming with DMA for bulk transfers above 64 bytes.

I would also add a uart_flush() helper that busy-waits until the TX buffer is empty and the TC flag is set — this is invaluable before entering STOP mode, where an incomplete UART transmission would be truncated. And I would route all printf-style debug output through the ring buffer by implementing _write() (Newlib-nano / syscall stub) so that every printf("sensor=%d", val) automatically goes through the interrupt-driven path without blocking.

Finally, I would test under load: two USART ports looped back to each other at 921600 baud, sending 1 MB of random data, with CRC-32 verification on the receiving side and zero bytes lost. That is the acceptance criterion I use. If the ring buffer drops a single byte at 921600 with both RX and TX active, the design is not hardened enough for production.

Sources and further reading

STMicroelectronics, RM0368 — STM32F401 Reference Manual: USART register map, chapters 20–21.
STMicroelectronics, AN3109 — STM32 USART communication: application note on USART configuration and error handling.
STMicroelectronics, STM32CubeF4 Firmware Package — Projects/STM32F401RE-Nucleo/Examples/UART: reference HAL and LL examples.
Memfault Blog — “Ring Buffer Basics” series on circular buffer patterns in embedded systems.
ARM, CMSIS-Core (Cortex-M4): intrinsic functions for memory barriers (__DMB, __DSB).

Comments

Have comments? Send me an email.

Send me an email