STM32 USART Interrupt-Driven Ring Buffer: Lock-Free RX/TX on STM32F4

2026-06-04 · Davide Carrese
STM32 · USART · STM32F4 · Embedded

Every embedded project needs serial I/O. Debug logging, CLI commands, sensor polling, bootloader protocols — USART is the universal backchannel. The naive approach (blocking HAL_UART_Transmit/Receive) works for prototypes, but the moment you need concurrent code paths — a control loop running while you print, or a command arriving while a sensor acquisition is in progress — blocking UART becomes a bottleneck. An interrupt-driven ring buffer decouples the ISR from the application context without a single lock, without malloc, and with bounded latency. Here is how I implement one on STM32F4 at register level.

Why ring buffers, and why lock-free

A ring buffer (circular buffer) is a fixed-size FIFO backed by a linear array. Two pointers track the state: head (where the next byte is written) and tail (where the next byte is read). When either pointer reaches the end of the buffer, it wraps around to index zero.

The beauty of a single-producer, single-consumer (SPSC) ring buffer is that no atomic operations or mutexes are needed — provided you obey the following rule:

Because each index is written by exactly one context, there is never a data race on the pointers themselves. For the data bytes, the producer writes into a slot only after advancing head, and the consumer reads from a slot only before advancing tail. With a volatile-qualified shared structure and __DSB() where necessary, this is correct on Cortex-M4 without explicit atomics.

USART register-level setup on STM32F4

Before we talk about the buffer, the USART must be configured. On STM32F401RE (USART2 on PA2-TX, PA3-RX via the Nucleo header, or the virtual COM port on PA2/PA3 routed through the ST-Link), the peripheral registers are:

Initialisation boils down to enabling the clock on the right APB bus, configuring the GPIO pins to alternate function mode, then writing the peripheral registers. Here is the complete setup for USART2 at 115200 8N1:

void usart_init(void)
{
    // Enable clocks: USART2 on APB1, GPIOA on AHB1
    RCC->APB1ENR |= RCC_APB1ENR_USART2EN;
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
    __DSB();

    // PA2 = TX (AF7), PA3 = RX (AF7)
    GPIOA->MODER  &= ~(GPIO_MODER_MODER2 | GPIO_MODER_MODER3);
    GPIOA->MODER  |=  (2 << GPIO_MODER_MODER2_Pos | 2 << GPIO_MODER_MODER3_Pos);
    GPIOA->AFR[0] &= ~(0xF << (2 * 4) | 0xF << (3 * 4));
    GPIOA->AFR[0] |=  (7 << (2 * 4) | 7 << (3 * 4));  // AF7 = USART2

    // 115200 baud @ 42 MHz APB1
    USART2->BRR = 42000000 / 115200;  // = 365 -> 0x16D

    // Enable USART, TX, RX, RXNE interrupt
    USART2->CR1 = USART_CR1_UE | USART_CR1_TE | USART_CR1_RE |
                  USART_CR1_RXNEIE;
    // No interrupts for TX yet — we enable TCIE on demand
}

Note that we do not enable TX complete interrupts (TCIE) in init. They are enabled only when we have data to send, and disabled again when the transmit buffer is empty. This prevents the TX ISR from firing constantly on an idle line.

Ring buffer data structure

The ring buffer uses a power-of-2 size so that wrapping can be done with a bitwise mask instead of a modulo operation. On Cortex-M4, a single-cycle AND is measurably faster than the division-based modulo that compilers emit for arbitrary sizes.

#define UART_BUF_SIZE  256   // must be power of 2
#define UART_BUF_MASK  (UART_BUF_SIZE - 1)

typedef struct {
    volatile uint16_t head;           // producer index (ISR writes here)
    volatile uint16_t tail;           // consumer index (main reads here)
    uint8_t buf[UART_BUF_SIZE];
} ringbuf_t;

static ringbuf_t rx_buf;
static ringbuf_t tx_buf;

The volatile qualifier tells the compiler that head and tail can change outside the current execution context — i.e., in the ISR. The buffer buf[] itself does not need volatile because it is accessed indirectly through these indexes.

Push and pop operations

For the RX buffer, the ISR pushes bytes and the main loop pops them:

static inline bool ringbuf_push(ringbuf_t *rb, uint8_t byte)
{
    uint16_t next = (rb->head + 1) & UART_BUF_MASK;
    if (next == rb->tail)           // full
        return false;
    rb->buf[rb->head] = byte;
    __DMB();                        // ensure byte written before head update
    rb->head = next;
    return true;
}

static inline bool ringbuf_pop(ringbuf_t *rb, uint8_t *byte)
{
    if (rb->tail == rb->head)       // empty
        return false;
    *byte = rb->buf[rb->tail];
    __DMB();
    rb->tail = (rb->tail + 1) & UART_BUF_MASK;
    return true;
}

static inline uint16_t ringbuf_avail(const ringbuf_t *rb)
{
    return (rb->head - rb->tail) & UART_BUF_MASK;
}

The memory barrier (__DMB()) ensures the CPU does not reorder the data write before the head update (in push) or the data read before the tail update (in pop). On Cortex-M4 without caches this is technically not required for correctness on a single core because loads/stores to the same peripheral address space are not reordered by the CM4 pipeline. I keep it as documentation and for portability to Cortex-M7 with cache.

RX interrupt handler

The RXNE flag is set as soon as a byte is received. The ISR reads USART_DR (which clears RXNE) and pushes the byte into the ring buffer.

void USART2_IRQHandler(void)
{
    uint32_t sr = USART2->SR;

    if (sr & USART_SR_RXNE) {
        uint8_t byte = (uint8_t)(USART2->DR & 0xFF);
        if (!ringbuf_push(&rx_buf, byte)) {
            // Buffer full — byte lost. Optionally set an overflow flag.
            rx_overflow = 1;
        }
    }

    if (sr & USART_SR_TC) {
        // TX complete: if more bytes in tx_buf, send next
        uint8_t byte;
        if (ringbuf_pop(&tx_buf, &byte)) {
            USART2->DR = byte;
        } else {
            // No more data: disable TC interrupt
            USART2->CR1 &= ~USART_CR1_TCIE;
        }
    }
}

The RX ISR must be fast. Pushing into a 256-byte ring buffer is 6 loads/stores and an ALU operation — roughly 12 CPU cycles at 84 MHz, or 140 ns. That is well within the time budget for 115200 baud (one byte every 86.8 µs).

TX with demand-driven TC interrupt

Transmitting is trickier because we do not want the TX ISR to fire when there is nothing to send. The strategy:

  1. A uart_write(uint8_t byte) function pushes the byte into the TX ring buffer.
  2. If the TC interrupt is not already enabled, enable it now. The pending TC flag from a previous transmission triggers one ISR call immediately, which pops the first byte and sends it.
  3. The TC ISR pops subsequent bytes. When the TX buffer is empty, it disables TCIE.
void uart_write(uint8_t byte)
{
    // Spin if TX buffer full (back-pressure)
    while (ringbuf_avail(&tx_buf) == UART_BUF_SIZE - 1) {
        // If TCIE is off and we are stuck, ensure ISR will run
        if (!(USART2->CR1 & USART_CR1_TCIE)) {
            USART2->CR1 |= USART_CR1_TCIE;
        }
    }
    ringbuf_push(&tx_buf, byte);
    // Prime the TC pump if idle
    USART2->CR1 |= USART_CR1_TCIE;
}

void uart_write_str(const char *s)
{
    while (*s) uart_write((uint8_t)*s++);
}

int uart_read(uint8_t *byte)
{
    return ringbuf_pop(&rx_buf, byte) ? 0 : -1;
}

uint16_t uart_available(void)
{
    return ringbuf_avail(&rx_buf);
}

This scheme is strictly single-producer, single-consumer: the main context calls uart_write and the ISR consumes from the TX buffer. No lock needed.

Practical example: echo server with line-based command parser

Here is how the ring buffer integrates into a real main loop. The application echoes received characters and recognises a command led\n that toggles an output pin:

static char line_buf[64];
static uint8_t line_pos = 0;

void process_line(const char *line)
{
    if (strcmp(line, "led") == 0) {
        GPIOA->ODR ^= GPIO_ODR_OD5;       // toggle PA5 (Nucleo LED)
        uart_write_str("LED toggled\r\n");
    } else if (strcmp(line, "help") == 0) {
        uart_write_str("Commands: led, help, hello\r\n");
    } else if (strncmp(line, "hello ", 6) == 0) {
        uart_write_str("Hi, ");
        uart_write_str(line + 6);
        uart_write_str("!\r\n");
    } else {
        uart_write_str("Unknown: ");
        uart_write_str(line);
        uart_write_str("\r\n");
    }
}

void main_loop(void)
{
    while (1) {
        uint8_t c;
        if (uart_read(&c) == 0) {
            uart_write(c);                     // echo
            if (c == '\r' || c == '\n') {
                uart_write_str("\r\n");
                line_buf[line_pos] = '\0';
                if (line_pos > 0)
                    process_line(line_buf);
                line_pos = 0;
            } else if (line_pos < sizeof(line_buf) - 1) {
                line_buf[line_pos++] = c;
            }
        }
        // Application code here — sensor polling, control loops, etc.
    }
}

The key property: every call to uart_read and uart_write is non-blocking (or very briefly spinning only on TX buffer full). This means your control loop continues to execute even when the serial terminal has nothing to say.

Practical checklist

How I would approach this on a client project

On a commercial firmware project, I would not ship the ring buffer in this raw form directly. I would wrap it in a module that provides two independent serial channels: one for debug/CLI (human interface, low throughput, needs echo and line editing) and one for a binary protocol (machine interface, high throughput, raw framing). Both share the same ring buffer primitives but have different ISR policies — the debug channel uses TC-based flow control, the binary channel uses TXE-based streaming with DMA for bulk transfers above 64 bytes.

I would also add a uart_flush() helper that busy-waits until the TX buffer is empty and the TC flag is set — this is invaluable before entering STOP mode, where an incomplete UART transmission would be truncated. And I would route all printf-style debug output through the ring buffer by implementing _write() (Newlib-nano / syscall stub) so that every printf("sensor=%d", val) automatically goes through the interrupt-driven path without blocking.

Finally, I would test under load: two USART ports looped back to each other at 921600 baud, sending 1 MB of random data, with CRC-32 verification on the receiving side and zero bytes lost. That is the acceptance criterion I use. If the ring buffer drops a single byte at 921600 with both RX and TX active, the design is not hardened enough for production.

Sources and further reading

Comments

Have comments? Send me an email.