STM32 SPI Master Mode at Register Level: Baud Rate, CPOL/CPHA, NSS, and DMA Transfers on STM32F4

2026-06-07 · Davide Carrese
STM32 · SPI · DMA · STM32F4 · Register-Level

SPI is the de-facto peripheral bus for interfacing sensors, ADCs, DACs, displays, SD cards, and wireless modules in embedded systems. On the STM32F4, the SPI peripheral is flexible enough to handle everything from a low-rate temperature sensor at 125 kHz to a display controller streaming at 21 MHz — but only if you understand the register-level knobs. This article covers SPI master mode configuration, clock polarity/phase (CPOL/CPHA), baud rate prescaler selection, software NSS management, half-duplex bidirectional mode, CRC, and DMA-based full-duplex transfers on the STM32F401.

SPI architecture on STM32F4

The STM32F401 has up to three SPI interfaces. SPI1 is on APB2 (max 84 MHz), while SPI2 and SPI3 are on APB1 (max 42 MHz). Each SPI can operate in master or slave mode, supports full-duplex, half-duplex, and simplex (unidirectional) communication, and has an independent 8-bit/16-bit data frame format.

The core registers you need to know:

On the F401, SPI1, SPI2, and SPI3 do not have a deep FIFO — they use a single-word data buffer with the TXE/RXNE flags. Later STM32 families like the G4 or H7 added 32-bit FIFOs, but the F4's SPI is purely register-level with a one-word buffer.

Baud rate prescaler selection

The SPI clock is generated by dividing the peripheral clock (PCLK) by a power-of-two prescaler configured in SPI_CR1[BR]:

SPI_CR1_BR_0  → fPCLK / 2
SPI_CR1_BR_1  → fPCLK / 4
SPI_CR1_BR_0 | SPI_CR1_BR_1 → fPCLK / 8
SPI_CR1_BR_2  → fPCLK / 16
SPI_CR1_BR_2 | SPI_CR1_BR_0 → fPCLK / 32
SPI_CR1_BR_2 | SPI_CR1_BR_1 → fPCLK / 64
SPI_CR1_BR_2 | SPI_CR1_BR_1 | SPI_CR1_BR_0 → fPCLK / 256

If SPI1 runs on APB2 at 84 MHz, the maximum SPI clock is 84/2 = 42 MHz, but the F401's SPI clock is limited to fPCLK/2 in master mode. In practice, the maximum reliable SPI clock for F401 is about 21 MHz (prescaler /4) at 3.3 V. The actual achievable bit rate also depends on board layout, trace length, and slave device capability — always check the slave datasheet.

Practical example: choosing the baud rate for a 10-MHz SPI ADC

/* SPI1 on APB2 = 84 MHz. Target: 10 MHz SPI clock */
/* 84 MHz / 8 = 10.5 MHz (closest to 10 MHz) */
#define SPI_BAUD_10MHZ  (SPI_CR1_BR_0 | SPI_CR1_BR_1)  /* /8 */

/* Target: 2 MHz SPI clock for an SD card */
/* 84 MHz / 64 = 1.3125 MHz — too slow */
/* Use SPI2 on APB1 = 42 MHz: 42 / 16 = 2.625 MHz */
#define SPI_BAUD_2MHZ_APB1  (SPI_CR1_BR_2)  /* /16 on APB1 */

Always measure the actual SCK frequency with an oscilloscope or logic analyser in your setup. Board parasitic capacitance and driver drive strength can reduce the effective frequency below the prescaler calculation.

CPOL and CPHA: clock polarity and phase

These two bits define the SPI clock idle state and the data sampling edge. There are four modes, and the slave device must use the same mode as the master:

Mode 0 and Mode 3 are the most common in practice. Most SPI temperature sensors, ADCs, and MEMS sensors use Mode 0. Some display controllers and RF modules use Mode 3. Always check the slave datasheet.

/* SPI Mode 0 (CPOL=0, CPHA=0) — most common for sensors */
#define SPI_MODE_0  (0)

/* SPI Mode 3 (CPOL=1, CPHA=1) — common for displays, RF */
#define SPI_MODE_3  (SPI_CR1_CPOL | SPI_CR1_CPHA)

NSS management: hardware vs software

The NSS (Slave Select) pin can be managed in hardware or software. In master mode with software NSS, you set SSM=1 and SSI=1 in SPI_CR1. The SPI peripheral ignores the physical NSS pin and generates the internal slave-select signal internally. This is the simplest and most common approach for a single-bus master: you drive a separate GPIO as chip select for each slave device.

/* Enable software NSS for master mode */
SPI1->CR1 |= SPI_CR1_SSM | SPI_CR1_SSI;

/* Drive CS as a regular GPIO */
#define CS_PIN   GPIO_PIN_4
#define CS_PORT  GPIOA

void cs_select(void)  { HAL_GPIO_WritePin(CS_PORT, CS_PIN, GPIO_PIN_RESET); }
void cs_deselect(void) { HAL_GPIO_WritePin(CS_PORT, CS_PIN, GPIO_PIN_SET); }

With hardware NSS (SSOE=1 in SPI_CR2), the SPI peripheral drives the NSS pin low automatically when a transfer starts. This works only for a single slave device and requires correct pin mapping. For multiple slaves, use software NSS + manual GPIO chip selects.

Full-duplex master mode: polling transfer

A basic polling full-duplex transfer writes to SPI_DR and waits for RXNE. Because SPI is a shift-register protocol, every transmitted byte shifts a byte in from the slave simultaneously:

static uint8_t spi_transfer_byte(SPI_TypeDef *spi, uint8_t tx)
{
    /* Wait for TXE (transmit buffer empty) */
    while (!(spi->SR & SPI_SR_TXE));
    spi->DR = tx;
    /* Wait for RXNE (receive buffer not empty) */
    while (!(spi->SR & SPI_SR_RXNE));
    return (uint8_t)spi->DR;
}

For multi-byte transfers, never use polling for large buffers — it blocks the CPU for the entire transfer duration. At 10 MHz, a 1 KB transfer takes ~820 µs, which is a lifetime in a real-time system.

DMA-based full-duplex transfer

The STM32F4 SPI can use DMA for both TX and RX simultaneously. The SPI generates a DMA request on each TXE and RXNE event. You need two DMA streams — one for TX and one for RX — configured in circular or normal mode depending on the use case.

For SPI1 on STM32F401, the DMA mapping is:

DMA setup for a one-shot full-duplex transfer

void spi1_dma_fullduplex(uint8_t *txbuf, uint8_t *rxbuf, uint16_t len)
{
    /* 1. Enable SPI1 DMA requests */
    SPI1->CR2 |= SPI_CR2_TXDMAEN | SPI_CR2_RXDMAEN;

    /* 2. Configure RX DMA stream (DMA2 Stream 0, Channel 3) */
    DMA2_Stream0->CR = 0;  /* reset */
    DMA2_Stream0->NDTR = len;
    DMA2_Stream0->PAR  = (uint32_t)&SPI1->DR;
    DMA2_Stream0->M0AR = (uint32_t)rxbuf;
    DMA2_Stream0->CR = DMA_SxCR_CHSEL_1 | DMA_SxCR_CHSEL_0   /* ch 3 */
                     | DMA_SxCR_MSIZE_0 | DMA_SxCR_PSIZE_0    /* 8-bit */
                     | DMA_SxCR_MINC | DMA_SxCR_TCIE;         /* inc + irq */

    /* 3. Configure TX DMA stream (DMA2 Stream 3, Channel 3) */
    DMA2_Stream3->CR = 0;
    DMA2_Stream3->NDTR = len;
    DMA2_Stream3->PAR  = (uint32_t)&SPI1->DR;
    DMA2_Stream3->M0AR = (uint32_t)txbuf;
    DMA2_Stream3->CR = DMA_SxCR_CHSEL_1 | DMA_SxCR_CHSEL_0   /* ch 3 */
                     | DMA_SxCR_MSIZE_0 | DMA_SxCR_PSIZE_0    /* 8-bit */
                     | DMA_SxCR_MINC | DMA_SxCR_DIR_0;        /* mem-to-per */

    /* 4. Enable streams */
    DMA2_Stream0->CR |= DMA_SxCR_EN;
    DMA2_Stream3->CR |= DMA_SxCR_EN;
}

The RX stream should be configured with the higher priority or started first, because the first TXE event triggers the TX stream, and the SPI starts shifting immediately. If the RX stream is not ready, the first received byte overruns.

For the DMA completion interrupt, enable TCIE (Transfer Complete Interrupt Enable) on the RX stream. When the RX DMA transfer count reaches zero, the TCIF flag fires. At that point you can disable the SPI DMA requests and deselect the CS pin:

void DMA2_Stream0_IRQHandler(void)
{
    if (DMA2->LISR & DMA_LISR_TCIF0) {
        DMA2->LIFCR |= DMA_LIFCR_CTCIF0;
        SPI1->CR2 &= ~(SPI_CR2_TXDMAEN | SPI_CR2_RXDMAEN);
        DMA2_Stream0->CR &= ~DMA_SxCR_EN;
        DMA2_Stream3->CR &= ~DMA_SxCR_EN;
        cs_deselect();
        /* Signal completion to main loop or RTOS task */
    }
}

Half-duplex bidirectional mode

Some slaves are transmit-only (e.g. a digital temperature sensor) or receive-only (e.g. a DAC). For these cases, you can use the SPI in half-duplex (BIDIMODE=1) bidirectional mode with only one data line (MISO/MOSI) shared. Set BIDIOE=1 to output data, or BIDIOE=0 to receive.

This is useful when you need the spare GPIO normally used by the unused data pin. The baud rate and protocol are the same, but you must toggle BIDIOE between write and read phases if the slave protocol alternates.

CRC on SPI

STM32F4 SPI supports hardware CRC generation and checking when CRCEN=1 in SPI_CR1. The CRC is computed on the TX data and appended after the last data byte. The RX side generates its own CRC and compares it. If there is a mismatch, CRCERR is set in SPI_SR.

In practice, hardware SPI CRC is rarely used in master mode for multi-slave buses because it increases protocol overhead and requires both sides to agree on the polynomial. For reliable transfer on noisy buses, most designers prefer a higher-level CRC or checksum in the application protocol, or use a differential bus like RS-485 with SPI extending over short distances.

Practical example: reading an SPI temperature sensor (MAX31855 or MCP3564)

Here is a complete register-level master-mode initialization for SPI1 on an STM32F401 communicating with a generic SPI sensor at 5.25 MHz, Mode 0, 8-bit, with DMA one-shot transfer:

void spi1_master_init(void)
{
    /* Enable clocks: SPI1, GPIOA (SCK, MOSI, MISO, CS), DMA2 */
    RCC->AHB1ENR  |= RCC_AHB1ENR_GPIOAEN;
    RCC->APB2ENR  |= RCC_APB2ENR_SPI1EN;
    RCC->AHB1ENR  |= RCC_AHB1ENR_DMA2EN;

    /* GPIO: PA5=SCK, PA7=MOSI, PA6=MISO, PA4=CS */
    GPIOA->MODER  &= ~(0xFu << 8);  /* PA5,PA6,PA7: AF */
    GPIOA->MODER  |=  (0x2u << 8) | (0x2u << 12) | (0x2u << 14);
    GPIOA->MODER  &= ~(0x3u << 8);  /* PA4: output */
    GPIOA->MODER  |=  (0x1u << 8);
    GPIOA->AFR[0] |=  (0x5u << 20) | (0x5u << 24) | (0x5u << 28); /* AF5 */
    GPIOA->OSPEEDR |= 0x3F00;       /* high speed */
    cs_deselect();

    /* SPI1: master, /16 → 84/16 = 5.25 MHz, Mode 0, 8-bit, SSM */
    SPI1->CR1 = SPI_CR1_MSTR
              | SPI_CR1_BR_2               /* /16 */
              | SPI_CR1_SSM | SPI_CR1_SSI   /* software NSS */
              | SPI_CR1_SPE;               /* enable */
    SPI1->CR2 = 0;  /* no interrupts, no DMA yet */
}

Reading the sensor in a one-shot DMA transfer:

uint8_t tx_buf[4] = {0x00, 0x00, 0x00, 0x00};  /* dummy bytes to clock */
uint8_t rx_buf[4] = {0};

cs_select();
spi1_dma_fullduplex(tx_buf, rx_buf, 4);
/* Wait for DMA completion interrupt */
/* rx_buf now contains the 4-byte sensor reading */

Practical checklist for SPI master mode

How I would approach this on a client project

On a client project, I never use polling SPI for more than 4 bytes. The CPU time is too valuable. The standard approach is:

Sources and references

Comments

Have comments? Send me an email.