STM32 SPI Master Mode at Register Level: Baud Rate, CPOL/CPHA, NSS, and DMA Transfers on STM32F4
SPI is the de-facto peripheral bus for interfacing sensors, ADCs, DACs, displays, SD cards, and wireless modules in embedded systems. On the STM32F4, the SPI peripheral is flexible enough to handle everything from a low-rate temperature sensor at 125 kHz to a display controller streaming at 21 MHz — but only if you understand the register-level knobs. This article covers SPI master mode configuration, clock polarity/phase (CPOL/CPHA), baud rate prescaler selection, software NSS management, half-duplex bidirectional mode, CRC, and DMA-based full-duplex transfers on the STM32F401.
SPI architecture on STM32F4
The STM32F401 has up to three SPI interfaces. SPI1 is on APB2 (max 84 MHz), while SPI2 and SPI3 are on APB1 (max 42 MHz). Each SPI can operate in master or slave mode, supports full-duplex, half-duplex, and simplex (unidirectional) communication, and has an independent 8-bit/16-bit data frame format.
The core registers you need to know:
SPI_CR1— master/slave select (MSTR), baud rate (BR[2:0]), clock polarity (CPOL), clock phase (CPHA), data frame format (DFF), software NSS (SSM+SSI), half-duplex (BIDIMODE/BIDIOE), and enable (SPE).SPI_CR2— DMA enables (TXDMAEN,RXDMAEN), interrupt enables (TXEIE,RXNEIE,ERRIE), SS output (SSOE), and FIFO threshold (FRXTH).SPI_SR— status flags: TXE (transmit buffer empty), RXNE (receive buffer not empty), BSY (busy), MODF (mode fault), CRCERR, OVR (overrun).SPI_DR— data register, 16-bit wide. Writing transmits; reading receives.SPI_CRCPR— CRC polynomial register (if CRC is enabled).
On the F401, SPI1, SPI2, and SPI3 do not have a deep FIFO — they use a single-word data buffer with the TXE/RXNE flags. Later STM32 families like the G4 or H7 added 32-bit FIFOs, but the F4's SPI is purely register-level with a one-word buffer.
Baud rate prescaler selection
The SPI clock is generated by dividing the peripheral clock (PCLK) by a power-of-two prescaler configured in SPI_CR1[BR]:
SPI_CR1_BR_0 → fPCLK / 2 SPI_CR1_BR_1 → fPCLK / 4 SPI_CR1_BR_0 | SPI_CR1_BR_1 → fPCLK / 8 SPI_CR1_BR_2 → fPCLK / 16 SPI_CR1_BR_2 | SPI_CR1_BR_0 → fPCLK / 32 SPI_CR1_BR_2 | SPI_CR1_BR_1 → fPCLK / 64 SPI_CR1_BR_2 | SPI_CR1_BR_1 | SPI_CR1_BR_0 → fPCLK / 256
If SPI1 runs on APB2 at 84 MHz, the maximum SPI clock is 84/2 = 42 MHz, but the F401's SPI clock is limited to fPCLK/2 in master mode. In practice, the maximum reliable SPI clock for F401 is about 21 MHz (prescaler /4) at 3.3 V. The actual achievable bit rate also depends on board layout, trace length, and slave device capability — always check the slave datasheet.
Practical example: choosing the baud rate for a 10-MHz SPI ADC
/* SPI1 on APB2 = 84 MHz. Target: 10 MHz SPI clock */ /* 84 MHz / 8 = 10.5 MHz (closest to 10 MHz) */ #define SPI_BAUD_10MHZ (SPI_CR1_BR_0 | SPI_CR1_BR_1) /* /8 */ /* Target: 2 MHz SPI clock for an SD card */ /* 84 MHz / 64 = 1.3125 MHz — too slow */ /* Use SPI2 on APB1 = 42 MHz: 42 / 16 = 2.625 MHz */ #define SPI_BAUD_2MHZ_APB1 (SPI_CR1_BR_2) /* /16 on APB1 */
Always measure the actual SCK frequency with an oscilloscope or logic analyser in your setup. Board parasitic capacitance and driver drive strength can reduce the effective frequency below the prescaler calculation.
CPOL and CPHA: clock polarity and phase
These two bits define the SPI clock idle state and the data sampling edge. There are four modes, and the slave device must use the same mode as the master:
- Mode 0 (CPOL=0, CPHA=0): SCK idle low, data sampled on the rising edge, shifted on the falling edge.
- Mode 1 (CPOL=0, CPHA=1): SCK idle low, data sampled on the falling edge, shifted on the rising edge.
- Mode 2 (CPOL=1, CPHA=0): SCK idle high, data sampled on the falling edge, shifted on the rising edge.
- Mode 3 (CPOL=1, CPHA=1): SCK idle high, data sampled on the rising edge, shifted on the falling edge.
Mode 0 and Mode 3 are the most common in practice. Most SPI temperature sensors, ADCs, and MEMS sensors use Mode 0. Some display controllers and RF modules use Mode 3. Always check the slave datasheet.
/* SPI Mode 0 (CPOL=0, CPHA=0) — most common for sensors */ #define SPI_MODE_0 (0) /* SPI Mode 3 (CPOL=1, CPHA=1) — common for displays, RF */ #define SPI_MODE_3 (SPI_CR1_CPOL | SPI_CR1_CPHA)
NSS management: hardware vs software
The NSS (Slave Select) pin can be managed in hardware or software. In master mode with software NSS, you set SSM=1 and SSI=1 in SPI_CR1. The SPI peripheral ignores the physical NSS pin and generates the internal slave-select signal internally. This is the simplest and most common approach for a single-bus master: you drive a separate GPIO as chip select for each slave device.
/* Enable software NSS for master mode */
SPI1->CR1 |= SPI_CR1_SSM | SPI_CR1_SSI;
/* Drive CS as a regular GPIO */
#define CS_PIN GPIO_PIN_4
#define CS_PORT GPIOA
void cs_select(void) { HAL_GPIO_WritePin(CS_PORT, CS_PIN, GPIO_PIN_RESET); }
void cs_deselect(void) { HAL_GPIO_WritePin(CS_PORT, CS_PIN, GPIO_PIN_SET); }
With hardware NSS (SSOE=1 in SPI_CR2), the SPI peripheral drives the NSS pin low automatically when a transfer starts. This works only for a single slave device and requires correct pin mapping. For multiple slaves, use software NSS + manual GPIO chip selects.
Full-duplex master mode: polling transfer
A basic polling full-duplex transfer writes to SPI_DR and waits for RXNE. Because SPI is a shift-register protocol, every transmitted byte shifts a byte in from the slave simultaneously:
static uint8_t spi_transfer_byte(SPI_TypeDef *spi, uint8_t tx)
{
/* Wait for TXE (transmit buffer empty) */
while (!(spi->SR & SPI_SR_TXE));
spi->DR = tx;
/* Wait for RXNE (receive buffer not empty) */
while (!(spi->SR & SPI_SR_RXNE));
return (uint8_t)spi->DR;
}
For multi-byte transfers, never use polling for large buffers — it blocks the CPU for the entire transfer duration. At 10 MHz, a 1 KB transfer takes ~820 µs, which is a lifetime in a real-time system.
DMA-based full-duplex transfer
The STM32F4 SPI can use DMA for both TX and RX simultaneously. The SPI generates a DMA request on each TXE and RXNE event. You need two DMA streams — one for TX and one for RX — configured in circular or normal mode depending on the use case.
For SPI1 on STM32F401, the DMA mapping is:
- SPI1_RX: DMA2 Stream 0, Channel 3
- SPI1_TX: DMA2 Stream 3, Channel 3
DMA setup for a one-shot full-duplex transfer
void spi1_dma_fullduplex(uint8_t *txbuf, uint8_t *rxbuf, uint16_t len)
{
/* 1. Enable SPI1 DMA requests */
SPI1->CR2 |= SPI_CR2_TXDMAEN | SPI_CR2_RXDMAEN;
/* 2. Configure RX DMA stream (DMA2 Stream 0, Channel 3) */
DMA2_Stream0->CR = 0; /* reset */
DMA2_Stream0->NDTR = len;
DMA2_Stream0->PAR = (uint32_t)&SPI1->DR;
DMA2_Stream0->M0AR = (uint32_t)rxbuf;
DMA2_Stream0->CR = DMA_SxCR_CHSEL_1 | DMA_SxCR_CHSEL_0 /* ch 3 */
| DMA_SxCR_MSIZE_0 | DMA_SxCR_PSIZE_0 /* 8-bit */
| DMA_SxCR_MINC | DMA_SxCR_TCIE; /* inc + irq */
/* 3. Configure TX DMA stream (DMA2 Stream 3, Channel 3) */
DMA2_Stream3->CR = 0;
DMA2_Stream3->NDTR = len;
DMA2_Stream3->PAR = (uint32_t)&SPI1->DR;
DMA2_Stream3->M0AR = (uint32_t)txbuf;
DMA2_Stream3->CR = DMA_SxCR_CHSEL_1 | DMA_SxCR_CHSEL_0 /* ch 3 */
| DMA_SxCR_MSIZE_0 | DMA_SxCR_PSIZE_0 /* 8-bit */
| DMA_SxCR_MINC | DMA_SxCR_DIR_0; /* mem-to-per */
/* 4. Enable streams */
DMA2_Stream0->CR |= DMA_SxCR_EN;
DMA2_Stream3->CR |= DMA_SxCR_EN;
}
The RX stream should be configured with the higher priority or started first, because the first TXE event triggers the TX stream, and the SPI starts shifting immediately. If the RX stream is not ready, the first received byte overruns.
For the DMA completion interrupt, enable TCIE (Transfer Complete Interrupt Enable) on the RX stream. When the RX DMA transfer count reaches zero, the TCIF flag fires. At that point you can disable the SPI DMA requests and deselect the CS pin:
void DMA2_Stream0_IRQHandler(void)
{
if (DMA2->LISR & DMA_LISR_TCIF0) {
DMA2->LIFCR |= DMA_LIFCR_CTCIF0;
SPI1->CR2 &= ~(SPI_CR2_TXDMAEN | SPI_CR2_RXDMAEN);
DMA2_Stream0->CR &= ~DMA_SxCR_EN;
DMA2_Stream3->CR &= ~DMA_SxCR_EN;
cs_deselect();
/* Signal completion to main loop or RTOS task */
}
}
Half-duplex bidirectional mode
Some slaves are transmit-only (e.g. a digital temperature sensor) or receive-only (e.g. a DAC). For these cases, you can use the SPI in half-duplex (BIDIMODE=1) bidirectional mode with only one data line (MISO/MOSI) shared. Set BIDIOE=1 to output data, or BIDIOE=0 to receive.
This is useful when you need the spare GPIO normally used by the unused data pin. The baud rate and protocol are the same, but you must toggle BIDIOE between write and read phases if the slave protocol alternates.
CRC on SPI
STM32F4 SPI supports hardware CRC generation and checking when CRCEN=1 in SPI_CR1. The CRC is computed on the TX data and appended after the last data byte. The RX side generates its own CRC and compares it. If there is a mismatch, CRCERR is set in SPI_SR.
In practice, hardware SPI CRC is rarely used in master mode for multi-slave buses because it increases protocol overhead and requires both sides to agree on the polynomial. For reliable transfer on noisy buses, most designers prefer a higher-level CRC or checksum in the application protocol, or use a differential bus like RS-485 with SPI extending over short distances.
Practical example: reading an SPI temperature sensor (MAX31855 or MCP3564)
Here is a complete register-level master-mode initialization for SPI1 on an STM32F401 communicating with a generic SPI sensor at 5.25 MHz, Mode 0, 8-bit, with DMA one-shot transfer:
void spi1_master_init(void)
{
/* Enable clocks: SPI1, GPIOA (SCK, MOSI, MISO, CS), DMA2 */
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
RCC->APB2ENR |= RCC_APB2ENR_SPI1EN;
RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;
/* GPIO: PA5=SCK, PA7=MOSI, PA6=MISO, PA4=CS */
GPIOA->MODER &= ~(0xFu << 8); /* PA5,PA6,PA7: AF */
GPIOA->MODER |= (0x2u << 8) | (0x2u << 12) | (0x2u << 14);
GPIOA->MODER &= ~(0x3u << 8); /* PA4: output */
GPIOA->MODER |= (0x1u << 8);
GPIOA->AFR[0] |= (0x5u << 20) | (0x5u << 24) | (0x5u << 28); /* AF5 */
GPIOA->OSPEEDR |= 0x3F00; /* high speed */
cs_deselect();
/* SPI1: master, /16 → 84/16 = 5.25 MHz, Mode 0, 8-bit, SSM */
SPI1->CR1 = SPI_CR1_MSTR
| SPI_CR1_BR_2 /* /16 */
| SPI_CR1_SSM | SPI_CR1_SSI /* software NSS */
| SPI_CR1_SPE; /* enable */
SPI1->CR2 = 0; /* no interrupts, no DMA yet */
}
Reading the sensor in a one-shot DMA transfer:
uint8_t tx_buf[4] = {0x00, 0x00, 0x00, 0x00}; /* dummy bytes to clock */
uint8_t rx_buf[4] = {0};
cs_select();
spi1_dma_fullduplex(tx_buf, rx_buf, 4);
/* Wait for DMA completion interrupt */
/* rx_buf now contains the 4-byte sensor reading */
Practical checklist for SPI master mode
- Peripheral clock: confirm the SPI instance is on APB1 or APB2 and the PCLK is what you expect. SPI1 on APB2 can clock much faster than SPI2/SPI3 on APB1.
- GPIO alternate function: verify AF mapping in the datasheet. STM32F401 uses AF5 for SPI1, AF5 for SPI2/SPI3. Mis-mapped AFs are the #1 cause of dead SPI.
- Output speed: set OSPEEDR to high speed on SCK, MOSI, and MISO. Low-speed settings cause distorted SCK edges above ~5 MHz.
- Pull-ups: MISO does not need a pull-up (driven by slave). MOSI and SCK should use weak pull-downs or be configured by the application.
- DMA stream priority: set the RX stream to higher priority than TX to avoid RX overrun on the first byte.
- CS timing: some slaves require a minimum CS-low-to-first-SCK-edge delay. Insert a few NOPs or a
udelay()after asserting CS. - Busy flag: after the last byte, wait for BSY=0 before de-asserting CS. This ensures the last bit has been shifted out.
- OVR flag: after each transfer, clear the OVR bit by reading
SPI_SRthenSPI_DR. An uncleared OVR stalls the SPI.
How I would approach this on a client project
On a client project, I never use polling SPI for more than 4 bytes. The CPU time is too valuable. The standard approach is:
- Write a DMA-based SPI abstraction that takes a buffer pair, a length, and a callback. The SPI peripheral + DMA runs in the background while the CPU handles other tasks or enters low-power mode.
- For multi-slave buses, keep a context struct per CS pin that stores the SPI instance, DMA stream handles, and a completion semaphore (or flag).
- Use the hardware NSS + SSOE only for a single-chip design. For any board with multiple SPI slaves, use software NSS with GPIO CS — this gives you full control over CS timing and lets you insert delays between slaves.
- Validate the SPI bus with a known device or a loopback test (connect MOSI to MISO externally) before connecting the real slave. A loopback at 8-bit, Mode 0, writing 0xAA should read back 0xAA.
- Document the maximum SCK frequency after measuring it on the actual PCB — don't rely on the prescaler calculation alone. Board capacitance and driver strength matter.
Sources and references
- STM32F401 Reference Manual (RM0368) — SPI chapter
- STM32F40x/41x Reference Manual (RM0090) — SPI and DMA chapters
- STM32CubeF4 firmware package — HAL SPI examples at
Projects/STM32F401RE-Nucleo/Examples/SPI/ - AN4031: STM32™ SPI communication application note
- AN4488: Getting started with STM32F4xxxx MCU hardware development

Comments
Have comments? Send me an email.