ESP32 · FreeRTOS · Debugging

ESP32 Watchdogs and Core Dumps: Turning FreeRTOS Stalls into Evidence

2026-05-25 · Davide Carrese

A watchdog reset is not a fix and a disabled watchdog is not debugging. On ESP32 projects, the useful middle ground is to make watchdogs fail loudly enough that the next reboot carries evidence: which task stalled, which core was affected, and whether the real fault was interrupt latency, scheduler starvation, or a deadlock.

The practical problem

Many ESP32 failures initially look the same in the field: the device stops publishing data, BLE or Wi-Fi becomes unreliable, a UART protocol times out, or a motor controller misses a deadline. Then the unit reboots and the application appears healthy again. If the firmware only stores a generic reset counter, the team has learned almost nothing.

ESP-IDF gives us several mechanisms that are worth treating as part of the product architecture, not as temporary debug switches: the interrupt watchdog timer, the task watchdog timer, the panic handler, backtraces, GDB stub support, and core dumps to flash or UART. The engineering question is not “should the watchdog be enabled?” The better question is: what evidence do we preserve when the watchdog fires?

Two watchdogs, two different classes of bugs

Interrupt watchdog: long critical sections and blocked ISRs

The interrupt watchdog is aimed at situations where the system cannot service interrupts in time. Typical causes are long critical sections, code that disables interrupts around too much work, flash/cache related delays in the wrong context, or high-priority interrupt load that prevents normal housekeeping. On a dual-core ESP32, this can be misleading if you only look at application-level logs: one task may be innocent while the core it depends on is not scheduling normally.

When this watchdog triggers, I first look for places where firmware treats “atomic” as “do everything with interrupts masked”. Register updates and small ring-buffer pointer changes are reasonable critical sections. Formatting strings, waiting for peripheral flags, copying payloads, or calling drivers from inside critical sections are not.

Task watchdog: starvation, deadlocks, and bad ownership

The task watchdog is more useful for application-level progress. In ESP-IDF it can monitor subscribed tasks or users and detect when they fail to reset the watchdog in time. That makes it useful for catching a task that is still technically running but no longer making forward progress: waiting forever for a mutex, spinning on a peripheral state bit, or being starved by an overly aggressive task at the same or higher priority.

A common mistake is to feed the task watchdog from a timer or from an unrelated supervisor task. That proves the scheduler is alive, but it does not prove the critical work is progressing. Feed it from the execution path that represents actual progress: one successful control-loop iteration, one drained queue batch, one completed protocol transaction, or one safe idle point in a state machine.

A minimal progress-based pattern

The exact API shape depends on the ESP-IDF version and project configuration, but the design rule is stable: subscribe the task that owns the critical loop and reset the watchdog only after useful work has completed. Avoid resetting it before a blocking call that may never return.

#include "esp_task_wdt.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/queue.h"

static QueueHandle_t telemetry_q;

typedef struct {
    uint32_t sensor_id;
    int32_t value;
} telemetry_msg_t;

static bool process_one_message(const telemetry_msg_t *msg)
{
    /* Keep this bounded. No unbounded retries, no printf storms,
       no waiting forever for a peripheral from this path. */
    return app_store_sample(msg->sensor_id, msg->value) == ESP_OK;
}

static void telemetry_task(void *arg)
{
    ESP_ERROR_CHECK(esp_task_wdt_add(NULL));   // subscribe current task

    for (;;) {
        telemetry_msg_t msg;

        if (xQueueReceive(telemetry_q, &msg, pdMS_TO_TICKS(500)) == pdTRUE) {
            if (process_one_message(&msg)) {
                /* Feed only after real progress. */
                ESP_ERROR_CHECK(esp_task_wdt_reset());
            }
        } else {
            /* An idle heartbeat is acceptable only if "no work" is healthy. */
            ESP_ERROR_CHECK(esp_task_wdt_reset());
        }
    }
}

This example is intentionally boring. The important part is not the queue; it is the location of the reset. If process_one_message() can block forever, the task watchdog should fire. If a mutex inversion prevents the task from running, it should fire. If the queue is empty and that is a valid state, the task can reset the watchdog on the timeout path. If an empty queue is itself a fault, do not feed there; record an error state and let the supervisor policy decide.

Core dumps make watchdogs useful after reboot

For remote products, the panic output visible on a development UART is rarely enough. ESP-IDF core dumps can be written to flash or UART and decoded later. Flash storage is usually the more useful field option: after reset, the application can upload the dump or expose it through a service command before erasing it.

Core dumps are not free. You must reserve storage, decide how much task stack data to include, protect private data if dumps leave the device, and make sure the crash collection path does not compromise boot reliability. Still, even a small dump with task snapshots and a backtrace is often the difference between guessing and fixing.

Operational rule: a watchdog policy without post-mortem collection is only an availability feature. A watchdog policy with reset reason, backtrace, core dump, firmware version, build ID, and last subsystem breadcrumb is a debugging system.

What I log before feeding

I like lightweight breadcrumbs that survive until the next crash without requiring continuous logging. For example, each critical task can update a small retained structure with a state enum, last error code, monotonic counter, and the line or step identifier of the current operation. Keep it static, boring, and safe to update without allocation.

Do not turn this into a general logging framework inside real-time paths. The goal is a compact “last known useful state” that can be read at boot together with the reset reason. If you combine that with a decoded core dump, you can usually separate four cases quickly: CPU stuck in a critical section, task blocked on synchronization, heap/stack corruption that later caused a panic, or a driver call that violates context rules.

Practical example: a field telemetry gateway

Imagine an ESP32 gateway that reads a sensor board over UART, buffers samples locally, and publishes batches over Wi‑Fi/MQTT. The product may run for months in a cabinet, so “it rebooted once” is not enough information. I would treat the telemetry task, MQTT task, storage task, and OTA/update path as separate progress domains.

For the telemetry task, the task watchdog should be fed only after a valid frame has been parsed or after a deliberate healthy idle timeout. For the MQTT task, it should be fed after a publish attempt completes, not before waiting on network state. For storage, it should be fed after a bounded write transaction. If the unit resets, the next boot should report: reset reason, task/core involved, last progress domain, firmware build ID, and the retained breadcrumb. That turns a generic “remote device froze” report into a concrete lead: UART parser blocked, flash write exceeded budget, or network task starved the system.

Practical checklist

Keep the interrupt watchdog enabled during development unless you are isolating a very specific debugger interaction.
Subscribe only meaningful tasks to the task watchdog; do not feed from an unrelated heartbeat.
Reset the task watchdog after forward progress, not before a risky blocking operation.
Audit critical sections for formatting, allocation, driver calls, polling loops, and large memory copies.
Enable panic output and ensure backtraces are symbolized in CI or in the support workflow.
Configure core dumps deliberately: destination, size, integrity check, retention, and privacy policy.
Store reset reason, firmware version, build ID, and a compact breadcrumb block across reboot.
Test the failure mode intentionally: deadlock a task, hold a critical section too long, and verify the collected evidence.

How I would approach this on a client project

First I would map the firmware into progress domains: communications, acquisition/control, storage, update path, and safety supervision. For each domain I would define what “healthy progress” means in measurable terms. Then I would configure the ESP32 watchdogs around those definitions instead of around arbitrary timeouts copied from an example project.

Second, I would add a small crash record interface: reset reason, watchdog source when available, build metadata, selected breadcrumbs, and core dump handling. On a connected product, the next boot would upload the record with rate limiting. On an offline product, it would be retrievable over UART, USB, BLE, or a manufacturing/service command.

Finally I would add fault-injection tests to the firmware test plan. A watchdog design is not complete until someone has intentionally caused the watchdog to fire and verified that the resulting evidence points to the injected fault. That is the step that turns “the unit rebooted” into an actionable engineering report.

Sources consulted

Comments

Have comments? Send me an email.

✉️ Send me an email