Multi-modal sensor fusion is one of those engineering problems where the academic literature and the field reality diverge early and diverge hard. The academic treatment usually assumes synchronous sensor streams, reliable time-of-arrival timestamps, and Gaussian noise models for each sensor. Field deployment on an embedded UGV or UAS means asynchronous sensor outputs, USB-coupled timestamps with millisecond-class jitter, and non-Gaussian noise from structural vibration, RF interference, and thermal drift. The fusion architecture that works in simulation often fails in integration test for reasons that have nothing to do with the fusion algorithm itself.
This is a practical architecture writeup, not a survey of fusion algorithms. We document the structural decisions that determine whether a multi-modal fusion pipeline actually works under field conditions, specifically for platforms fusing EO/IR cameras, mmWave radar, and acoustic sensors in a resource-constrained embedded context.
The time synchronization problem is not solved by GPS alone
The first and most commonly underestimated problem in multi-modal fusion is timestamp alignment. If your EO camera, radar, and IMU are publishing measurements with different time references, or with the same nominal reference but different pipeline latencies, the fusion filter is fusing observations from different points in the platform's history. At 10 m/s platform speed and a 20 ms timestamp misalignment, that corresponds to a 20 cm displacement error in the predicted target state — which is large enough to cause the covariance gating logic to reject a valid measurement as an outlier.
GPS-disciplined time via PPS (pulse-per-second) signal is the standard for outdoor platforms with GPS coverage. A hardware PPS signal from a GNSS module, injected into the kernel via GPIO and associated with the system clock via chrony or linuxptp, can achieve timestamp accuracy in the 1–10 microsecond range. That is sufficient for most fusion applications.
The problem is the pipeline latency after timestamping. A camera frame timestamped at t=0 at the sensor hardware does not arrive at the fusion node at t=0 — it arrives at t=0 plus the time required to read from the sensor interface, copy to system memory, process through the ISP if applicable, encode, and DDS-publish. That pipeline latency is typically 5–20 ms depending on the sensor interface and load. If the latency is consistent across frames, it can be characterized and subtracted as a static bias. If it varies with system load — which it does on a shared-resource SoC under inference load — the residual jitter remains as a noise source in the fusion filter.
The architecture implication: time synchronization must be implemented at the hardware capture layer, not the software publication layer. Hardware timestamp at the sensor interface, log the measured pipeline latency during system integration, and propagate a per-message timestamp that is corrected for that measured pipeline delay. The ROS 2 message header provides a stamp field in the std_msgs/Header; use it with the hardware-corrected capture time, not the wall-clock time of the DDS publication call.
Late fusion vs. early fusion: the right split for embedded platforms
The fundamental architectural choice in multi-modal fusion is where in the processing chain to combine information from different sensors. Early fusion combines raw or lightly processed sensor data before detection — for example, concatenating a radar range-Doppler image with a camera image frame as a multi-channel input to a single neural detection model. Late fusion runs per-sensor detection pipelines independently and combines the resulting detection lists or track hypotheses at the object level.
For resource-constrained embedded platforms, late fusion is almost always the correct choice, for three reasons. First, early fusion requires that all sensor inputs are spatially and temporally registered at pixel-level resolution, which is a hard calibration and latency management problem. Second, early fusion models are substantially larger than single-modal models and consume more compute at inference time. Third, and most important for field reliability, late fusion allows graceful degradation when a sensor goes offline: the fusion layer simply stops receiving input from that modality and continues with the remaining sensors, propagating uncertainty through the filter. Early fusion models cannot gracefully handle a missing input channel without architectural changes to the model.
The late fusion architecture we use is a track-level fusion approach. Each sensor runs its own detection pipeline: the camera produces a list of 2D bounding boxes with classification confidence; the radar produces a list of range-velocity-angle detections with associated SNR; the acoustic sensor produces bearing estimates and frequency signatures. A multi-hypothesis tracker (typically an Unscented Kalman Filter or a variant of the Joint Probabilistic Data Association filter) fuses these detection lists into a shared object track list, with each track maintaining an explicit estimate of which sensors are contributing to it and the associated covariance matrix.
Covariance gating and why it fails when you need it most
Covariance gating is the mechanism by which the fusion filter decides whether a new detection is a plausible match for an existing track (and should update it) or is an independent new detection (and should initialize a new track). The gate is typically an ellipse in measurement space, defined by the predicted measurement covariance and a threshold on the normalized innovation squared (NIS). Measurements outside the gate are rejected as inconsistent with the track hypothesis.
In degraded sensor conditions — exactly the conditions when reliable fusion matters most — covariance gating tends to fail in one of two directions. Under high-clutter conditions (urban multipath, rain, electronic interference), the number of spurious detections inside the gate explodes, and the filter updates on clutter rather than the true target, causing the track estimate to diverge. Under low-signal conditions (obscured target, sensor partially blocked), the genuine target detections arrive at lower confidence and may fall outside the gate if the predicted covariance does not account for the increase in measurement noise.
The standard mitigations are known: OS-CFAR for the radar detection stage (discussed separately in our mmWave article), adaptive covariance inflation for the acoustic sensor when wind noise is elevated, and JPDA (Joint Probabilistic Data Association) rather than nearest-neighbor association when clutter density is high. The more important point is that the covariance model for each sensor must be validated under field conditions, not just in clean lab measurements. A radar modeled as zero-mean Gaussian in range and bearing will produce suboptimal fusion whenever the non-Gaussian multipath tails contribute meaningfully to the detection distribution.
Cross-modal failure detection: knowing when to distrust a sensor
A fusion architecture that treats its sensor inputs as unconditionally reliable will fail when a sensor is degraded, miscalibrated, or actively spoofed. Cross-modal consistency checking — using the other sensors to evaluate whether a given sensor's outputs are plausible — is the mechanism for detecting and isolating sensor failures.
Consider a scenario: a small UGV platform is operating in a scenario where the EO camera is partially obscured by dust. The camera detection model begins producing spurious detections from motion blur artifacts in the partially obscured frame. The radar, unaffected by the dust, continues producing accurate range-velocity detections. A fusion filter without cross-modal failure detection will accept the camera detections as valid and the resulting false tracks will degrade the platform's situational awareness picture. A fusion filter that monitors the innovation sequence for each sensor — the difference between predicted and actual measurements over time — will detect that the camera's NIS statistic has elevated above its expected distribution and automatically downweight or gate out camera detections until the consistency metric recovers.
Implementing this requires that each sensor modality has an associated quality estimator that produces a validity flag or weight alongside its detection output. For the camera, a simple proxy is the frame-level confidence histogram: if the mean detection confidence drops below a threshold (indicating that the model is returning low-confidence outputs consistent with degraded visual input), the camera modality weight in the fusion layer should be reduced. For the radar, a frame-level SNR below the CFAR threshold combined with elevated clutter density is the analog signal.
ROS 2 QoS and graceful degradation
On an embedded platform running ROS 2, two configuration decisions matter most for fusion reliability: node lifecycle management and DDS QoS profiles. ROS 2 Lifecycle Nodes enforce ordered startup — the fusion node does not enter Active state until all upstream sensor nodes have confirmed readiness, preventing silent measurement drops during the startup transient.
QoS: sensor topic profiles should use RELIABILITY=BEST_EFFORT for high-rate streams (camera frames, radar detection lists) where a dropped message is less harmful than backpressure-induced latency. The fusion output topic (the track list) should use RELIABILITY=RELIABLE because a missed track update to the autonomy layer has downstream consequences. HISTORY depth for sensor topics should be 1 (keep-last), not the default of 10 — processing a 300 ms backlog of stale radar detections after a processing hiccup produces track divergence that is difficult to distinguish from a real target event.
The sensor-removal test determines whether a fusion stack is deployable. A properly designed late-fusion architecture should maintain track continuity on confirmed tracks for a configurable hold-off duration when one modality goes offline, and the quality metric in the operator display should reflect degraded mode explicitly. What fails in this test is almost always track initialization logic, not established track maintenance. Programs that require two-sensor confirmation for track initialization cannot initialize new tracks in single-sensor degraded operation. The correct architecture maintains separate single-modal and multi-modal confidence tiers with explicit labeling, rather than a binary confirmed/unconfirmed status that obscures which sensors are contributing to a given track.