Single-modal sensing is a bet that the environment will cooperate. In defense ISR, the environment does not cooperate by design. Adversarial concealment — thermal decoys, camouflage netting, debris fields, structured clutter — is not a fringe condition. It is the expected condition. The question an autonomous platform must answer is not whether its sensor is good enough on clear days. It is whether the fusion architecture can resolve ambiguities that a single sensor modality structurally cannot.

We have spent considerable time on this problem at Kestrelsense. What follows is a technical discussion of the primary sensor fusion architectures, where each fails, and what the practical design choices look like for a weight-constrained airborne platform.

The Three Sensor Modalities and Their Failure Modes

Before discussing fusion architectures, it is worth stating clearly what each modality contributes and where it fails independently.

Electro-Optical (EO) cameras produce high-resolution texture and color information in the visible spectrum. They are excellent classifiers in good lighting and clear atmospheric conditions. Their failure modes are predictable: low-light conditions, direct-sun glare, specular reflection, thermal crossover periods at dawn and dusk, and any surface treatment that reduces visible-spectrum contrast. A vehicle under a camo net with leaf-scattering texture is difficult to distinguish from surrounding vegetation in EO alone.

Long-Wave Infrared (LWIR) sensors detect emitted thermal radiation in the 8–12 micron band. They operate in complete darkness and see through many concealment approaches that defeat EO. Their failure modes are different but equally real: thermal crossover (where object and background reach the same temperature, making them thermally invisible), thermal wash at high ambient temperatures, and limited texture information that makes fine-grained classification difficult. A vehicle that has been stationary long enough to temperature-equalize with its surroundings disappears from LWIR.

LIDAR provides 3D point-cloud geometry independent of both lighting and thermal conditions. It resolves shape ambiguities that confuse both EO and LWIR. Its failure modes include limited range at high operating altitudes, susceptibility to atmospheric obscurants (fog, smoke, precipitation), and the computational cost of processing point clouds at the frame rates required for tracking moving objects.

The critical observation is that these failure modes are largely non-overlapping. An object that defeats EO by thermal equalization is still visible in LIDAR geometry. An object hidden by smoke or dust that defeats LIDAR is still detectable in LWIR. A well-designed fusion architecture exploits this complementarity systematically — not by running multiple classifiers independently and voting, but by integrating the physical measurements at the feature level so each modality's evidence informs the others.

Early Fusion vs. Late Fusion vs. Middle Fusion

Three broad architectural approaches appear in the literature and in deployed systems.

Late fusion (sometimes called decision-level fusion) runs independent classifiers on each sensor stream, produces independent classification outputs, and combines the outputs through a voting or confidence-weighting scheme. This is the simplest architecture to implement and the easiest to validate independently, since each classifier can be tested against its own single-modal dataset. Its weakness is that it does not allow modalities to resolve each other's ambiguities — each classifier is still operating on its own limited information, and a confident but wrong answer from one modality can dominate the fusion output.

Early fusion concatenates raw or minimally processed sensor inputs into a single multi-channel tensor and runs a single neural network across all channels simultaneously. This gives the network maximum information but creates substantial engineering complexity: the sensors must be spatially registered and temporally synchronized with high precision, and the network must learn to handle missing channels when a sensor is degraded or offline. On airborne platforms, the alignment and synchronization requirements are demanding — a misregistration error of even 2 pixels across the EO and LWIR channels at 400m range translates to a 1.2-meter spatial error in the fused output.

Middle fusion (cross-modal attention, feature-level fusion) runs independent backbone encoders on each sensor stream to extract per-modality feature maps, then applies a learned attention mechanism that allows features from one modality to attend to spatially corresponding features from the others. The network learns which modality to trust in which regions of the scene based on content — if the EO backbone produces low-confidence features in a shadow region, the attention weights shift toward the LWIR features for that region automatically.

Middle fusion is harder to implement and train than late fusion, and harder to validate than early fusion. But it handles partial sensor failure gracefully — if one modality is degraded, the attention mechanism down-weights its contribution without requiring explicit failure detection logic. In our experience, it produces substantially better classification confidence in the scene conditions that defense ISR actually encounters.

Cross-Modal Attention on a Power-Constrained NPU

The practical challenge of middle fusion on a weight- and power-constrained platform is the compute cost of the cross-modal attention operations. Standard transformer attention has quadratic complexity in the sequence length — on a high-resolution sensor stream, that is a real problem for a 8W NPU budget.

We addressed this through a combination of spatial downsampling in the attention module and quantized attention weights. The backbone encoders run at full spatial resolution; the cross-modal attention module operates on 8x downsampled feature maps from each stream. This reduces the attention compute cost by roughly 64x relative to full-resolution attention while preserving the spatial correspondence needed for effective cross-modal resolution. The final output is upsampled back to detection resolution before the classification head.

The quantized attention weights — running at INT8 rather than FP32 — introduce a measurable but small accuracy penalty. In our validation data, the INT8 cross-modal attention model achieves 87.3% precision at 0.75 IoU threshold on our test set, versus 89.1% for the FP32 reference. That 1.8 percentage point gap is acceptable for deployment when the power and compute savings make the module physically possible on the target platform.

Sensor Registration and Temporal Synchronization

Any multi-modal fusion architecture requires that the sensor streams are registered in both space and time. For an airborne platform in forward motion, temporal synchronization is non-trivial: at 60 knots and 400 feet AGL, a 16ms timing offset between the EO and LIDAR capture windows produces a 0.5-meter parallax error in the fused point cloud. At 100ms — the asynchronous frame latency you get if you simply read sensor outputs as they arrive — the error is 3 meters. That is enough to miss a target or generate a false track.

The KS-100 handles synchronization through a hardware trigger line that fires the EO, LWIR, and LIDAR sensors simultaneously at the start of each inference frame. Trigger-to-capture jitter is less than 0.5ms across all three modalities. The spatial registration is handled by a factory calibration procedure that records the physical offset and angular misalignment between sensor mounting positions; these calibration matrices are stored in the module's secure flash partition and applied in the pre-processing stage before features are extracted.

Integrators sometimes ask whether the calibration remains valid after the payload is installed on the aircraft. Our thermal cycling validation shows that the registration error increases by less than 0.3 pixels across the full -40°C to +85°C operating range — within the tolerance budget for our detection resolution. We do recommend a factory re-calibration after any payload bay modification that changes the sensor mounting geometry.

Point Cloud Pre-Processing for Neural Fusion

Raw LIDAR point clouds require pre-processing before they can be fed into a 2D feature extraction backbone alongside EO and LWIR imagery. The most common approach is projection — converting the 3D point cloud into a range image (each pixel contains the range to the nearest surface at that bearing) that is spatially registered to the camera frame.

We use a voxelized point cloud representation rather than a range image for the LIDAR backbone input. Voxelization preserves some 3D structure — particularly useful for height estimation of targets behind occlusions — while producing a fixed-size tensor that feeds efficiently into the backbone encoder. The voxel grid is set at 0.25m resolution at the nominal detection range; this provides adequate shape discrimination for vehicle-class objects while keeping the feature tensor within the NPU's on-chip SRAM budget.

What the Architecture Buys You Operationally

It is worth being direct about what multi-modal fusion actually delivers in operational terms, because the claims in vendor literature are often imprecise. What we have validated in our own testing and field evaluations:

  • False-positive rate on cluttered backgrounds reduced by approximately 34% compared to single-modal EO classification at matched precision — this is our observed result, on our test dataset, under our evaluation conditions.
  • Detection maintained on thermally equilibrated targets (LWIR crossover conditions) using LIDAR geometry as primary evidence — a condition where a single-modal LWIR classifier effectively fails.
  • Detection maintained at night and through light obscurant conditions where EO is degraded, using LWIR as primary evidence with LIDAR providing shape confirmation.
  • Graceful degradation when one sensor modality is disabled: the fusion architecture continues operating on the remaining modalities with reduced but non-zero classification confidence, rather than failing completely.

What fusion does not buy you: it does not compensate for poor sensor quality, mis-calibrated registration, or adversarial techniques specifically designed to defeat multi-modal systems. Fusion is a force multiplier for good sensors, not a substitute for them.

The sensor fusion architecture is one of the most consequential design choices in an autonomous ISR platform. Getting it right requires careful attention to the physics of each modality's failure modes, the computational constraints of edge deployment, and the precise synchronization and calibration requirements that turn three independent sensor streams into a single coherent picture. We have built the KS-100 around these requirements from first principles — and in our experience, that is the only approach that produces reliable results in the environments where it actually matters.