Forty milliseconds sounds like a long time until you are designing for it. A fixed-wing UAS at 80 knots covers roughly 1.5 meters per second in the direction of its sensor boresight. At 40 ms from sensor trigger to classification output, the platform has already moved 6 centimeters. If your model is running at 120 ms — a latency most edge inference papers classify as "real-time" — you have moved a quarter-meter before you have an answer. For surveillance and detection tasks that depend on sub-pixel alignment between frames, that difference is not academic.
This piece documents the latency budget constraints we work within and the specific architectural choices that result from them. It is not a benchmark comparison post. We are not claiming our numbers are faster than everything else. We are explaining the reasoning chain from mission requirement to silicon choice to software stack, because that chain is rarely written down in one place.
The latency budget starts with the mission, not the hardware
For a UAS payload doing wide-area surveillance, the relevant latency is not inference latency in isolation. It is the end-to-end pipeline: sensor trigger → raw capture → preprocessing → inference → structured output to mission interface. Each stage consumes time and each stage has a minimum floor set by physics or silicon.
A typical EO sensor at 30 fps has a 33 ms frame period. If you want to run inference on every frame without dropping, your entire pipeline must fit in that window. At 60 fps the window is 16.7 ms, which demands a fundamentally different hardware choice. The decision to target 30 fps vs. 60 fps vs. event-triggered inference (running only when a motion pre-filter flags activity) is therefore the first architectural decision, and it precedes hardware selection entirely.
For ground moving target indicator (GMTI) tasks at medium altitude, 30 fps is typically sufficient. For close-in target acquisition, 60 fps may be required for track continuity. Event-triggered inference can reduce effective latency requirements by an order of magnitude because you are only classifying frames that already passed a lightweight change-detection threshold — but it adds a pre-filter pipeline that has its own latency and false-negative risk.
Where the milliseconds actually go
On a Jetson Orin NX with a YOLOv8-small model compiled via TensorRT at INT8, our measured pipeline breakdown for a 640×480 EO frame looks roughly like this:
- Frame capture DMA transfer from sensor interface: 1.2–2.1 ms (varies with PCIe/CSI contention)
- Preprocessing (resize, normalize, HWC→CHW reorder): 0.8–1.4 ms on the VIC/ISP, offloaded from CPU
- TensorRT inference on DLA core 0: 8.4–11.2 ms at INT8
- Post-processing (NMS, confidence threshold, struct pack): 1.1–1.8 ms on CPU
- DDS publish to ROS 2 topic: 0.4–0.7 ms
Total wall-clock: roughly 12–17 ms under normal conditions. That sounds well inside a 40 ms budget — and it is, until you factor in task preemption, thermal throttling, and the second inference stream running in parallel for IR. Sustained dual-modal operation on a single Orin NX without priority pinning can push the effective latency to 28–35 ms under load. That is still within budget, but the margin is thin enough that a single scheduling anomaly can produce a dropped frame.
DLA cores vs. GPU cores: the SWAP tradeoff
The Jetson Orin family ships with two Deep Learning Accelerator (DLA) cores alongside the GPU. DLA is a fixed-function engine optimized for throughput on batch convolution workloads at lower power than the iGPU. On the AGX variant, each DLA core can sustain around 32 INT8 TOPS at roughly 2–3 W versus the GPU's peak of ~275 sparse TOPS at a substantially higher power draw.
The important nuance is that DLA has limited operator support. Attention mechanisms, dynamic shapes, and several custom activation functions do not compile to DLA. Any layer that fails to map falls back to the GPU, and the fallback incurs a cross-engine data transfer that adds latency. When evaluating a model for DLA deployment, the first task is always to profile which layers fall back and whether restructuring the model eliminates them. A model with 98% DLA coverage but a single attention layer in the neck will often run slower on DLA than a GPU-only deployment because of that one sync point.
For architectures like MobileNetV3 or EfficientDet-D0 that have clean DLA coverage, the power-per-inference advantage is significant. We target DLA-first for background persistent surveillance tasks and GPU for burst burst-mode classification when a cue has been detected.
The Hailo-8 alternative
Hailo's architecture is purpose-built for edge inference in a way that the Jetson family is not. The Hailo-8 at 26 TOPS consumes roughly 2.5–3.5 W in sustained inference, and it runs as a PCIe M.2 add-in that is hardware-agnostic at the system level. On a platform where the host SBC is already determined by the airframe supplier, adding a Hailo-8 M.2 is sometimes the only path to meeting the inference latency budget without replacing the compute module.
The constraint is the compiler. Hailo's Dataflow Compiler requires models to be statically sized and exported in ONNX before compilation. The resulting binary is tightly coupled to that specific model topology — you cannot change input resolution or swap a backbone at runtime without recompiling. For programs with fixed sensor configurations and stable model versions, this is fine. For programs where the model is expected to be updated in the field, the update workflow needs to accommodate re-compilation as part of the payload update process.
We have run YOLOv8n on the Hailo-8 at 480p INT8 with 4–6 ms inference latency at 2.8 W sustained. That is genuinely impressive for the power number. The caveat is that the 4 ms comes with a batch size of 1 and no concurrent workloads on the Hailo fabric — adding a second model stream reduces that advantage measurably.
When FPGA is the right answer
Consider a scenario where a small UAS has a 5 W total compute power budget allocated from the airframe, the mission requires inference at 60 fps, and the model topology is fixed for the program lifecycle. This is the scenario where Xilinx Zynq UltraScale+ with a custom accelerator IP block is not overkill — it is often the only viable path.
FPGAs for inference are not a general-purpose solution. They require a hardware implementation effort that is orders of magnitude larger than deploying a TensorRT model. But they offer deterministic latency, no OS scheduling jitter, and power consumption that can be tailored to the exact computation being performed. A fixed-topology INT8 accelerator on a ZU5EV can run at under 3 W for the inference core while achieving sub-5 ms latency on compact detection networks. The NRE cost is significant; the runtime power profile is not.
We are not saying FPGA is the right answer for most programs. For a 12-month development cycle with a changing model pipeline, it is almost certainly not. For a mature program with a locked model, a hard power budget, and deterministic latency requirements, it deserves a serious evaluation that most teams skip because the FPGA path is harder to prototype.
Jitter is the latency problem nobody talks about
Mean inference latency is straightforward to measure. Jitter — the variance in latency across frames — is harder to characterize and more dangerous in practice. An autonomy stack that assumes 15 ms inference latency and gets a 55 ms spike on the 200th frame, because a garbage collector ran or a thermal event throttled the DVFS state, can produce a tracking discontinuity that the mission system interprets as a lost contact. In a fire control application that would be catastrophic; in an ISR application it produces a false drop that degrades track quality.
The mitigation is not simply "buy faster hardware." It involves pinning the inference thread to an isolated CPU core, disabling DVFS governor scaling during mission windows, pre-allocating tensor buffers at startup rather than during inference, and ensuring that the DDS QoS profile for the output topic is configured with DEADLINE and LIVELINESS policies that cause the receiving node to generate a warning event — not a silent miss — when a frame is late. None of this appears in standard edge AI deployment guides because those guides are written for commercial applications where a dropped frame means a slightly worse user experience, not a track loss event.
The latency numbers that matter for UAS payloads are the 99th-percentile numbers under thermal load, not the median numbers in a lab with active cooling. Design to the tail, not the mean.