2025-12-18 Edge AI Hardware SWaP

SWaP Tradeoffs in Edge AI Hardware Selection

Jetson Orin, Hailo-8, Myriad X, custom FPGA — every edge compute platform involves a different point on the latency / power / cost curve. We document the tradeoff space for mission-critical ISR payloads.

Hardware comparison diagram for edge AI compute platforms

By Kestrelsense Engineering — December 18, 2025 — 9 min read

Every edge AI hardware selection for a defense payload involves the same fundamental tradeoff surface: TOPS per watt, memory bandwidth, deterministic latency, supply chain, and integration complexity. There is no platform that wins on all five dimensions. The selection that is right for a Group 2 UAS operating at 5 W total compute budget is wrong for a Group 4 UAS with a 30 W allocation; the selection that is right for a low-volume program prototyping a capability is wrong for a production run with a 7-year logistics tail.

This article documents our working framework for navigating that tradeoff surface. We focus on the three platform families we work with most directly — NVIDIA Jetson Orin, Hailo-8, and Xilinx Zynq UltraScale+ FPGA — because those represent meaningfully different architectural choices rather than incremental variations on a theme. We are not covering every edge compute option on the market. We are explaining the logic that makes each of these platforms the right answer for a specific subset of requirements.

Framing the SWaP-C problem

SWaP-C — Size, Weight, Power, and Cost — is the standard framework for evaluating payload components in defense systems. For edge compute specifically, the relevant SWaP-C dimensions interact in non-obvious ways. A platform that runs at 5 W nominal may run at 8 W under sustained inference load on a hot day. A platform that costs $200 in prototype quantities may have a $2,000 fully-provisioned unit cost when you add the carrier board, thermal management hardware, and the vibration isolation mount required to meet MIL-STD-810 qualification. The advertised TOPS number may be achievable only at a specific precision level (INT4 vs. INT8 vs. FP16) that is incompatible with your model accuracy requirements.

The first step is always to define your mission-specific requirements with actual numbers rather than general direction: What is the maximum sustained power draw from the airframe's compute allocation? What is the maximum physical volume? What is the worst-case ambient temperature the board will see in enclosure? What inference latency is required at the 99th percentile under load? What is the expected program lifecycle duration, and what is the supply chain risk tolerance? These are not questions you can defer until after hardware selection — they are the selection criteria.

NVIDIA Jetson Orin: the flexible but power-hungry option

The Jetson Orin family spans from the Nano (10 TOPS, 5–10 W TDP) to the AGX Orin (275 sparse TOPS, up to 60 W TDP). The flexibility is its primary advantage: you can run TensorRT, CUDA, DeepStream, ROS 2, and nearly any GPU-accelerated workload without modification from your desktop development environment. The toolchain is mature, the documentation is comprehensive, and the developer ecosystem is the largest in the embedded AI market. If you need to prototype quickly, test multiple model architectures, and iterate on your pipeline, nothing else comes close.

The NX module is the most common selection for embedded ISR payloads we evaluate. At 10–20 W TDP with 16 or 32 TOPS available (depending on MAXN vs. MODE 10W configuration), it fits a medium-class UAS compute allocation with headroom. The dual DLA cores allow splitting a two-model pipeline (detection + classification) across DLA and GPU without contention. The 16 GB LPDDR5 on the 16 GB NX variant is more memory than most inference pipelines need, which provides buffer for future model capacity growth.

The honest weaknesses: thermal management is non-trivial. The NX module requires active cooling in any enclosure without forced airflow, and passive heat spreaders alone are marginal above 40°C ambient. In an enclosure mounted in a UAS fuselage, the thermal design is often a harder engineering problem than the software integration. Second, the Orin family's supply chain went through significant constraints in 2023–2024; programs with multi-year production commitments should maintain strategic inventory or qualify a secondary sourcing arrangement.

Hailo-8 and Hailo-8L: purpose-built inference at minimal power

The Hailo-8 (26 TOPS) and Hailo-8L (13 TOPS) are not general-purpose compute modules. They are neural network inference accelerators with a fixed dataflow architecture optimized for throughput on common vision model topologies. The power profile is their defining advantage: the Hailo-8 draws approximately 2.5–3.5 W under sustained single-model inference, roughly one-fifth the power of an Orin NX running the same workload on the iGPU.

The architecture imposes genuine constraints. The Hailo Dataflow Compiler performs a layer-fusion and partitioning step that maps the model's computation graph onto the chip's core grid. Models that map well — ResNet, MobileNet, EfficientDet, YOLO family through v8 — achieve near-datasheet TOPS utilization. Models with dynamic shapes, conditional branching, or transformer attention layers that do not map to the compiler's supported operators either require model modification or cannot run on Hailo at all. The compiler is improving with each release, but for programs with unusual model architectures, the compatibility check is mandatory before any hardware commitment.

The Hailo-8 as an M.2 2280 card integrates into any system with an available PCIe M.2 slot, which means it can augment an existing SBC rather than replacing it. For airframe integrators who have already qualified a specific host processor (for example, a carrier board around an i.MX 8 or Rockchip RK3588), adding a Hailo M.2 is often the lowest-risk path to meeting inference requirements without re-qualifying the host system. The Hailo PCIe driver is available for Linux and supports standard V4L2 and GStreamer integration paths.

Xilinx Zynq UltraScale+: determinism over flexibility

The Zynq UltraScale+ family combines ARM Cortex-A53 application processors with an FPGA fabric in a single SoC. For inference specifically, the FPGA fabric is programmed with a custom accelerator IP — typically using Xilinx's Vitis AI toolchain, which provides a DPU (Deep Learning Processing Unit) IP core that handles the convolution-heavy layers of standard vision models.

The ZU4EV and ZU5EV are the most commonly selected parts for small UAS embedded compute because they combine a reasonable DSP slice count with the Cortex-A53 cluster for host-side processing and the power-gated Mali GPU for display or pre/post-processing tasks. A ZU5EV running the Vitis AI B1600 DPU at INT8 can sustain roughly 1.6 TOPS at 3–5 W, which is competitive with the Hailo-8L for compact detection networks.

The FPGA advantage over both GPU and purpose-built NPU is determinism. The FPGA fabric, once programmed, executes with clock-cycle-level predictability. There are no OS scheduling preemptions, no DVFS state transitions, no background driver tasks consuming compute. For programs that require certification-grade latency bounds — the kind where you must be able to claim that inference completes within X cycles 100% of the time, not 99.9% of the time — an FPGA is the only option that provides that guarantee without custom silicon.

We are not saying FPGA is the right answer for fast-moving programs. The NRE cost of FPGA integration is real: RTL development, simulation, place-and-route, bitstream generation, and testing takes months, not days. Model updates require a re-synthesis cycle. The toolchain (Vivado, Vitis AI) has a steeper learning curve than CUDA. These are genuine barriers. The correct question is whether the program's requirements — specifically deterministic latency, harsh environment qualification, or long logistics tail — justify the higher integration cost. For a 5-year fielded program with a locked model, often they do.

A selection framework, not a universal ranking

Rather than a ranked list, the selection logic we use maps requirement clusters to platform choices:

Rapid prototyping, changing model pipeline, <20 W budget, Group 2–3 UAS: Jetson Orin NX. The toolchain flexibility and developer ecosystem offset the power overhead at this platform size.

Hard power constraint (<5 W for inference), host processor already qualified, standard model topology: Hailo-8 M.2. The power number is real, the M.2 integration is low risk, and the compiler handles YOLO-family and EfficientDet well.

Deterministic latency required, 7+ year fielded lifecycle, locked model, harsh environment qualification: Zynq UltraScale+ with Vitis AI DPU. The NRE cost is justified when the program lifecycle is long enough to amortize it and the latency requirements cannot be met with probabilistic guarantees.

Two considerations cut across all three: supply chain qualification and thermal design. A platform that achieves your performance targets on the bench but runs at elevated junction temperature inside your enclosure in July will underperform or fail in the field. And a platform that is sole-sourced from a single distributor for a 10-year program is a logistics risk regardless of its technical merits. Both of these need to be in the evaluation before hardware commitment, not discovered during integration.

INT8 vs. FP16: the precision decision is architectural, not a post-training option

A common misconception is that running a model at INT8 rather than FP16 is simply a matter of calling a quantization API after training. For embedded deployment, the precision decision needs to be made during model design, because quantization-aware training (QAT) — where the quantization noise is included in the training loss — consistently outperforms post-training quantization (PTQ) on the accuracy metrics that matter for ISR tasks: small-object detection recall at low false alarm rate.

On YOLOv8n evaluated on a fine-grained detection task with objects occupying under 32x32 pixels, we have measured 3–5 percentage point recall degradation from PTQ versus QAT at INT8. That degradation is the difference between a system that meets its probability of detection specification and one that does not. The hardware selection and the model training pipeline are therefore coupled: if you are targeting a hardware platform that requires INT8 for its power budget, you need to begin QAT early in the model development cycle, not apply post-training quantization as a final step before deployment.

More technical insights.

All Insights