Quantization is, at its core, a lossy compression technique applied to a neural network's numerical representation. You replace 32-bit floating-point weights with 8-bit integers, or 4-bit integers in aggressive cases, and accept some amount of accuracy degradation in exchange for dramatic reductions in memory footprint, compute cost, and power draw. In commercial computer vision, the tradeoff is often straightforward: a well-quantized MobileNet is nearly indistinguishable from its FP32 counterpart on standard benchmarks, and the inference speedup on an INT8-capable NPU makes the choice obvious.

Defense edge deployments are not standard benchmarks. The accuracy floor requirements are stricter, the test distributions are adversarially varied, and the failure mode of misclassification is not a degraded user experience — it is a false cue or a missed contact at a moment when the decision window is 80 milliseconds. I want to walk through the quantization process as we actually apply it at Kestrelsense, including the parts that do not appear in framework tutorials.

Why Quantization Is Not Optional at 8W

The compute budget available inside an 8W NPU is finite and measurable. A full-precision (FP32) multi-modal transformer model for 6-class object detection and kinematic estimation runs approximately 3.4 billion MAC operations per inference frame. At the gate efficiency of current embedded NPU silicon, delivering 3.4B FP32 MACs per frame at 60fps would require a compute budget in excess of 40W — five times our total payload allocation.

INT8 quantization reduces the per-MAC cost by a factor of 4 on most dedicated NPU architectures, which includes reduced data movement from DRAM to the compute array. With a fully INT8 inference graph, the same model runs in 6.1W at 60fps on our NPU — within the power envelope, with operational margin remaining. There is no version of this system that operates at full inference speed without quantization. The physics do not allow it.

Post-Training Quantization vs. Quantization-Aware Training

There are two primary approaches to producing a quantized model, and the choice between them has significant consequences for accuracy in the target domain.

Post-training quantization (PTQ) takes a trained FP32 model and converts it to INT8 after training. The conversion process uses a calibration dataset to determine the dynamic range of each layer's activations and set the quantization scale and zero-point parameters. PTQ is fast and does not require re-training. Its weakness is that it tends to be sensitive to the choice of calibration dataset: if the calibration distribution does not match the inference distribution, activation ranges will be mis-estimated, and quantization error will be higher than expected on out-of-distribution inputs.

Quantization-aware training (QAT) simulates quantization noise during the forward pass of training, allowing the model to learn weight distributions that tolerate quantization more gracefully. QAT typically produces better accuracy than PTQ, particularly in low-bit regimes (INT4 or below) and in models where the activation distributions vary significantly across the inference distribution. The cost is a substantially longer training process — QAT requires starting from a pre-trained FP32 checkpoint and running additional training epochs with simulated quantization enabled.

For defense edge models, we strongly prefer QAT over PTQ. The inference distribution we care about — adversarially varied backgrounds, thermal crossover conditions, oblique angles, camouflage patterns — is substantially harder to cover with a calibration dataset than a commercial benchmark. PTQ on a well-curated calibration set gives us a model that performs within 1.5% of FP32 on our validation set, but may show accuracy drops of 4–7% on out-of-distribution captures that represent real operational conditions. QAT, with a training set that includes the hard cases, closes most of that gap.

Layer-Wise Sensitivity Analysis — The Step Most Tutorials Skip

Not all layers of a neural network tolerate quantization equally. Attention layers in transformer architectures are particularly sensitive — the softmax normalization in the attention computation is numerically unstable when the pre-softmax logits are represented in low-precision integer format, because the softmax amplifies small differences in the logit distribution that INT8 cannot represent accurately.

Before committing to a quantization configuration, we run a layer-wise sensitivity analysis: systematically quantizing one layer at a time while keeping all others at FP32, and measuring the accuracy impact. The output is a sensitivity map that identifies which layers contribute most to quantization error. In our multi-modal fusion model, the cross-modal attention layers account for approximately 60% of the total quantization-induced accuracy loss, despite representing less than 15% of the model parameters.

The practical implication is mixed-precision quantization: run the sensitive attention layers at INT16 or FP16, and run the less sensitive convolutional backbone layers at INT8. This increases the compute cost slightly — INT16 MACs are about 2x the cost of INT8 MACs on our NPU — but the accuracy recovery more than justifies the budget. Our production model uses INT8 for the backbone encoders and INT16 for the three attention layers; overall power draw is 6.1W versus 5.6W for a fully INT8 graph, and classification accuracy is 1.1 percentage points higher.

Calibrating for the Actual Deployment Distribution

For PTQ, and for QAT's validation set, the accuracy of the quantized model depends on how well the calibration data represents the inference distribution. This is an underappreciated source of quantization errors in defense deployments.

Commercial quantization tutorials recommend using 100–1000 samples from the training distribution for calibration. For a defense ISR model, the training distribution is not the deployment distribution. Models trained on COCO or PASCAL-VOC have never seen desert scrub backgrounds, tarpaulin-covered vehicles at 300m range, or thermal crossover conditions at dawn. Calibrating on those datasets produces INT8 models that are well-quantized for those datasets — and meaningfully degraded on the sensor captures that actually matter.

At Kestrelsense, we maintain a calibration dataset specifically built from sensor captures in conditions representative of the operational environments specified in our SBIR program scope. The dataset covers multiple background types, three atmospheric conditions (clear, haze, light precipitation), four target classes, and three thermal conditions including crossover. Calibration against this dataset produces a quantized model whose accuracy gap from FP32 is consistently under 1.8 percentage points across all condition categories — a gap we have validated, not assumed.

Validating Quantized Models Against Hard Cases, Not Just Averages

The final step in a defensible quantization workflow is validation against held-out hard cases — sensor captures or synthetic scenarios that represent the tail of the difficulty distribution. Average accuracy on a representative test set is a necessary condition for a production model, but not a sufficient one.

In our experience, quantized models tend to degrade non-uniformly across the difficulty distribution. The easy cases — a vehicle in the open, good lighting, clear atmosphere — tolerate INT8 quantization with almost no accuracy impact. The hard cases — a partially occluded target at oblique angle under haze — are where quantization error concentrates. A model with 87% mean precision can have 71% precision on the hard-case subset. For a defense program, the hard-case performance is the number that matters.

We evaluate every candidate model against a hard-case validation set before approving it for deployment. The hard-case set is adversarially selected: it contains the frames and conditions on which previous model versions had the most failures. Models that pass the average-case validation but show more than 5 percentage points of accuracy degradation on hard cases are rejected and returned for additional QAT fine-tuning.

The Model Signing Step

After quantization and validation, the inference graph must be prepared for secure deployment. Every KS-100 module requires an ECDSA-signed model file before it will execute an inference graph. The signing process is handled by our firmware delivery pipeline, which is ITAR-controlled and maintains chain-of-custody documentation for each signed model version.

The quantization step is the last engineering stage before signing. A model that has been quantized, validated, and approved for a given program scope gets signed with the program-specific key. Once signed, the model binary cannot be modified without invalidating the signature — ensuring that the quantized, validated model that was approved is the model that runs on the deployed hardware, not a subsequent modified version.

Quantization is a precise discipline with real accuracy consequences for defense deployments. Getting it right requires layer-wise sensitivity analysis, domain-matched calibration data, QAT rather than PTQ for adversarially varied inference distributions, and hard-case validation that does not let good average performance mask dangerous tail-of-distribution degradation. We have built these practices into our model development workflow from the start — because a model that works well on a benchmark but fails on the hard cases is not a model you can rely on when the engagement window is 80 milliseconds.