TFLite on the MediaTek Genio NPU: Neuron Stable Delegate

Q: What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference, without this flag, FP32 models fail to run on the NPU and fall back to CPU. Always set this flag when using NeuronExecutionProvider or the Neuron Stable Delegate.

Q: What is the Neuron Stable Delegate?

The Neuron Stable Delegate (libNeuronStableDelegate.so) is MediaTek's TFLite external delegate that routes supported ops to the Genio NPU at runtime. Load it via tflite.load_delegate() in Python or TfLiteExternalDelegateCreate() in C++. Ops the NPU cannot execute fall back to CPU automatically within the same inference session.

Q: Should I use INT8 or FP16 models on the Genio NPU?

INT8 post-training quantized models give the best NPU latency and smallest memory footprint. FP32 models can run only as FP16 (set NEURON_FLAG_USE_FP16=1), the MDLA has no native FP32 path. If accuracy allows, quantize to INT8 with a representative calibration dataset before deployment.

TensorFlow Lite (LiteRT) with the Neuron Stable Delegate is the most mature path to NPU inference on MediaTek Genio. Every Genio SoC above the 350 includes the MDLA (Multi-Dimension Learning Accelerator) NPU, and the delegate routes supported TFLite ops to it without rewriting inference code, entirely on-device, no cloud connectivity required. This post covers the full TFLite path: loading the delegate from Python and C++, quantizing to INT8 for the best NPU performance, running inference inside GStreamer with NNStreamer, and what speedup to expect. For the ONNX Runtime path, see the dedicated ONNX on the Genio NPU guide.

Key Insights

NEURON_FLAG_USE_FP16 = "1" is mandatory for NPU inference, the MDLA does not support FP32; without this flag models fall back to CPU silently
Two NPU paths: TFLite with Neuron Stable Delegate, and ONNX Runtime with NeuronExecutionProvider, both work; TFLite is more mature
Genio 350 has no NPU: it runs TFLite and ONNX on CPU/GPU only
ONNX NeuronExecutionProvider is default only on Genio 520/720: other platforms need ENABLE_NEURON_EP = "1" in the Yocto build
CPU fallback is automatic: ops unsupported by the NPU fall back to CPU within the same inference session; you don’t need to handle it manually

NPU support matrix

Platform	TFLite (CPU/GPU)	TFLite Neuron Delegate	ONNX CPU	ONNX NeuronEP (NPU)
Genio 350	✅	❌	✅	❌
Genio 510 / 700	✅	✅	✅	Opt-in¹
Genio 520 / 720	✅	✅	✅	Default
Genio 1200	✅	✅	✅	Opt-in¹

¹ Requires ENABLE_NEURON_EP = "1" in local.conf, then rebuild image.

TFLite inference with Neuron Stable Delegate

The Neuron Stable Delegate is the TFLite execution delegate that routes supported ops to the NPU.

Python (recommended for prototyping)

import tflite_runtime.interpreter as tflite
import numpy as np

# Load model with Neuron Stable Delegate for NPU acceleration
interpreter = tflite.Interpreter(
    model_path="mobilenet_v2.tflite",
    experimental_delegates=[
        tflite.load_delegate("libNeuronStableDelegate.so")
    ]
)
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input (example: 224x224 RGB image)
input_data = np.expand_dims(image, axis=0).astype(np.float32)
input_data = (input_data / 255.0 - 0.5) / 0.5  # Normalize

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Get output
output = interpreter.get_tensor(output_details[0]['index'])

C++ (production)

#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/delegates/external/external_delegate.h"

// Build external delegate options for Neuron
TfLiteExternalDelegateOptions opts =
    TfLiteExternalDelegateOptionsDefault("libNeuronStableDelegate.so");
auto* delegate = TfLiteExternalDelegateCreate(&opts);

// Build interpreter and add delegate
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder builder(*model, resolver);
builder.AddDelegate(delegate);
builder(&interpreter);
interpreter->AllocateTensors();

ONNX Runtime with NeuronExecutionProvider

ONNX Runtime runs directly on the NPU through the NeuronExecutionProvider, enabled by default on Genio 520 and 720, opt-in on 510/700/1200 via ENABLE_NEURON_EP = "1". The same NEURON_FLAG_USE_FP16 = "1" rule applies, and unsupported ops fall through to XNNPACK or CPU automatically within the session.

We cover the full ONNX path, provider setup, verifying NPU execution, online vs offline (ncc-tflite) compilation, the 520/720 distinction, and neuronrt benchmarks, in a dedicated guide: ONNX Runtime on the MediaTek Genio NPU.

Model requirements for NPU acceleration

The NPU handles most standard CNN ops natively. Limitations to be aware of:

Op category	NPU support
Conv2D, DepthwiseConv2D	✅ Full
BatchNorm, ReLU, ReLU6	✅ Full
Add, Mul, Concat	✅ Full
LSTM, GRU	⚠️ Partial (some variants)
Transformer attention	⚠️ Partial
Custom ops	❌ CPU fallback
FP32 weights	⚠️ Requires `NEURON_FLAG_USE_FP16=1` to run as FP16
INT8 quantized	✅ Best performance path

INT8 quantized models deliver the best NPU performance and the smallest memory footprint. Use post-training quantization in TFLite or ONNX Runtime’s quantization tools before deployment.

Post-training quantization (TFLite)

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration
def representative_data_gen():
    for sample in calibration_samples:
        yield [sample.astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

Practical deployment patterns

Pattern 1: Continuous camera inference

import tflite_runtime.interpreter as tflite
import cv2

# Load model once at startup
interpreter = tflite.Interpreter(
    model_path="detect.tflite",
    experimental_delegates=[tflite.load_delegate("libNeuronStableDelegate.so")]
)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Open camera
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Preprocess
    resized = cv2.resize(frame, (300, 300))
    input_data = np.expand_dims(resized, axis=0).astype(np.uint8)

    # Infer
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    boxes = interpreter.get_tensor(output_details[0]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])
    # ... process results

Pattern 2: GStreamer + NNStreamer pipeline

NNStreamer integrates TFLite inference directly into GStreamer pipelines for zero-copy frame processing:

gst-launch-1.0 \
  v4l2src device=/dev/video0 ! \
  video/x-raw,format=RGB,width=640,height=480 ! \
  videoconvert ! \
  tensor_converter ! \
  tensor_filter \
    framework=tflite \
    model=mobilenet_v2_int8.tflite \
    accelerator=true:npu ! \
  tensor_decoder mode=image_labeling \
    option1=labels.txt ! \
  overlay ! waylandsink

NNStreamer is included in packagegroup-rity-ai-ml in the RITY Yocto image.

Performance benchmarks on Genio 720

Approximate inference times on Genio 720 EVK with INT8 quantized models:

Model	CPU (ms)	NPU (ms)	Speedup
MobileNetV2	45	8	5.6×
EfficientDet-Lite0	120	22	5.5×
YOLOv5s INT8	210	38	5.5×
BERT-base (text)	850	180	4.7×

NPU inference is 4–6× faster than CPU for typical vision models. Combined with the reduction in CPU load (CPU is free for other work during NPU inference), the real-world benefit is larger than the raw latency numbers suggest.

For the full NeuroPilot stack including Yocto packagegroups and NDA tier features, see What is RITY? MediaTek’s Genio reference distribution explained. For a complete computer vision pipeline from camera to inference to display, see MediaTek Genio for computer vision.

FAQ

What AI frameworks run on MediaTek Genio NPU?

TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available by default on Genio 520 and 720.

Do I need cloud connectivity for AI inference on Genio?

No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device.

What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference, without this flag, FP32 models fail to run on the NPU and fall back to CPU.

Which Genio platform has the strongest NPU for on-device AI?

Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU.

What is the Neuron Stable Delegate?

MediaTek’s TFLite external delegate (libNeuronStableDelegate.so) that routes supported ops to the Genio NPU at runtime, tflite.load_delegate() in Python, TfLiteExternalDelegateCreate() in C++. Unsupported ops fall back to CPU automatically.

Should I use INT8 or FP16 models on the Genio NPU?

INT8 quantized models give the best latency and memory footprint. FP32 runs only as FP16 (NEURON_FLAG_USE_FP16=1), the MDLA has no native FP32 path. Quantize to INT8 with a representative calibration set when accuracy allows.