On-device AI without the cloud on MediaTek Genio
Running AI inference on Genio without cloud connectivity is the primary use case for the platform. Every Genio SoC above the 350 includes the MDLA (Multi-Dimension Learning Accelerator) NPU. MediaTek’s NeuroPilot stack integrates the NPU into TFLite and ONNX Runtime through execution providers, so standard model files work without rewriting inference code. This post covers the full path from a trained model to running inference on the NPU.
Key Insights
NEURON_FLAG_USE_FP16 = "1"is mandatory for NPU inference — the MDLA does not support FP32; without this flag models fall back to CPU silently- Two NPU paths: TFLite with Neuron Stable Delegate, and ONNX Runtime with NeuronExecutionProvider — both work; TFLite is more mature
- Genio 350 has no NPU — it runs TFLite and ONNX on CPU/GPU only
- ONNX NeuronExecutionProvider is default only on Genio 520/720 — other platforms need
ENABLE_NEURON_EP = "1"in the Yocto build - CPU fallback is automatic — ops unsupported by the NPU fall back to CPU within the same inference session; you don’t need to handle it manually
NPU support matrix
| Platform | TFLite (CPU/GPU) | TFLite Neuron Delegate | ONNX CPU | ONNX NeuronEP (NPU) |
|---|---|---|---|---|
| Genio 350 | ✅ | ❌ | ✅ | ❌ |
| Genio 510 / 700 | ✅ | ✅ | ✅ | Opt-in¹ |
| Genio 520 / 720 | ✅ | ✅ | ✅ | Default |
| Genio 1200 | ✅ | ✅ | ✅ | Opt-in¹ |
¹ Requires ENABLE_NEURON_EP = "1" in local.conf, then rebuild image.
TFLite inference with Neuron Stable Delegate
The Neuron Stable Delegate is the TFLite execution delegate that routes supported ops to the NPU.
Python (recommended for prototyping)
import tflite_runtime.interpreter as tflite
import numpy as np
# Load model with Neuron Stable Delegate for NPU acceleration
interpreter = tflite.Interpreter(
model_path="mobilenet_v2.tflite",
experimental_delegates=[
tflite.load_delegate("libNeuronStableDelegate.so")
]
)
interpreter.allocate_tensors()
# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Prepare input (example: 224x224 RGB image)
input_data = np.expand_dims(image, axis=0).astype(np.float32)
input_data = (input_data / 255.0 - 0.5) / 0.5 # Normalize
# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# Get output
output = interpreter.get_tensor(output_details[0]['index'])
C++ (production)
#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/delegates/external/external_delegate.h"
// Build external delegate options for Neuron
TfLiteExternalDelegateOptions opts =
TfLiteExternalDelegateOptionsDefault("libNeuronStableDelegate.so");
auto* delegate = TfLiteExternalDelegateCreate(&opts);
// Build interpreter and add delegate
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder builder(*model, resolver);
builder.AddDelegate(delegate);
builder(&interpreter);
interpreter->AllocateTensors();
ONNX Runtime with NeuronExecutionProvider
Python setup
import onnxruntime as ort
import numpy as np
# Configure NeuronExecutionProvider options
neuron_opts = {
"NEURON_FLAG_USE_FP16": "1", # MANDATORY — NPU is FP16 only
"NEURON_FLAG_MIN_GROUP_SIZE": "1", # Minimum ops per NPU subgraph
}
providers = [
("NeuronExecutionProvider", neuron_opts),
"XnnpackExecutionProvider", # Fallback: XNNPACK on CPU
"CPUExecutionProvider", # Final fallback
]
session = ort.InferenceSession("model.onnx", providers=providers)
# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
Verify NPU is being used
# Check which providers are active
print(session.get_providers())
# ['NeuronExecutionProvider', 'XnnpackExecutionProvider', 'CPUExecutionProvider']
# Check if NeuronEP was actually used (it appears first if it took the model)
active = session.get_provider_options()
print(active)
If NeuronExecutionProvider is in the returned list, at least some ops ran on the NPU. Ops that NeuronEP doesn’t support automatically fall through to the next provider.
Enabling NeuronEP on Genio 510/700/1200 (Yocto)
# conf/local.conf
ENABLE_NEURON_EP = "1"
Rebuild the image after setting this flag. The flag adds the NeuronExecutionProvider shared library to the ONNX Runtime installation.
Model requirements for NPU acceleration
The NPU handles most standard CNN ops natively. Limitations to be aware of:
| Op category | NPU support |
|---|---|
| Conv2D, DepthwiseConv2D | ✅ Full |
| BatchNorm, ReLU, ReLU6 | ✅ Full |
| Add, Mul, Concat | ✅ Full |
| LSTM, GRU | ⚠️ Partial (some variants) |
| Transformer attention | ⚠️ Partial |
| Custom ops | ❌ CPU fallback |
| FP32 weights | ⚠️ Requires NEURON_FLAG_USE_FP16=1 to run as FP16 |
| INT8 quantized | ✅ Best performance path |
INT8 quantized models deliver the best NPU performance and the smallest memory footprint. Use post-training quantization in TFLite or ONNX Runtime’s quantization tools before deployment.
Post-training quantization (TFLite)
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration
def representative_data_gen():
for sample in calibration_samples:
yield [sample.astype(np.float32)]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)
Practical deployment patterns
Pattern 1: Continuous camera inference
import tflite_runtime.interpreter as tflite
import cv2
# Load model once at startup
interpreter = tflite.Interpreter(
model_path="detect.tflite",
experimental_delegates=[tflite.load_delegate("libNeuronStableDelegate.so")]
)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Open camera
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
# Preprocess
resized = cv2.resize(frame, (300, 300))
input_data = np.expand_dims(resized, axis=0).astype(np.uint8)
# Infer
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
boxes = interpreter.get_tensor(output_details[0]['index'])
scores = interpreter.get_tensor(output_details[2]['index'])
# ... process results
Pattern 2: GStreamer + NNStreamer pipeline
NNStreamer integrates TFLite inference directly into GStreamer pipelines for zero-copy frame processing:
gst-launch-1.0 \
v4l2src device=/dev/video0 ! \
video/x-raw,format=RGB,width=640,height=480 ! \
videoconvert ! \
tensor_converter ! \
tensor_filter \
framework=tflite \
model=mobilenet_v2_int8.tflite \
accelerator=true:npu ! \
tensor_decoder mode=image_labeling \
option1=labels.txt ! \
overlay ! waylandsink
NNStreamer is included in packagegroup-rity-ai-ml in the RITY Yocto image.
Performance benchmarks on Genio 720
Approximate inference times on Genio 720 EVK with INT8 quantized models:
| Model | CPU (ms) | NPU (ms) | Speedup |
|---|---|---|---|
| MobileNetV2 | 45 | 8 | 5.6× |
| EfficientDet-Lite0 | 120 | 22 | 5.5× |
| YOLOv5s INT8 | 210 | 38 | 5.5× |
| BERT-base (text) | 850 | 180 | 4.7× |
NPU inference is 4–6× faster than CPU for typical vision models. Combined with the reduction in CPU load (CPU is free for other work during NPU inference), the real-world benefit is larger than the raw latency numbers suggest.
For the full NeuroPilot stack including Yocto packagegroups and NDA tier features, see What is RITY? MediaTek’s Genio reference distribution explained. For a complete computer vision pipeline from camera to inference to display, see MediaTek Genio for computer vision.
FAQ
What AI frameworks run on MediaTek Genio NPU?
TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available by default on Genio 520 and 720.
Do I need cloud connectivity for AI inference on Genio?
No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device.
What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?
NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU.
Which Genio platform has the strongest NPU for on-device AI?
Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU.
Relevant Services
MediaTek Genio Expert Support
Building on MediaTek Genio?
BSP bring-up, GStreamer pipelines, NeuroPilot integration, we've shipped it. Get unblocked fast. One call to scope it, fixed bid to deliver it.
Frequently Asked Questions
What AI frameworks run on MediaTek Genio NPU?
TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available out of the box on Genio 520 and 720; other platforms require opt-in via ENABLE_NEURON_EP=1.
Do I need cloud connectivity for AI inference on Genio?
No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device. This makes Genio suitable for air-gapped deployments, privacy-sensitive applications, and use cases where network latency is unacceptable.
What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?
NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU. Always set this flag when using NeuronExecutionProvider or the Neuron Stable Delegate.
Which Genio platform has the strongest NPU for on-device AI?
Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU — it is CPU/GPU inference only.
Written by
Andrés CamposCo-Founder & CTO · ProventusNova
8 years deep in embedded systems, from underwater ROVs to edge AI. Andrés leads every technical delivery personally.
Connect on LinkedInRelated Articles
MediaTek Genio for computer vision: a practical guide
Build a computer vision pipeline on MediaTek Genio. Camera capture, GStreamer, OpenCV, TFLite NPU inference, and end-to-end object detection pipeline examples.
APU, NPU, VPU, and MDLA on MediaTek Genio: what each one does
Clear explanation of APU, NPU, VPU, and MDLA on MediaTek Genio. What each accelerator handles, which Genio SoCs include them, and when to use each.
Running inference on MediaTek Genio: NeuroPilot, TFLite, and ONNX
How to run offline inference on MediaTek Genio using NeuroPilot SDK, TFLite APU delegate, and ONNX. Model conversion workflow, supported ops, and latency.
MediaTek Genio for industrial automation
MediaTek Genio for industrial automation. TSN networking, OPC-UA, real-time Linux, deterministic control, Jailhouse hypervisor, and industrial protocol support.