Edge AI в 2026: on-device модели для mobile, browser и embedded without cloud-first assumptions

Актуальный обзор Edge AI на 22 марта 2026: Google AI Edge и MediaPipe LLM Inference, ExecuTorch, Core ML, ONNX Runtime GenAI, Transformers.js v4, browser WebGPU/WebNN и current on-device deployment patterns.

На 22 марта 2026 уже неточно объяснять Edge AI как просто список WebLLM, CoreML, TensorFlow Lite и Raspberry Pi. Current on-device stack изменился:

у Google current frame уже проходит через Google AI Edge, MediaPipe generative AI tasks и LLM Inference API;
PyTorch edge story now clearly goes through ExecuTorch;
browser story moved from novelty demos to more serious WebGPU and Transformers.js v4;
ONNX Runtime GenAI now matters as a generate-loop runtime, not just classic ONNX inference;
practical edge discussion today includes not only phones and browsers, but also governance, latency, power budget and model conversion path.

Поэтому в 2026 Edge AI полезнее понимать как on-device inference architecture, а не как набор разрозненных SDK.

Edge AI - это когда модель работает прямо на устройстве пользователя или рядом с ним:

на телефоне;
в браузере;
на embedded board;
на local gateway.

Главная идея не в том, чтобы "заменить облако везде", а в том, чтобы дать:

low latency;
privacy;
offline behavior;
lower dependence on network.

Старая рамка WebLLM / CoreML / TensorFlow Lite / Jetson уже слишком плоская. Current edge stack лучше объяснять через actual deployment lanes: Google AI Edge + MediaPipe, ExecuTorch, Core ML, ONNX Runtime GenAI, Transformers.js v4 и browser WebGPU/WebNN direction.

Линия	Current ориентир	Когда выбирать
Android / iOS native gen AI	`Google AI Edge` / `MediaPipe LLM Inference`	on-device mobile assistants
Apple ecosystem	`Core ML`	iPhone/iPad/Mac on-device ML stack
PyTorch-to-edge	`ExecuTorch`	if your model pipeline starts in PyTorch
Cross-platform runtime	`ONNX Runtime GenAI`	portable generative loop and model serving
Browser AI	`Transformers.js v4`, WebGPU	zero-install AI in browser apps

1. Что такое Edge AI сейчас

Current edge discussion уже не сводится к вопросу "можно ли запустить маленькую модель на устройстве".

Правильные вопросы сегодня:

сколько памяти есть на устройстве;
сколько энергии можно тратить;
насколько критична latency;
нужен ли offline;
кто контролирует model conversion and updates;
как долго модель должна жить on-device без cloud fallback.

То есть edge today - это system design, not just model compression.

2. Google AI Edge: current official framing для mobile generative AI

Google current edge site now explicitly puts generative AI tasks front and center.

Official story:

run LLMs and diffusion models on the edge;
use MediaPipe generative AI tasks;
cross-platform deployment with optimized hardware acceleration.

Это important because older MediaPipe-only framing now looks too narrow. Current mental model should be:

Google AI Edge = umbrella;
MediaPipe LLM Inference = one of the main practical on-device lanes.

3. MediaPipe LLM Inference: actual on-device LLM path on Android and iOS

Official Android and iOS guides for LLM Inference make the current mobile lane very concrete.

Important facts:

runs fully on-device;
supports Android and iOS;
built for text generation, summarization, retrieval-like natural language tasks;
supports model conversion/customization paths and LoRA tuning;
guides explicitly mention Gemma-family and converted PyTorch LLMs.

This matters because MediaPipe today is no longer just about vision demos. It is a serious practical route for mobile generative AI.

4. Apple: Core ML remains foundational, but the framing should be on-device ML platform

Official Apple docs still make Core ML the foundation for on-device model integration.

The useful current framing is:

Core ML is not "one more inference library";
it is the base platform for shipping optimized models across Apple devices;
it leverages CPU, GPU and Neural Engine;
it supports on-device prediction and even on-device retraining/fine-tuning flows in supported contexts.

So when teams say "Apple edge AI", what they usually really mean is:

model conversion and packaging into Core ML compatible form;
app integration via Core ML stack;
optimization for Apple hardware path.

5. ExecuTorch: current PyTorch-native edge story

ExecuTorch docs now position it clearly as PyTorch’s solution for efficient AI inference on edge devices:

mobile phones;
wearables;
embedded systems.

This is a big shift from older summaries that often still mention PyTorch Mobile more prominently.

Current practical implication:

if your internal model stack is already PyTorch-first;
and you want deployment to edge devices;
then ExecuTorch is now the most current official PyTorch lane to consider.

6. Browser AI: no longer just a toy

Transformers.js

Official docs and the v4 preview blog show that browser AI has matured:

same API style across browser, server-side runtimes and desktop apps;
WebGPU runtime completely rewritten in C++;
broad task support across text, vision, audio and multimodal;
explicit device: 'webgpu' support;
fallback to CPU/WASM when needed.

This matters because browser AI in 2026 is no longer just "cute local inference demo". It is increasingly practical for:

zero-install assistants;
client-side summarization;
local extraction;
privacy-sensitive front-end features.

WebGPU and resource constraints

Even with better runtimes, browser AI still lives under:

VRAM constraints;
browser compatibility;
download-size friction;
session lifecycle issues.

So browser edge is strongest when:

model size is modest;
user benefit from zero-install is high;
partial offline/privacy is valuable.

7. ONNX Runtime GenAI: more than classic ONNX inference

Official ONNX Runtime GenAI docs describe a preview generate() API with:

tokenization and pre-processing;
generate loop;
logits processing;
search and sampling;
KV cache management;
structured output for tool calling.

This is important because old ONNX framing was mostly:

export model to ONNX;
run inference.

Current GenAI framing is richer:

ONNX Runtime is now trying to cover the actual generative loop, not only single forward passes.

That makes it more relevant for real edge and embedded generative applications.

8. Edge is about deployment lanes, not model families

One common mistake is to compare edge frameworks by model brand.

Current better comparison is by deployment lane:

Native mobile lane

MediaPipe LLM Inference
Core ML
ExecuTorch

Browser lane

Transformers.js
WebGPU runtimes

Portable runtime lane

ONNX Runtime GenAI

Embedded/board lane

native vendor stacks, compact runtimes, model-specific pipelines

This is much closer to how real engineering decisions are made.

9. Why edge AI is chosen

Current edge AI is usually chosen for four reasons:

privacy;
latency;
offline operation;
cost avoidance at scale.

Not every app needs all four. But when at least two of them matter strongly, edge becomes very compelling.

10. Where edge AI is already strong

Current sweet spots:

on-device text rewriting and summarization;
transcription or post-processing on device;
privacy-sensitive local assistants;
browser-side extraction and small local chat;
embedded anomaly detection and compact assistants;
UI features where cloud round-trip would hurt UX.

11. Where edge AI still loses to cloud

Edge still struggles more when:

model size must be frontier-large;
multi-step reasoning is critical;
multimodal depth is very high;
latency tolerance is okay but quality demand is extreme;
updates and experimentation must be centralized.

That is why many mature 2026 systems are hybrid:

edge for first-pass or privacy-preserving operations;
cloud for escalation.

12. Current practical heuristics

If you are building a mobile app

start by deciding Apple-first, Android-first or both;
only then pick Core ML, MediaPipe, or ExecuTorch;
do not start from abstract benchmark tables.

If you are building browser AI

ask whether zero-install is worth model download cost;
choose small enough models;
benchmark on real user hardware, not just your dev machine.

If you need a portable runtime

evaluate ONNX Runtime GenAI, especially if model conversion/export already sits in your pipeline.

13. Для разработчика

Browser lane with Transformers.js

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("sentiment-analysis", undefined, {
  device: "webgpu",
});

MediaPipe mobile mindset

1. Pick a supported or convertible on-device model.
2. Validate memory footprint on target devices.
3. Benchmark latency and battery impact.
4. Add fallback or smaller model path for weaker hardware.

Practical engineering posture

measure memory, thermals and power, not only accuracy;
keep model update story simple;
choose one deployment lane first;
add hybrid fallback if edge quality is insufficient.

Плюсы

Edge AI в 2026 has much clearer official lanes: Google AI Edge, ExecuTorch, Core ML, Transformers.js and ONNX Runtime GenAI
On-device deployment is now practical for many bounded generative AI tasks
Browser and mobile runtimes are far more mature than old demo-era summaries suggest
Edge delivers real wins in privacy, latency and offline resilience

Минусы

Quality ceilings and hardware fragmentation still matter a lot
Model conversion and packaging remain non-trivial
Browser AI still lives under strict resource and compatibility constraints
Hybrid cloud-edge architecture is often still required for best user outcomes

Проверьте себя

1. Что точнее всего описывает Edge AI в 2026?

{ "text": "Это просто запуск любой маленькой модели на любом устройстве", "correct": false, "explanation": "Нет. Edge is about deployment architecture under latency, power and memory constraints." } { "text": "Это on-device inference architecture across mobile, browser and embedded lanes", "correct": true, "explanation": "Верно. Это current useful framing." } { "text": "Это только browser demo с WebGPU", "correct": false, "explanation": "Нет. Browser is only one lane." }

2. Когда `ExecuTorch` особенно уместен?

{ "text": "Когда model pipeline starts in PyTorch and target is edge/mobile deployment", "correct": true, "explanation": "Да. Это current official PyTorch edge path." } { "text": "Только для облачного Kubernetes serving", "correct": false, "explanation": "Нет. Это edge-focused runtime." } { "text": "Только для iOS image filters", "correct": false, "explanation": "Нет. Его scope шире." }

3. Почему browser AI в 2026 уже нельзя списывать как игрушку?

{ "text": "Потому что WebGPU and current runtimes like Transformers.js v4 made zero-install AI much more practical", "correct": true, "explanation": "Верно. Хотя ограничения всё ещё остаются." } { "text": "Потому что браузеры теперь запускают frontier 400B models", "correct": false, "explanation": "Нет. Resource limits remain very real." } { "text": "Потому что браузеры полностью заменили mobile SDK", "correct": false, "explanation": "Нет. Это just one deployment lane." }

Источники

LM Studio в 2026: local AI desktop, headless service, MCP и OpenAI-compatible server