SLM в 2026: Phi-4-mini, Gemma 3/3n, Qwen3 и Llama 3.2 для локального запуска

Актуальный обзор Small Language Models на 22 марта 2026: Phi-4-mini, Gemma 3 и Gemma 3n, Qwen3, Llama 3.2, когда small models уже достаточно, как выбирать размер, что запускать локально и где границы SLM.

На 22 марта 2026 уже неточно объяснять SLM через старую рамку Phi-4 + Gemma 2 + Qwen2.5, как будто small-model landscape застыл в 2025. Current practical lineup уже другой:

у Microsoft главным small default стал Phi-4-mini-instruct, а не только базовый Phi-4;
у Google small/open story moved from Gemma 2 to Gemma 3 и особенно Gemma 3n для on-device use;
у Qwen current baseline уже Qwen3, а не Qwen2.5;
у Meta small local lane по-прежнему practical строится вокруг Llama 3.2 1B / 3B.

Поэтому в 2026 SLM полезнее понимать не как "маленькие версии больших моделей", а как отдельный operational class для local, edge, offline и latency-sensitive сценариев.

SLM - это small language models: модели, которые реально можно запускать локально на ноутбуке, мини-ПК, в embedded-сценариях, а иногда и прямо на телефоне. Они не обязаны быть "самыми умными", их задача другая:

быстро отвечать;
влезать в ограниченную память;
работать без облака;
стоить почти ноль после скачивания.

Старая рамка Gemma 2 / Qwen2.5 / benchmark race уже устарела. Current SLM discussion больше крутится вокруг Phi-4-mini, Gemma 3/3n, Qwen3, Llama 3.2, on-device memory budgets, context windows, hybrid reasoning modes и practical local deployment.

Семейство	Current роль	Когда выбирать
`Phi-4-mini-instruct`	сильный small reasoning/general default	ноутбуки, enterprise local apps, multilingual assistant
`Gemma 3 1B/4B`	compact open model family от Google	local experiments, lightweight apps, open Gemma stack
`Gemma 3n`	on-device multimodal lane	phone/laptop, low-memory multimodal scenarios
`Qwen3 4B/8B`	strong multilingual + hybrid thinking open models	local copilot, coding, Russian/multilingual tasks
`Llama 3.2 1B/3B`	mature small Meta baseline	simplest local baseline, wide ecosystem, easy quantization

1. Что такое SLM сейчас

Current SLM уже нельзя честно описывать только через parameter count вроде 0.5B-14B.

Практически small model today определяется не только размером, но и тем, что она оптимизирована под:

low memory footprint;
local inference;
fast cold start;
lower latency;
simpler deployment;
optional edge and on-device use.

Поэтому в 2026 полезнее думать о SLM как о deployment class, а не как о "маленьком LLM".

2. Почему SLM в 2026 стали намного важнее

Рост SLM идёт не только из-за open weights. Есть более practical причины:

облачные API дороги на high-frequency internal workloads;
privacy and sovereignty стали жёстче;
enterprise teams хотят offline fallback;
coding copilots и internal helpers часто не требуют frontier intelligence;
new small architectures дают гораздо лучшее quality-per-parameter, чем раньше.

Именно поэтому current local stack уже не выглядит как "компромисс для бедных". Во многих сценариях это уже sane default.

3. Phi-4-mini: current Microsoft small default

Official Microsoft Azure Phi page прямо пишет, что Phi-4-mini и Phi-4-multimodal - newest models in the Phi family.

Model card for Phi-4-mini-instruct даёт ещё более useful picture:

3.8B dense decoder-only model;
128K context;
multilingual support;
built for memory/compute constrained and latency-bound scenarios;
positioned for strong reasoning, especially math and logic, relative to size.

Это делает Phi-4-mini одним из лучших current defaults, если нужен:

local assistant on laptop;
enterprise document helper;
small reasoning-oriented model;
good quality without moving to 7B-14B immediately.

Практический вывод:

если вы раньше держали в голове Phi-3.5-mini или только full Phi-4, current practical reference уже Phi-4-mini.

4. Gemma 3 и Gemma 3n: Google теперь играет в small/open stack иначе

Gemma 3

Official Gemma docs now describe Gemma 3 as the core open family.

Что важно:

sizes are 1B, 4B, 12B, 27B;
1B is text-first small lane;
larger models add image understanding and longer context;
this is already a very different lineup from Gemma 2.

For SLM discussion especially relevant are:

Gemma 3 1B as compact text model;
Gemma 3 4B as stronger local default while still manageable on consumer hardware.

Gemma 3n

Current Gemma 3n docs are even more important for SLM/edge framing.

Google explicitly positions it for:

phones;
laptops;
tablets;
audio, text and vision input;
parameter-efficient on-device processing.

Model card for gemma-3n-E2B-it explains the practical trick:

raw model size is larger, but effective runtime memory can look closer to a traditional 2B class model.

Это очень важный 2026 signal:

current small-model race уже идёт не только по parameter count;
идёт по effective on-device footprint and mixed-modality utility.

5. Qwen3: current small open baseline вместо Qwen2.5

Official Qwen blog уже давно перевёл основную story from Qwen2.5 to Qwen3.

Что важно для small-model overview:

current dense sizes include 0.6B, 1.7B, 4B, 8B;
Qwen3 combines thinking and non-thinking modes in one model family;
strong multilingual and tool-use posture remains a core differentiator.

Model cards for Qwen3-4B and Qwen3-8B are useful because they show:

32K native context and 131K with YaRN;
Apache 2.0 licensing;
practical recommendation for local use with Ollama, LM Studio, MLX, llama.cpp and others in the official blog.

Это делает Qwen3 особенно useful when:

you need multilingual quality;
Russian and mixed-language prompts matter;
coding and reasoning both matter;
you want one family spanning tiny local model to stronger 8B lane.

6. Llama 3.2: still current as small local Meta baseline

Даже после всех vendor updates Llama 3.2 1B / 3B остаётся очень важным reference point.

Official Meta model cards on Hugging Face keep the small lane simple:

1B and 3B instruct models;
small enough for lightweight local setups;
broad ecosystem support;
mature quantization and inference tooling.

Сегодня Llama 3.2 уже не выглядит как most capable small family, but it still matters because:

almost every local tool supports it;
community recipes are everywhere;
it is a stable baseline for comparing newer small families.

Практически:

если нужно самое predictable local experience, Llama 3.2 1B/3B всё ещё полезны;
если нужен stronger quality-per-size, current Phi-4-mini, Gemma 3, Qwen3 often look better.

7. Как мыслить о размерах в 2026

Вместо старой привязки "до 14B = SLM" полезнее использовать deployment bands:

Ultra-small

Примерно sub-1B to 1B.

Когда полезно:

phones;
very low-RAM devices;
simple classification or templated assistants;
extreme latency constraints.

Practical local default

Примерно 3B to 4B.

Сейчас это, пожалуй, sweet spot:

влезает в consumer setups;
уже достаточно умно для internal assistants;
можно квантизовать и запускать локально без боли.

Strong local

Примерно 7B to 8B.

Когда useful:

local coding helper;
multilingual chat;
heavier RAG;
if you can tolerate more memory and a bit less speed.

Это лучше описывает рынок, чем старое "всё до 14B одинаково small".

8. Где SLM реально уже достаточно

Current SLM особенно хороши для:

локального chat assistant по внутренним материалам;
RAG over internal docs;
prompt templating and summarization;
classification and extraction;
code autocomplete / routine coding help;
offline support tools;
kiosk, browser, edge or on-device UX.

Во всех этих задачах frontier cloud model часто не обязателен.

9. Где SLM всё ещё проигрывают облаку

Даже current best small models всё ещё чаще уступают cloud frontier models в:

long-horizon reasoning;
research and deep synthesis;
difficult tool orchestration;
multimodal generality outside special cases;
reliability on weird edge cases.

Практическая истина здесь простая:

SLM are often enough for bounded tasks;
cloud still wins for open-ended difficult work.

10. Локальный выбор по памяти

Вместо старых charts с токенами в секунду полезнее держать такую грубую operational рамку:

Память устройства	Что обычно реалистично
`4-8 GB`	ultra-small и квантизованные `1B-3B`
`8-16 GB`	комфортный запуск `3B-4B`, иногда `7B` в aggressive quantization
`16-32 GB`	strong local band `7B-8B`, более длинный контекст, better UX

Это не строгий benchmark, а practical planning heuristic. Реальный fit зависит от:

quantization format;
CPU vs GPU vs unified memory;
context length;
batch size;
toolchain.

11. Что изменилось в SLM design

Current small models побеждают не только за счёт масштаба данных, но и за счёт design choices:

better post-training;
synthetic and reasoning-dense data;
hybrid thinking / non-thinking modes;
more efficient vocabularies and attention layouts;
parameter-efficient on-device architectures.

Это видно по current vendor stories:

Microsoft emphasises reasoning density in Phi-4-mini;
Google emphasises on-device architecture in Gemma 3n;
Qwen emphasises think/non-think switching;
Meta emphasises ecosystem and deployability.

12. Как их запускать сегодня

Current practical local stack обычно такой:

Ollama for quick start;
LM Studio for desktop GUI;
llama.cpp / GGUF when you want maximum control;
MLX if you are on Apple Silicon;
vLLM / server runtimes only when you move from personal local use to real serving.

Для SLM особенно важно, что все эти tools now make small local deployment boring in a good way: less heroic tinkering, more ordinary engineering.

13. Для разработчика

Hugging Face path

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "microsoft/Phi-4-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
)

Local runtime mindset

1. Start with a 3B-4B class model.
2. Quantize if memory is tight.
3. Test on your real prompts, not public benchmarks.
4. Only move upward to 7B-8B if quality is still insufficient.

Practical selection heuristics

choose Phi-4-mini when you want strongest reasoning-ish small default;
choose Gemma 3n when on-device and multimodal footprint matter;
choose Qwen3 4B/8B when multilingual quality and local coding matter;
choose Llama 3.2 when ecosystem maturity matters more than raw quality-per-size.

Плюсы

SLM в 2026 уже стали реальным deployment class для local, offline и edge-сценариев
Phi-4-mini, Gemma 3/3n, Qwen3 и Llama 3.2 закрывают разные practical tiers по памяти и качеству
Current small models заметно лучше старых по quality-per-parameter
Локальный запуск через Ollama, LM Studio, MLX и llama.cpp стал гораздо более предсказуемым

Минусы

Даже лучшие SLM всё ещё часто уступают облаку на open-ended complex reasoning
Память, квантизация и runtime по-прежнему сильно влияют на опыт
Один universal best SLM не существует: deployment constraints matter more than benchmark screenshots
Small multimodal and on-device stories всё ещё более fragmented, чем text-only local inference

Проверьте себя

1. Что сильнее всего изменилось в SLM-ландшафте к 2026 году?

{ "text": "Он по-прежнему крутится в основном вокруг Gemma 2 и Qwen2.5", "correct": false, "explanation": "Нет. Current small-model frame уже moved to Phi-4-mini, Gemma 3/3n, Qwen3 and Llama 3.2." } { "text": "SLM стали отдельным operational class для local, edge и latency-sensitive deployment", "correct": true, "explanation": "Верно. Это уже не просто 'меньшие LLM'." } { "text": "Small models перестали быть нужны из-за GPT-5", "correct": false, "explanation": "Нет. Напротив, local/edge demand вырос." }

2. Когда `Gemma 3n` особенно логично рассматривать?

{ "text": "Когда важен on-device сценарий на телефоне, ноутбуке или планшете", "correct": true, "explanation": "Да. Google именно так и позиционирует Gemma 3n." } { "text": "Только как облачный API", "correct": false, "explanation": "Нет. Речь как раз про on-device family." } { "text": "Только для massive server clusters", "correct": false, "explanation": "Нет. Это opposite of its current positioning." }

3. Какой practical выбор чаще всего разумен для начала local deployment?

{ "text": "Сразу брать самый большой доступный open model", "correct": false, "explanation": "Нет. Это часто лишняя нагрузка на память и latency." } { "text": "Начать с 3B-4B класса и подниматься выше только если качества не хватает", "correct": true, "explanation": "Верно. Это самый sane default для current local engineering." } { "text": "Всегда использовать только 0.5B модели", "correct": false, "explanation": "Нет. Это слишком узко и часто недостаточно по качеству." }

Источники

Ollama в 2026: локальный model runtime с tools, thinking, structured outputs и cloud bridge

llama.cpp и GGUF в 2026: low-level local runtime, hybrid CPU+GPU inference и current quantization reality