NVIDIA Nemotron 3 Nano Omni
The biggest infra-native model launch of the day. NVIDIA introduced Nemotron 3 Nano Omni, an open 30B / A3B multimodal MoE with 256K context, built for agentic workloads spanning text, image, video, audio, and documents.
Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability.
Key specs from the follow-on posts: this is NVIDIA's first omni release with speech/audio understanding backed by a Parakeet encoder, English-only for now, and a 5.95% WER on the Open ASR leaderboard. Several hosts cited ~9× throughput versus comparable open omni models.
Mini-SGLang: What "match" Actually Does
One-line recall: match = how much prefix can already be reused from the query tree.
Plain analogy: you go to a library to find a book. The librarian (the match function) tells you "we already have chapters 1-5 of Harry Potter on the shelf, chapter 6 onwards is new."
Visualization
Suppose the radix tree currently looks like this (each [...] is a node):
root
│
[You are a fitness coach.] ← 14 token shared prefix
│
[How should I train]
╱ ╲
[chest?] [abs?] ← two existing branches A new prompt comes in:
"You are a fitness coach. How should I train abs? What else should I watch out for?" How match Walks the Tree
match(root, prompt_tokens):
step 1: walk down from root
"You are a fitness coach." → in tree, share 14 tokens
step 2: walk to next node "[How should I train]"
"How should I train" → in tree, share 5 more tokens
step 3: at branch point, check children for "abs"
found, walk into [abs?] → share 4 more tokens
step 4: continue with "What else should I watch out for?"
children don't have this token → STOP
→ return: matched_node=[abs?], n_matched=23
(first 23 tokens already in tree, the rest is new) What the Two Return Values Mean
matched_node, n_matched = match(root, prompt_tokens)
↓ ↓
which node we total tokens
ended up at matched
(where new stuff (these don't need
gets attached) KV recompute) Why match and insert Must Pair Up
match → tells you "first 23 tokens skip recompute"
insert → attaches "from token 24 onwards" into the tree + KV cache
no match: don't know what's reusable → recompute everything → wasted
no insert: computed but didn't save → next time recompute → wasted Unsloth: LoRA Finetuning Modes
A) Non-merged mode (peft default):
inference: y = (W + α*B@A) * x
→ slightly slower (one extra matrix add)
→ flexible: one base model can carry multiple adapters
(English / Chinese / medical / ...) B) Merged mode (merge_and_unload):
bake A, B into W: W_new = W + α*B@A
→ inference is identical speed to a normal model
→ but W_new is 14GB, you lose the "lightweight" advantage Mimicking Dream of the Red Chamber Style with Unsloth
The size delta is striking:
Qwen3-8B (base): 16 GB
LoRA adapter: 167 MB ← 100x smaller I trained a small adapter and ran a "West Lake travel diary in Dream of the Red Chamber style" generation. The output had Baoyu and Daiyu having the kind of conversation you'd expect, with the period vocabulary and rhythm coming through. The actual reasoning didn't quite hold together, but the format was right — with a larger base model the same approach should work cleanly.
TRL: HuggingFace's Transformer Reinforcement Learning
TRL = Transformer Reinforcement Learning. HuggingFace's training library covering SFT / DPO / PPO / GRPO — the full stack.
What I worked through today:
- What TRL is: HuggingFace training library, the SFT/DPO/PPO/GRPO family
- DPO data format: (prompt, chosen, rejected)
- DPO loss intuition: push chosen up, push rejected down, but anchor against the reference model
- On-policy vs off-policy distinction
- Wrote a complete DPO training loop in pseudocode
- Compared pseudocode to TRL's real code, understood the sigmoid loss
- Ran it end to end: data generation + DPOTrainer + before/after comparison
- Spotted the overfitting warning signs (margin too high, loss too low)