AI Notes — April 29

NVIDIA Nemotron 3 Nano Omni

The biggest infra-native model launch of the day. NVIDIA introduced Nemotron 3 Nano Omni, an open 30B / A3B multimodal MoE with 256K context, built for agentic workloads spanning text, image, video, audio, and documents.

Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability.

Key specs from the follow-on posts: this is NVIDIA's first omni release with speech/audio understanding backed by a Parakeet encoder, English-only for now, and a 5.95% WER on the Open ASR leaderboard. Several hosts cited ~9× throughput versus comparable open omni models.

Mini-SGLang: What "match" Actually Does

One-line recall: match = how much prefix can already be reused from the query tree.

Plain analogy: you go to a library to find a book. The librarian (the match function) tells you "we already have chapters 1-5 of Harry Potter on the shelf, chapter 6 onwards is new."

Visualization

Suppose the radix tree currently looks like this (each [...] is a node):

                 root
                  │
        [You are a fitness coach.]   ← 14 token shared prefix
                  │
           [How should I train]
              ╱           ╲
        [chest?]        [abs?]   ← two existing branches

A new prompt comes in:

"You are a fitness coach. How should I train abs? What else should I watch out for?"

How match Walks the Tree

match(root, prompt_tokens):
  step 1: walk down from root
          "You are a fitness coach." → in tree, share 14 tokens
  step 2: walk to next node "[How should I train]"
          "How should I train" → in tree, share 5 more tokens
  step 3: at branch point, check children for "abs"
          found, walk into [abs?] → share 4 more tokens
  step 4: continue with "What else should I watch out for?"
          children don't have this token → STOP
          → return: matched_node=[abs?], n_matched=23
          (first 23 tokens already in tree, the rest is new)

What the Two Return Values Mean

matched_node, n_matched = match(root, prompt_tokens)
       ↓                ↓
  which node we      total tokens
  ended up at        matched
  (where new stuff   (these don't need
  gets attached)      KV recompute)

Why match and insert Must Pair Up

match  → tells you "first 23 tokens skip recompute"
insert → attaches "from token 24 onwards" into the tree + KV cache

no match:  don't know what's reusable → recompute everything → wasted
no insert: computed but didn't save → next time recompute → wasted

Unsloth: LoRA Finetuning Modes

A) Non-merged mode (peft default):

inference: y = (W + α*B@A) * x
→ slightly slower (one extra matrix add)
→ flexible: one base model can carry multiple adapters
   (English / Chinese / medical / ...)

B) Merged mode (merge_and_unload):

bake A, B into W: W_new = W + α*B@A
→ inference is identical speed to a normal model
→ but W_new is 14GB, you lose the "lightweight" advantage

Mimicking Dream of the Red Chamber Style with Unsloth

The size delta is striking:

Qwen3-8B (base):   16 GB
LoRA adapter:      167 MB    ← 100x smaller

I trained a small adapter and ran a "West Lake travel diary in Dream of the Red Chamber style" generation. The output had Baoyu and Daiyu having the kind of conversation you'd expect, with the period vocabulary and rhythm coming through. The actual reasoning didn't quite hold together, but the format was right — with a larger base model the same approach should work cleanly.

TRL: HuggingFace's Transformer Reinforcement Learning

TRL = Transformer Reinforcement Learning. HuggingFace's training library covering SFT / DPO / PPO / GRPO — the full stack.

What I worked through today:

What TRL is: HuggingFace training library, the SFT/DPO/PPO/GRPO family
DPO data format: (prompt, chosen, rejected)
DPO loss intuition: push chosen up, push rejected down, but anchor against the reference model
On-policy vs off-policy distinction
Wrote a complete DPO training loop in pseudocode
Compared pseudocode to TRL's real code, understood the sigmoid loss
Ran it end to end: data generation + DPOTrainer + before/after comparison
Spotted the overfitting warning signs (margin too high, loss too low)