AI Notes — April 10

Meta's New Model: Muse Spark

Meta abandoned the Llama series and launched a new model family called Muse. The first release is Muse Spark, positioned as "small and fast" — the entry-level version of the series. Led by Alexandr Wang (whose company Scale AI was acquired by Meta), the team spent 9 months rebuilding the entire AI training architecture from scratch.

The most important number: 10x efficiency improvement. Same capability, requiring only one-tenth the compute of Llama 4 Maverick. This is a massive engineering breakthrough.

Benchmark	Muse Spark	Claude Opus
CharXiv Reasoning	86.4 ✅ #1	lower
HealthBench Hard	42.8 ✅	14.8
SWE-Bench Verified	77.4 ⚠️ weak point	80.8

Medical reasoning is particularly strong — the training data was curated by over 1,000 doctors.

Contemplating mode: parallel multi-agent reasoning — not one model thinking alone, but multiple sub-agents reasoning in parallel, then aggregating results.

Strategic shift: from open to closed source. Meta no longer wants to be "Linux for AI" — instead directly deploying AI across its product matrix (WhatsApp, Instagram, Threads, Ray-Ban glasses) to reach billions of ordinary users.

Meta.ai's 16 Hidden Tools

Researcher Simon Willison directly asked meta.ai's chat interface "what tools do you have?" — it answered honestly, exposing 16 hidden tools: browser search/open, Python code interpreter, HTML artifact creation (similar to Claude Artifacts), pixel-level visual grounding, sub-agent spawning, personal content search (Instagram/Threads/Facebook), file editing, and Google/Outlook calendar + email integration.

Most interesting: visual_grounding — tested on an AI-generated image of a raccoon, it identified 12 whiskers (each with coordinates), 8 paws, eyes, ears, and the garbage hat's position box. This isn't calling an external API — it's the model's native capability through tool calls.

Two New Thoughts

I. AI Tools Are Essentially a New Kind of Game

Claude Code, Codex — these "skills + AI" tools look like productivity tools, but structurally they're closer to a variant of games. The core mechanism that makes games addictive is called agency — you control a character, the character actually affects the world, and this immediate feedback of "my action has effects" is the root of addiction. AI agent tools give you exactly the same feeling.

But what makes it more interesting than games: a game's world is closed — lose, restart, consequences are contained. AI agents operate on the real world — real files, real deployments, real consequences. This combines the satisfaction of games with the sense of meaning from the real world — an unprecedented interaction paradigm.

II. Vibe Coding Is to Traditional Programming What Web Fiction Is to Classic Literature

Before web novels, "writing a novel" had barriers — you needed to refine your prose, pass through publishers, before anyone could read it. Web novels removed the gatekeepers; writing and publishing dropped in threshold, resulting in explosive growth in quantity and variety.

Vibe Coding is taking the same path. Previously, "making software" required knowing architecture, debugging, deployment. Now the threshold is: do you have a real need? Those tools made with AI might have chaotic architecture in an old programmer's eyes — but as long as they solve real problems, they'll have massive audiences.

Hidden here is the embryo of a new profession — not the people writing code, but the people who can judge whether AI-written code is usable.

Nanotron: Deep Dive on Parallelism

One-sentence Mental Model

All parallelism in LLM training answers the same question: when model + data + optimizer states + activations can't fit in one GPU, which dimension do you cut, what stays, what gets transmitted, who computes, who communicates?

The Six Parallelism Strategies

DP (Data Parallel): same model copied to all GPUs, each sees different batches. Communication: all-reduce gradients (once per step). Best choice when model fits on one GPU.
TP (Tensor Parallel): cut weight matrices (W_q/W_o/W_1/W_2 by column or row). Communication: all-reduce activations (2x per layer). Use within single machine with NVLink.
PP (Pipeline Parallel): cut by layers. Communication: point-to-point (send activation, recv gradient). Usable across machines with low bandwidth.
SP (Sequence Parallel): cut activation's sequence dimension. Reduces activation memory significantly. Usually coupled with TP.
CP (Context Parallel): more aggressive sequence cutting for long contexts (≥32K), cutting K/V cache.
EP (Expert Parallel): MoE models only — route tokens to different expert GPUs via all-to-all communication.

ZeRO: Not a New Dimension, but Sharding Optimization on DP

ZeRO-1: shard optimizer states (Adam m, v) — saves 8 bytes × params. Almost free.
ZeRO-2: + shard gradients
ZeRO-3 (= FSDP): + shard weights themselves — extreme memory savings, but more all-gather calls

Training Memory Formula

Training memory ≈ 16 × N_params + activations

Weights (bf16)        : 2 bytes × N_params
Gradients (bf16)      : 2 bytes × N_params
Adam states (fp32 m,v): 8 bytes × N_params
Master weights (fp32) : 4 bytes × N_params

LLaMA-7B already can't fit in one 80GB H100 for training (needs 100+ GB). This is why all serious LLM training uses at least ZeRO-1 + TP.

Picotron vs Nanotron

Picotron teaches you "how each parallelism works individually." Nanotron teaches you "how they combine and enable resume, FP8 training, sequence parallel." The former is a blackboard; the latter is a factory.