AI Notes — April 9

Generative Engine Optimization (GEO)

GEO will be an increasingly important focus area. AI models are already showing preferences between each other — this creates a new optimization surface beyond traditional SEO.

Mythos: A Nuclear Weapon in Code

Mythos scores 93.9% on SWE-bench Verified, up from 80.8% for the previous version — an unprecedented 13-point jump that means it "crushes any programming task, including finding security vulnerabilities in software."

My take: Mythos has become a nuclear weapon. The reason it's not being released publicly is that anyone could use it to find security vulnerabilities, which could cause catastrophic damage to the existing ecosystem. It's like everyone fighting with guns, and suddenly someone shows up with a nuclear weapon.

Anthropic launched Project Glasswing instead of public release — allowing a handful of partners (AWS, Apple, Google, Microsoft, NVIDIA, Cisco, CrowdStrike, JPMorgan Chase, Linux Foundation) to use the system, with $100M in credits and $4M for open-source security work. This signals that cutting-edge AI is rapidly shifting from impressive benchmarks to controlled cyber infrastructure, rather than consumer products.

Picotron: Hands-on Distributed Training (Part 1)

Running a 360M parameter LLaMA model on random data. First step result: Loss = 11.0

Is Loss = 11.0 reasonable? ln(49152) ≈ 10.8. Randomly guessing one of 49,152 words, cross-entropy is ~10.8. So 11.0 means the model hasn't learned anything yet — as expected after just one step.

Key Distributed Training Concepts

LOCAL_RANK: Which GPU am I on this machine. Starts at 0.
RANK (global rank): Which GPU am I globally. With 2 machines of 4 GPUs each, RANK goes from 0 to 7.
WORLD_SIZE: Total number of GPUs participating in training.

dist.barrier(): A synchronization wall. All GPUs must reach this point before any can continue — like meeting at a mountain waypoint before hiking together.

NCCL (pronounced "nickel"): NVIDIA's GPU communication library, optimized for GPU-to-GPU data transfer. With NVLink, can reach hundreds of GB/s.

Language Model Training Target

input_ids  = [I, love, eating]      → [5, 12, 8]
target_ids = [love, eating, apples]  → [12, 8, 33]

Target is input shifted one position right. The model learns: given previous tokens, predict the next one.

Data Parallel Step 5 vs Step 6: Naive vs Bucket

Naive DP: each parameter does all-reduce individually, synchronously. With hundreds of parameters, that's hundreds of all-reduce calls.

Bucket DP: pack parameters into ~25MB buckets. When all parameters in a bucket have gradients ready, fire an async all-reduce (async_op=True) — computation and communication overlap. This is the core idea behind PyTorch DDP.

	Step 5 Naive	Step 6 Bucket
Communication	Sync, per-parameter all-reduce	Async, per-bucket all-reduce
Gradient precision	bf16 (same as params)	float32 accumulation, more stable
Compute/comm overlap	None	Yes (async_op=True)

Pipeline Parallel: AFAB and 1F1B Schedules

PP splits the model by layer: GPU 0 handles layers 0-7, GPU 1 handles 8-15, etc. Data flows like an assembly line.

AFAB (All Forward All Backward): first complete all micro-batch forwards, then all backwards. Simple but creates large pipeline bubbles — GPUs sit idle waiting.

1F1B: alternate forward and backward every step. Warmup phase fills the pipeline; steady state runs 1F1B continuously; cooldown drains it. The idle slots are only at the start and end — the middle is fully utilized.

1F1B also uses bidirectional communication: in steady state, a GPU simultaneously sends forward results to the next stage AND receives backward gradients from it — packed into one batch_isend_irecv call.