Medical-LLM refactor: four findings
- ml-intern v0.1 cannot run unattended overnight. The current
turn_completeheuristic is a soft check with no deliverable verification — it claims done before the work is actually done. - Multi-format task interference. Adding 3k samples of PubMedQA pushed PubMedQA up by +28.6 but dragged MedQA down by −4.2 and MedMCQA by −9.3. Mixing yes/no/maybe + A/B/C/D + free-text outputs in one model causes the head to tear.
- iCliniq schema bug + scale mismatch. Schema bug fixed. Scale numbers can't be directly compared with ChatDoctor 2023 — we need to switch to the raw BERTScore protocol to make them comparable.
- SFT plateaus at 25% of total steps. Under packing + 1 epoch, step 50 of 175 is already plateau territory. The remaining 75% of compute buys nothing.
Architectural differences across five frontier models
Scale and sparsity. The five models span the full spectrum from small dense to massive MoE. Gemma 4 (31B) is the only pure dense architecture — all 31B parameters activate on every forward pass. The other four are MoE: total parameters are huge, but only a small slice activates per inference. GLM-5.1 (744B) activates ~40B; Kimi K2.6 (1T) activates ~32B; DeepSeek V4-Pro (1.6T) activates ~49B.
Attention — the biggest divergence point.
- Gemma 4 uses a 5:1 ratio of local-to-global attention. Local layers use sliding-window with 32 heads; global layers are sparser at 10 heads. A lightweight variant of classical multi-head attention.
- Qwen 3.6 takes the most aggressive hybrid: Gated DeltaNet (linear attention) interleaved with full attention at a 3:1 ratio. The first three layers run linear attention + MoE; layer four runs full attention + MoE. Compute drops sharply.
- GLM-5.1 and Kimi K2.6 both adopt Multi-head Latent Attention (MLA, originally from DeepSeek V3), which compresses the KV cache and reduces memory significantly. Their architectures are nearly twins.
- DeepSeek V4-Pro stratifies MLA further: it distinguishes ordinary Compressed Self-Attention (CSA) from Heavy Compressed Attention (HCA), and introduces mhC (manifold-constrained hyperconnections) — four parallel residual streams replacing the single x + F(x) skip connection. The most structurally complex of the five.
MoE routing strategy. DeepSeek V4-Pro's first three blocks use hash-based MoE instead of learned top-k routing, which sidesteps the routing instability that plagues early training. The other models all use standard learned routing.
Context length. From Gemma 4's 256k up to Qwen 3.6 and DeepSeek V4-Pro at 1M tokens. The gap is large, and behind it sits a different RoPE extension strategy for each model.
Project Deal: agents trading among themselves
Anthropic recently ran Project Deal, a week-long internal test where 69 employees handed buying and selling decisions entirely to Claude agents — no human approval. Each participant started with a $100 budget. The agents created listings, negotiated prices, accepted offers, and completed trades inside Slack.
The experiment produced 186 deals worth more than $4,000 across over 500 listed items. It also exposed a clear gap in model quality: Claude Opus agents consistently got better prices and closed more deals than Haiku versions — and most users did not realize they were getting worse results.
Takeaway: AI commerce may arrive faster than expected, but not all agents will be equal. The model tier gap shows up clearly when agents negotiate against each other.
GeoGuessr + time
An online GeoGuessr variant that adds a time-of-day component. The format is dead simple — just an online game — but surprisingly fun. Worth thinking about: minimum-viable mechanics often beat clever ones, especially when the core loop is genuinely entertaining.
You are the most expensive model
The argument is sharp: the real cost of AI agents is your time, not the API bill. McDonald's would never put its CEO on the burger grill — that hour is worth $9,230. Same logic for AI: you don't need a frontier model to check your to-do list. Don't pay 75¢ every half hour ($1,095 per month) to let Claude Opus do something a smaller model could do for cents.
The framing implies a four-step framework for keeping AI costs in check, where the first cost to budget is your own attention spent supervising the agent. The second insight: stop herding agents across terminals and branches — bundle each task into a single workspace with a living spec, agent notes, and full change visibility. Orchestrate agents like a system, not a swarm.