nanochat: scaling depth and the FP8 training trick
Working through nanochat with 8x RTX Pro 6000s. Setup notes: configure wandb, set the wandb token, bump the version in pyproject.toml, run inside screen.
Going from depth 12 → 24: layers double → forward pass does roughly 2x the matrix multiplies → training time doubles. Holding data volume, batch size, and everything else constant:
total_time = steps × time_per_step = const × 2 = 2x Traditional BF16 training vs the FP8 trick
Traditional BF16 (mainstream):
forward: W_bf16 × x_bf16 → y_bf16 (bf16 precision)
backward: dW = dL/dy × x.T (bf16)
update: W += -lr × dW (bf16)
memory: W (16-bit) + grad (16-bit) + Adam m,v (16-bit each)
= 8 bytes/param The FP8 training trick: the key insight is that the matmul inside forward goes to fp8, but everything around it stays in bf16/fp32.
forward:
1. quantize W (bf16 → fp8): scale = max(W) / fp8_max
W_fp8 = round(W / scale)
2. quantize x (bf16 → fp8): same idea
3. matmul W_fp8 @ x_fp8 → y_fp32 (accumulate in fp32!)
4. cast y_fp32 back to bf16 for the next layer
backward: similar quantization, but gradient accumulation always in fp32
Adam optimizer m, v state: always fp32 (this is the bulk of memory) Coding agent shootout: Claude Code, Claude Design, Cursor, Codex
Designed a single landing-page brief — a satirical "Token Waster" SaaS — and ran the same prompt through four agents to compare design ability, autonomy, and code output.
The brief: build a complete marketing landing page for a fictitious B2B SaaS called "Token Waster", with navbar, hero (including a fake real-time "tokens wasted today" counter), three feature sections, social proof, three-tier pricing, FAQ, footer. Next.js 15 + Tailwind + shadcn/ui, single file. Restrained brand palette (no more than two primary colors plus grayscale), real copy not lorem ipsum, dark mode and mobile considered. 30-minute time budget.
Models picked at the highest tier each tool offered:
- Cursor: Composer 2
- Codex: GPT-5.5 (highest)
- Claude Code: Opus 4.7 (highest)
- Claude Design: no model selector visible
Completion times — Claude Code 13 min, Cursor 9 min, Codex 12 min. Claude Design felt faster but I lost track since switching between panels was clunky.
Interruption count during the run: Cursor allowed 3 mid-flight messages; Codex allowed 3 but doesn't really pause — it keeps running the prior task while replying to new messages, and won't change permissions mid-task.
Claude Code hit a hydration bug: it had a counter component where useState's initializer returned a random server number and a Math.random() + Date.now() client number, so React's hydration check failed. Fix is to use the same static seed on server and client, then swap to live values after mount. After surfacing the bug, Claude Code patched it correctly.
GPT-5.5's page felt visually dense. Looking at its system prompt, it leans toward "dense but organized info" — which doesn't translate well to a marketing landing page where breathing room and hierarchy matter more than information density.
Cursor SDK takes WolfBench
Cursor Agent + GPT-5.5 currently sits at #1 on Terminal-Bench 2.0. Cursor opened up the same runtime and harness for third-party embedding via the Cursor SDK.
Claude Code small UX bug
Conversations that started before "auto mode" existed cannot be switched into auto mode after the fact.
GPT-5.5 vs vulnerability discovery
Per the AI Safety Institute, the publicly accessible GPT-5.5 (not the announced GPT-5.5-Cyber tier) is roughly on par with Claude Mythic at finding vulnerabilities. Earlier reporting noted Anthropic considered Claude Mythic "too dangerous to release publicly." Possible readings: either the original framing was overstated, or the model is hard to serve at scale relative to Opus.
Apache 2.0 is a meaningful detail
So when an email mentions "IBM Granite 4.1 is Apache 2.0" or "SenseNova U1 is Apache 2.0", the subtext is: you can download the model, drop it into your product, sell it, and IBM / SenseTime won't come after you and won't ask for a cut. For enterprise users this is important information — with a Llama-style conditional license, legal needs to convene first to decide if you can use it at all.
Apache 2.0 also has the patent grant clause: if you use Apache 2.0 code, the author also licenses the relevant patents to you for free. That's the layer it has over MIT and the reason enterprises feel safer — the original author can't turn around and sue you for patent infringement on their own code.
One-line memory: Apache 2.0 = "go ahead and use it, commercial use is fine, just credit me."
2023–2025: where AI value got captured
From 2023 to 2025, all of the AI value got captured at the infrastructure layer. NVIDIA's blockbuster May 2023 earnings call — stock jumped 25% after the print — formally marked the start of the AI trade. In 2024, Vistra and GE Vernova were among the best S&P 500 performers (up 265% and 146% respectively) as the market priced in electricity as the binding constraint. In 2025, attention turned to memory: SanDisk, Western Digital, Seagate, and Micron all up over 200% year-on-year. Plenty of other infrastructure names beat broadly, riding AI capex.
Two pieces of that landscape worth unpacking together: VR NVL72 and "neoclouds." They're two faces of the same story.
VR NVL72: from chips to systems
VR is short for Vera Rubin, NVIDIA's astronomer-naming tradition: Pascal, Volta, Turing, Ampere (A100), Hopper (H100), Blackwell (B200/B300), and next is Rubin. VR is two generations after Hopper, expected to start shipping in 2026.
NVL72 is the sales form factor. NVIDIA used to sell you single cards (one H100); now it sells you entire racks: 72 GPUs fully NVLink-interconnected in one liquid-cooled cabinet you wheel in and plug power into. The shift over the past two years has been from "selling chips" to "selling systems," because the bottleneck for large-model training and inference long stopped being single-card FLOPs and moved to inter-GPU bandwidth. Shipping a full rack guarantees the 72 cards are NVLink full-bandwidth internally, with InfiniBand between racks for the supercluster.
Why VR NVL72 looks so cheap on the price-per-PFLOP chart ($0.29/PFLOP): Rubin doubles single-card FLOPs over Blackwell, HBM4 has higher bandwidth, and the rack-level form factor saves enormous cooling and networking overhead. This is the physical reason AI compute cost keeps halving.
Neoclouds: the GPU-cloud middlemen
Neocloud means "new cloud" — the GPU-only cloud providers that emerged on this AI wave, distinct from the AWS/Azure/GCP hyperscalers. Representative names: CoreWeave (largest, public), Lambda, Crusoe, Nebius (spun out of Yandex), Together AI, Voltage Park, Foundry, Applied Digital.
Why they exist: after ChatGPT broke through in 2023, AI companies wanted GPUs and AWS / Azure couldn't deliver fast enough. Two reasons: AWS itself was short on supply, and AWS datacenters are designed for general-purpose multi-tenant servers — network topology, rack density, cooling — none of which is optimized for large GPU clusters. Stitching together a 1000-card cluster on AWS gives you bad cross-node communication.
Neoclouds filled the gap: designed for GPU clusters from day one. Whole machine rooms wired to InfiniBand topology, liquid cooled, dense racks, no general-purpose multi-tenant isolation. Customers get near bare-metal performance.
Business model: borrow money from investors and banks (often using GPUs as collateral), drop large amounts on H100/B200 in one shot, build the datacenter, rent by the hour or sign 3–5 year contracts.
Interesting food chain: CoreWeave signed multi-tens-of-billions of dollars in long-term contracts with Microsoft, and a large slice of that capacity gets subleased to OpenAI. So the largest AI cloud customer in the world has a compute path that goes self → Microsoft → CoreWeave → NVIDIA. Microsoft has Azure but still buys neocloud capacity because neoclouds are scaling faster than Azure is.
The IRR figures on the chart (15.3%, 38%) describe the neocloud business case: you buy a GPU for ~$30,000, rent it at $X/hour, recoup over five years, plus a target IRR (e.g. 15%) as profit. The scarcer the GPU, the higher the IRR they can charge (early H100 era hit 38% or higher); the more abundant and competitive it gets, the more IRR compresses (chart assumes 15.3% on VR NVL72). The chart says two things: hardware itself is getting cheaper, and the neocloud abnormal-profit window is closing. CoreWeave caught the H100 scarcity wave; later entrants will have less margin to work with.
The biggest neocloud risk is depreciation. They borrowed against H100s and depreciate over 5–6 years. But if B200, B300, and Rubin each halve $/PFLOP, no one will rent old H100s at the previous price, and old GPUs become stranded assets on the books. That's the root cause of CoreWeave's wild post-IPO volatility — the market keeps re-betting whether AI demand grows faster than hardware depreciates.
One-line summary: Neoclouds are the middlemen who rack NVIDIA's cards and resell to AI companies, riding the GPU scarcity premium. The business is profitable now, but the asset base is depreciating fast.
Real vs accounting depreciation rates
As of May 2026, two layers worth separating: book depreciation and real economic depreciation — the gap between them is the active debate.
Book depreciation (how companies record it). All straight-line, the rate is just 1/years:
- CoreWeave: 6 years (~16.7%/year), longest, most criticized.
- AWS / Azure / Google: 5–6 years (16.7%–20%/year). Amazon shortened server depreciation from 6 to 5 years in February 2025, citing "AI accelerating the pace of change."
- Lambda: 5 years.
- Nebius: 4 years (most aggressive at 25%/year).
Trend: everyone is shortening, but CoreWeave hasn't moved. That's the core of Michael Burry's CoreWeave short thesis from last November — that the 6-year schedule props up earnings while real economic life is much shorter.
Real economic depreciation. Industry consensus: under frontier-training use, real GPU economic life is 2–3 years. Not because the cards die — because once a new generation halves $/PFLOP, the electricity cost to keep training frontier models on old H100s exceeds the price difference of buying new cards within 18–36 months.
Codex: dynamic UI for the task at hand
Codex pitch line: dynamic UI for the task at hand, 20% faster computer and browser use, better slides and sheets, in-browser annotation, artifacts and code, easier onboarding, cleaner cross-app design, performance improvements, no clunky handoff or switching.
Other notable releases of the day
- Qwen3.6 27B looks like the most important open-weight release of the day. Artificial Analysis ranked it the new open-weights leader under 150B parameters with an Intelligence Index of 46, ahead of Gemma 4 31B and prior Qwen variants. Apache 2.0, 262K context, native multimodal input, and BF16 weights small enough to fit on a single H100. The 35B A3B MoE companion scored 43, the strongest open model around 3B active parameters. Tradeoff: expensive inference per output token — AA estimates Qwen3.6 27B used ~144M output tokens on the suite, roughly 21x the run cost of Gemma 4 31B. Still a notable step on capability-per-size.
- Open-source supply-chain risk: Socket reported the popular PyPI package
lightningwas compromised in 2.6.2 and 2.6.3, with malicious code executing on import, downloading Bun, and running an 11MB obfuscated JavaScript payload aimed at credential theft. Theo connected this to theintercom-clientnpm compromise and a Linux zero day, arguing the tempo of supply-chain attacks is increasing. - Security scanners as first-class AI products: Anthropic shipped Claude Security, a repo vulnerability scanner that validates findings and suggests fixes, powered by Opus 4.7. Cursor shipped a parallel Cursor Security Review with always-on PR review and scheduled scans. Model vendors moving directly into established devsecops categories.
- Qwen-Scope: Qwen released an open suite of sparse autoencoders for Qwen models, focused on feature steering, debugging, data synthesis, and evaluation — a rare interpretability release rather than just weights.