AI Notes — May 2

Agent orchestration: a while-loop of tool calls

At its core, an agent is a while-loop wrapping tool calls. Building one is a five-step pipeline:

Step 1: Define the tools. What can the agent actually do?
Step 2: Permission modes. Plan / read-only / ask permission / auto / bypass permission. Pick the one that matches the risk profile.
Step 3: Multi-agent orchestration. Multiple agents collaborate, and context engineering is mostly about controlling what each agent sees in its context window.
Step 4: Protocol-first design. Decouple the agent layer from the UI. The protocol is the contract.
Step 5: MCP servers. Handle the messy realities — server crashes, timeouts, streaming feedback.

Distillation in the LLM era

Distillation looks different now because frontier models are mostly black-box APIs — you cannot get logits or intermediate features. The dominant pattern has become data distillation: use a strong model to generate a large, high-quality instruction-response dataset, then SFT a smaller model on it. Alpaca generating 52K samples from GPT-3.5 to train a 7B model is the canonical example. Vicuna and WizardLM follow the same playbook. Strictly speaking it is not classical distillation (no soft labels) but the spirit is the same: transfer the strong model's knowledge into a weak one.

The other variant is chain-of-thought distillation, where the teacher emits not just the answer but the reasoning trace. The student learns "how to think" alongside "what to answer," which transfers reasoning ability much better.

In practice, distilled small models hit 80–95% of the teacher's performance with maybe a tenth of the parameters and dramatically lower inference cost. There is a ceiling though — student capacity is bounded, and very complex knowledge will not compress.

Claude Code for PMs: write the roadmap and nothing else

One PM's heuristic that resonated: the only product document I write is the roadmap. Every PRD and every ticket is written by Claude.

Writing is thinking, so a new GM should take time drafting the roadmap personally — understand the product, usage trends, user feedback, the market, talk to people who built earlier versions. But once that thinking is on paper, every downstream artifact can be generated.

The setup needs an issue tracker with MCP integration. The agent writes tickets, moves them across the board, keeps statuses fresh. The PM no longer reads or writes tickets — they talk about them with the agent. Status columns collapse to now / next / later, plus "in progress" and "done." That is all you need.

The six layers of AI products

Prompt Wrapper: input → prompt → LLM → output. Title generators, email polishers, resume optimizers.
Grounded AI / RAG: the AI retrieves from documents, knowledge bases, or databases before answering. Company knowledge assistants, course TAs, support FAQs.
Tool-using AI: not just answers — calls APIs, reads calendars, updates CRMs, generates files.
LLM Workflow: the AI sits inside a controlled business flow with explicit steps for classification, retrieval, judgment, generation, approval, execution.
Agentic Core: a true agent that plans → acts → observes → updates state → continues or stops. It decides the next step itself and loops.
AI-native Product / System: low-friction interaction, contextual intelligence, memory, permissions, proactive triggers, evals, guardrails, runtime — the AI becomes the intelligent layer of the product or organization.

GPT-5.5, Grok 4.3, DeepSeek V4 Pro

OpenAI/Codex momentum. OpenAI calls GPT-5.5 its strongest launch yet — API revenue growing 2x faster than prior releases, Codex doubling revenue in under seven days.

Grok 4.3. Intelligence Index 53 (up 4 from 4.20). Input prices down 40%, output down 60%. GDPval-AA jumped 321 Elo to 1500, τ²-Bench Telecom hit 98%. The cost: non-hallucination metrics dropped 8 points — stronger but less reliable. Community is split. @scaling01 says it "still trails Chinese open-source models," and Andon Labs reports it "would rather sleep than act" on Vending-Bench 2.

DeepSeek V4 Pro is the open-weight model worth watching this week. 1.6T total / 49B active, 1M context, hybrid CSA/HCA attention, KV cache squeezed to 10%, long-context inference FLOPs cut to a quarter. @omarsar0's read after testing: this is the first open-source model that can "genuinely arm-wrestle Codex / Claude Code on multi-turn agentic coding."

Open vs closed gap is closing. Last week's three big open-source models (Kimi K2.6, MiMo V2.5 Pro, DeepSeek V4 Pro) sit at Intelligence Index 52–54. Gemini 3.1 Pro Preview and Claude Opus 4.7 are at 57. GPT-5.5 is 60. The remaining gap clusters in HLE, CritPt, TerminalBench Hard, and hallucination evals.

Agent ecosystem: the contest moves from model IQ to harness design

Codex vs Claude Code. Codex is iterating fast — responsive testing toolbar, CI status, migration tools, even a "virtual pet" became a hit feature. @theo's summary: "GPT-5.5 is smarter and unblocks you faster; Opus 4.7 has better taste but wanders. Claude Code is noticeably slower on TTFT/TPS with more tool calls; Codex is more direct and economical." But @scaling01 points out GPT-5.5 doesn't actually beat Opus 4.7 on PostTrainBench — results depend heavily on the harness.

Other frameworks. Devin added shell hotkey invocations. Hermes shipped /goal loops where a supervisor model forces the agent to run to completion. Flue positions itself as "Claude Code, but programmable" in TypeScript.

Durable execution is becoming standard. Cloudflare launched Dynamic Workflows. LangChain made create_agent the primitive underneath Deep Agents. Multi-tenant deploys are converging on data isolation, credential delegation, and HITL (human-in-the-loop responses returned as tool results) — all production-grade concerns now.

Research worth bookmarking

ReaLM-Retrieve: retrieval during reasoning (not just upfront RAG). F1 +10.1%, retrieval calls down 47%.
OCR-Memory: store long-horizon trajectories as images plus index anchors to avoid information loss in text summaries. SOTA on Mind2Web and AppWorld under strict context budgets.
Recursive Multi-Agent Systems: agents communicate via shared latent computation rather than natural language. Accuracy +8.3%, tokens cut 34.6–75.6%.
Meta FAIR self-improving pretraining: a strong post-trained model rewrites pretraining suffixes and acts as a judge in RL-style pretraining. Factuality +36.2%, safety +18.5%, generation quality wins 86.3% of head-to-heads.
Microsoft synthetic long-horizon computer-use worlds: 1000 synthetic computers populated with real files, 8-hour runs averaging 2000+ agent turns. The takeaway: "the bottleneck for computer-use agents isn't model capability, it's real experiential data."

Local LLMs and hardware

Qwen 3.6 27B and Gemma 4 31B trade blows on local coding tasks (Gemma cleaner, Qwen flashier).
Qwen open-sourced Qwen-Scope — Sparse Autoencoders for the entire Qwen 3.5 lineup, the largest open interpretability toolkit to date.
PFlash uses a small drafter to pick important tokens plus BSA sparse attention for speculative prefill. 10x faster than llama.cpp on RTX 3090 at 128K — though some flag it as too lossy.
Hardware builds proliferating: 16x DGX Spark cluster (QSFP56 + 200Gbps), AMD Halo Box (Ryzen 395 + 128GB). Google announced TPU 8t/8i: training cost-efficiency +170–180%, inference +80%, datacenter bandwidth +300% — the foundation for Gemini 3.1 Pro and future trillion-parameter multimodal models.

Why synthetic data won't kill human annotation

The relationship between synthetic data and human annotation isn't replacement — it's that they're good at fundamentally different things. Six places where synthetic data can't reach:

1. Tasks that can't be auto-verified. Synthetic environments only work when the answer is programmatically checkable: find a file, fill a form, run a command, modify code. But many real agent tasks have fuzzy success criteria — "reply to this customer complaint," "make this slide deck more persuasive," "do a competitive analysis," "screen these resumes." "Correct" is a continuous spectrum. Only humans can give preference labels (do you prefer A or B), which is the lifeblood of RLHF / DPO.

2. The long tail of real distributions. 1000 synthetic computers are still built to a designer's idea of a "standard office computer." Real users have 47 desktop icons (half named "New Folder (3)"), browsers loaded with weird plugins, mixed Chinese/Japanese/Korean filenames with spaces and emoji in paths, half-finished spreadsheets from a former employee referencing deleted sheets, niche enterprise software the synthetic environment never imagined. Anthropic and OpenAI are still hiring contractors to record screen-trace data on real machines for exactly this reason.

3. Side effects and "completing the wrong way." Auto-verifiers usually only check the final goal, but agents do unexpected damage along the way: deleting unrelated folders, commenting out tests so they "pass," sending the meeting confirmation but also marking the entire inbox as read. This kind of reward hacking is hard for automated judges to catch but humans spot it instantly.

4. The synthetic generator is its own ceiling. Synthetic data is generated by big models (auto-create tasks, auto-judge answers). The fundamental problem: the generator can't teach what it can't do. Want to train an agent for frontier math research? The generator can't write the problems. Want it to do complex legal due diligence? The generator can't judge which report is good. At the capability frontier, you need domain experts to write and grade. That's why Scale AI, Surge, Mercor are paying PhD physicists, senior lawyers, and senior engineers $100–200/hour.

5. Safety, red-teaming, value alignment. "Did the agent secretly read files it shouldn't?" "Does it leak user privacy?" "How should it refuse unreasonable requests?" These are values judgments with no objective answer. They depend on culture, company, and context, and synthetic data can't encode that.

6. Behavioral demonstrations for fine-tuning small models. Many enterprise deployments use a small model plus fine-tuning. The need isn't massive data — it's a few hundred to a few thousand high-quality, domain-specific, style-consistent trajectory demonstrations: "this is how our company writes customer support replies." That data has to be pulled from real workflows or written by experts. It can't be synthesized.