AI Notes — April 17

Follow Claude's X Account

Case in point: today's reset wasn't announced to us in advance. If we'd known yesterday, we could have used it much more aggressively. Following official channels actually saves tokens.

Overseeing Agents Is the Future, Not Writing Code

Dan Shipper's take: dev work is headed toward overseeing agents, not writing code. The idea that CLI will eat UI is nixed. With a CLI-first workflow, you mostly supervise through text: commands, logs, git state, diffs, terminal output. Now that agents are doing the coding, that's not a good primary interface.

Instead, the future coding UI is centered on managing parallel work, staying aware of git/task context, and — most importantly — having access to a preview of what you're building.

I find this interesting. Many people say CLI is the future, but is the future really only for programmers to oversee agents? Like reporting to colleagues, you have to speak plain language to non-technical folks too. UI is plain language. Plain language is needed.

Two Things That Surprised Me About Opus 4.7

It dispatched Haiku to handle doc exploration tasks while I was on Opus 4.7 Max. That's clever — it's actively cutting token consumption and boosting speed.
It could identify which template website our Chinese resume came from. I genuinely did not expect that.

Fast Mode Claude Stopped Working

Worked on 4.6, stopped after the update. Frustrating.

Also, the positioning of these two modes is unclear:

Allow bypass permissions mode: skips all permission checks so Claude works uninterrupted. Good for fixing lint errors or generating boilerplate. Letting Claude run arbitrary commands is risky — data loss, system corruption, prompt injection attacks.
Allow auto permissions mode: lets Claude handle permission decisions during coding sessions so developers can run longer tasks without manual approvals. Also includes additional safeguards against prompt injections.

Not obvious which one to reach for when.

OpenAI Codex Update

OpenAI quietly acquired a company called Software Apps Inc, who launched a macOS AI companion called Sky about a year ago. This team was obsessed with the Mac, and somehow built a magical experience — largely because they were controlling the Mac in the background. You work on one document, Codex clicks buttons and performs actions on another, without interrupting you.

I hear computer use is pretty smooth now. Computer use UX is becoming a mainstream product category: OpenAI's Codex desktop/computer-use update drew unusually strong practitioner reactions. Some called sub-agents + computer use "very close" to what AGI actually feels like.

Warp Terminal Supports Any CLI Agent

Warp, in my opinion the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI — all running side by side in vertical tabs with live status indicators.

The killer feature here, and it solves what I think is the single worst part about Claude Code, is notifications when agents need you. If you've used Claude Code you know the pain of constantly checking if it's waiting for permission or input. Warp notifies you. You step in, approve, go back to what you were doing. They also added integrated code review, a rich multimodal input editor, and — this is wild — remote control from mobile.

Claude Code Team Best Advice

Delegate, don't micromanage — hand the whole task over, don't hover on every step.
Put full goal + constraints + acceptance criteria up front — front-load the information.
Tell the model how to verify — encode testing workflows in claude.md or skills.

That strongly suggests Anthropic optimized toward autonomous task loops where explicit validation is central.

Build Self-Validation Loops

Configure Claude to automatically run builds, tests, and lint to verify its own work. This lets Claude work autonomously for longer and catch its own errors, especially effective when you have Claude write tests first, then code.

Develop Task Classification Intuition

Learn to distinguish tasks suited for async processing (edge features, prototypes) from those needing sync supervision (core business logic, critical fixes). Abstract tasks at the product edge can use "auto-accept mode"; core features need tighter oversight.

Write Clear, Detailed Prompts

When multiple components share similar names or functions, be extremely specific. The better and more detailed the prompt, the more you can let Claude work independently without accidentally modifying the wrong parts of the codebase.

CLAUDE.md Scoping

Tab opened in /Desktop/claudecode/:
  reads: global + this project + memory

Tab opened in /Desktop/other-project/:
  reads: global + that project's CLAUDE.md (if any) + that project's memory

→ Global CLAUDE.md is shared
→ Project CLAUDE.md is independent
→ Conversation context is always independent (unless you use /resume)

How the Claude Code Team Uses Claude Code

Auto-update docs after each session: the team has Claude Code summarize completed work and propose improvements at the end of every working session. This forms a continuous improvement loop — Claude Code keeps refining the claude.md and workflow docs based on actual usage, making subsequent iterations smoother.

Parallel task management across instances: for long-running data tasks, they open multiple Claude Code instances in different repos. Each instance keeps its full context, so even hours or days later, Claude Code remembers exactly where it left off.

Continuing nano-vllm

The full inference flow:

You input: "今天天气真好"
         │
         ▼
    ┌─────────┐
    │ tokenize│  ← chops text into numbers
    └─────────┘
         │
         ▼
    [15, 234, 89, 12, 456, 78]
         │
         ▼
  ═══════════════════════════════
         INFERENCE
  ═══════════════════════════════
         │
         ├─ Phase 1: PREFILL
         │   feed the whole sequence to GPU at once
         │
         └─ Phase 2: DECODE
             spit out tokens one at a time

Batching

Decode N requests in parallel → per-step time is roughly the same → throughput ×N. This is the core performance source of vLLM, nano-vllm, and similar inference frameworks.

Static Batching:
  wait until [A, B, C, D] all gathered → depart together
  A finishes in 50 tokens → but has to wait for B, C, D
  → A's "seat" sits empty, wasting memory and compute

Continuous Batching:
  A finishes, leaves immediately; new request boards immediately
  → seats are always occupied

Preempt

When memory is full, nano-vllm does:

before:
  VRAM: [A 100MB] [B 200MB] [C 60MB]
  Request D needs 150MB → can't fit

preempt(A):
  kick A back to waiting queue
  free A's KV cache (100MB)

after:
  VRAM: [___empty___] [B 200MB] [C 60MB]
  100MB free

preempt(B) if still not enough:
  VRAM: [___empty___] [___empty___] [C 60MB]
  300MB free ✅
  D comes in, allocates 150MB

What about the kicked-out A? Its KV cache is gone. When A's turn comes again, re-prefill — recompute the whole history (prompt + what's been generated). Recomputing beats waiting, because prefill is compute-bound and very fast.

Prefix Caching

Block table:
  block 0: K,V for "Please introduce"       (A, B both point here)
  block 1: K,V for "Beijing's"              (A, B both point here)
  block 2: K,V for "history,"               (A, B both point here)
  block 3: K,V for "focus on Ming-Qing"     (only A points here)
  block 4: K,V for "focus on Republic era"  (only B points here)

Managed via block ref_count:
  block 0 ref_count = 2 (A, B both using)
  block 3 ref_count = 1 (only A)

When A finishes: block 0 ref_count 2 → 1, don't release (B still using)
When B finishes: block 0 ref_count 1 → 0, release

Why do providers charge so little for cached system prompts? With prefix caching, a 5000-token system prompt is computed once; every subsequent call reuses the blocks. Provider's compute cost is near zero, so they pass the savings on.

Generate Pseudocode

function generate(prompt, max_new_tokens):
    # Step 1: tokenize
    tokens = tokenize(prompt)

    # Step 2: prefill — feed whole sequence, start from empty kv_cache
    next_token, kv_cache = model(tokens, kv_cache=empty)
    tokens.append(next_token)

    # Step 3: decode loop — feed one new token, reuse kv_cache
    for i in range(max_new_tokens - 1):
        last_token = tokens[-1]
        next_token, kv_cache = model([last_token], kv_cache)
        tokens.append(next_token)

    # Step 4: detokenize
    return detokenize(tokens)

Two Key Realizations

1. Q, K, V are computed from the same token via three different matrices W_Q, W_K, W_V. (That's why "model parameters" take so much memory — these W matrices are huge.)

2. kv_cache stores K and V only, not Q. (Q is recomputed every step; K, V history doesn't change, so it's stored.)

Understanding Attention

Attention formula:
  Attention(Q, K, V) = softmax(Q · K) · V

Analogy:
  Q (Query)   = "what am I looking for?"     ← current token's question
  K (Key)     = "what type am I?"            ← each history token's label
  V (Value)   = "my actual content"          ← each history token's info

Library research:
  Your question Q       = "I want to learn prefill"
  Each book title K     = "Python Intro", "LLM Inference", "Transformer Paper"...
  Each book content V   = actual text in the books

  Attention:
    1. compare Q to each K, see which book is relevant (softmax(Q·K))
    2. read each V weighted by relevance
    3. get a "fused" answer

QKV Forward

function model_forward(new_tokens, kv_cache):
    # Step 1: for each new token, compute Q, K, V
    for each token in new_tokens:
        embedding = embed(token)
        Q = embedding @ W_Q
        K = embedding @ W_K
        V = embedding @ W_V

        # Append new K, V to kv_cache
        kv_cache["K"].append(K)
        kv_cache["V"].append(V)

    # Step 2: attention with current Q + all K, V in kv_cache
    all_K = kv_cache["K"]
    all_V = kv_cache["V"]
    context = attention(Q, all_K, all_V)

    # Step 3: context → more layers → predict next token
    logits = output_layer(context)
    next_token = sample(logits)

    return next_token, kv_cache

Prefill (3 tokens in one pass):

  今 → Q1, K1, V1  ↘
  天 → Q2, K2, V2  →  GPU parallel computes 3 sets of K, V
  气 → Q3, K3, V3  ↗

Decode (1 token in):

  好 → Q4, K4, V4   ← only 1 set of K, V

Prefix caching gotcha: in nano-vllm, prefix caching is computed "by block," not by token. block_size = 256. If your prompt is only 22 tokens, you don't even fill half a block, so no hash gets generated, and the second request can't find it — miss.

React Dashboards vs Jupyter Notebooks

Jupyter Notebooks are disposable — run once, look, throw away, rewrite for the next evaluation. A React dashboard is a real web app — deploy it once and keep using it; every time you train a new model, plug the data in and view, no rewriting.

The core point: model performance visualization is complex — not just an accuracy number but multi-dimensional, interactive views. Power BI/Tableau are general; they need highly customized views to understand specific metrics in RL training. So they just have Claude Code write them a dedicated React web app.

Why Still Use Figma

Figma is still the absolute standard in design, not "outdated." It keeps iterating — it added AI features last year. The designer toolchain is Figma for visual design + Claude Code to turn designs into real code. Complementary, not competing.

The doc says they have Figma open 80% of the time — Figma actually becomes more important because of Claude Code. Designers can paste Figma screenshots directly into Claude Code to generate code prototypes.

GitHub Actions Auto-Tickets

This is an officially-supported Claude Code feature — plug Claude Code into GitHub Actions, and when someone opens an Issue or PR comment, Claude is automatically triggered to analyze and propose changes, even open a PR. The design team uses this to work through the backlog of small changes like spacing tweaks or copy changes.

996 Is Dead, 24/7 Is the New Norm

For the first time in the history of knowledge work, the person who went home did not take the only copy of their brain with them. 996 as a concept is dead; we're simply 24/7 employees now — but the 24/7 employee is not a person working 24 hours, it's a person whose agents work with enormous parallelization.

Most teams in 2026 still bottleneck on coordination rather than typing, and most organizations have barely begun to restructure. But the frontier is always where the future shows up first, and the frontier is already here. This essay is not a description of the industry at large, but rather a description of what is already happening inside the most AI-native teams.

I want to make a video on this.

Probability: Programmers Can't Guarantee Their Code Runs

This is not a future problem but a present one. Past a certain throughput, errors sneak in not because reviewers are careless but because output volume exceeds what human attention can meaningfully check, and most reviewing is done by non-deterministic models that miss things. The codebase is no longer something you know works — it's something you trust works, something whose probability you can no longer precisely describe.

This has strategic implications most leadership teams don't yet grasp. You're not building organizational capability for the model you have; you're building it for the model you don't yet have. The specs you're learning to write, the review culture you're building, the observability you're integrating, the agent fleets you're learning to command, the training rituals you're experimenting with to keep junior craft alive — none of this is for 2026 capability, it's scaffolding for 2027 and 2028 capability.

Companies building that scaffolding now will leverage the advantage in the next capability jump. Those waiting for tools to mature before restructuring will spend the first year of the next capability era learning what early adopters already know, while early adopters compound the lead.

McKinsey on the Agentic Organization

Source: McKinsey Podcast — Alexis Krivkovich × Lucia Rahilly (2026-04-02)

The Core Paradox

Over 80% of companies have invested heavily in AI without seeing bottom-line returns. The problem isn't the tech — it's that organizations haven't redesigned themselves to match.

Five Pillars

1. Business model: near-zero marginal delivery cost will break any moat that depends on friction. Example: an agent that automatically moves money between banks to chase the best rate → bank account stickiness disappears.

2. Team structure: 75% of roles need fundamental redesign within 2-3 years. Not disappearing — their responsibility boundaries need rewriting, including management. Large orgs added 1-3 layers of management over the past decade; agents could enable "superhuman management span," flattening hierarchies and speeding decisions.

3. Workflow: the biggest value isn't "use AI to do a task faster," it's end-to-end process redesign. Examples: insurance underwriting, HR hiring-to-onboarding. AAA (American Arbitration Association): agents auto-build case timelines, organize facts, generate ruling recommendations → arbitrators only judge "do I agree."

4. Leadership:

Human in the loop: agent does part, hands off to human for the rest
Human above the loop: agent completes the core process; human only makes final judgments ← this is the direction

5. Talent: soft skills like judgment, systems thinking, and people management become more valuable. The junior paradox: if agents take over "scut work," juniors lose the path to build judgment → the fix is making L&D a core part of the career journey, not a side activity. Change management shifts from episodic to perpetual: "Change management is no longer an episodic thing. It's a perpetual state." Use "two-way doors" (reversible experiments) — don't bet the whole company.

Connection

Same trend as Dan Shipper's "overseeing agents is the future" — McKinsey is the org-management version.