AI Notes — April 14

WeChat Anti-Screenshot

If you use Claude Computer Use on certain apps, it may trigger their protection mechanisms — for example, anti-screenshot features that produce blank captures. This is a fundamental limitation of current computer-use approaches.

nano-vLLM: Prefill vs Decode

The fundamental difference in one line:

Prefill:  100 tokens enter GPU simultaneously  → fast, GPU fully engaged
Decode:   only 1 token output at a time          → slow, GPU mostly waiting for memory

The Scheduler: Traffic Controller

Two queues:

waiting queue: new requests not yet processed
running queue: requests that have completed prefill, now in decode

Each schedule() call: if there are waiting requests → do PREFILL (move to running). If waiting is empty → do DECODE (process all running requests). This is the essence of continuous batching.

Preempt: if memory runs out, "kick" a running request back to waiting, release its KV cache memory, restart when space is available.

KV Cache

Without KV cache: generating the 2nd token requires recomputing K and V for all previous tokens — wasteful duplication.

With KV cache: store computed K, V for each token. When generating a new token, read cached values directly, only compute K, V for the new token.

KV cache = scratch paper

Prefill: write understanding results for every token in prompt on scratch paper
Decode:  each new token adds one line to scratch paper, doesn't rewrite previous lines

Why decode is slow: every token generation requires reading the entire "scratch paper" from memory. The longer the paper gets, the more data needs to be read — bottleneck is memory bandwidth, not compute.

PagedAttention

Problem with naive KV cache: continuous memory allocation. Request A's cache grows but Request B is right behind it in memory — nowhere to expand.

PagedAttention solution: cut memory into fixed-size "blocks" (16 tokens' KV cache each). Requests can use non-contiguous blocks — like a file system on a hard drive. Need more space? Find a free block and attach it.

Request A:  [block 1] → [block 3] → [block 7]   ← non-contiguous is fine
Request B:  [block 2] → [block 5]
Request C:  [block 4] → [block 6] → [block 8]

Notion on Model Behavior Engineers

A new role emerging at AI companies: Model Behavior Engineer (MBE). Data scientist + PM + prompt engineer hybrid. Core output is evals and LLM judges, not feature code. Now mostly uses coding agents to auto-generate evals rather than writing them manually.

The role requires judgment about "what counts as doing well" — not general engineering ability, but specialized cognitive work. Notion deliberately made this an independent career ladder, welcoming "misfits" — background doesn't have to be engineering, but must have taste.

Notion's 3-tier eval system: CI regression tests (must pass), release quality report cards (80-90% pass to ship), and "frontier evals" (intentionally only 30% pass — used to understand model ceiling). The MBE team focuses on tier 3 — "their own final exam" for giving Anthropic/OpenAI high-quality pre-release model feedback.

The "Software Factory" Design

Simon's goal: maintain key invariants with minimal human intervention. Components:

Spec layer: Markdown files in the repo. Human-readable, browsable, editable — the agent's "constitution."
Self-verification: comprehensive test layers so agents can execute code, see test results, debug themselves, submit PRs — all in one runtime environment.
Defect flow: an internal issue tracker (Notion database) that agents can file problems into. A manager agent subscribes and processes them, compressing 70 daily notifications from 30 agents into 5.
Key invariant preservation: agents cannot modify their own permissions — a deliberately preserved human intervention point.