WeChat Anti-Screenshot
If you use Claude Computer Use on certain apps, it may trigger their protection mechanisms — for example, anti-screenshot features that produce blank captures. This is a fundamental limitation of current computer-use approaches.
nano-vLLM: Prefill vs Decode
The fundamental difference in one line:
Prefill: 100 tokens enter GPU simultaneously → fast, GPU fully engaged
Decode: only 1 token output at a time → slow, GPU mostly waiting for memory The Scheduler: Traffic Controller
Two queues:
- waiting queue: new requests not yet processed
- running queue: requests that have completed prefill, now in decode
Each schedule() call: if there are waiting requests → do PREFILL (move to running). If waiting is empty → do DECODE (process all running requests). This is the essence of continuous batching.
Preempt: if memory runs out, "kick" a running request back to waiting, release its KV cache memory, restart when space is available.
KV Cache
Without KV cache: generating the 2nd token requires recomputing K and V for all previous tokens — wasteful duplication.
With KV cache: store computed K, V for each token. When generating a new token, read cached values directly, only compute K, V for the new token.
KV cache = scratch paper
Prefill: write understanding results for every token in prompt on scratch paper
Decode: each new token adds one line to scratch paper, doesn't rewrite previous lines Why decode is slow: every token generation requires reading the entire "scratch paper" from memory. The longer the paper gets, the more data needs to be read — bottleneck is memory bandwidth, not compute.
PagedAttention
Problem with naive KV cache: continuous memory allocation. Request A's cache grows but Request B is right behind it in memory — nowhere to expand.
PagedAttention solution: cut memory into fixed-size "blocks" (16 tokens' KV cache each). Requests can use non-contiguous blocks — like a file system on a hard drive. Need more space? Find a free block and attach it.
Request A: [block 1] → [block 3] → [block 7] ← non-contiguous is fine
Request B: [block 2] → [block 5]
Request C: [block 4] → [block 6] → [block 8] Notion on Model Behavior Engineers
A new role emerging at AI companies: Model Behavior Engineer (MBE). Data scientist + PM + prompt engineer hybrid. Core output is evals and LLM judges, not feature code. Now mostly uses coding agents to auto-generate evals rather than writing them manually.
The role requires judgment about "what counts as doing well" — not general engineering ability, but specialized cognitive work. Notion deliberately made this an independent career ladder, welcoming "misfits" — background doesn't have to be engineering, but must have taste.
Notion's 3-tier eval system: CI regression tests (must pass), release quality report cards (80-90% pass to ship), and "frontier evals" (intentionally only 30% pass — used to understand model ceiling). The MBE team focuses on tier 3 — "their own final exam" for giving Anthropic/OpenAI high-quality pre-release model feedback.
The "Software Factory" Design
Simon's goal: maintain key invariants with minimal human intervention. Components:
- Spec layer: Markdown files in the repo. Human-readable, browsable, editable — the agent's "constitution."
- Self-verification: comprehensive test layers so agents can execute code, see test results, debug themselves, submit PRs — all in one runtime environment.
- Defect flow: an internal issue tracker (Notion database) that agents can file problems into. A manager agent subscribes and processes them, compressing 70 daily notifications from 30 agents into 5.
- Key invariant preservation: agents cannot modify their own permissions — a deliberately preserved human intervention point.