AI Notes — May 13

What to do this week

Figure out your longest agent run. METR teaches us that duration may be a good approximation of difficulty. Ask: what's the longest stretch you've trusted an agent on autopilot? If you don't know, you can't extend it.

Perplexity: how to design skills

Perplexity, an AI search company building agentic research and browsing tools, published its methodology for designing agent skills. The main lesson: don't start with the skill, start with the tests.

Write the evals first. Pull five to ten cases from production queries, known failures, and edge cases. Include negative examples — queries that should not invoke this skill.
Phrase triggers like a human would. Start with "Load when…" and use the language your users use. Instead of "monitors pull requests," try "babysit a PR," "watch CI," or "make sure this lands." The skill loads without your team needing a specific command or technical phrase.
Write the body in principles, not procedures. The model already knows commands; it needs direction on how to apply them. Instead of listing detailed steps (check out a branch, cherry-pick files, check for conflicts…), write instructions like "Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent."
Codify failures into lessons. When the agent fails in production, write the failure mode into the skill file. The mistake becomes a standing instruction guarding against future mistakes.
Edit instructions rigorously. For every line you add, ask: "Would the agent get this wrong without this?" If not, cut it. Every extra line adds context cost.

Benchmarks keep getting harder

Research-level reasoning benchmarks keep escalating. Soohak introduces 439 research-level math problems authored from scratch by 64 mathematicians (including 38 faculty), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, SophontAI released Medmarks v1.0, expanding its open medical benchmark suite from 20→30 benchmarks and 46→61 models. There's growing sentiment that old evals are saturating: the argument that benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.

Blackwell as the reference platform for large-MoE serving

Blackwell racks are emerging as the reference platform for serving large MoEs. Perplexity published details on serving post-trained Qwen3 235B on NVIDIA GB200 NVL72, arguing GB200 is a major inference step up over Hopper. Benchmarks cite NVLS all-reduce latency dropping from 586.1µs on H200 to 313.3µs on GB200, and MoE prefill combine at EP=4 dropping from 730.1µs to 438.5µs, with better decode throughput at high token rates. This materially changes prefill/decode disaggregation for serving large MoEs.

Demis Hassabis on AI for health

Demis Hassabis has long believed the No.1 application of AI should be improving human health. That work started with AlphaFold, and now continues at Isomorphic Labs with the mission to reimagine drug discovery and one day solve all disease — turbocharged with $2.1B in new funding.

Safer codegen is becoming its own research track

GitHub's pull_request_target remains one of the sharpest CI/CD traps in fork-based PR automation. At the workstation level, the recommendation is to move secrets out of ubiquitous local .env files into a proper secrets manager. Stanford-aligned work on SecureForge targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, framing it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.

1T-parameter model on Intel Optane

A high-memory Xeon build using Intel Optane DC Persistent Memory DIMMs reportedly runs Kimi K2.5 — a ~1T parameter MoE model — locally at about 4 tokens/s via llama.cpp hybrid GPU/CPU inference. The key technical point: 768GB Optane PMem in Memory Mode (Optane appears as system RAM) with 192GB DDR4 ECC DRAM as cache, so the model's sparse expert weights live in PMem while attention/dense/shared-expert/routing tensors fit on an RTX 3060 12GB using override-tensor or ngl auto/cmoe. Commenters debated whether a higher-core-count Cascade Lake Xeon would help, and whether Optane Storage Mode plus mmap might outperform Memory Mode — and whether 4 tokens/s is practically tolerable for interactive use.

Google: AI to replace clicking and typing

Ahead of I/O, Google unveiled new Gemini-powered products: an AI-native "Googlebook" laptop, a cross-device Gemini Intelligence system for Android, and a new AI-powered cursor called Magic Pointer. Magic Pointer lets users point at something on screen while Gemini understands the surrounding context and responds to voice commands directly. Examples: point at a building in a video → "show me the route"; point at a date in an email → create a meeting; point at a table → turn it into a chart. The system works across Chrome, Android apps, and Google services without constantly opening chat windows or writing detailed prompts. Google also announced AI-assisted dictation, auto-browsing in Chrome, and AI-generated widgets.

Why real-time communication matters

Most AI still works like messaging software: type something, wait, get a response. Human collaboration doesn't work that way. Thinking Machines is betting the next shift in AI is not just better intelligence but better interaction. If these systems improve, AI could start feeling less like using software and more like working alongside something that can listen, react, and stay present in real time.