AI Notes — May 14

A classic model-training failure caused by data leakage

This is a fascinating and instructive story — a real case of an AI deployment failing in hospitals.

Background: predicting sepsis

What is sepsis? A dysregulated immune response to infection that can rapidly cause organ failure. It's one of the deadliest emergencies in hospitals — every hour of delayed treatment raises mortality by 4–8%. So an AI that predicts which patients will develop sepsis a few hours early could save many lives.

Who is Epic? Epic Systems is the largest electronic health record (EHR) company in the US; over 40% of US hospitals use it to manage patient data. It shipped an AI sepsis-prediction tool, claimed high internal accuracy, and was enabled by default at hundreds of hospitals.

The failure: much worse in real hospitals

In 2021, JAMA Internal Medicine published an external validation study with shocking results: it missed 67% of real sepsis cases (two-thirds of true patients undetected), and produced a flood of false alarms. Doctors were frequently interrupted for nothing — "alert fatigue" — which led them to start ignoring alerts, including real ones.

The puzzle: internal testing looked great, so why was real-world performance so bad?

The core problem: the model was "cheating" with future information

This is the key technical point. It's called data leakage (or target leakage), a classic ML pitfall. When training the model, researchers fed it hundreds of features (temperature, heart rate, white-cell count, medication records…) to learn "what kind of patient gets diagnosed with sepsis."

One feature was: "has the doctor already prescribed antibiotics?" It sounds fine — antibiotic prescription correlates strongly with sepsis. But the logical order is wrong:

Real clinical flow: doctor first suspects/diagnoses sepsis → then prescribes antibiotics.
What the model learned: doctor prescribed antibiotics → this patient has sepsis.

In other words, the "predictive signal" the model learned is actually "the doctor's reaction after already diagnosing." It wasn't predicting — it was parroting the doctor's diagnosis.

Why internal testing couldn't catch this

During internal testing: researchers validated on historical case data. In those cases, the timestamps of antibiotic prescriptions and sepsis diagnoses are all present. The model sees "this patient was prescribed antibiotics" — that info exists in the historical data — and predicts "sepsis," looking incredibly accurate.

During real deployment: the model must warn before the patient is diagnosed. But at that point the doctor hasn't prescribed antibiotics yet (they prescribe because they've already diagnosed — they don't need an AI prompt). The model's most-relied-on feature is gone, and accuracy collapses.

This is what the source meant by: "the model used a feature from the future, depending on a variable that has a causal dependency on the outcome." "From the future" means: relative to the moment the model should predict, "the doctor prescribed antibiotics" happens later. "Causal dependency" means: the antibiotic prescription is not the cause of sepsis but the result of it (prescribed only after diagnosis). The model reversed causality — it thought "prescribed antibiotics → will get sepsis," when really it's "has sepsis → then gets prescribed antibiotics."

The future of human work

More and more human work becomes "controlling AI" — like factory workers shifting from manual labor to monitoring machines after the industrial revolution, future roles increasingly revolve around specification and supervision.

Claude pricing

Claude's pricing-change notice is well done, though it's not what subscription users want to hear: every Claude subscription now gets a monthly API token credit equal to the dollar amount of the plan. So you pay $200 and get both a Claude subscription with its own usage limits for interactive tools like Claude.ai and Claude Code ("interactive use"), plus $200 of API credit for programmatic use of Claude (e.g. claude -p and elsewhere).

Pretraining efficiency and architecture experiments

This is the strongest research thread. Nous Research's token-stacking training modifies the early stage of pretraining so the model reads/predicts a continuous packet of tokens before reverting to standard next-token prediction; they report 2–3x wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argue current message/chat training over-constrains agents to a single stream, and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more readable parallel reasoning/tool use. δ-mem proposes an external online associative memory attached to a frozen full-attention backbone, reportedly improving average score 1.10x at 8x8 state and 1.15x over a non-δ-mem baseline, with larger gains on memory-intensive benchmarks.

Trajectory dataset

A note to track a large SWE trajectory dataset (SWE-ZERO-12M-trajectories) for later study.

Figure runs an 8-hour autonomous robot shift

Figure AI livestreamed an 8-hour autonomous factory shift using teams of its Figure 03 humanoids powered by the company's Helix-02 system. The robots sort small packages by detecting barcodes from camera input, picking up packages, and placing them face-down onto conveyors for scanning. Figure says the system now operates at roughly human speed, averaging around one package every 3 seconds. The robots run entirely onboard without cloud inference, using a single neural network for vision, movement, balance, and manipulation.

More notably, the robots coordinate with each other to keep the system running continuously. When battery levels drop, robots autonomously request replacements to minimize downtime. If a robot detects a fault, it can reportedly diagnose the issue itself, walk to maintenance, and request another unit to take over. Figure says the long-term goal is continuous 24/7 operation.

These companies are starting to test whether robots can function as persistent labor infrastructure in real operating environments. In robotics, reliability matters more than demos. A robot is only economically valuable when it can run for long stretches, recover from problems, coordinate with other machines, and keep working without constant human supervision.