AI Notes — May 15

Abridge: from documentation tool to clinical intelligence layer

Core message: Abridge (founded 2018) is moving from "AI medical scribe" to "clinical intelligence layer," aiming to be AI infrastructure spanning before, during, and after a visit for the US healthcare system.

Scale: expected to support 80M+ patient-clinician conversations this year, covering 250 large US health systems, 28+ languages and 50+ specialties. Raised $300M at a $5.3B valuation in June 2025.

Three strategic phases: save time → save money → save lives

Save time: reduce 10–20 hours/week of documentation burden (eliminate "pajama time" — doctors finishing notes at home in pajamas).
Save money: help hospitals optimize billing, compliance, and prior authorization in a low-margin environment.
Save lives: improve patient outcomes via clinical decision support.

Product evolution: from passive recording to active intelligence

The goal is for AI to exist like air conditioning — working quietly in the background, intervening only when necessary. Over 90% of traditional medical alerts are ignored, so Abridge chose to prepare doctors before the visit rather than interrupt frequently during it.

Prior authorization case: traditional flow — order an MRI → rejected by insurance 4 weeks later → patient reschedules. Abridge's way — while the patient is still in the room, prompt the doctor to ask the key questions ("any physical therapy? pain longer than 6 weeks?"), satisfying insurance criteria in real time.

Key technical challenges and approach

The hardest AI problem: achieving high quality + low latency + low cost simultaneously for real-time support in high-stakes clinical settings.

Model strategy: third-party frontier models + proprietary models trained on 100M+ medical conversations. "Every agent is essentially a coding agent"; the EHR can be viewed as the agent's "file system."

Three levels of personalization: individual doctor (style preferences), specialty (cardiology vs dermatology), health system (institutional guidelines).

Evals: LFD (internal clinicians look at data first), LLM judge, third-party evaluators + specialty-specific evaluation, progressive rollout (analogous to Waymo's self-driving). Customers have moved from quarterly release cycles to monthly.

Privacy and compliance: all data is one-way de-identified (irreversible). Strict HIPAA / PHI controls, written into customer contracts.

A contrarian view: "PRD isn't dead, prototypes aren't a panacea"

In high-stakes, high-complexity products, clear written thinking matters more than fast prototyping. "Go slow to go fast." Early startups can throw 30 prototypes at the wall, but at Abridge's scale, each decision involves implementation cost across 200+ hospital systems.

Regulatory tailwinds

The FDA updated clinical decision support guidance in January 2026, more AI-friendly. Government is pushing for interoperability across systems. High-stakes domains will actually solve the hardest AI problems first (80/20 doesn't apply in medicine).

Team: clinician scientists

"Mutants" with both an MD background and technical skills, embedded in product and eval teams, fundamentally raising the product bar.

Vision

The same patient-clinician conversation can simultaneously serve: the doctor (note), the patient (visit summary), the insurer (claims basis), and pharma (clinical trial matching) — folding today's fragmented, expensive multiple systems into one platform.

Execution isolation for agents continues to mature

W&B/CoreWeave launched CoreWeave Sandboxes for isolated execution in RL, tool use, and eval workloads, explicitly testing destructive commands like rm -rf / at scale. In a similar spirit, open-source/local dev tooling is surfacing around agent debugging — a free local agent debugging stack with traces exposed to Codex/Claude Code for automated eval authoring.

CoreWeave (the parent of Weights & Biases) launched Sandboxes in preview: isolated CPU environments via the W&B SDK where agents can execute code, clone repos, and install dependencies — exactly the things you don't want happening on your main machine after a supply-chain incident. A clear use case: agentic evaluations need fresh, consistent environments per test, then teardown. Sandboxes solve exactly that.

The shift toward Hermes (Nous Research)

An interesting industry-narrative shift: several long-time AI commentators independently switched their daily agent harness this week. The common story: a popular agent harness was magical early on, but after Anthropic's pricing changes made Max-tier Opus usage through it significantly more expensive, plus months of constant breakage on every update, the magic faded — users found themselves "constantly fixing" it instead of using it.

So some moved to Codex, others to Hermes from Nous Research. Why Hermes? It's now the #1 most-used CLI agent on OpenRouter globally, passing other harnesses and even Claude Code on OpenRouter usage. It has /goal, steering, and background computer use via the TryCUA integration. It's open, so you can port your memory, profile, and config files seamlessly.

Steering is maybe the most underrated addition — a Codex feature that also exists in Hermes: you can send a follow-up message and the agent sees it after the next tool call, not after the whole chain of thought completes. This makes the conversation much more natural.

Thinking Machines: a 276B MoE

The interaction model is a 276B MoE with 12B active. The local-model community's hope: weights that can be quantized to run on small home hardware for a fully offline, always-present home assistant. The recurring theme — small, capable, local models as the foundation for ambient assistants.