AI Notes — April 28

Sakana's Conductor: AI Managing AI

Sakana AI introduced a noteworthy multi-agent result. They trained a 7B Conductor with reinforcement learning to orchestrate a pool of frontier models in natural language, rather than solving tasks directly. The Conductor dynamically decides which agent to call, what subtask to assign, and which context to expose.

The reported numbers are striking: 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond, beating any single worker in its pool. Hardmaru highlighted "AI managing AI" and recursive self-selection as a new axis of test-time scaling.

This raises an interesting question: is one expensive agent better, or several cheap ones with smart routing? The "recursive self-selection" idea points to something more fundamental: meta-cognitive ability matters more than raw capability. Knowing "I'm not good at this, who should handle it?" is a high-value skill in itself. The Conductor layer might be the most valuable thing to invest in, not the underlying models.

OpenAI Quietly Building an AI-First Phone

OpenAI is reportedly working on its own AI-first smartphone, with early plans pointing to mass production around 2028. The company is said to be collaborating with Qualcomm and MediaTek on custom chips, while Luxshare Precision Industry may handle design and assembly.

Instead of apps, the device is expected to center around AI agents that execute tasks end-to-end. The hardware is being optimized for on-device AI, with heavier workloads pushed to OpenAI's cloud. If this direction holds, it could shift phones from app-driven interfaces to outcome-driven interactions, putting pressure on ecosystems built by Apple and Google.

Annotation for GUI Agents: A Different Paradigm

How do you label data for AI that operates software? The answer is fundamentally different from traditional NLP annotation. Traditional NLP is "text in, text out" — a labeler judges right or wrong. But GUI Agent annotation needs to capture: given a screen state, what would a human do, and why?

1. Screen recording + action trajectory capture

The tooling needs to simultaneously record:

Each frame's screenshot (or a structured UI tree dump)
Mouse coordinates, click types, keyboard input
Timestamps

Approaches in the wild include rrweb (web) or pyautogui combined with screenshots. The key challenge is granularity — record pixel coordinates, or semantic elements ("clicked the Buy button")?

2. Task intent annotation layer

Pure action trajectories are "behavior", but the model needs to understand "purpose". So annotations need to overlay on top of trajectories:

Overall task goal ("order me malatang")
Sub-goal at each step ("searching restaurants" → "selecting dishes" → "filling address")
Reasoning at key decision points ("why this restaurant and not that one")

This layer is expensive because annotators need to genuinely understand task semantics.

3. State-action pair verification tools

The hardest part of multi-step tasks is error propagation — get step 3 wrong and everything after collapses. The annotation tool needs to support:

Replay and branching (re-record an alternate path from a given step)
Marking "is this step optimal?"
Annotating "why did this step fail?" (slow network, UI changed, misread the task)

4. Cross-app context stitching

This is the hardest part. For example, "use Meituan to order takeout, then send the receipt screenshot to a friend on WeChat" involves app switching, state preservation, and context handoff. The tool needs to stitch the cross-app sequence into one complete trajectory, where "app switch" itself is a meaningful action node.

Industry Practice Today

Operator-style data: OpenAI Operator and Anthropic Computer Use both rely on large amounts of human demonstration data. Annotators complete tasks in real or sandboxed environments while tools record the entire flow.
Synthetic data: Use rules or another model to generate trajectories, then have humans review. Cross-app scenarios still suffer from poor synthetic quality.
Crowdsourcing platforms: Scale AI, Surge, and others have started offering specialized GUI annotation pipelines — essentially turning the annotator's desktop into the annotation surface itself.

The Real Bottleneck

It's not the tooling. It's the tension between task diversity and environment consistency. Real users have different app versions, phone models, and account states, which means the "correct action sequence" for the same task differs across environments. Building a reproducible annotation sandbox while keeping the task distribution realistic is the core engineering challenge for this direction.

In short: annotation tools for this kind of data turn the annotator's entire desktop into the annotation medium, not a form or text box.

YC Summer 2026 Request for Startups

YC's latest RFS reads like a thesis: AI has shifted from feature to infrastructure. They want to see software, services, and chips rebuilt from scratch. 14 directions, but three caught my eye:

AI-Native Service Companies: Don't sell software, sell the service result. Insurance brokerage, accounting, compliance, healthcare admin. The services market is many times larger than the software market.
Inference Chips for Agent Workflows: Most AI chips are designed for "prompt in, response out". Agents loop, branch, and hold context across dozens of steps, leaving GPUs at 30-40% utilization. Purpose-built silicon wins here. The compiler is the moat.
Dynamic Software Interfaces: AI lets every user become their own "forward deployed engineer". Same email client shows me a task list, shows a student a calendar.
SaaS Challengers: AI lowered software production cost 10-100x. Old SaaS moats are gone. Clone products at 1/10 the price, redesign workflows, or take on the "unassailable" giants in ERP and chip design software.
Agent-First Software: The next billion users are AI agents, not humans. Rebuild software for agents — APIs, MCP, CLIs replacing buttons and forms. Documentation that lets agents discover and integrate autonomously.

"Forward Deployed Engineer" — Why It Matters

The term comes from enterprise practice. Salesforce sells to Boeing and stations an engineer inside Boeing to customize the software to their workflow. That's a forward deployed engineer. Historically only big customers got this treatment because it was expensive.

YC's bet: when AI coding is good enough, every regular user gets this level of customization. AI is your dedicated on-site engineer. You tell it "I want my email client to look like a task list, sorted by urgency, read items auto-collapse" and AI rewrites the software itself — not a theme swap, but regenerating an interface and interaction logic that's uniquely yours.

On Evaluations

Good evaluations are surprisingly hard to design. The best ones are simple enough to become ubiquitous but specific enough to actually measure meaningful things. Clear outputs, fast feedback, obvious signals. Most evals fail on one of these. The friction of setting up infrastructure, interpreting results, or debugging failures determines whether researchers actually use it in their iteration loops.

A great benchmark becomes a Schelling point. Once it exists, the entire field orients around it because everyone wants to claim they beat it. You move the field by incentivizing everyone to optimize against them. This is why creating the right eval is sometimes more impactful than creating the model that scores well on it.