AI Notes — May 3

AI-native organizations: when execution cost falls, the org has to be rewritten

Individuals get 15–40% faster with AI; superusers see 10x or more. Companies, however, see zero measurable gain (NBER, Goldman Sachs studies). The bottleneck isn't the tooling — it's the organization. Two root causes:

Incentive misalignment — people are paid by time, not output.
Organizational friction — meetings, approvals, alignment processes were never designed for AI speed.

Three real-world rebuild patterns we've observed:

One company spun all engineering into a subsidiary, keeping only a "GenTech landing" team in-house.
Another runs 3–5 person Pods that hit 10x the speed of traditional teams.
A third laid off everyone whose primary job was writing code, keeping only AI Architects.

An AI-native org needs end-to-end product ownership, teams formed by trait rather than job title, and context infrastructure as the moat. Flattening alone fails (see Spotify, Zappos, Amazon). You need new incentives, new evaluation, new systems. The real competition isn't Claude Code vs Cursor — it's who builds the AI-native organization first.

Bottleneck 1: full-time employees' incentives don't match productivity gains

Full-time employees are paid by time, not output. If AI lets an engineer finish in three days what used to take two weeks, what do they get? Same salary — and probably more work piled on. If they spend the saved time slacking, they're materially better off. Under that incentive structure, no one will push AI to its limits.

Faros AI's study of 10,000 developers confirmed this: developers wrote more code and finished tasks faster with AI, but team delivery speed and business outcomes showed no measurable improvement. EY found that 88% of employees use AI at work, but only 5% are actually using it to transform how they work. The vast majority just use it for search and summarization.

Charlie Munger said he always remembered incentives were the most important factor, and every year he discovered he had still underestimated their impact. For AI productivity, incentive misalignment is the decisive bottleneck.

Bottleneck 2: organizational friction eats individual gains

Traditional large companies were designed for information coordination. Meetings, status updates, cross-team alignment, approvals — these aren't bugs, they're features. They exist because, when execution was expensive, you had to make sure every execution was correct, which required heavy upfront alignment.

But AI has crushed execution cost to a critical threshold. If you can ship a prototype in hours, the logic of three days of alignment meetings collapses — just build it and let user feedback speak.

The catch: when you make one stage faster, the rest of the pipeline doesn't. Code is done, code review queues for two days. Product is ready, legal review takes a week. The bottleneck node erases the individual gain. And when AI lets you skip "necessary" work — like cross-team alignment meetings — the people whose value depended on those meetings will resist. The old production relations push back.

An interesting data point: METR's 2025 study found that experienced developers actually took 19% longer to complete tasks with AI tools — even though they believed they were 20% faster. Why? Because they spent so much time integrating and verifying AI-generated code into existing systems. Not an AI problem — a workflow-not-redesigned-for-AI problem.

AI is a sports car, but you're driving it down an alleyway. The car is fine. Fix the road.

Three rebuild case studies

Case 1: a major Chinese tech company spinning execution off into a subsidiary. A friend who's a GM at a large Chinese tech firm is doing something radical: moving all engineering roles into a subsidiary, keeping only a small core "GenTech landing" team in the parent department. That team's job isn't to write code — it's to bridge tech and business: managing requirements, the AI toolchain, core architecture, ops, and people coordination. The logic: if AI drives the marginal cost of code production toward zero, "writing code" is no longer the core competency. The new core is judgment about technical direction, ability to orchestrate AI tools, and grasp of business context.

Case 2: an internal Pod model. A large company we're training with chose a softer but equally ambitious path: spin up 3–5 person teams outside the existing reporting structure, similar to Meta's Pod model. No traditional reporting lines, no legacy dev process — just a clear product goal and rapid iteration. The target isn't 100% of a big team's output; it's 10% of the headcount delivering 80% of the value at 5–10x the speed. Meta's Reality Labs has gone further still, dissolving the engineer/designer/PM titles in a ~1,000-person developer-tools team and replacing them with three roles: AI Builder (executor), AI Pod Lead (small team lead), AI Org Lead (org manager). Pod Leads run day-to-day; Org Leads manage performance and promotion — explicitly with AI assistance for evaluations.

Case 3: a Silicon Valley startup laying off everyone who codes. Another company we're training with is going the furthest: planning to lay off everyone whose primary job is writing code, keeping only what they call Conductors (we call the equivalent role AI Architect). Their core work isn't code — it's:

Orchestration: coordinating AI agents to collaborate efficiently.
Boundary setting: defining success criteria and finish lines for the agents.
Measurement: building evaluation systems for AI output quality.
Context management: organizing and maintaining the context the AI needs.

Their stance: if an engineer is still hand-writing code, they haven't yet learned to use AI properly. (This applies most cleanly to new products, new code, new companies.)

End-to-end product ownership

The core problem in traditional orgs: too many people accountable for process, too few accountable for outcome. Frontend owns frontend code, backend owns backend, PM owns the PRD, design owns the mocks. Everyone does their part well — and the product is still bad. Because no one owns end-to-end user experience and business outcome.

The first principle of an AI-native org: the core team owns the product end-to-end. Not the code, not the mocks — the question of whether the product creates user value and survives in the market.

Form teams by trait, not job family

Traditional teams are built by job family: you're frontend, he's backend, she's design. In an AI-native world, AI is blurring these boundaries. A good engineer with AI can produce decent design. A technically-minded PM can prototype directly. A more sensible model is to compose teams by trait rather than job title:

Builder / Pirate — turns ideas into reality. Core skills: execution and speed. One company calls this role "Pirate" — get to the goal by any means, leave the rest to the Architect.
Architect — makes systems scalable and maintainable. Builder makes it work; Architect makes it keep working at scale. Owns system design, technical choices, and long-term tech debt.
Taste Maker — has aesthetic judgment. In an era of mass AI generation, "what is good?" becomes the central question. Owns quality, experience, and the difference between usable and delightful.
Signal Reader — understands user needs and reads market signals. Constantly running quant/qual research, always answering: are we building what the market actually wants?
Decision Maker — can decide under uncertainty and generate effective initiatives. In small teams, no layered approvals reduce risk for you. Someone has to call it with incomplete information and own the outcome.

The ideal AI-native team is 3–5 people, each with some combination of these traits and a clear primary one. They define their work by what the team needs now, not by job title.

Context is competitive advantage

One thing is severely underrated in AI-native orgs: context. The same model produces dramatically different output quality depending on the input context. Which means: your value in the org increasingly depends on how high-quality the context you provide to AI (and colleagues) is, not on how many lines of code you can output.

A three-layer methodology for Context Architecture:

Context Org Chart (concept layer): defines what context a project needs, where it lives, who maintains it, and how it should be progressively exposed to AI.
Context Architecture (technical layer): three-tier memory (working, project, organizational); progressive loading so AI gets exactly the context needed, no more no less; continuous distillation from raw notes/discussions into high-quality insight; rule injection so your team's most important standards are enforced automatically.
Context Toolchain (execution layer): Git for context versioning; integrations with Slack/Lark for auto-capture; doc systems and PM tools for automated collection; AGENTS.md / MEMORY.md so every AI entry-point auto-loads org-level context.

Core idea: the moat of an AI-native org isn't who has the better AI tools — it's who has the stronger context infrastructure.

What decision-makers should do today

Find a small enough product line and stand up an AI-native team to experiment. 3–5 people, end-to-end ownership. Outcome metrics, not process metrics. Full tools and authority. Don't make them follow legacy process.
Invest in context infrastructure. Not more AI tools — structure the team's knowledge, processes, and decision history so AI can consume it. A great context infrastructure usually beats a stronger model.
Redesign incentives. If people are paid by time and evaluated on process, they have no reason to fundamentally change how they work. Move toward outcome-based pay, or at least let AI-driven efficiency translate into something the employee can feel — autonomy, bonuses, faster promotion.
Cultivate Architects / Conductors. The scarcest future role isn't "person who writes code" — it's "person who orchestrates AI, manages context, and decides under uncertainty." That capability has to be built deliberately.

Why companies will lay off

Harmony + universal coverage = inefficiency. The team is competent and gets along, but to protect everyone's role and dignity, the productivity surplus AI brings is diluted by redundant process and meetings. Plainly: people who use AI well are forced to operate at the pace of people who don't.

We routinely observe AI-native individuals completing in days what traditional teams take months for. But these people, inside the org, rarely get matching incentives — and may even face danger. Going too fast makes others uncomfortable, and they push back unconsciously.

The PM role probably has to evolve too: from "increase connections between people" to "increase coordination between people and AI." Pursuing effectiveness is, fundamentally, a kind of anti-collaboration — fewer human-human links, more human-AI links.

Cursor UI/UX lead: software is a stack of concepts

Ryo's framing: software's essence is not code or pixels but concepts and the relationships between them. "My understanding of software is that it's a concept, then relationships between concepts. Every layer is connected to the same blob of concepts. If you know what you're trying to do, there's an optimal solution."

The implication: designers shouldn't start from the page — start from the atomic concept. TikTok's essence is "list of videos," Notion's is "blocks and pages," Cursor's is "agents and models." When the underlying concept is clear, the UI is its natural extension. When the concepts are messy, no amount of pretty animation saves you from "AI Slop."

Talking to AI: imagine it has amnesia. Prompts matter — don't get lazy.

What PMs do in the AI era

For PMs and designers: your most important job is no longer writing PRDs. It's defining "what is good" — not by listing a hundred requirements, but by pointing at five examples and saying "like this." Your intuition for quality is the source of the product's soul; AI does the technical work under that soul's direction.

"Care" is the door to quality. You have to care to feel the difference between good and bad. A person with no feeling for food can't tell good from bad; a person with no feeling for writing can't see high from low; a person with no feeling for product can't direct AI toward something good.

This is why Pirsig invented "gumption" — and the "gumption trap." Quality requires you to care. Caring requires energy and enthusiasm. If you're consumed by trivia, ground down by process, drained by meaningless repetition, you lose the capacity to feel quality at all. AI in this sense should be an amplifier of human quality-judgment: it absorbs the static labor that consumes your gumption, freeing you to do what you're best at — feel, judge, care.

Why "fine-tune" became "customization / continued pretraining"

Why don't people say fine-tune anymore? "Traditional fine-tuning is the supervised-learning playbook. It doesn't work on big models. After you do it, post-training and alignment are basically broken."

Why this matters: it directly punctures a misconception many people "know intellectually but still do in practice." The traditional fine-tune intuition comes from the small-model era — you supervise on a downstream dataset, metrics go up, everyone's happy. But in the big-model era, the capability you actually depend on lives in the post-training stack (alignment, preference learning, verifiable rewards, tool use, reasoning style). Once you crudely overwrite weights with traditional supervised fine-tune, common results: instruction-following degrades, alignment warps, output distribution narrows, even safety boundaries break.

The popularity of "customization / continued pretraining / domain adaptation" reflects an industry recognition: we want to add capability, domain, or context length without breaking the alignment structure — not swing a supervised hammer everywhere.

The new training pipeline: pre / mid / post

The old binary was simple: capability is fully grown in pretraining; everything after is just teaching it how to behave. The new mental model cuts the middle out:

Pre-training: still large-scale compression, learning general representations.
Mid-training: targeted addition of structural capability — extending context window from 8K to 200K, injecting deep domain knowledge (code, math, medicine), teaching a specific reasoning paradigm. Not behavior preferences — capability extensions. Not appropriate to bury in the original pretraining (data volume insufficient, or it would pollute general capability).
Post-training: tuning the now-existing capabilities into usable behavioral policies — alignment to human preference, verifiable rewards (RLVR, e.g. math problems where correct is correct), tool-call style, chain-of-thought format.

What "bad data" means, and how to scale taste

A great example of bad data, courtesy of AI2: a Reddit subreddit where every comment is mimicking microwave sounds — and then suddenly a "ding" with a totally different text distribution appears, and loss explodes. SEO spam, dead/parked websites, all need filtering.

How can humans possibly review all of this? Taste isn't scalable. The engineering answer: "Look how much text Google can index. The search stack already does this well. Tools are scalable — the key is asking the right question and using tools well." A practical addition: "You can also use a language model to clean it — let it look at this corpus and judge whether the quality is fine."

Does using a model to clean data for a model add information? The key asymmetry: "Generating good text is harder than judging whether text is good. The model can't generate it well, but it can recognize bad text and filter it out. Finding is easier than generating — that's the asymmetry."

This is worth chewing on: in many spaces, discrimination/filtering complexity < generation/construction complexity. So "model as filter / reviewer / QC" can work in engineering even if it doesn't create new information. On synthetic data: distillation can push in this direction "for one or two steps, but where the boundary is, we still have to explore."

The benchmark leakage problem

A real industry scenario worth understanding: data vendors selling "score-boosting datasets." "They might guess which benchmarks you'll test on, then paraphrase data into your training set. No exact match, but leakage. Testing only on public benchmarks is dangerous." The conclusion: every company needs its own secret evaluation benchmark, and never disclose it.

The practical takeaway on data ordering: if you want the model to be good at something, save that data for the very end of training, don't spread it evenly across the run. High-quality data (curated code, textbooks, professional writing) goes late. Junk data (raw web crawls) goes early. This is why "annealing" / "cooldown" phases exist now — a final pass on a batch of ultra-high-quality data near the end of training.