AI Notes — May 4

Cyber psychosis: when builders start questioning themselves

Zhang Haoyang shared a conversation he had with Wang Huiwen, the Meituan co-founder. Wang said he feels closest to god during manic episodes — like "the third eye opens." Zhang admits he himself has entered a kind of hypomanic state — twenty-one days without feeling tired, adrenaline running constantly. Only after the product actually broke through did his body finally start to protest.

It's not an isolated phenomenon. He notices that almost every executive who has touched AI is grinding harder than their own employees — employees go home for the holidays while bosses are vibe-coding maniacally with AI.

"It turns out, in this era, I can do this too — and probably more efficiently, because I have more concepts to connect across domains."

163 commits in a single day

Two days before the interview, Zhang made 163 commits in one day, shipping 18 major feature versions. The whole community was stunned. He said it's like the moment a typist first discovered the computer keyboard — we're entering a new era and most people haven't realized it yet.

His workflow: pull himself into every user-facing channel; pull feedback from his core community (a high-net-worth group of "cyber psychosis" peers) within 10-20 minutes straight into the Vibe Coding window; ship to production immediately — coding directly in production.

"This kind of iteration speed is impossible for traditional capital to imagine. Traditional capital plans first, calculates the business model step by step. We just burn tokens but amplify value massively."

What AI cannot copy

What can AI not copy? He gave a list — every item worth chewing on:

Premium subscription letters (Stratechery, The Information) — selling this specific person's judgment and taste. AI can copy the prose, not the identity.
Boutique consulting (McKinsey, small boutiques) — selling a person across the table who bears the consequences. The AI doesn't get hauled in front of a board.
Curated brands (A24, Aesop) — selling aesthetics and curation. AI can generate 1000 images; choosing the right one is human work.
Industry associations (bar associations, medical boards) — selling licensing and legitimacy. AI doesn't issue you a license.
Members clubs (Soho House) — selling who is in the room. AI can't enter physical spaces, can't gather this crowd.
Companies that bear final responsibility in the physical world (doctors, lawyers, surgical hospitals, signing accountants) — selling a legally accountable entity. Nobody can sue an AI when it gets it wrong.

The pattern: none of these look "tech." What they sell is not "capability" but identity, relationship, responsibility, taste, access — things that cannot be copied by code.

The unsettling implication

It tells founders and investors something uncomfortable:

Many "AI companies" valued at $10B today may be running on a business model that is quietly expiring.

Their moat is "we tune the best" — but once skills are open-sourced, the tuning is shared. Their moat is "we have the strongest model" — but open-source is catching up. Their moat is "our agent pipeline is smoothest" — but the frameworks are on GitHub.

Anything that can be copied trends to zero (or to the price of electricity). That's the most basic law of economics, and AI won't be an exception.

Cursor's post-training as a case study

Cursor didn't pretrain a model from scratch. They took an open-source MoE base and ran large-scale RL post-training inside an agent harness that simulates Cursor's production environment, training the model's tool-call decisions and response efficiency.

A common skeptical question: isn't this just fine-tuning?

The five-month evolution from Composer 1 to Composer 2 answers it. Cursor's training pipeline went through three iterations, each one a methodology upgrade rather than a tuning tweak. Composer 1 and 1.5 were pure RL on an open-source base. By 1.5 they had scaled RL compute 20x — post-training compute exceeded base pretraining itself — and added thinking tokens (adaptive reasoning depth) and self-summarization (long-context auto-compression). But they hit diminishing returns: CursorBench rose only 6.2 points from 1 to 1.5 despite 20x more compute.

Composer 2 made a key methodological pivot: insert continued pretraining before RL, changing the quality of the starting point that RL explores from. The base became Kimi K2.5 (Moonshot has confirmed it officially). Continued pretraining first, then RL — and CursorBench jumped 17.1 points. The Composer 2 technical report is explicit: it hits Pareto-optimal at significantly lower inference cost than peer models. In other words, Cursor's post-training pipeline doesn't slap a fine-tune on top and accept a quality penalty — it compresses cost and latency while keeping comparable coding ability.

Metric design: north-star vs diagnostic

Cursor splits metrics into two classes, and the classification is more valuable than the metrics themselves.

The first class is north-star metrics close to user quality. The most interesting one is Keep Rate — what fraction of code generated by the agent is still in the user's codebase after a fixed window. If users repeatedly ask the agent to redo the same chunk, manually revert, or rewrite themselves, the agent's work wasn't really adopted. Keep Rate doesn't ask whether the model can write code — it measures user behavior: did the code survive. It's a behavioral metric, not a capability metric. That's exactly why it's closer to real value than any test-pass metric.

Another north-star is semantic-level user-response analysis. Cursor uses a language model to read what the user does after the agent's output: moving on to the next feature is a positive signal; pasting a stack trace is a negative one. The judge model introduces noise, but it provides a sustainable directional signal.

The second class is diagnostic metrics: latency, token efficiency, tool call count, cache hit rate, tool error rate. These can guide direction — lower latency, fewer tokens, higher cache hit — but cannot prove the agent is good on their own. An agent can generate wrong code very fast, or burn few tokens making useless tool calls. Diagnostics localize problems; north-star metrics define quality. You need both: with only north-star, you don't know where regressions come from; with only diagnostics, you might optimize the system to be very efficient at making users unhappy.

Claude Code PM

This is also why Cat spends so much time on team principles and metric readouts. The Claude Code team runs strict weekly metrics readouts so everyone has the business goals, trends, and drivers loaded in their head. The team also writes down clearly: who the core users are, why them, what is acceptable to sacrifice. That way every engineer who sees a piece of feedback knows which bucket the user belongs in; designers know what experiences can be sacrificed in interaction tradeoffs; PMM and docs know how to explain a feature when it's about to launch.

On the surface this looks like documentation. The real function is "loop infrastructure." In the past, goal definition was mainly used for pre-launch alignment. Now goal definition is also the substrate that supports high-frequency autonomous judgment. When many people on the team can take things end-to-end, the PM cannot stand on every path approving everything. The PM's job is to lay user persona, success state, failure modes, and feedback channels on the ground beforehand, so the team can run very tight loops on its own.

This explains why the PM role won't disappear. When engineers can ship product directly, message-passing PMs do shrink. But as the product loop speeds up, places that need judgment don't decrease — they densify. The team has more ideas it can execute per day, more user feedback to handle, more experiments to ship. When execution is no longer scarce, the truly scarce question becomes: what should go into the loop, and how do we judge whether the loop is actually learning?

Product taste is cost judgment

Asked what PMs need most going forward, Cat gave a thread that ties everything together: when code can be generated cheaply, the scarce thing is deciding what to write.

Compressed: product taste matters. But "taste" easily becomes hot air. In real work, it's a kind of cost-judgment ability.

A user says a button is in the wrong place. Old flow: PM collects more feedback, batches it for the next release — even moving a button costs a full design/eng/test/ship cycle. Under the new price table, if the change is genuinely small, an engineer or PM can ship a version internally or to a small external slice on the same day. Holding three meetings to debate moving that button doesn't match the cost anymore. But conversely: if a small feature touches permissions, billing, enterprise security, and data migration, then it looks easy but actually needs serious planning.

First, requirement-management ability gets downgraded; goal-definition ability gets upgraded. Writing PRDs still has value, especially on ambiguous features and infra projects. Cat says they still write PRDs sometimes. But PRDs are no longer the core carrier of PM value. What really determines weight is whether you can clearly define target user, success state, failure modes, and what can be dropped — so the team stays directionally aligned in high-frequency action.

Second, engineering understanding goes from a plus to a basic judgment ability. PMs don't need to become full-time engineers, but they must read the post-AI price table. Whether a feature deserves a meeting or a same-day research preview; whether a problem can be caught at the prompt layer or needs harness, eval, and product interaction together; whether a capability the model will pick up on its own next quarter or needs long-term productization — these judgments swing speed and resource allocation constantly.

Third, the PM increasingly becomes a loop designer. The loop covers how user feedback enters the team, how the team turns it into experiments, how experiments ship quickly, how users understand expectations, how the team judges success or failure, and how AI improves the next output from those signals. There's documentation, data, release mechanics, evals, organizational collaboration. Glued together, this is the new working surface of an AI-era PM.

This also explains why Cat says roles are merging while Anthropic still has 30-40 PMs. The title hasn't disappeared — the work under the title has changed. PMs used to translate between user problems, product solutions, and engineering implementation. Now, in the most frontier AI teams, the PM is more like the keeper of a high-speed learning system.

Defending against AI cyberattacks

Things that can be done:

Turning off auto-approve / YOLO mode is currently the highest-ROI single action. The 91% YOLO-session figure shows most users run agents in auto-approve, and most attack chains require auto-approve as a precondition.

Enable sandboxing (Claude Code's bubblewrap/Seatbelt, Codex's Landlock/Seatbelt) to bound the blast radius after manipulation. If your tool supports a sandbox but doesn't enable it by default, turn it on manually.

Treat PRs that touch .cursorrules, .claude/settings.json, .github/copilot-instructions.md, and AGENTS.md with the same review level as .github/workflows/. They control what the agent is allowed to execute on your machine.

Migrate credentials from .zshrc, .npmrc, and .env into a secrets manager. Session logs from AI coding tools record the contents of every file read.

For AI agents in CI/CD, use pinned commit hashes for all dependencies and configure minimumReleaseAge. The root cause of OpenAI's Axios incident was a floating tag with no cooldown.