How reliable are benchmarks today?
Agentic benchmarks are in a decent place, but benchmarks are no longer trusted as a correlate to real-world performance. A key example of this gray area is Gemini 3's incredible benchmark scores paired with its remarkable irrelevance in where AI tools are currently being tested and deployed (agents). These trends point to obvious and lasting flaws in our measurements.
RLVR explained
RLVR stands for Reinforcement Learning with Verifiable Rewards. Literally: "RL using rewards you can verify."
The fastest way to understand it is to compare it to RLHF.
What's wrong with RLHF
In RLHF the reward signal comes from human labelers' preference judgments — "this answer is better than that one." That judgment is subjective, fuzzy, and expensive. Worse, the model can learn to "please humans" instead of "actually do the task right" — the classic reward hacking failure mode.
The core idea of RLVR
RLVR swaps in a different reward source: only train on tasks whose answers can be objectively verified.
- Math: is the answer correct? A program can check automatically.
- Code: do the test cases pass? Just execute and see.
- Formal logic, theorem proving: a verifier decides directly.
The reward signal becomes a clean 0/1 hard signal. No humans needed, fully automatable at scale, and the model can't bluff its way through.
Why it became mainstream
DeepSeek-R1 was the landmark case. They found that training with RLVR alone on math and code caused the model to spontaneously develop chain-of-thought, self-reflection, and backtracking behaviors — none of which were taught directly. RL discovered them. Much stronger than pure SFT, and cheaper.
Limitations
RLVR only works for tasks with an objective answer. Quality of writing, fidelity of translation, clarity of explanation — none of these can be auto-verified. RLHF and other methods are still necessary for those.
Kimi K2.6
Moonshot says K2.6 can run continuously for 12 hours, execute over 4,000 tool calls in a single session, and coordinate up to 300 parallel sub-agents on larger tasks.
My reaction to the showcase demos: hard to really judge which design is better. Too staged.
Hermes agent design patterns
1. Stateless parallel units
A normal agent remembers prior conversation and files, but when many agents run at once they contaminate each other. Adding a flag like skip_memory=True makes the agent "use-once-and-throw-away" — every run is fresh, which is what makes large-scale parallelism actually work.
2. Don't blindly retry on failure
The traditional approach is: task failed, try again. The smarter approach is: when something fails, structurally record "why it failed, which step broke, what tool was called," then let the LLM analyze that trace and re-plan — not brainlessly rerun.
3. Inject context dynamically
Different subtasks need different background knowledge. Instead of cramming everything into one giant prompt, put per-directory instruction files (AGENTS.md) in each folder. When the agent enters a directory it reads that directory's notes. Load on demand, cleaner.
The ecosystem is also shifting toward self-improving harnesses and long-running operation: hermes-skill-factory, maestro, icarus-plugin, and cloud templates, alongside the "Externalized Intelligence in LLM Agents" survey — which frames capability as increasingly living outside the model weights, in memory systems, tools, protocols, and harnesses.
The "build a website" demand is saturated
Manus and Lovable were some of the earliest, then Claude Design, then Kimi K2.6 agent.
These are web-building systems that generate full production-ready sites from a single prompt, combining high-level design, interactive visuals, and backend infra in a single run. They use React, TypeScript, Tailwind, Three.js, and similar.
My prediction: next step is building apps. Further out, one idea spins up a company — incorporation, legal, launch, everything bundled, maybe even the agent assigning work to humans.
Build-your-dashboard
Anthropic's new Cowork feature lets Claude build live dashboards, trackers, and internal tools connected to apps like Slack, Salesforce, Google Drive, Asana, and Jira, with reports that refresh whenever reopened. Work that once required BI software, data pipelines, and technical teams can now begin with a prompt and a permission click. In one early example, Claude built a Google and Meta ads dashboard in under a minute, pulling campaign data, spotting trends, scheduling recurring tasks, and helping generate new creatives.
The post-commodity economy
This essay by University of Chicago economist Alex Imas argues that AI automation won't eliminate human labor — it will give rise to a "post-commodity economy."
Core logic
Starting from the Starbucks case: the company could have fully automated, but instead chose to hire more baristas, push handwritten names on cups, and ceramic mugs — because customers want more than a cup of coffee, they want an experience. This exposes the key question: when AI can produce nearly every commodity, what becomes scarce?
The historical pattern of structural change
Economics has a precedent. Agriculture once employed 40% of the U.S. workforce; today it's under 2%. People didn't starve — mechanization made agriculture cheap, incomes rose, and spending shifted elsewhere. The key point: when people get richer, they don't just buy more of the same — they shift to higher income-elastic goods, like better restaurants, more interesting experiences, more thoughtful service.
Mimetic desire and scarcity
The essay brings in Girard's "mimetic desire" concept: humans want what others want but can't have. This pursuit of exclusivity, status, and provenance can never be fully satisfied. The author's experiments show that willingness to pay nearly doubles once people learn that some will be excluded from access to the good. Another study shows that AI involvement weakens the sense of "one-of-a-kindness," compressing the premium.
Rise of the "relational sector"
The author predicts that AI will keep making automatable commodity production cheaper, and that's precisely what will push consumption toward "relational services" — nurses, therapists, teachers, artisans, chefs, performers, community workers. The core value in these fields is human participation itself, which AI cannot truly replicate.
Rebutting "demand collapse"
The essay also addresses the pessimistic take: AI eliminates jobs → workers lose income → purchasing power collapses → economy shrinks. The author argues that the comparative nature of mimetic desire means demand for premium, humanized products doesn't saturate easily, giving the economy enough pressure valves to absorb structural transition.
Closing
The durable work of the future isn't "monitoring AI" or "prompt engineering" — it's work where human participation itself is the value. Every past wave of tech replacing labor didn't collapse the economy, it transformed it. AI may be no different — just with the transformation pointed toward the more human, less mass-producible side.