AI Notes — April 15

Local Model Rankings (Based on Reddit)

The top models people are actually recommending, not just benchmark supremacy:

Qwen 3.5 — most broadly recommended family right now across use cases
Gemma 4 — strong recent buzz for local usability, especially smaller and mid-sized deployments
GLM-5 / GLM-4.7 — near the top of broad open-model rankings, increasingly part of the "best overall" conversation
MiniMax M2.5 / M2.7 — repeatedly cited for agentic/tool-heavy workloads
DeepSeek V3.2 — still firmly in the top cluster for strongest open-weight general models
GPT-oss 20B — increasingly recommended as a practical local option

For local coding: the overwhelming consensus is Qwen3-Coder-Next.

How to Steer AI Toward Your Desired Website Style

Images are higher bandwidth than words. It's hard to steer the model with words. You can write a paragraph about wanting "clean, modern, with plenty of whitespace" but it's almost pointless — you always end up with the default style.

Images can get the model to make unique designs because they carry more information. A screenshot encodes hundreds of micro-decisions about spacing, colors, and layouts that you can't define with words. So unless you have code samples, reference images are the best way to steer AI.

The 2026 AI Engineer Roadmap

Most developers are building toys while the world demands systems. The gap between a prompt engineer and a systems architect is growing. Key insight: stop building generic wrappers over GPT — these are not businesses, they are features waiting to be absorbed by big tech.

To be indispensable: understand orchestration, memory, and local inference. Five production-grade project tiers:

Beginner: AI-powered mobile app with SLM — edge AI + resource optimization (quantization, battery optimization, offline-first sync)
Intermediate: Self-improving coding agent — agentic loops (plan → code → test → reflect), memory hierarchy, sandboxing
Advanced: Cursor but for video editors — multimodal AI + complex tool integration
Expert: Personal life OS agent — deep context + privacy-first architecture
Master: Autonomous enterprise workflow agent — production-grade orchestration, multi-agent delegation, audit trails, RBAC

Robot Tax

A game theory paper from UPenn + Boston University: every company fires workers to cut costs → every fired worker stops buying products → revenue collapses → the companies that fired everyone go bankrupt. It's a Prisoner's Dilemma. Automate and you survive short-term; don't automate and your competitor kills you; but everyone automating destroys the demand that makes all companies viable.

The researchers found only one solution: a Pigouvian automation tax ("robot tax"). UBI and profit taxes won't fix the structural problem.

Simple Agent Scheduling: Smarter Model Where It Counts

Looking at Sparkle (a Mac file organizer): it starts by analyzing files with Opus 4.6 (smart and expensive), gets user sign-off on folder structure, then uses Haiku 4.5 (fast, cheap) for classifying new files. "Q1 invoice.pdf" → Finance, no heavy AI lifting required.

My takeaway: a lot of so-called "agent scheduling" is actually quite simple — use a higher-end model for hard things, use a cheaper execution model for simple things that don't require much reasoning, and have the two interact (possibly via skills). The sophisticated architecture is often unnecessary.

a16z Report: Key Takeaways

ChatGPT still strong but moat is shifting. 900M weekly active users. ~20% of ChatGPT users also use Gemini the same week — loyalty comes from accumulated context (data, apps, connected services) not from being the only option.
Platform divergence. OpenAI → consumer super-app (220 apps in ChatGPT app store). Anthropic → professional infrastructure (MCP, Claude Code, enterprise API). Like iOS vs Android — not a war, two different markets.
Creative tool reshuffle. In 2023 top 10, 7 of 9 creative tools were image generators. Now only 3 remain — ChatGPT and Gemini absorbed image generation. Midjourney fell from top 10 to #46. Survivors (Suno for music, ElevenLabs for voice) survived through depth, not breadth.
Measurement is failing. When AI becomes a feature everywhere (Excel, browser, OS-level), page views and MAU can no longer reflect true usage. An engineer working full-time with Claude Code doesn't appear in these stats at all.

Karpathy on the AI Capability Gap

Two groups of people talking past each other. Group 1: tried free ChatGPT last year, laughing at hallucinations. Group 2: pays $200/month, uses frontier agentic models (Codex/Claude Code) professionally in technical domains. Group 2 is experiencing "AI psychosis" because the improvements in coding, math, and research have been nothing short of staggering this year.

The key: these domains have explicit, verifiable reward functions (unit tests pass yes/no) that are amenable to reinforcement learning, plus they're the most valuable in B2B settings, so that's where the biggest fraction of the team focuses.

My take: In China, some people ask a question to Doubao, get a wrong answer, and declare "AI isn't good." They might be using the wrong model, prompting incorrectly, or the specific product genuinely hasn't caught up. If you use the world's top models extensively, you will be shocked. Arguably, AGI has already arrived.