AI Notes — May 12

Few-shot and zero-shot

These two concepts originally come from "few-shot learning" and "zero-shot learning" in machine learning, but in the LLM era they mostly describe how many examples you put in the prompt.

Zero-shot

Give the model no examples, just ask it to do the task. You describe what you want in natural language, and the model relies on knowledge learned during pretraining to understand and execute. For example, sentiment analysis:

Judge whether the sentiment of this sentence is positive, negative, or neutral:
"The weather is great today, I feel especially relaxed."

The model has seen no demonstration and directly outputs "positive" based on the instruction.

When to use: the task is common, the phrasing is clear, and the model likely saw similar tasks in pretraining — translation, summarization, rewriting, basic classification, writing a polite email. Pros: short prompt, fast to write, fewer tokens. Cons: when the format is unusual, a strict style is required, or it's an unfamiliar domain, the output gets unstable and drifts.

Few-shot

Put a few input–output example pairs in the prompt (usually 1 to 10) so the model completes the new task by imitating them. This is also called in-context learning — note the model's weights are not updated; it just "gets it on the spot" in the current conversation. Same sentiment analysis:

Judge the sentiment of the sentence:

Sentence: This restaurant's service was terrible.
Sentiment: negative

Sentence: The movie was okay, nothing special.
Sentiment: neutral

Sentence: I got a gift today and was happy all day.
Sentiment: positive

Sentence: He finally passed the exam.
Sentiment:

The model observes the format of the first three examples and answers "positive" in the same format. Derived concept: giving only one example is one-shot, a special case of few-shot.

When to use: a specific output format is needed (e.g. fixed JSON), there are subtle judgment criteria (what counts as "positive" depends on your business definition), the task is niche, or you want consistent output style. Pros: significantly improves accuracy and format consistency — like on-the-spot teaching. Cons: examples consume the context window and cost more tokens; badly chosen examples mislead the model (uneven distribution, incomplete coverage, or errors).

Comparison and selection advice

In short: zero-shot is "tell the model what to do," few-shot is "show the model how to do it." The practical rule of thumb: try zero-shot first and see how it goes. If the output already meets expectations, don't add examples — save tokens and effort. If format is unstable, criteria are fuzzy, or it drifts wrong, add 2–5 high-quality examples to make it few-shot. Still not working? Consider adding Chain-of-Thought (let the model reason step by step), or just fine-tune.

Note: example quality matters far more than quantity. Three carefully chosen examples covering typical and edge cases usually beat ten thrown together. Examples should be consistent in style and format, and cover the different input types you expect the model to handle.

Full-duplex multimodal interaction

Full-duplex is a comms term. A phone call is full-duplex — both sides can talk and interrupt simultaneously; a walkie-talkie is half-duplex — one person finishes before the other speaks. Most current AI voice assistants (including ChatGPT voice mode) are essentially half-duplex: you finish, it processes, replies, then waits for you.

Multimodal means handling multiple information forms at once — voice, video, text, image. Full-duplex multimodal interaction means the AI can simultaneously listen, watch the camera, think, search, and respond — and these actions are concurrent, not queued.

"Trained from scratch" vs "stacked"

This is the key point. Traditional approach (stacked): today's ChatGPT voice mode is roughly glued together from a speech-recognition model (speech → text), an LLM (text → reply text), a speech-synthesis model (reply text → speech), plus rules for "did you finish talking" (turn-taking) and "should I call a tool." It's a few independent parts glued together; under the hood it's still turn-based — you say one thing, it says one thing.

New approach (trained from scratch): Thinking Machines trained one model whose "input" is a continuous stream of audio + video + text and whose "output" is also a continuous stream of audio + text. No intermediate conversion, no concept of turns; the model natively lives in continuous time.

The human↔AI bandwidth problem

The people cited (John Schulman, a core RLHF author; Soumith Chintala, PyTorch's creator) frame this as a bandwidth problem. When humans talk to each other, information is high-bandwidth and concurrent — I listen while you speak, watch your expression, think about my reply, maybe check my phone. But human↔AI communication today is low-bandwidth and serial — one side must finish before the other moves. The new model widens that information pipe so interaction feels as natural as person-to-person.

Capabilities emphasized in the demo

Continuous time perception: the model isn't "fed" a recording; it perceives time passing in real time (e.g. realizing "the user has been silent for 3 seconds").
Interruption handling: change your mind mid-sentence and interrupt, and it stops to listen — instead of finishing the prepared reply.
Simultaneous speech: both sides can overlap, like the "mm-hm, yeah" of real conversation.
Visual proactivity: it actively notices and incorporates what the camera sees, without you prompting "look at this."
Background tool use: it quietly searches/calls tools while talking, without blurting "I'm searching now..." and stalling. The experience stays continuous.

What "zero-shot" means at the end

A type signature is a programming term for a function's input and output types. Schulman's point: AI's old type signature was text → text (or, with parts bolted on, speech → speech); the new model's is continuous audio+video+text → audio+text. When the model's native I/O is this rich multimodal stream, many tasks that previously needed a purpose-built system can now be done zero-shot — no special training, no examples. Examples: real-time sign-language translation used to need a sign-recognition model + translation + speech synthesis stitched together; now the model watches video, hears speech, outputs speech in one shot. "Give verbal suggestions while watching you code" used to need screen recognition + code understanding + speech generation; now it's zero-shot.

One-sentence summary: Thinking Machines previewed a model that is natively continuous, multimodal, real-time interaction — not an existing LLM with voice bolted on, but a model that learned during training to "listen, speak, watch, think, and search at the same time." Once the capability base looks like this, many real-time interaction tasks that needed bespoke systems can now be done with no extra training (zero-shot).

Deep agents and local agents are maturing fast

A deep-agent CLI can hot-swap the underlying model provider mid-conversation without losing context — a non-trivial systems feature many agent stacks still lack. LangChain also emphasized using profiles for provider/model-specific tuning. A separate pricing analysis argued that for high-volume agent workloads, DeepSeek V4 Flash can be much cheaper than GPT/Gemini flash-tier options.

Local/open models keep improving faster than the hardware ceiling. One strong argument: under the same top-tier MacBook Pro memory limit, the "smartest open-weight model you can actually run" went from Llama 3 70B-era capability to DeepSeek V4 Flash hybrid Q2 GGUF-era capability — roughly 4.7x in 24 months, i.e. doubling every ~10.7 months, faster than Moore's law. Supporting data points: rapid growth in GGUF uploads, and the community repeatedly observing that Qwen 3.6, Gemma 4, and DeepSeek variants can now do non-trivial agent tasks locally.