AI Notes — April 25

DeepSeek-V4 vs Flash Attention vs Multi-Head Attention

Putting these three side by side makes the difference between "algorithmic optimization" and "architectural innovation" obvious.

1. Multi-Head Attention (MHA): the baseline

Core formula:

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

Q, K, and V are split into h heads, each head computes attention independently, then the heads are concatenated.

KV cache pain: at inference time, you have to cache K and V at every layer. Cache size = 2 × n_layers × seq_len × d_model × batch_size, which is brutal for long sequences. Memory complexity is O(n²) — VRAM blows up at long context.

2. Flash Attention: same architecture, better algorithm

Flash Attention is not a new attention structure — it's a more efficient way of computing the same MHA.

Result: mathematically identical to standard MHA
Core idea: tiled computation that avoids writing the full n×n attention matrix to HBM
Memory access: O(n²) HBM reads/writes drops to O(n)
Speed: 2-4x faster
What changed: only the GPU memory access pattern (IO-aware)

Analogy: same dish, but instead of laying every ingredient on the counter first, you grab and cook in sequence — same result, less counter space.

3. DeepSeek-V4's CSA / HCA: actual architectural change

This is the real innovation. It tackles MHA's structural problem, not just implementation efficiency.

3.1 CSA (Compressed Sparse Attention)

Idea: before computing attention, do a low-rank compression on K and V — squeeze the high-dimensional vectors into a low-dimensional latent space.

Standard MHA:
  K = X · Wₖ   ← dim d_model
  V = X · Wᵥ   ← dim d_model

CSA:
  c_KV = X · W_c     ← compress to low dim d_c (d_c << d_model)
  K = c_KV · Wₖ_up   ← project back up
  V = c_KV · Wᵥ_up

The KV cache only needs to store c_KV! Compression ratios of 5-10x are achievable. This is a continuation and enhancement of the MLA (Multi-head Latent Attention) idea from DeepSeek-V2/V3.

3.2 HCA (Heavily Compressed Attention)

Higher compression rate than CSA, used in certain layers (typically middle layers) where you can trade a bit of precision for more efficiency. The hybrid design means different Transformer blocks can alternate between CSA and HCA, adapting based on layer depth.

3.3 Pre-Block / Post-Block Mixing

Pre-Block Mixing:  before attention, mix the current layer with neighboring layers
Post-Block Mixing: after attention, integrate information again

This is a cross-layer mixing mechanism, similar in spirit to DenseNet's dense connections — information flows more smoothly between layers.

3.4 mHC residuals

The "stacked rectangle" next to ⊕ in the diagram is the multi-Head Compression residual, an enhancement to traditional residual connections:

Standard residual:  x_out = x_in + Attention(x_in)

mHC residual:       x_out = Mix([x_in, x_in_compressed_1, x_in_compressed_2]) + Attention(x_in)

Multiple representations at different compression levels are kept around and fused at the residual addition step, capturing multi-scale features.

4. Side by side

Standard MHA
  ↓ (same architecture, only GPU memory access optimized)
Flash Attention (implementation optimization, identical result)
  ↓ (architecture changes, KV representation compressed)
MLA / CSA / HCA (architectural innovation — fundamentally less to store and compute)

Dimension	MHA	Flash Attention	CSA/HCA (V4)
Architecture change	baseline	no	yes
Problem solved	—	GPU IO bottleneck	KV cache + compute
KV cache size	100%	100% (unchanged)	~10-20%
Mathematical equivalence	baseline	fully equivalent	approximate (lossy)
Where to use	general	any setting	large-model long-context inference
Main payoff	—	train / infer speed	inference memory + throughput

5. Intuition

MHA: every person (head) walks into the meeting carrying their full notebook (K, V)
Flash Attention: still full notebooks, but the meeting is run more efficiently — you don't lay every page on the table at once
CSA/HCA: before the meeting, everyone compresses their notes into summaries and brings only the summary; original notes are looked up on demand. Much smaller table footprint, at the cost of a little detail

This is why DeepSeek-V4 can keep strong performance while supporting much longer context and higher inference throughput.

GPT-Image 2 + Seedance 2.0: AI short-film workflow

Full pipeline: idea → scene sheet → storyboard → image upscaling → Seedance 2.0 prompts → generate clips → edit.

Step 1 — build the scene sheet

Drop your idea into the system prompt below at "[insert your scene idea]" — clearly describe what's happening, what the frame looks like, and how the scene progresses. Iterate until the scene is solid both visually and narratively.

Scene sheet system prompt (structured fields):

Scene title
Visual format / style
Location / environment
Cast / character notes
Shot breakdown
Tone / emotional rhythm
Camera / cinematic language
Lighting design
Sound / audio design
Scene goal

Usage rules:

Hold a professional cinematic tone, suitable for image-to-video or film-video generation workflows
Use the bolded section headers above strictly, no format changes
Output should read like a director's scene sheet or shot breakdown — not prose, not a traditional script

Shot breakdown rules

Build the scene as a series of numbered shots (Shot 1, Shot 2…)
Each shot must specify: shot type (extreme close-up, wide, medium, over-the-shoulder), frame action, character behavior and timing (pauses, reactions, micro-movements), and embedded dialogue in quotes within the shot
Dialogue must be tight, natural, and cinematic — avoid info-dump lines
Add micro-beats (silence, hesitation, eye movement, tension shifts)
Make sure shots have causal continuity
Lean on visual storytelling, not exposition

Overall creative direction

Use precise cinematic language: composition, lens, depth of field, blocking, movement
Concise text, high visual information density
Strong rhythm (tension → pause → release)
Convey emotion through visual action, framing, and timing — not direct explanation
Output should be ready to feed to a video generation model or production team

Step 2 — generate the storyboard

In the same chat, ask ChatGPT:

"Based on the scene sheet above, generate a 21:9 aspect ratio 3×3 storyboard grid, with no text or dialogue."

You now have visual shot references you can use directly.

Step 3 — pick frames and upscale

Pick your favorite frames from the storyboard grid and upscale them (the author uses Magnific for realistic skin texture). These become your primary visual references.

Step 4 — convert the scene sheet to Seedance 2.0 prompts

Use the "Seedance 2.0 Prompt Director" GPT to convert the scene sheet into 3 separate 15-second clip prompts. The idea is now structured for generation.

Step 5 — generate clips and edit

Upload the upscaled references to Seedance 2.0
Use the Seedance 2.0 prompts to generate 3 clips
Quick edit, ship

Note: the author uses the Magnific video upscaler to push the final cut to 4K.