DeepSeek-V4 vs Flash Attention vs Multi-Head Attention
Putting these three side by side makes the difference between "algorithmic optimization" and "architectural innovation" obvious.
1. Multi-Head Attention (MHA): the baseline
Core formula:
Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V Q, K, and V are split into h heads, each head computes attention independently, then the heads are concatenated.
KV cache pain: at inference time, you have to cache K and V at every layer. Cache size = 2 × n_layers × seq_len × d_model × batch_size, which is brutal for long sequences. Memory complexity is O(n²) — VRAM blows up at long context.
2. Flash Attention: same architecture, better algorithm
Flash Attention is not a new attention structure — it's a more efficient way of computing the same MHA.
- Result: mathematically identical to standard MHA
- Core idea: tiled computation that avoids writing the full n×n attention matrix to HBM
- Memory access: O(n²) HBM reads/writes drops to O(n)
- Speed: 2-4x faster
- What changed: only the GPU memory access pattern (IO-aware)
Analogy: same dish, but instead of laying every ingredient on the counter first, you grab and cook in sequence — same result, less counter space.
3. DeepSeek-V4's CSA / HCA: actual architectural change
This is the real innovation. It tackles MHA's structural problem, not just implementation efficiency.
3.1 CSA (Compressed Sparse Attention)
Idea: before computing attention, do a low-rank compression on K and V — squeeze the high-dimensional vectors into a low-dimensional latent space.
Standard MHA:
K = X · Wₖ ← dim d_model
V = X · Wᵥ ← dim d_model
CSA:
c_KV = X · W_c ← compress to low dim d_c (d_c << d_model)
K = c_KV · Wₖ_up ← project back up
V = c_KV · Wᵥ_up The KV cache only needs to store c_KV! Compression ratios of 5-10x are achievable. This is a continuation and enhancement of the MLA (Multi-head Latent Attention) idea from DeepSeek-V2/V3.
3.2 HCA (Heavily Compressed Attention)
Higher compression rate than CSA, used in certain layers (typically middle layers) where you can trade a bit of precision for more efficiency. The hybrid design means different Transformer blocks can alternate between CSA and HCA, adapting based on layer depth.
3.3 Pre-Block / Post-Block Mixing
Pre-Block Mixing: before attention, mix the current layer with neighboring layers
Post-Block Mixing: after attention, integrate information again This is a cross-layer mixing mechanism, similar in spirit to DenseNet's dense connections — information flows more smoothly between layers.
3.4 mHC residuals
The "stacked rectangle" next to ⊕ in the diagram is the multi-Head Compression residual, an enhancement to traditional residual connections:
Standard residual: x_out = x_in + Attention(x_in)
mHC residual: x_out = Mix([x_in, x_in_compressed_1, x_in_compressed_2]) + Attention(x_in) Multiple representations at different compression levels are kept around and fused at the residual addition step, capturing multi-scale features.
4. Side by side
Standard MHA
↓ (same architecture, only GPU memory access optimized)
Flash Attention (implementation optimization, identical result)
↓ (architecture changes, KV representation compressed)
MLA / CSA / HCA (architectural innovation — fundamentally less to store and compute) | Dimension | MHA | Flash Attention | CSA/HCA (V4) |
|---|---|---|---|
| Architecture change | baseline | no | yes |
| Problem solved | — | GPU IO bottleneck | KV cache + compute |
| KV cache size | 100% | 100% (unchanged) | ~10-20% |
| Mathematical equivalence | baseline | fully equivalent | approximate (lossy) |
| Where to use | general | any setting | large-model long-context inference |
| Main payoff | — | train / infer speed | inference memory + throughput |
5. Intuition
- MHA: every person (head) walks into the meeting carrying their full notebook (K, V)
- Flash Attention: still full notebooks, but the meeting is run more efficiently — you don't lay every page on the table at once
- CSA/HCA: before the meeting, everyone compresses their notes into summaries and brings only the summary; original notes are looked up on demand. Much smaller table footprint, at the cost of a little detail
This is why DeepSeek-V4 can keep strong performance while supporting much longer context and higher inference throughput.
GPT-Image 2 + Seedance 2.0: AI short-film workflow
Full pipeline: idea → scene sheet → storyboard → image upscaling → Seedance 2.0 prompts → generate clips → edit.
Step 1 — build the scene sheet
Drop your idea into the system prompt below at "[insert your scene idea]" — clearly describe what's happening, what the frame looks like, and how the scene progresses. Iterate until the scene is solid both visually and narratively.
Scene sheet system prompt (structured fields):
- Scene title
- Visual format / style
- Location / environment
- Cast / character notes
- Shot breakdown
- Tone / emotional rhythm
- Camera / cinematic language
- Lighting design
- Sound / audio design
- Scene goal
Usage rules:
- Hold a professional cinematic tone, suitable for image-to-video or film-video generation workflows
- Use the bolded section headers above strictly, no format changes
- Output should read like a director's scene sheet or shot breakdown — not prose, not a traditional script
Shot breakdown rules
- Build the scene as a series of numbered shots (Shot 1, Shot 2…)
- Each shot must specify: shot type (extreme close-up, wide, medium, over-the-shoulder), frame action, character behavior and timing (pauses, reactions, micro-movements), and embedded dialogue in quotes within the shot
- Dialogue must be tight, natural, and cinematic — avoid info-dump lines
- Add micro-beats (silence, hesitation, eye movement, tension shifts)
- Make sure shots have causal continuity
- Lean on visual storytelling, not exposition
Overall creative direction
- Use precise cinematic language: composition, lens, depth of field, blocking, movement
- Concise text, high visual information density
- Strong rhythm (tension → pause → release)
- Convey emotion through visual action, framing, and timing — not direct explanation
- Output should be ready to feed to a video generation model or production team
Step 2 — generate the storyboard
In the same chat, ask ChatGPT:
"Based on the scene sheet above, generate a 21:9 aspect ratio 3×3 storyboard grid, with no text or dialogue."
You now have visual shot references you can use directly.
Step 3 — pick frames and upscale
Pick your favorite frames from the storyboard grid and upscale them (the author uses Magnific for realistic skin texture). These become your primary visual references.
Step 4 — convert the scene sheet to Seedance 2.0 prompts
Use the "Seedance 2.0 Prompt Director" GPT to convert the scene sheet into 3 separate 15-second clip prompts. The idea is now structured for generation.
Step 5 — generate clips and edit
- Upload the upscaled references to Seedance 2.0
- Use the Seedance 2.0 prompts to generate 3 clips
- Quick edit, ship
Note: the author uses the Magnific video upscaler to push the final cut to 4K.