Research Paper to 3B1B-Style Video

Motivation

After being enthusiastic followers of 3Blue1Brown over the years, we set aside some time over a few weekends to answer one question: "How well can frontier models visually explain AI research papers using Grant Sanderson's Manim framework?"

So we built something we personally always wanted — a pipeline that ingests a research paper and outputs a 3Blue1Brown-style animated explainer video with narration, where key mathematical components of the paper are visualized for intuitive understanding. The hope is to get through a paper quickly without parsing through academic jargon.

Animation demo: Unified Latents paper visualized — A clip from our pipeline's output on the Unified Latents paper — generated with Gemini 3.1.

Pipeline Overview

The system takes a research paper PDF as input and produces a fully narrated, animated video as output. Users can select between two difficulty levels — Initiate (includes prerequisite concepts) or Scholar (focused deep dive). Each difficulty level uses a specialized multi-agent system with distinct prompts tailored to the target audience, controlling explanation depth, segment count, vocabulary complexity, and pacing. There are five stages, each feeding into the next:

PDF → Explanation → Audio Beats → Manim Code → Synced Video

Audio is generated before the animation code. This ordering is deliberate: the beat timeline produced by TTS generation becomes the source of truth for animation timing. Each segment's narration is split into sentence-level beats, each with a precise measured duration. The Manim code generator receives this timing metadata and structures its animations to land on the right beats.

Specialized Agents

Agent 1

Intuition Generator

Reads the full paper PDF and produces a 3B1B-style educational breakdown. Each segment has a core insight, a concrete visual metaphor, a running example, and a narration script. Emphasizes intuition before formulas.

Agent 2

Manim Code Generator

Takes the educational breakdown, the running example, and beat-level timing metadata, and produces executable Manim Python code. Strictly limited to classes demonstrated in actual Manim documentation — no hallucinated APIs.

Separating these concerns matters. The intuition agent focuses entirely on the what and why — it doesn't know anything about Manim. The code agent focuses entirely on translating those ideas into valid, runnable Python. Each agent has a tightly scoped system prompt tuned for its role.

LLM-as-Judge Feedback Loops

Not every generated explanation is worth animating. A separate judge agent evaluates each explanation against six binary criteria: does it lead with intuition before formulas? Are the visual metaphors strong? Is the running example used consistently? Does it have real animation potential? Is math connected back to intuition? Is the narration script natural?

All six must pass. If any fail, the judge returns specific written feedback, and the explanation agent regenerates with that feedback appended to its prompt. This loop runs up to three times. The result is that only explanations meeting a consistent quality bar proceed to the expensive rendering stages.

Generate
Explanation

→

Judge Agent
6 criteria

→

Score = 1?

→

Pass · Done

On fail — written feedback injected into prompt, regenerate · max 3 attempts

Execution-Based Feedback

LLM-generated Manim code frequently references classes or methods that don't exist, or assembles valid-looking Python that fails at render time. Rather than filtering this statically, the pipeline runs the generated code through manim render as a subprocess. If it executes cleanly, we move on. If it fails, the error output is parsed.

Key terms and class names are extracted from the error message, used to query the RAG system, and the retrieved documentation is injected back into the code generation prompt alongside the original error. The code agent then regenerates with this grounded context. This loop runs up to three retries per scene before falling back.

Gen Manim
Code

→

Execute
manim render

→

Success

↓ error

Parse Error
extract terms

→

Query RAG
ChromaDB

inject context →

Gen Manim
Code
↩ retry

RAG context + original error injected into code gen prompt · max 3 attempts per scene

RAG with Manim Documentation

The Manim documentation — API reference, tutorials, guides — was scraped, chunked at ~1000 tokens with 200-token overlap, and embedded using OpenAI's text-embedding-3-large. The resulting ChromaDB vector store holds thousands of chunks, each labeled with its Manim classes, animation types, and chunk type (code example, API doc, tutorial, concept).

Retrieval is error-driven. When a code execution fails, the system builds a semantic query from the extracted error terms and fetches the top candidates from ChromaDB. These are then reranked: code example chunks get a significant boost, API docs get a moderate boost, and chunks mentioning more Manim classes rank higher. The top results are assembled into a documentation context block that the code agent sees on its next attempt.

This grounds the model in actual implementation details — real class signatures, working code patterns, confirmed behaviors — rather than letting it hallucinate from training data alone.

Execution Error · extract key terms

↓

Build semantic query → embed with text-embedding-3-large

↓

ChromaDB · vector similarity search → top 20 candidates

Code Examples +0.4 boost

Tutorials +0.25 boost

API Docs +0.2 boost

↓

Rerank by type + metadata richness → top 10 → context block

Parallel Execution

Both Manim code generation and video rendering run in parallel using Python's ThreadPoolExecutor. Each segment's code generation (including its retry loop) runs in its own thread, with up to 4 workers executing concurrently. Similarly, video rendering and audio synchronization for each segment happen in parallel threads.

This parallelization significantly reduces total pipeline time. For a paper with 6 segments, instead of processing them sequentially (6 × average_time), the pipeline processes them in batches of 4, cutting wall-clock time by roughly 60–70% depending on segment complexity.

Explanation
6 segments

→

ThreadPool
4 workers

Segments 1–4 process concurrently, then 5–6 · results collected in original order · ~60–70% faster than sequential

Audio and Video Synchronization

Narration is generated using OpenAI TTS at the beat level — each sentence or short phrase is its own audio file with a precisely measured WAV duration. These durations are accumulated into a timeline JSON that records the exact start time and length of every beat within every segment.

After a Manim segment renders, its actual video duration is measured with ffprobe. The pipeline then computes a speed factor: target_duration / actual_duration. If the adjustment is within ±30%, ffmpeg's setpts filter is applied to retime the video. If the video runs short, the last frame is frozen and appended. If it runs long, it's trimmed. The speed-adjusted video and audio are then merged into a single synced segment, and all segments are concatenated into the final output.

Audio track

Narration

→

Beat Split
8–25 words

→

TTS
OpenAI

Video track

Manim
Code

→

Render
Manim

→

ffprobe
measure

Synchronization

speed_factor = audio_duration / video_duration within ±30% → ffmpeg setpts · too short → freeze last frame · too long → trim

Synced segment → concatenate all → final_video.mp4

Model Comparison

We ran the same paper — Unified Latents — on "scholar" mode through the pipeline using three different frontier models for the explanation and code generation steps. Here's what each produced.

Google Gemini 3.1 Pro Preview via OpenRouter — strongest overall visual metaphors and narration quality.

OpenAI GPT 5.2 Codex — most compact output; tighter pacing, slightly more code-literal in its visualizations.

Kimi K2.5 — longest output; richest animation density with more elaborate scene compositions.