Motivation
After being enthusiastic followers of 3Blue1Brown over the years, we set aside some time over a few weekends to answer one question: "How well can frontier models visually explain AI research papers using Grant Sanderson's Manim framework?"
So we built something we personally always wanted โ a pipeline that ingests a research paper and outputs a 3Blue1Brown-style animated explainer video with narration, where key mathematical components of the paper are visualized for intuitive understanding. The hope is to get through a paper quickly without parsing through academic jargon.
Pipeline Overview
The system takes a research paper PDF as input and produces a fully narrated, animated video as output. Users can select between two difficulty levels โ Initiate (includes prerequisite concepts) or Scholar (focused deep dive). Each difficulty level uses a specialized multi-agent system with distinct prompts tailored to the target audience, controlling explanation depth, segment count, vocabulary complexity, and pacing. There are five stages, each feeding into the next:
Audio is generated before the animation code. This ordering is deliberate: the beat timeline produced by TTS generation becomes the source of truth for animation timing. Each segment's narration is split into sentence-level beats, each with a precise measured duration. The Manim code generator receives this timing metadata and structures its animations to land on the right beats.
Specialized Agents
Agent 1
Intuition Generator
Reads the full paper PDF and produces a 3B1B-style educational breakdown. Each segment has a core insight, a concrete visual metaphor, a running example, and a narration script. Emphasizes intuition before formulas.
Agent 2
Manim Code Generator
Takes the educational breakdown, the running example, and beat-level timing metadata, and produces executable Manim Python code. Strictly limited to classes demonstrated in actual Manim documentation โ no hallucinated APIs.
Separating these concerns matters. The intuition agent focuses entirely on the what and why โ it doesn't know anything about Manim. The code agent focuses entirely on translating those ideas into valid, runnable Python. Each agent has a tightly scoped system prompt tuned for its role.
LLM-as-Judge Feedback Loops
Not every generated explanation is worth animating. A separate judge agent evaluates each explanation against six binary criteria: does it lead with intuition before formulas? Are the visual metaphors strong? Is the running example used consistently? Does it have real animation potential? Is math connected back to intuition? Is the narration script natural?
All six must pass. If any fail, the judge returns specific written feedback, and the explanation agent regenerates with that feedback appended to its prompt. This loop runs up to three times. The result is that only explanations meeting a consistent quality bar proceed to the expensive rendering stages.
Explanation
6 criteria
Execution-Based Feedback
LLM-generated Manim code frequently references classes or methods that don't exist, or assembles valid-looking Python that fails at render time. Rather than filtering this statically, the pipeline runs the generated code through manim render as a subprocess. If it executes cleanly, we move on. If it fails, the error output is parsed.
Key terms and class names are extracted from the error message, used to query the RAG system, and the retrieved documentation is injected back into the code generation prompt alongside the original error. The code agent then regenerates with this grounded context. This loop runs up to three retries per scene before falling back.
Code
manim render
extract terms
ChromaDB
Code
โฉ retry
RAG with Manim Documentation
The Manim documentation โ API reference, tutorials, guides โ was scraped, chunked at ~1000 tokens with 200-token overlap, and embedded using OpenAI's text-embedding-3-large. The resulting ChromaDB vector store holds thousands of chunks, each labeled with its Manim classes, animation types, and chunk type (code example, API doc, tutorial, concept).
Retrieval is error-driven. When a code execution fails, the system builds a semantic query from the extracted error terms and fetches the top candidates from ChromaDB. These are then reranked: code example chunks get a significant boost, API docs get a moderate boost, and chunks mentioning more Manim classes rank higher. The top results are assembled into a documentation context block that the code agent sees on its next attempt.
This grounds the model in actual implementation details โ real class signatures, working code patterns, confirmed behaviors โ rather than letting it hallucinate from training data alone.
text-embedding-3-large
Parallel Execution
Both Manim code generation and video rendering run in parallel using Python's ThreadPoolExecutor. Each segment's code generation (including its retry loop) runs in its own thread, with up to 4 workers executing concurrently. Similarly, video rendering and audio synchronization for each segment happen in parallel threads.
This parallelization significantly reduces total pipeline time. For a paper with 6 segments, instead of processing them sequentially (6 ร average_time), the pipeline processes them in batches of 4, cutting wall-clock time by roughly 60โ70% depending on segment complexity.
6 segments
4 workers
Audio and Video Synchronization
Narration is generated using OpenAI TTS at the beat level โ each sentence or short phrase is its own audio file with a precisely measured WAV duration. These durations are accumulated into a timeline JSON that records the exact start time and length of every beat within every segment.
After a Manim segment renders, its actual video duration is measured with ffprobe. The pipeline then computes a speed factor: target_duration / actual_duration. If the adjustment is within ยฑ30%, ffmpeg's setpts filter is applied to retime the video. If the video runs short, the last frame is frozen and appended. If it runs long, it's trimmed. The speed-adjusted video and audio are then merged into a single synced segment, and all segments are concatenated into the final output.
8โ25 words
OpenAI
Code
Manim
measure
ffmpeg setpts ยท too short โ freeze last frame ยท too long โ trim
Model Comparison
We ran the same paper โ Unified Latents โ on "scholar" mode through the pipeline using three different frontier models for the explanation and code generation steps. Here's what each produced.