Hand it a schema and it fills it from what's on screen and said. Point it at a whole channel and it compounds every video into one coherent artifact. Built as a primitive your agent composes — not another summarizer.
# one line — no install, no paths, no JSON surgery $ claude mcp add screenscribe -- uvx screenscribe-mcp
Hand it a JSON Schema (or a preset) and get back validated JSON — every CLI command shown, the final config, a recipe with quantities, the code at each step. Output is schema-checked, retried, never malformed. extract_structured
Point at a channel / playlist / list → fan the typed extraction over every video → compound the results into one artifact in the shape you define. Scales past a context window, one capped batch at a time. synthesize
An MCP server (9 tools) and a CLI — your agent calls it directly and reasons over the result. Works in Claude Code, Cursor, Windsurf, any MCP client. Any language, no captions required.
The extraction layer is a commodity — so we built the part that compounds: typed output an agent automates on, and synthesis no single model call produces.
$ uvx screenscribe extract-structured "<url>" --schema recipe { "dish": "Aam Pora Chicken", "ingredients": [ {"item":"raw mango","quantity":"1 medium"}, … exact quantities, read on-screen + spoken ], "steps": [{ "seconds":41, "action":"roast over flame" }] }
7 built-in presets (cli_commands · code_blocks · final_config · step_sequence · resources_mentioned · chapters · recipe) — or pass your own schema.
$ uvx screenscribe synthesize categorize "<channel>" channel — 353 videos, 6 categories: 80 vegetarian 62 fish 47 chicken # confirm, then: $ uvx screenscribe synthesize pass "<channel>" \ --category fish --item-schema recipe \ --aggregate-schema cookbook --top 20 → one cookbook, compounding with every pass
Each pass is bounded and cached — re-runs are free, and the aggregate is resumable.
A single video, a playlist, an array of URLs, or a whole channel — resolve_videos normalizes it to a video list. Gemini watches the actual frames + audio; no captions needed.
Each video → validated JSON conforming to your shape. Cached per (video, schema), so re-runs cost nothing.
Fold the per-video results into one persisted aggregate — a cookbook, a technique grammar, a comparison. The artifact grows; you stay in control of each pass.
Requires uv (or pip). ffmpeg is bundled. The only key is GEMINI_API_KEY — transcript-only mode needs none.
$ claude mcp add screenscribe -- uvx screenscribe-mcp
$ uvx screenscribe extract-structured "<url>" --schema cli_commands $ uvx screenscribe extract "<url>" --transcript-only # free, no key