screenscribe

What it does

Three layers, each building on the last.

▦

Typed extraction

Hand it a JSON Schema (or a preset) and get back validated JSON — every CLI command shown, the final config, a recipe with quantities, the code at each step. Output is schema-checked, retried, never malformed. extract_structured

⧉

Cross-video synthesis

Point at a channel / playlist / list → fan the typed extraction over every video → compound the results into one artifact in the shape you define. Scales past a context window, one capped batch at a time. synthesize

⌘

Agent-native

An MCP server (9 tools) and a CLI — your agent calls it directly and reasons over the result. Works in Claude Code, Cursor, Windsurf, any MCP client. Any language, no captions required.

Show, don't tell

A schema in. Validated data out.

The extraction layer is a commodity — so we built the part that compounds: typed output an agent automates on, and synthesis no single model call produces.

ONE VIDEO → TYPED JSON

$ uvx screenscribe extract-structured "<url>" --schema recipe
{
  "dish": "Aam Pora Chicken",
  "ingredients": [
    {"item":"raw mango","quantity":"1 medium"},
    … exact quantities, read on-screen + spoken
  ],
  "steps": [{ "seconds":41, "action":"roast over flame" }]
}

7 built-in presets (cli_commands · code_blocks · final_config · step_sequence · resources_mentioned · chapters · recipe) — or pass your own schema.

A WHOLE CHANNEL → ONE ARTIFACT

$ uvx screenscribe synthesize categorize "<channel>"
channel — 353 videos, 6 categories:
   80  vegetarian
   62  fish
   47  chicken      # confirm, then:

$ uvx screenscribe synthesize pass "<channel>" \
    --category fish --item-schema recipe \
    --aggregate-schema cookbook --top 20
→ one cookbook, compounding with every pass

Each pass is bounded and cached — re-runs are free, and the aggregate is resumable.

How it works

Resolve → extract → compound.

Point at anything

A single video, a playlist, an array of URLs, or a whole channel — resolve_videos normalizes it to a video list. Gemini watches the actual frames + audio; no captions needed.

Extract to your schema

Each video → validated JSON conforming to your shape. Cached per (video, schema), so re-runs cost nothing.

Compound across many

Fold the per-video results into one persisted aggregate — a cookbook, a technique grammar, a comparison. The artifact grows; you stay in control of each pass.

Install

Two ways in. One key.

Requires uv (or pip). ffmpeg is bundled. The only key is GEMINI_API_KEY — transcript-only mode needs none.

As an MCP server — for Claude Code / Cursor / Windsurf

$ claude mcp add screenscribe -- uvx screenscribe-mcp

As a CLI — extract, synthesize, or just grab a transcript free

$ uvx screenscribe extract-structured "<url>" --schema cli_commands
$ uvx screenscribe extract "<url>" --transcript-only  # free, no key

MCP tools

Nine tools your agent can call.

extract_transcript · free, no key

analyze_video · whole-video analysis

extract_frames · key frames / slides

extract_structured · typed JSON

synthesize_categorize · discover categories

synthesize_pass · compounding aggregate

get_video_analysis · read analysis

get_session · transcript + frames

list_sessions · processed videos