AI Suite

AI Video cluster

Auto-subtitles (Whisper), smart trim (VAD), scene detection, auto highlights, per-frame style transfer, smart reframe (horizontal → vertical), audio denoise, source separation.

The Video AI cluster handles the heavy lifting for video editing that used to require desktop software with cloud-uploaded models. Subtitling, silence-trimming, highlight extraction, vertical reframing — all running in your browser on a stack of Whisper, Silero VAD, RNNoise, and (experimental) frame-by-frame Stable Diffusion.

Headline tools (8)

Audio cluster (Whisper + VAD + RNNoise + Demucs share an audio.worker.ts so the audio-track decode runs once per clip):

  • AI Auto-Subtitles — Whisper-base (~140 MB, Lite tier) transcribes speech with word-level timing. Output as SRT, VTT, or burned-in subtitles. Auto-detects language; specify lang:en to skip detection.
  • AI Smart Trim — Silero VAD (~3 MB, Lite) detects silent regions and proposes cut planning. Modes: cut-silence, cut-silence:1000 (cut gaps > 1s), cut-filler (uh / um), or keep-speech-only.
  • AI Audio Denoise — RNNoise WASM (<1 MB, Lite) — real- time capable on most devices, removes background hum and AC drone.
  • AI Source Separation — Demucs Small ONNX (~200 MB, Standard) — split audio into vocals + drums + bass + other. Useful for music videos where you want the vocals isolated for subtitling or muted for re-narration.

Scene-analysis cluster (per-frame inference shares a decode pipeline):

  • AI Scene Detection — histogram + colour-cluster heuristic (no ML model, runs in <2 s for a 60s clip). Returns scene timestamps + representative thumbnails. Powers chapter markers and the highlights tool.
  • AI Auto Highlights — distil a long clip into a 15s / 30s / 60s cut. Combines scene detection + per-frame composition scoring
    • speech-segment scoring. Pick a focus axis (speech, motion, subject, mixed) to bias the result.
  • AI Video Caption — per-keyframe vision captions aggregated by the LLM into a narrative / bullet list / SEO description / hashtag suggestions / social caption.
  • AI Smart Reframe — the horizontal-to-vertical tool. R-SAM tracks the subject across frames; a path planner generates a smooth crop trajectory that keeps the subject in frame. Target aspects: 9:16, 1:1, 4:5, 16:9.

Per-frame style (experimental):

  • AI Per-Frame Style Transfer — SD img2img per frame with seed-locking + EMA temporal coherence pass. Quality ~70% temporal consistency. 15 s per frame at 512 px = ~4 minutes for a 16-frame clip; this is the slow, experimental tier. Ships behind a Beta badge.

Experimental video generation (stretch catalogue)

  • AI AnimateDiff — SD + motion module (~300 MB additional). Generates short (8 or 16 frame) animated clips from a prompt at 512 px. 40–60 s wall-clock.
  • AI Cinemagraph — subtle motion on a still image. Optical-flow + per-frame img2img. Good for water, hair, flags, smoke.
  • AI Slow Motion — RIFE-4.6 (~100 MB) frame interpolation. 2× / 4× temporal upsampling with smooth results.
  • AI Voice Clone — Piper TTS (~50 MB + per-speaker) for narration generation.
  • AI Avatar Lipsync — Wav2Lip (~250 MB) — drive a single photo with an audio track. Sync drift ~50 ms; Beta-flagged.

What can't be done client-side

The video-generation matrix has honest exclusions:

  • True text-to-video at quality (CogVideoX-1.5B and similar) — decoder isn't WASM-friendly, no mature browser binding as of cutoff.
  • Stable Video Diffusion (SVD) image-to-video — model + VRAM spike exceed mid-range device capability.
  • Full music generation (MusicGen) — decoder unfriendly to WASM.

These are documented in the FAQ as "needs server-side compute we don't operate." If a future Chromium release or smaller quantised variant ships, they fold in as follow-up phases.

How the cluster integrates with the video editor

The video editor's AI Suite sidebar surfaces all eight headline tools as preset chips. The AI panel in the video editor recognises video-specific commands ("cut the silent parts", "make a 30 second highlight", "burn in subtitles").

The video toolbar's AI Smart-Reframe button (⤢) gives one-click access to the most common workflow (horizontal to TikTok 9:16). The AI Quick- Generate (🪄) button opens a prompt input for text-to-background that applies as the video's bg-fill.

The E3+X1 worker migration that ships with this cluster moves all seven remaining single-thread video utilities (compressor, format converter, resizer, rotate, canvas extender, metadata remover, format comparison) to OffscreenCanvas workers so per-frame AI processing doesn't compete with utility encoding for the main thread.

Tier required

Lite covers Subtitles, Smart Trim, Audio Denoise, Scene Detection, Reframe, and Best-Frame Picker. Standard covers Source Separation, Highlights, Video Caption, Thumbnail Generator. Pro covers Per-Frame Style Transfer, AnimateDiff, Cinemagraph, Voice Clone, Lipsync.