Guide
Video Background Removal — How It Works
A frame-by-frame deep dive: what the AI does, why some clips work better than others, and how to get a clean cutout.
How video background removal works
Video background removal runs the same AI model used for still images (RMBG-1.4 or BiRefNet) on every frame of the video, then composites each masked frame into a new output video. The whole pipeline runs locally in your browser via WebAssembly and WebGPU — no upload, no server.
The challenge with video isn't the per-frame inference (which is the same as for images) — it's keeping the mask stable across frames. A model that produces a slightly different mask each frame causes visible edge flicker. We solve this with temporal smoothing: each frame's mask is blended with the previous frame's mask (75% current / 25% previous) so subjects move smoothly without edges popping.
Fast vs Best Quality
Fast (RMBG-1.4) — ~80 MB model, ~0.5 sec/frame on most hardware. Good for talking heads, single subjects, social media clips.
Best Quality (BiRefNet) — ~280 MB model, ~1.5 sec/frame. Significantly cleaner edges on hair, fur, complex backgrounds, and fine details. Use it when you need broadcast quality.
Output format choices
- WebM (VP9) with alpha — preserves true transparency. Required for compositing in After Effects, Premiere, or web overlays. Largest file size.
- MP4 (H.264) with solid/blur background — universal compatibility, smaller files. Use when you've replaced the background with a colour, blur, or image.
- WebM (VP9) with background — middle ground. Smaller than alpha-WebM, broader playback than alpha.
Best practices
- Stable lighting — sudden lighting changes between frames make the model produce inconsistent masks. Avoid auto-exposure on the camera.
- Avoid skinny limbs against busy backgrounds — RMBG can lose thin features when the background has matching colours. Best Quality handles this better.
- Process at original resolution — don't downscale before upload. The tool processes everything frame-by-frame; lower input resolution = lower output quality.
- Use blur or solid background if you don't need true alpha — much faster encoding, smaller files, broader compatibility.
Speed mode (opt-in)
For clips longer than 30 seconds, the tool offers a "Speed mode" toggle that runs inference on every 2nd frame and interpolates the rest. This roughly halves processing time but may introduce slight edge flicker on fast motion. It's off by default — toggle it on if you need throughput over visual perfection.