Technical Deep Dives6 min read

Why Our Best-Quality Background Remover Runs on WASM, Not WebGPU

WebGPU guarantees only 8 storage buffers per shader stage. BiRefNet's graph wants 33–65. That single number decides where our best-quality model runs — and why "just use the GPU" isn't an option.

There's a recurring piece of advice for anyone running ML in the browser: use WebGPU, it's faster than WASM. It usually is. But our best-quality background-removal model — BiRefNet — runs on WASM on purpose, and the reason comes down to a single number in the WebGPU spec.

The number that decides everything

WebGPU's spec guarantees a minimum of 8 storage buffers per shader stage (maxStorageBuffersPerShaderStage). Many real GPUs report more, but 8 is the floor an app can rely on across devices.

BiRefNet is a bilateral-reference segmentation network. Exported to ONNX and run through ONNX Runtime Web's WebGPU backend, its heaviest shader stages need 33 to 65 storage buffers at once — far past the guaranteed 8. On a machine that only exposes the minimum, session creation doesn't gracefully degrade; you get errors like createBuffer failed, size (1536) is too large for the implementation. A 1.5 KB buffer is obviously not "too large" on real hardware — that error is the device telling you it has run out of binding slots, not memory.

So the choice isn't "fast WebGPU vs slow WASM." It's "a model that crashes on a class of devices vs a model that runs everywhere." We chose runs-everywhere.

What we actually do

Our fast model (RMBG-1.4) fits comfortably within WebGPU limits and uses it when the device has a healthy adapter. BiRefNet is pinned to a multi-threaded WASM execution provider. To make that decision honest rather than accidental, we added two things in our latest pass:

  • Adapter-quality detection. Before choosing WebGPU we inspect the adapter: is it a software fallback? What are its real limits? A degenerate adapter (the kind that reports nonsense buffer limits) gets refused, and we route to WASM explicitly instead of letting the model half-bind to a dying device.
  • Execution-provider telemetry. Every model session now logs which provider actually bound. For ~50 update rounds we guessed whether a slowdown meant "silently fell back to WASM." Now we can read it.

"But couldn't you use a smaller model?"

Yes — and that's the live research path. A BiRefNet_lite / fp16 export roughly halves the graph and the download (around 490 MB vs ~970 MB). If a lite variant fits within a device's storage-buffer budget, we can put it on WebGPU behind the same adapter-quality gate and fall back to WASM otherwise. The point is the gate, not the model: never run on a device path that will crash or crawl, and never pretend you didn't downgrade.

The takeaway

When someone says "just run it on the GPU," the honest answer for browser ML is: only if the graph fits the device's guaranteed limits. For segmentation models with dozens of simultaneous buffers, the WebGPU minimum is the wall you hit first — long before you run out of compute. Knowing the exact number (8) is what lets you design around it instead of shipping a tool that works on your laptop and crashes on a reviewer's.