Technical Deep Dives5 min read

How We Finally Proved Which GPU Path Our Models Actually Use

For dozens of update rounds we debated whether a slowdown meant our models silently fell back to WASM. We never knew — because nothing logged it. The one small telemetry probe that changed how we diagnose performance.

For a long stretch of this project, performance debugging went like this: a tool felt slow, someone said "it's probably silently falling back to WASM instead of WebGPU," someone else patched a symptom, and nobody could confirm whether the theory was even true. We were debating a fact we'd never measured.

The missing instrument

ONNX Runtime Web and Transformers.js both take an execution-provider preference like ['webgpu', 'wasm']. They try WebGPU, and if it's unavailable or the graph doesn't fit, they fall back to WASM. That fallback is silent by design — it's a feature, so your model still runs. But it means the single most important performance question — which provider actually bound? — had no answer in our logs.

So a 3–6× slowdown could be "WebGPU is busy" or "we've been on WASM this whole time" and we genuinely could not tell them apart.

The probe

The fix wasn't a rewrite. It was a small telemetry module that records, per model session: the provider requested, the provider we believe actually bound, the adapter's reported info (vendor, whether it's a software fallback, its real limits), and the dtype. For the Transformers.js path the bound provider is exact — an explicit device: 'webgpu' either binds or throws, so the device of the successful attempt is ground truth. For the raw ONNX path we infer it from adapter health and mark it as inferred rather than guessed.

You read it by adding ?debug=ep to any tool URL. The console then shows lines like:

[EP] transformers · BiRefNet · requested=wasm = bound=wasm — forced WASM (model exceeds WebGPU limits)
[EP] transformers · RMBG · requested=webgpu,wasm = bound=webgpu | adapter hw storageBuffers/stage=10

What it immediately revealed

Two things stopped being arguments and became facts:

  1. Some models are on WASM by design, not by accident. Our best-quality background remover is forced to WASM because its graph needs more storage buffers than WebGPU guarantees. The telemetry labels that explicitly, so "why is it on WASM?" has a documented answer instead of a fallback theory.
  2. Degenerate adapters were real. On some devices WebGPU exposes a software fallback adapter that reports nonsense limits (the source of the infamous "createBuffer size 1536 is too large" error). Now we detect that and refuse the GPU path, routing to WASM on purpose — and the log says so.

The broader lesson

You cannot optimize what you cannot observe, and "silent fallback" is the enemy of observation. The cheapest, highest-leverage thing we did for performance this round wasn't a faster code path — it was making the existing behavior visible. One probe turned ~50 rounds of guessing into a question with a printed answer, and it's now part of how we'll catch any future regression to the slow path automatically instead of by feel.