Technical Deep DivesJune 8, 20265 min read

How We Finally Proved Which GPU Path Our Models Actually Use

For dozens of update rounds we debated whether a slowdown meant our models silently fell back to WASM. We never knew — because nothing logged it. The one small telemetry probe that changed how we diagnose performance.

For a long stretch of this project, performance debugging went like this: a tool felt slow, someone said "it's probably silently falling back to WASM instead of WebGPU," someone else patched a symptom, and nobody could confirm whether the theory was even true. We were debating a fact we'd never measured.

The missing instrument

ONNX Runtime Web and Transformers.js both take an execution-provider preference like ['webgpu', 'wasm']. They try WebGPU, and if it's unavailable or the graph doesn't fit, they fall back to WASM. That fallback is silent by design — it's a feature, so your model still runs. But it means the single most important performance question — which provider actually bound? — had no answer in our logs.

So a 3–6× slowdown could be "WebGPU is busy" or "we've been on WASM this whole time" and we genuinely could not tell them apart.

The probe

The fix wasn't a rewrite. It was a small telemetry module that records, per model session: the provider requested, the provider we believe actually bound, the adapter's reported info (vendor, whether it's a software fallback, its real limits), and the dtype. For the Transformers.js path the bound provider is exact — an explicit device: 'webgpu' either binds or throws, so the device of the successful attempt is ground truth. For the raw ONNX path we infer it from adapter health and mark it as inferred rather than guessed.

You read it by adding ?debug=ep to any tool URL. The console then shows lines like:

[EP] transformers · birefnet-lite-512 · requested=webgpu = bound=webgpu (fp16) | adapter hw storageBuffers/stage=10
[EP] transformers · ormbg-ONNX · requested=wasm = bound=wasm (q8) — forced WASM compatibility path

What it immediately revealed

Two things stopped being arguments and became facts:

Provider choice is model-specific. Fast is deliberately pinned to WASM because its q8 graph is compact and its MaxPool contract is not supported by our current WebGPU path. Best Quality uses the browser-ready 512px BiRefNet graph on WebGPU/fp16 and can retry WASM/fp32. Telemetry distinguishes each real binding.
Degenerate adapters were real. On some devices WebGPU exposes a software fallback adapter that reports nonsense limits (the source of the infamous "createBuffer size 1536 is too large" error). We detect that and refuse the GPU path, routing Best Quality to its WASM variant on purpose — and the log says so.

The broader lesson

You cannot optimize what you cannot observe, and "silent fallback" is the enemy of observation. The cheapest, highest-leverage thing we did for performance this round wasn't a faster code path — it was making the existing behavior visible. One probe turned ~50 rounds of guessing into a question with a printed answer, and it's now part of how we'll catch any future regression to the slow path automatically instead of by feel.

webgpu wasm execution provider telemetry performance onnx runtime web

Found this useful?

← Back to Blog