Technical Deep Dives6 min read

No "Bring Your Own Model": How We Source Client-Side AI Models

Plenty of browser AI tools punt the hard part to you — "paste a model URL." We don't. Our rule for sourcing self-hostable, permissively-licensed ONNX models so every tool works out of the box.

A pattern shows up in browser-based AI tools when a feature gets hard: bring your own model. The tool ships a text box where you paste a model URL, and if you don't have one, the feature simply doesn't work. It's a way to claim a capability without actually delivering it. We made a rule against it: every tool gets a real, self-hostable, permissively-licensed model — or it's honestly gated, never faked.

That rule turns "source a model" into a concrete research checklist. Here's how we evaluate one.

The four questions for every model

  1. Does a web-ready ONNX export exist? Not a PyTorch checkpoint — an ONNX file that ONNX Runtime Web or Transformers.js can actually load. Many famous models have no clean web export, which quietly kills them as client-side options.
  2. Is the license permissive enough to self-host? We need to serve the weights ourselves (MIT/Apache and similar). A research-only or non-commercial license is a non-starter for a free public tool.
  3. Does it fit the browser's constraints? Download size that's honest to gate, and — critically — a compute graph that fits WebGPU's limits or runs acceptably on WASM. (A model that needs 33+ storage buffers per shader stage won't survive WebGPU's 8-buffer floor on many devices.)
  4. What's the exact I/O contract? Input tensor shape, normalization, output decoding. Without this you can load a model and still get garbage.

What that yielded this round

  • Florence-2 (onnx-community/Florence-2-base-ft) — one model that upgrades captioning, OCR, alt-text and VQA, replacing weaker single-task models. MIT. The catch we documented: it's quantization-sensitive in the encoder, so you must set per-module dtypes (keep the vision encoder higher precision) or OCR comes out garbled.
  • MediaPipe Tasks Vision (@mediapipe/tasks-vision) — fully client-side pose, hand, and face-mesh landmarks. Apache-2.0. This is the honest path for pose-conditioned generation and for constraining eye-colour edits to the actual iris instead of recolouring the whole image.
  • Demucs v4 ONNX — a self-contained ONNX export (STFT/ISTFT rewritten to be ONNX-compatible) for audio stem separation, runnable via ONNX Runtime Web. MIT.

For each, we recorded the URL, license, size, and I/O contract in our model catalog before wiring — so "does this tool use the right model?" has one audited answer instead of being scattered across modules.

The honest-gate fallback

Sometimes the answer to the four questions is "not yet" — no clean export, or it's too heavy to ship responsibly. In that case the tool shows a clear "not available yet" state. What it never does is return your input and call it a result. A feature that says "coming soon" is honest; a feature that fakes its output is not.

Why this is more work — and worth it

"Bring your own model" is easier for us and worse for you: it offloads the genuinely hard part (finding a model that's web-ready, licensed, and fits the device) onto the user. Doing that work ourselves is the difference between a tool that lists a capability and a tool that has one.