No "Bring Your Own Model": How We Source Client-Side AI Models
Plenty of browser AI tools punt the hard part to you — "paste a model URL." We don't. Our rule for sourcing self-hostable, permissively-licensed ONNX models so every tool works out of the box.
A pattern shows up in browser-based AI tools when a feature gets hard: bring your own model. The tool ships a text box where you paste a model URL, and if you don't have one, the feature simply doesn't work. It's a way to claim a capability without actually delivering it. We made a rule against it: every tool gets a real, self-hostable, permissively-licensed model — or it's honestly gated, never faked.
That rule turns "source a model" into a concrete research checklist. Here's how we evaluate one.
The four questions for every model
- Does a web-ready ONNX export exist? Not a PyTorch checkpoint — an ONNX file that ONNX Runtime Web or Transformers.js can actually load. Many famous models have no clean web export, which quietly kills them as client-side options.
- Is the license permissive enough to self-host? We need to serve the weights ourselves (MIT/Apache and similar). A research-only or non-commercial license is a non-starter for a free public tool.
- Does it fit the browser's constraints? Download size that's honest to gate, and — critically — a compute graph that fits WebGPU's limits or runs acceptably on WASM. (A model that needs 33+ storage buffers per shader stage won't survive WebGPU's 8-buffer floor on many devices.)
- What's the exact I/O contract? Input tensor shape, normalization, output decoding. Without this you can load a model and still get garbage.
What that yielded this round
- Florence-2 (
onnx-community/Florence-2-base-ft) — one model that upgrades captioning, OCR, alt-text and VQA, replacing weaker single-task models. MIT. The catch we documented: it's quantization-sensitive in the encoder, so you must set per-module dtypes (keep the vision encoder higher precision) or OCR comes out garbled. - MediaPipe Tasks Vision (
@mediapipe/tasks-vision) — fully client-side pose, hand, and face-mesh landmarks. Apache-2.0. This is the honest path for pose-conditioned generation and for constraining eye-colour edits to the actual iris instead of recolouring the whole image. - Demucs v4 ONNX — a self-contained ONNX export (STFT/ISTFT rewritten to be ONNX-compatible) for audio stem separation, runnable via ONNX Runtime Web. MIT.
For each, we recorded the URL, license, size, and I/O contract in our model catalog before wiring — so "does this tool use the right model?" has one audited answer instead of being scattered across modules.
The honest-gate fallback
Sometimes the answer to the four questions is "not yet" — no clean export, or it's too heavy to ship responsibly. In that case the tool shows a clear "not available yet" state. What it never does is return your input and call it a result. A feature that says "coming soon" is honest; a feature that fakes its output is not.
Why this is more work — and worth it
"Bring your own model" is easier for us and worse for you: it offloads the genuinely hard part (finding a model that's web-ready, licensed, and fits the device) onto the user. Doing that work ourselves is the difference between a tool that lists a capability and a tool that has one.