AI Suite
AI Understand cluster
Vision captioning, document Q&A, OCR, alt-text generation, composition audits, and image diff narration. Powered by Xenova ViT-GPT2 + TrOCR + Donut.
The Understand cluster runs on three verified Xenova / Transformers.js models: ViT-GPT2 image-captioning (~120 MB) for captions and alt-text, TrOCR-base-printed (~500 MB) for OCR, and Donut-base-cord-v2 (~280 MB) for document Q&A. All three opt-in via the Lite tier and run in the browser through Transformers.js + WebGPU (WASM fallback).
These tools don't change your image — they tell you something about it. Their outputs flow into the rest of the platform: alt-text gets embedded into export metadata, captions seed the AI auto-name on download, composition audits drive "Fix this" one-click commands, and the Edit-by- Prompt cluster uses the caption to ground its LLM tool-calling.
The six tools
AI Describe. Generate a natural-language description in one of six styles: alt-text (subject-first, screen-reader friendly), product (e-commerce listing copy), social caption, scene description, list-of-objects, or composition critique. Use the preset chips to switch between styles without retyping.
AI Image Q&A. Ask a question about the image — "what's the total on this receipt?", "is there visible text?", "what's the date in the header?". Routes through Donut (document-question-answering pipeline) so it's strongest on receipts / forms / structured documents. For general-scene questions, chain it with AI Describe + the WebLLM tier via the AI panel. Constrains the answer to yes/no, free text, numeric, list, or a colour hex when applicable.
AI OCR. Extract every line of text with bounding boxes. Pick plain text, Markdown (preserves headings and bullets), or structured JSON output. Works on receipts, business cards, signage in photos, screenshots of documents, and PDF pages.
AI Alt-Text. A focused alt-text generator that clamps output to your max-char limit (WCAG recommends ≤ 125), strips opinion words, and surfaces a length-class warning when results trend long. The output can be auto-embedded into PNG iTXt / JPEG Comment / WebP XMP via ExportPanelV3.
AI Image Audit. Check an image against a target spec — Amazon main, Etsy listing, Instagram post, YouTube thumbnail, accessibility WCAG, print-ready 300 DPI, logo grade. Returns per-rule pass / warn / fail with plain-language reasoning and a one-click "Fix it" command id for the issues that have an automated fix path.
AI Image Compare. Narrate the differences between two images. Combines a WebGL ΔE pixel-diff (fast, quantitative — "12.4 % of pixels changed, concentrated upper-right") with vision-model narration ("the background was replaced; the subject's edge detail is sharper in image B"). Returns a heatmap overlay you can apply to either input.
Common workflows
Auto-name every export. Toggle "AI auto-name" in the export panel. Before download, AI Describe runs in alt-text style, slugifies the result, and uses it as the filename. Skip thinking up names for batches of 50.
Accessibility audit before publish. Run AI Image Audit with the accessibility spec; embed the alt-text result into export metadata. Helpful when uploading to blogs or social platforms that read embedded alt-text.
De-duplicate a photo library. Run AI Image Compare across pairs to spot near-duplicates. The pixel-diff is cheap (sub-second per pair); the narrated output flags interesting cases like "same scene, different exposure" that purely-numeric similarity tools miss.
Ground the AI assistant. When you ask the AI panel "make this better", the assistant first runs AI Describe to know what "this" is, then uses that context to pick the right enhancement recipe. The Vision context toggle in the AI panel controls whether this grounding step runs.
Privacy notes
All six tools run entirely on your device. The model weights are fetched from Hugging Face on the first invocation and cached in IndexedDB; nothing about the image you process is transmitted anywhere. The Network tab in DevTools confirms this — no outbound requests fire during inference.