Technical deep-dive

How the AI prompt enhancer works, end to end.

Q: Why 1,300 ms of debounce and not 500 or 2,000?

We tuned the debounce empirically against typing samples. At 500 ms the coach fires mid-sentence and cancels itself on the next keystroke, which wastes traffic. At 2,000 ms it feels laggy. 1,300 ms catches the natural "thinking pause" most people take once per 10–20 words without getting in the way.

Q: How does the router decide between Groq, Gemini and Hugging Face?

The model registry assigns every model a tier, a capability vector and a latency profile. The router computes a complexity score from the prompt, maps it to a minimum tier, then picks the fastest healthy model that meets that tier. If the first choice fails, the next-fastest healthy model in the same tier takes over.

Q: What happens if a provider is down?

The health tracker keeps a sliding window of success and failure for every model. Consecutive failures or rate-limits put that model on cooldown for a short window, and the router skips it for all requests during that window. The user never sees an error unless every model in every fallback tier fails — which, with three providers, is vanishingly rare.

Q: Do you train on my prompts?

No. Prompts are processed in-memory to score and rewrite the current request, then dropped. We do not retain training data, and we do not share prompt text with any third party beyond the inference provider handling that single request.

Q: Can I self-host the classifier?

Our classifier is a standard Node service. Self-host documentation is on the roadmap — for now the extension points at our hosted API, but the request schema is simple and easy to replicate if you want to run your own.

From the keystroke you just pressed to the rewritten prompt on your screen is about two seconds. This page unpacks the eight steps in between, the router that picks the model, and the three dimensions that define quality.

The pipeline

Eight steps from keystroke to coach card.

Watch the active step walk the chain below — each card lights up as the mock request passes through it.

01
Watcher starts on supported chats
running
When you open ChatGPT, Claude, Gemini or Perplexity, a content script finds the chat input and attaches a silent watcher. The watcher does nothing until you actually type — it just holds a reference.
02
1.3 s debounce
Every keystroke resets a debounce timer. Only after 1,300 ms of silence does the extension consider running. This is calibrated to match how people actually write prompts: you type a thought, pause, re-read, then continue.
03
15-character gate
If the prompt is under 15 characters, we skip the round-trip entirely. There is nothing useful a classifier can say about "hi" or "ok".
04
Duplicate de-duplication
Before sending, we hash the prompt and compare against the last one we processed. Identical prompts never go to the network a second time.
05
Abort previous in-flight request
If you kept typing while an earlier classify was running, the extension aborts it in the background. A generation counter guarantees that only the latest response can update the UI — no flicker, no stale scores.
06
LRU cache lookup
The extension keeps a 120-entry LRU cache in the background with a 5-minute TTL. Cache hits resolve in well under 5 ms, which is why repeat prompts feel instant.
07
Classify request to our API
On a miss, the extension sends the prompt to our API. Identical in-flight requests are deduplicated there too, so multiple tabs asking the same question share the same underlying model call.
08
Router picks the model, UI renders
Our routing layer picks the right tier, the right healthy provider, and returns category + 1–5 scores for each dimension + a rewritten prompt. The coach dot opens into a glass card and you can apply the rewrite in one click.

The router

Complexity becomes a tier. A tier becomes a model.

Every prompt gets three cheap signals measured before the model is even picked: length, code-likeness and line count. They combine into a complexity score that maps directly to a minimum model tier.

PromptRouterModel

ShortT1

MediumT2

Long & code-heavyT3

router.pick()

GroqACTIVE

GeminiOK

Hugging FaceOK

3/3 healthyShort · T1→Groq·412 ms

Signal	Low · +0	Mid · +1	High · +2
Prompt length (chars)	≤ 80	81–400	> 400
Code-like regex hits	None	1–2	3+
Line count	1–3	4–10	> 10

Tier 1

Complexity ≤ 2

Short, plain prompts. Routed to the fastest small models on Groq. Typical round-trip 400–800 ms.

Tier 2

Complexity ≤ 3

Medium prompts with some structure. Routed to mid-sized Gemini or Groq models with stronger reasoning.

Tier 3

Complexity > 3

Long, code-heavy or multi-part prompts. Routed to the larger models on Gemini or Hugging Face.

Intent detection

Eight intent buckets. Each gets a different rewrite plan.

Before rewriting, the optimizer checks which task bucket your prompt belongs in. The bucket decides which clarifying questions are actually worth asking.

Intent	Heuristic signals	Rewrite plan
coding	function, class, error, stack trace, fix this	Ask about language, runtime, input shape, expected output.
writing	write, draft, rewrite, tone, audience	Ask about length, tone, audience, format.
image	image, illustration, photo, midjourney, flux	Ask about aspect ratio, style, lighting, subject.
video	video, clip, scene, runway, sora	Ask about duration, camera, motion, setting.
audio	song, music, voice, tts, jingle	Ask about genre, mood, duration, voice type.
research	compare, sources, cite, literature	Ask about depth, citations, recency, format.
analysis	analyse, evaluate, breakdown, metric	Ask about data source, axes, decision context.
general	(fallback)	Ask about audience and desired output format.

Fallback chain

Health-aware, ordered, silent.

The router tries the primary model first. If it times out, returns an error, or is flagged unhealthy by the sliding-window tracker, it is skipped for a short cooldown and the next fallback in the same tier takes over. The caller sees one answer and never hears about the retries.

Sliding-window success + failure counters per model.
Cooldown on rate-limit (429) and repeated 5xx.
Per-model timeout enforced by the router, not the provider.

Live fallback

orderForTier(minTier)

groq/llama-8bT2 · 620 ms
idle
groq/mixtral-8x7bT2 · 780 ms
idle
gemini/flash-1.5T2 · 940 ms
idle
hf/mistral-7bT3 · 1420 ms
idle

caller sees one responseretries: silent

The three dimensions

What we actually score.

The overall 1–5 maturity score is a weighted blend of three sub-scores. You see the blend in the coach dot; you see the components in the details card. The live demo below cycles through a vague → refined → sharp rewrite so you can watch the per-dimension bars move in real time.

chat.ai / prompt maturity

Vague · 1.0/5

Weak on all three — no topic, no audience, no output shape.

1.0/5

Specificity

Does the prompt name the exact language, framework, audience or constraint?

1.0/5

Context richness

Does it share enough background — inputs, prior code, edge cases — to do its best work?

1.0/5

Output format

Have you told the model what shape of answer you want — code, JSON, table?

Why it feels instant

Three caches, one router, zero wasted traffic.

Cache hit

< 0 ms

Identical prompts from the last five minutes never touch the network.

Tier-1 models

~ 0–800 ms

Short and mid prompts on the fastest Groq models. You barely see the spinner.

Tier-3 models

~ 0–3 s

Reserved for long, code-heavy prompts where quality matters more than speed.

Latency numbers are typical ranges across recent production requests; exact values depend on provider and region.

Engineering FAQ

Deep-dive answers.

We tuned the debounce empirically against typing samples. At 500 ms the coach fires mid-sentence and cancels itself on the next keystroke, which wastes traffic. At 2,000 ms it feels laggy. 1,300 ms catches the natural "thinking pause" most people take once per 10–20 words without getting in the way.

The model registry assigns every model a tier, a capability vector and a latency profile. The router computes a complexity score from the prompt, maps it to a minimum tier, then picks the fastest healthy model that meets that tier. If the first choice fails, the next-fastest healthy model in the same tier takes over.

The health tracker keeps a sliding window of success and failure for every model. Consecutive failures or rate-limits put that model on cooldown for a short window, and the router skips it for all requests during that window. The user never sees an error unless every model in every fallback tier fails — which, with three providers, is vanishingly rare.

No. Prompts are processed in-memory to score and rewrite the current request, then dropped. We do not retain training data, and we do not share prompt text with any third party beyond the inference provider handling that single request.

Our classifier is a standard Node service. Self-host documentation is on the roadmap — for now the extension points at our hosted API, but the request schema is simple and easy to replicate if you want to run your own.

See it fire on your own prompt.

The live demo runs the exact same classify API the extension calls. Paste a prompt and watch the pipeline in action.

Open the live demo

Add to Chrome Add to Edge Add to Firefox

How the AI prompt enhancer works, end to end.

Eight steps from keystroke to coach card.

Watcher starts on supported chats

1.3 s debounce

15-character gate

Duplicate de-duplication

Abort previous in-flight request

LRU cache lookup

Classify request to our API

Router picks the model, UI renders

Complexity becomes a tier. A tier becomes a model.

Eight intent buckets. Each gets a different rewrite plan.

Health-aware, ordered, silent.

What we actually score.

Specificity

Context richness

Output format

Three caches, one router, zero wasted traffic.

Cache hit

Tier-1 models

Tier-3 models

Deep-dive answers.

See it fire on your own prompt.