Technical deep-dive

How the AI prompt enhancer works, end to end.

From the keystroke you just pressed to the rewritten prompt on your screen is about two seconds. This page unpacks the eight steps in between, the router that picks the model, and the three dimensions that define quality.

The pipeline

Eight steps from keystroke to coach card.

Watch the active step walk the chain below — each card lights up as the mock request passes through it.

  1. 01

    Watcher starts on supported chats

    running

    When you open ChatGPT, Claude, Gemini or Perplexity, a content script finds the chat input and attaches a silent watcher. The watcher does nothing until you actually type — it just holds a reference.

  2. 02

    1.3 s debounce

    Every keystroke resets a debounce timer. Only after 1,300 ms of silence does the extension consider running. This is calibrated to match how people actually write prompts: you type a thought, pause, re-read, then continue.

  3. 03

    15-character gate

    If the prompt is under 15 characters, we skip the round-trip entirely. There is nothing useful a classifier can say about "hi" or "ok".

  4. 04

    Duplicate de-duplication

    Before sending, we hash the prompt and compare against the last one we processed. Identical prompts never go to the network a second time.

  5. 05

    Abort previous in-flight request

    If you kept typing while an earlier classify was running, the extension aborts it in the background. A generation counter guarantees that only the latest response can update the UI — no flicker, no stale scores.

  6. 06

    LRU cache lookup

    The extension keeps a 120-entry LRU cache in the background with a 5-minute TTL. Cache hits resolve in well under 5 ms, which is why repeat prompts feel instant.

  7. 07

    Classify request to our API

    On a miss, the extension sends the prompt to our API. Identical in-flight requests are deduplicated there too, so multiple tabs asking the same question share the same underlying model call.

  8. 08

    Router picks the model, UI renders

    Our routing layer picks the right tier, the right healthy provider, and returns category + 1–5 scores for each dimension + a rewritten prompt. The coach dot opens into a glass card and you can apply the rewrite in one click.

The router

Complexity becomes a tier. A tier becomes a model.

Every prompt gets three cheap signals measured before the model is even picked: length, code-likeness and line count. They combine into a complexity score that maps directly to a minimum model tier.

PromptRouterModel
ShortT1
MediumT2
Long & code-heavyT3
router.pick()
GroqACTIVE
GeminiOK
Hugging FaceOK
3/3 healthyShort · T1Groq·412 ms
SignalLow · +0Mid · +1High · +2
Prompt length (chars)≤ 8081–400> 400
Code-like regex hitsNone1–23+
Line count1–34–10> 10
Tier 1

Complexity ≤ 2

Short, plain prompts. Routed to the fastest small models on Groq. Typical round-trip 400–800 ms.

Tier 2

Complexity ≤ 3

Medium prompts with some structure. Routed to mid-sized Gemini or Groq models with stronger reasoning.

Tier 3

Complexity > 3

Long, code-heavy or multi-part prompts. Routed to the larger models on Gemini or Hugging Face.

Intent detection

Eight intent buckets. Each gets a different rewrite plan.

Before rewriting, the optimizer checks which task bucket your prompt belongs in. The bucket decides which clarifying questions are actually worth asking.

IntentHeuristic signalsRewrite plan
codingfunction, class, error, stack trace, fix thisAsk about language, runtime, input shape, expected output.
writingwrite, draft, rewrite, tone, audienceAsk about length, tone, audience, format.
imageimage, illustration, photo, midjourney, fluxAsk about aspect ratio, style, lighting, subject.
videovideo, clip, scene, runway, soraAsk about duration, camera, motion, setting.
audiosong, music, voice, tts, jingleAsk about genre, mood, duration, voice type.
researchcompare, sources, cite, literatureAsk about depth, citations, recency, format.
analysisanalyse, evaluate, breakdown, metricAsk about data source, axes, decision context.
general(fallback)Ask about audience and desired output format.
Fallback chain

Health-aware, ordered, silent.

The router tries the primary model first. If it times out, returns an error, or is flagged unhealthy by the sliding-window tracker, it is skipped for a short cooldown and the next fallback in the same tier takes over. The caller sees one answer and never hears about the retries.

  • Sliding-window success + failure counters per model.
  • Cooldown on rate-limit (429) and repeated 5xx.
  • Per-model timeout enforced by the router, not the provider.
Live fallback
orderForTier(minTier)
  1. groq/llama-8bT2 · 620 ms
    idle
  2. groq/mixtral-8x7bT2 · 780 ms
    idle
  3. gemini/flash-1.5T2 · 940 ms
    idle
  4. hf/mistral-7bT3 · 1420 ms
    idle
caller sees one responseretries: silent
The three dimensions

What we actually score.

The overall 1–5 maturity score is a weighted blend of three sub-scores. You see the blend in the coach dot; you see the components in the details card. The live demo below cycles through a vague → refined → sharp rewrite so you can watch the per-dimension bars move in real time.

chat.ai / prompt maturity
Vague · 1.0/5

Weak on all three — no topic, no audience, no output shape.

1.0/5

Specificity

Does the prompt name the exact language, framework, audience or constraint?

1.0/5

Context richness

Does it share enough background — inputs, prior code, edge cases — to do its best work?

1.0/5

Output format

Have you told the model what shape of answer you want — code, JSON, table?

Why it feels instant

Three caches, one router, zero wasted traffic.

Cache hit

< 0 ms

Identical prompts from the last five minutes never touch the network.

Tier-1 models

~ 0–800 ms

Short and mid prompts on the fastest Groq models. You barely see the spinner.

Tier-3 models

~ 0–3 s

Reserved for long, code-heavy prompts where quality matters more than speed.

Latency numbers are typical ranges across recent production requests; exact values depend on provider and region.

Engineering FAQ

Deep-dive answers.

We tuned the debounce empirically against typing samples. At 500 ms the coach fires mid-sentence and cancels itself on the next keystroke, which wastes traffic. At 2,000 ms it feels laggy. 1,300 ms catches the natural "thinking pause" most people take once per 10–20 words without getting in the way.
The model registry assigns every model a tier, a capability vector and a latency profile. The router computes a complexity score from the prompt, maps it to a minimum tier, then picks the fastest healthy model that meets that tier. If the first choice fails, the next-fastest healthy model in the same tier takes over.
The health tracker keeps a sliding window of success and failure for every model. Consecutive failures or rate-limits put that model on cooldown for a short window, and the router skips it for all requests during that window. The user never sees an error unless every model in every fallback tier fails — which, with three providers, is vanishingly rare.
No. Prompts are processed in-memory to score and rewrite the current request, then dropped. We do not retain training data, and we do not share prompt text with any third party beyond the inference provider handling that single request.
Our classifier is a standard Node service. Self-host documentation is on the roadmap — for now the extension points at our hosted API, but the request schema is simple and easy to replicate if you want to run your own.

See it fire on your own prompt.

The live demo runs the exact same classify API the extension calls. Paste a prompt and watch the pipeline in action.