Inline completion / suggestion

Inline completion is the perception case where the AI runs while the user is still typing. Cursor, Copilot, v0, and every other typing-time inference product lives in this scenario. The trick is not making the model fast — modelling latency is what it is. The trick is making the request system fire only when the user has actually stopped, cancel itself the moment the user starts again, and never display a suggestion that does not match the text the user can see.

This scenario sits across the 0–100 MS and 100 MS – 1 S bands. The keystroke echo itself must clear the Card-Moran-Newell Card, Moran & Newell 1983 ~100 ms perceptual frame — the input has to feel responsive even while a model is working. The suggestion display lives in the 100 MS – 1 S band Miller 1968 Miller 1968 describes for "immediate response." Doherty 1982's Doherty 1982 ~400 ms productivity break gives the rough budget for how long the suggestion can take before users feel it as friction. Arapakis et al. 2014 Arapakis et al. 2014 says below 500 ms most users do not consciously notice the latency at all.

Inline completion

User types in a notes editor; the AI suggests an ending for the current line as ghost text. Naive: fires on every keystroke, no abort. Tuned: 200 ms debounce, abort on next keystroke, Tab to accept.

100 MS – 1 S

Off

Tab · accept suggestion

What is happening in the demo

Both sides are a notes editor backed by the same toy "completion model" — a dictionary that recognises a leading verb and suggests a plausible ending. Try typing buy, fix, email, review, schedule, call, ship, or draft. Each one triggers a 600 ms p50 simulated request (gamma-distributed). Press Tab on either side to accept the ghost-text suggestion.

The naive side fires a fresh request on every keystroke and never cancels. Type "email Sa" and the demo dispatches eight requests — one per keystroke. Whichever finishes last wins the suggestion slot. With a fat-tailed latency distribution, the response from "ema" can land after the response from "email Sa" — and you see a suggestion that reflects what you typed several characters ago. The ghost text drifts from the cursor, sometimes in the wrong direction.

The tuned side fires only after the user has been still for 200 ms. While the user is still typing, the suggestion clears — no chance of a stale suggestion lingering. When the request does fire, every subsequent keystroke aborts it via a cancellation token, so only the response that matches the final state of the input ever lands. A subtle "Thinking…" cue appears in the bottom-right while a request is in flight.

The debounce window is the perception lever. At 0 ms (naive) every keystroke pays the round-trip cost; at 500 ms the suggestion always feels late. 200 ms is the sweet spot — the Card-Moran-Newell perceptual frame is past, the user has stopped typing, but the cognitive cost of noticing the wait is still below threshold. Cursor uses ~150 ms; Copilot uses adaptive debouncing keyed off the language and recent edit pattern.

Tab acceptance is optimistic. The suggestion is committed to the text immediately on Tab, and a fresh request fires for the now-longer input. The user does not wait for the model to confirm — the model's job is to suggest, the user's job is to accept.

What to tune

Keystroke echo — input renders the character within ~50 ms; nothing about the suggestion yet.
Debounce window — 150–200 ms quiet before a request fires. Shorter feels frantic; longer feels late.
In-flight cue — past the perceptual frame, a subtle "Thinking…" appears. Below it, the user has not noticed the wait.
Stale-cancel — every keystroke aborts the in-flight request via cancellation token (stale-while-revalidate shaped for typing-time work).
Accept — Tab commits the suggestion optimistically; the next request fires from the new prefix without waiting for the model to confirm.

When perceived performance hurts you here

The biggest failure mode is the one the naive demo shows — race-condition stale suggestions. The user types four more characters; a response from before those characters arrives; the suggestion now reflects an older state than the cursor. Users notice this fast and learn to distrust the suggestions entirely. Always cancel in-flight requests on the next keystroke, and always check the request token against the current request before applying the result.

The second failure mode is over-eager debouncing. At 500 ms the suggestion feels slow even on instant networks. At 1 s the user has already typed past the suggestion point. The debounce window has to be tuned per-context — short prefixes (single-line completions) want shorter windows; long prefixes (block completions) can absorb longer ones.

The third failure mode is suggestion bias. If the model heavily prefers one completion family, users will accept it on autopilot and the suggestion becomes a stochastic auto-typer rather than a tool. Cursor calls this the "tab-spamming" problem and tunes their acceptance heuristics to avoid it. The perception layer cannot fix a bad suggestion-quality signal.

Accessibility

aria-live="polite" announcement when a suggestion is available is helpful for screen readers, but throttle it heavily — every keystroke is too noisy.
aria-keyshortcuts="Tab" on the editor (or a paired help text) tells assistive tech that Tab is the accept gesture.
Suggestion contrast — ghost text at 30–40 % opacity is visually distinguishable from typed text but still readable. Below 25 % opacity, low-vision users lose the suggestion entirely.
Keyboard parity — Tab to accept, Esc to dismiss, Right-arrow to accept word-by-word (Cursor / Copilot convention). All without mouse.
prefers-reduced-motion — the "Thinking…" cue is decorative; suppress its fade. The suggestion text appearing is content, not motion.

References

References · 4

Card, Moran & Newell 1983
Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum. The ~100 ms perceptual frame typing-time inference must respect — anything heavier blocks input.
Miller 1968
Miller, R. B. (1968). Response time in man-computer conversational transactions. Proceedings of the AFIPS Fall Joint Computer Conference, 33(I), 267–277. The 0.1 s keystroke-echo limit and the 0.1 – 1 s suggestion-display tier inline completion sits across.
Doherty 1982
Doherty, W. J., & Thadani, A. J. (1982). The Economic Value of Rapid Response Time. IBM Technical Report GE20-0752-0. Productivity drops sharply past ~400 ms — relevant for the suggestion latency budget once the user is waiting on it.
Arapakis et al. 2014
Arapakis, I., Bai, X., & Cambazoglu, B. B. (2014). Impact of response latency on user behavior in web search. Proceedings of SIGIR '14, 103–112. Below ~500 ms users rarely consciously notice latency; above ~1,000 ms detection is high. The suggestion budget sits inside this window.