Skip to content

Token Prediction

The KN-86 nEmacs editor and REPL share a token-prediction engine that ranks the most likely next ~8 tokens at the cursor and surfaces them in the predictive palette. This page is the implementer-facing reference for the algorithm, ranking weights, performance budget, memory footprint, and how cartridges influence predictions. Runtime engineers building the ranker and cartridge authors writing :vocabulary blocks read this.


Algorithm: static, weighted, grammar-aware

Section titled “Algorithm: static, weighted, grammar-aware”

No machine learning. Per ADR-0009, the v1 ranker is a static weighted model — fast, deterministic, debuggable. (A learned model is post-launch v1.1 territory; see “Post-launch” below.)

The model is not n-gram or trie-based despite the original stub language. It is a per-call ranking pass:

  1. Enumerate all candidate tokens (builtins + cartridge vocabulary + user-defined identifiers + recent session history).
  2. Apply a legal-form filter (hard constraint based on cursor position).
  3. Score each surviving candidate.
  4. Sort by score (descending), break ties alphabetically.
  5. Return the top 8.

The full pseudocode is in ADR-0009 §“Algorithm Implementation Pseudocode.”


SourceWhat’s in itLifetime
Fe builtinsThe 26 primitives in primnames[] (see ../fe-lisp/builtins.md)Permanent.
NoshAPI primitivesThe Tier 1 / Tier 2 names from ../nosh-api/primitives-by-category.mdPermanent.
Cartridge vocabularyDomain terms contributed via emacs-extend-vocabulary (ADR-0016)Cart-load to cart-unload.
Cartridge grammar productionsProductions added via emacs-extend-grammar (ADR-0016) — these influence the legal-form filter, not the score directlyCart-load to cart-unload.
Buffer-local identifiersNames defined in the current buffer via (let ...) or (= ...)Buffer lifetime.
Session historyThe last ~20 tokens the player typedSession (REPL or nEmacs instance).

Cartridge vocabulary is the highest-leverage contribution — domain terms (e.g., node, ice, threat-level from ICE Breaker; debit, account from Black Ledger) get a +5 boost in scoring (see “Ranking weights” below).


The filter eliminates syntactically-illegal tokens before scoring. Position types:

PositionLegalIllegal
Function position (first element of list)Callables: function names, macros, lambdas, builtinsPure data (literals other than quoted forms), binding keywords
Argument position (non-first)Identifiers, literals, function calls, quoted formsDefinition forms (defn-style macros)
Binding position (inside let / fn params)Identifiers onlyFunctions, literals, forms
Root (empty buffer or top-level)Top-level forms (defn, defstruct, defdomain, defmission)Bare data, lambdas

Cartridge-contributed grammar (emacs-extend-grammar per ADR-0016) extends the legal sets — e.g., a cart that introduces (scan-result) as a callable form adds it to the function-position legal list while that cart is loaded.


For each legal token:

score = 0
if token in cartridge_vocabulary: score += 5 ; DOMAIN_BOOST
if token in local_bindings: score += 3 ; LOCAL_BOOST
if token in session_history[-20:]:
age = len(session_history) - rfind_position(token)
score += max(0, 10 - age) ; recency, decays 0–10
if token in popularity_baseline: score += baseline[token] ; 0–4
if semantic_fit(token, context_stack): score += 1 ; SEMANTIC_BONUS

Constants:

ConstantValueRationale
DOMAIN_BOOST+5Cartridge vocabulary should dominate when relevant.
LOCAL_BOOST+3Locally visible bindings are proximate.
RECENCY_WINDOW20 tokensSession window for recency boost.
RECENCY_MAX+10 (decays linearly)Just-typed tokens get full boost; oldest in window get +0.
SEMANTIC_BONUS+1Small tiebreaker for context-fit (e.g., = in mission node-comparison contexts).

Popularity baseline (excerpt — full table in ADR-0009 §“Ranking Formula”):

TokenBaselineWhy
if, let+4Most common control flow / binding
lambda, map+3Common in callbacks / higher-order
defn, nil, +, >, =+2Common in mission code
cdr, car+2List navigation
cons, quote, +1Less frequent
User identifiers+0Ranked by recency / local boost only

Tiebreaker: alphabetical order (deterministic, readable).

A small per-context bonus consulted during ranking. Examples from ADR-0009 §“Semantic Fit Bonus”:

ContextTokenBonusReason
Mission script, node comparison=+2Mission pattern
ICE Breaker, list filteringfilter+3Domain idiom
Black Ledger, financial codedebit / credit+3Domain vocabulary
Crypto contextcipher-grade+4Highly semantic

Implementation is a context-specific boost table consulted during ranking. Specific bonuses finalize as carts are authored.


Per ADR-0016 §4, the predictive palette renders on CIPHER-LINE Rows 2–3 while nEmacs or the REPL is active:

  • Row 2 (Palette Row A): 8 candidates numbered [1]..[8].
  • Row 3 (Palette Row B): continuation indicator ( if more candidates available), selection feedback, or the alt slots from a cart’s multi-level binding.

The main grid stays free for the buffer view (recovered from ADR-0008’s earlier rows-22–25 allocation).

  • Accept: numpad keys 18 insert the selected token. Cursor advances to the inserted node. Palette recomputes.
  • Scroll: beyond the top 8, the player can scroll the candidate set via context-bound keys (per ADR-0008 mock 5). Implementation lands during the editor foundation task.
  • Reject / dismiss: BACK closes the palette without inserting. LAMBDA enters literal-entry mode for an identifier or number not in the palette.

Per ADR-0016 §9, palette keys can carry :tap / :double-tap / :long-press bindings. When a slot has alt bindings, Row 24 (firmware action bar) renders them with superscript markers (² for double-tap, for long-press). The palette adopts the same convention when context provides multi-level alternates.


Per ADR-0009 §“Latency”:

  • ~1–2 ms per palette render on the original RP2350 / Pico 2 target. The Pi Zero 2 W (1 GHz Cortex-A53, ADR-0009 hardware-retarget note) trivially exceeds this — the model is target-independent.
  • The palette only updates on cursor movement, CONS press, or token selection (event-driven, not per-frame).
  • Filtering + scoring is O(n) where n is the candidate count (~300 unique tokens in a typical session: builtins + user + domain).
  • Sorting top 8 is O(n log 8) ≈ O(n).

The static model satisfies the latency budget with comfortable margin. There is no caching layer needed.


The ranker holds:

  • Vocabulary tables. Builtins + NoshAPI: fixed strings, ~2 KB total. Cartridge contributions: variable, typically 50–200 terms × ~16 bytes ≈ 0.8–3.2 KB per cart.
  • Session history: 20 tokens × ~16 bytes = ~320 bytes.
  • Local bindings: scoped per buffer; rarely exceeds 50 names = ~800 bytes.
  • Per-call working set: scoring array for ~300 candidates ≈ ~1.2 KB (token pointer + float score), released after ranking returns.

Per ADR-0016 §“Known Unknowns” #5, the cart grammar arena is shared with the Cipher grammar arena (8 KB) or allocated separately — confirm during the foundation task. Probably separate for isolation.


Two FFI primitives (per ADR-0016 §7, see ../nosh-api/primitives-by-category.md):

Cart contributes domain terms. Each term receives the +5 vocabulary boost in ranking. Typical use:

(emacs-extend-vocabulary
(list "node" "ice" "threat-level" "probe" "extract" "breach"
"lockdown" "evasion" "sysop" "cipher-grade" "exfil"))

Terms are case-sensitive, stored as plain Lisp strings, and matched against the candidate token’s source-form representation. They live in the cart’s grammar arena from cart-load to cart-unload.

Cart contributes grammar productions that influence the legal-form filter, not the score. A cart introducing (scan-result) as a callable form makes it appear at function position in the palette where it wouldn’t otherwise be legal:

(emacs-extend-grammar
'(:legal-at-function-position scan-result)
'(:legal-at-argument-position node))

The exact syntax of grammar fragments lands during the editor foundation task; ADR-0016 §7 commits to the FFI primitive but defers the production syntax to implementation. Treat the production-syntax surface as evolving until the foundation task lands; the FFI signature is stable.

The semantic-fit bonus (see “Ranking weights” above) is a per-context boost table. As carts ship, the table will grow with their domain idioms. This is implementation-side: cartridge authors don’t write to the table directly, but their domain vocabulary is what the runtime authors put there.


Per ADR-0009 §“Quality Evaluation,” the static model covers ~95% of practical use cases as measured against four hand-coded test scenarios from launch titles (ICE Breaker filter, multi-phase extraction, Sysop ICE deployment, Black Ledger financial audit). The remaining ~5% are edge cases — rare domain terms, niche operators that didn’t make the popularity baseline.

  1. Builtin shadowing. A user defining let as a variable name would over-weight it via domain boost. Mitigation: disallow shadowing builtins in the parser.
  2. Context-insensitive boosts. Domain vocabulary applies globally even when not relevant. Mitigation: v1 doesn’t track per-cartridge context in the REPL; acceptable for launch. (v1.1 would track active cartridge per session.)
  3. Stale recency. A token used once then abandoned still gets recency boost. Mitigation: the 20-token decay window is short; v1 is acceptable.

Collect anonymized session telemetry (which tokens players select at each position type), feed to a lightweight neural net (e.g., GRU over position + context). Privacy concerns + model bloat → deferred. v1 static model ships at launch.

Track which cartridge a script targets; boost domain vocabulary of that cartridge only. Higher relevance, less cross-domain noise. Modest complexity. Nice-to-have for v1.1.