LLM Runtime Glossary: The Stuff Nobody Explains

2025-12-17

This glossary exists for one reason:
most AI resources skip the boring but critical layer.

These are the terms you keep bumping into once you stop doing demos and start shipping systems.

Read this slowly. Revisit it often.

API & INTERFACE LAYER

`/api/generate`

A single-prompt → single-output endpoint.

No roles (system, user, assistant)
Lower overhead
Easier to reason about
Ideal for:
classification
scoring
extraction
routing decisions

Mental model:

Call it like a function, treat output as untrusted text.

`/api/chat`

A conversation-oriented endpoint.

Accepts structured messages with roles
Maintains conversational context
More tokens, more cost, more variance

Use when: - You need memory of prior turns - You want assistant-style responses - You are building a UI chat experience

Streaming

Receiving tokens as they are generated.

Faster perceived response
More complex handling
Harder to retry cleanly

Rule of thumb:

Stream for UX, not for agents.

CONTEXT & TOKENS

Token

A chunk of text the model processes (not words).

“hello” ≠ 1 token
punctuation, spaces, emojis all count

Context Window (`num_ctx`)

Maximum number of tokens the model can see at once.

Includes: - prompt - system instructions - conversation history - retrieved documents (RAG) - tool outputs

Too small: model forgets things
Too large: slow, RAM-heavy, can freeze local machines

Truncation

When older tokens are silently dropped to fit the context window.

Danger:

The model doesn’t tell you what it forgot.

GENERATION CONTROLS (THE KNOBS)

`num_predict`

Maximum number of tokens the model is allowed to generate.

Hard stop
Prevents runaway responses
Protects latency and memory

Temperature

Controls randomness.

0.0 – 0.2 → deterministic, boring, safe
0.3 – 0.6 → balanced
0.7+ → creative, risky, vibes

Never use high temperature for decisions.

`top_p` (Nucleus Sampling)

Probability cutoff for token selection.

Lower = safer, more focused
Higher = more diverse, more nonsense

Often used instead of temperature, not with it.

`top_k`

Limits sampling to the top K most likely tokens.

Smaller = stricter
Larger = looser

Less commonly used now than top_p.

Repeat Penalty

Discourages repeating the same tokens.

Prevents loops
Prevents “As an AI language model…” spam

Mirostat

An adaptive sampling algorithm.

Tries to maintain consistent “surprise”
Reduces tuning guesswork
More stable long generations

Good for chat, rarely needed for agents.

MODEL & ENGINE REALITY

Model

The neural network + weights.

Examples: - LLaMA - Qwen - Mistral - Gemma

Models do not define APIs. Engines do.

Engine / Runtime

The software that runs the model.

Examples: - Ollama - llama.cpp - vLLM - OpenAI servers

This is where: - num_ctx limits live - performance constraints come from - crashes originate

Quantization

Reducing model precision to save memory.

Q4 / Q5 / Q8
Smaller = faster, dumber
Larger = slower, smarter

Local reality:

Quantization is why your laptop can run models at all.

VRAM / RAM Pressure

Local models compete with your OS.

Symptoms: - UI freezes - mouse lag - fans screaming - kernel panic (worst case)

This is not a bug. It’s physics.

AGENT & SYSTEM DESIGN

Determinism

Same input → same output (or close enough).

LLMs are not deterministic by default.

You must enforce it via: - low temperature - constrained prompts - validation - retries

Retry Logic

Calling the model again when output is invalid.

Common strategies: - fixed retries - escalating strictness - majority voting

Majority Voting

Run the same prompt multiple times, pick the consensus.

Useful for: - classification - extraction - weak signals

Costs more tokens, buys confidence.

Abstention / `NONE`

Allowing the model to say:

“I don’t know” or “No action required”

Critical for safety and trust.

Router

A deterministic step that decides what happens next.

Examples: - which tool to run - which model to call - whether to call an LLM at all

Routers should be boring.

Policy Layer

Rules that override model intent.

“Even if the model says yes, we say no”
Safety, cost, legality, scope control

This layer should not be AI.

Tool Use

Letting the model request actions: - database queries - API calls - scripts

The model suggests.
Your code decides.

COMMON TRAPS

“LLMs are just functions”

False.

They are: - probabilistic - stateful - failure-prone

Treat them like unreliable narrators.

“The model will behave if I prompt it well”

Also false.

Prompting helps.
Architecture matters more.

“Bigger model = better system”

No.

Wrong defaults beat big models
Determinism beats creativity
Guardrails beat vibes

FINAL MENTAL MODEL

Think in layers:

Transport – how you talk to the model
Inference – how tokens are generated
Control – how your system stays sane

If you understand those three,
you are no longer guessing.

This glossary will grow. Real systems always do.

← Back to Blog