docs: resurface glossary and add new terms (#7441)

e5e118a1 · Kristen Kelleher · GitHub · ba3aef8a · e5e118a1 · e5e118a1
Unverified Commit e5e118a1 authored Mar 19, 2026 by Kristen Kelleher Committed by GitHub Mar 19, 2026
Show whitespace changes
Inline Side-by-side

Showing with 19 additions and 4 deletions

README.md README.md +1 -1

docs/index.yml docs/index.yml +2 -2

docs/reference/glossary.md docs/reference/glossary.md +16 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -190,7 +190,7 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont

 ## Latest News

- [03/15] [Dynamo 1.0 is here — production-ready with strong community adoption](https://developer.nvidia.com/blog/nvidia-dynamo-1-production-ready/)
+- [03/15] [Dynamo 1.0 is here — production-ready with strong community adoption](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/)
 - [03/15] [NVIDIA Blackwell Ultra sets new inference records in MLPerf](https://developer.nvidia.com/blog/nvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut/)
 - [03/15] [NVIDIA Blackwell leads on SemiAnalysis InferenceMax benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/)
 - [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)

--- a/docs/index.yml
+++ b/docs/index.yml
@@ -42,6 +42,8 @@ navigation:
        path: reference/release-artifacts.md
      - page: Examples
        path: getting-started/examples.md
+      - page: Glossary
+        path: reference/glossary.md

  # ==================== Kubernetes Deployment ====================
  - section: Kubernetes Deployment
@@ -342,8 +344,6 @@ navigation:
      - page: Model Caching with Fluid
        path: kubernetes/model-caching-with-fluid.md
      # -- Reference --
-      - page: Glossary
-        path: reference/glossary.md
      - page: Tuning Disaggregated Performance
        path: performance/tuning.md
      # -- Frontend (hidden sub-pages) --

--- a/docs/reference/glossary.md
+++ b/docs/reference/glossary.md
@@ -17,6 +17,8 @@ title: Glossary

 **Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.

+**Discovery Plane** - The service discovery layer where components (frontend, router, and workers) register services, discover services, and watch for new service life-cycle events at runtime using Kubernetes or etcd backends.
+
 **Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.

 **Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
@@ -26,6 +28,8 @@ title: Glossary
 ## E
 **Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.

+**Event Plane** - The pub/sub layer for KV cache updates, worker metrics, and sequence tracking; it supports KV-aware routing and disaggregated serving architectures.
+
 ## F
 **Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.

@@ -33,7 +37,9 @@ title: Glossary
 **Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.

 ## I
-**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
+**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing.
+
+**Inter-Token Latency (ITL)** - The latency between consecutive output tokens during the decode phase; typically paired with TTFT to define performance SLAs.

 ## K
 **KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
@@ -46,6 +52,9 @@ title: Glossary

 **KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.

+## L
+
+**LoRA (Low-Rank Adaptation)** - A fine-tuning technique for serving specialized model variants without duplicating full model weights. Dynamo supports dynamic loading and serving of LoRA adapters at runtime using worker APIs (for example, to load/unload,or for discovery in /v1/models).

 ## M
 **Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting).
@@ -66,14 +75,20 @@ title: Glossary

 **Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.

+**Profiler** - Dynamo component that analyzes model performance to determine optimal engine configurations, including disagg/agg, parallelization mapping (TP, TEP, DEP), and other engine knobs (batch size, max num tokens), feeding the Planner for SLA-driven autoscaling.
+
 ## R
 **RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.

 **RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.

+**Request Plane** - The transport layer that transmits RPCs between components (frontend-to-worker or router-to-router) utilizing one of these protocols: TCP, HTTP, or NATS.
+
 ## S
 **SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.

+**Speculative Decoding** - An optimization where a draft model proposes tokens for parallel verification by the main model; reduces latency (for example, vLLM with Eagle).
+
 ## T
 **Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.