Unverified Commit e5e118a1 authored by Kristen Kelleher's avatar Kristen Kelleher Committed by GitHub
Browse files

docs: resurface glossary and add new terms (#7441)

parent ba3aef8a
......@@ -190,7 +190,7 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont
## Latest News
- [03/15] [Dynamo 1.0 is here — production-ready with strong community adoption](https://developer.nvidia.com/blog/nvidia-dynamo-1-production-ready/)
- [03/15] [Dynamo 1.0 is here — production-ready with strong community adoption](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/)
- [03/15] [NVIDIA Blackwell Ultra sets new inference records in MLPerf](https://developer.nvidia.com/blog/nvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut/)
- [03/15] [NVIDIA Blackwell leads on SemiAnalysis InferenceMax benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/)
- [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
......
......@@ -42,6 +42,8 @@ navigation:
path: reference/release-artifacts.md
- page: Examples
path: getting-started/examples.md
- page: Glossary
path: reference/glossary.md
# ==================== Kubernetes Deployment ====================
- section: Kubernetes Deployment
......@@ -342,8 +344,6 @@ navigation:
- page: Model Caching with Fluid
path: kubernetes/model-caching-with-fluid.md
# -- Reference --
- page: Glossary
path: reference/glossary.md
- page: Tuning Disaggregated Performance
path: performance/tuning.md
# -- Frontend (hidden sub-pages) --
......
......@@ -17,6 +17,8 @@ title: Glossary
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
**Discovery Plane** - The service discovery layer where components (frontend, router, and workers) register services, discover services, and watch for new service life-cycle events at runtime using Kubernetes or etcd backends.
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
......@@ -26,6 +28,8 @@ title: Glossary
## E
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
**Event Plane** - The pub/sub layer for KV cache updates, worker metrics, and sequence tracking; it supports KV-aware routing and disaggregated serving architectures.
## F
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
......@@ -33,7 +37,9 @@ title: Glossary
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
## I
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing.
**Inter-Token Latency (ITL)** - The latency between consecutive output tokens during the decode phase; typically paired with TTFT to define performance SLAs.
## K
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
......@@ -46,6 +52,9 @@ title: Glossary
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
## L
**LoRA (Low-Rank Adaptation)** - A fine-tuning technique for serving specialized model variants without duplicating full model weights. Dynamo supports dynamic loading and serving of LoRA adapters at runtime using worker APIs (for example, to load/unload,or for discovery in /v1/models).
## M
**Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting).
......@@ -66,14 +75,20 @@ title: Glossary
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
**Profiler** - Dynamo component that analyzes model performance to determine optimal engine configurations, including disagg/agg, parallelization mapping (TP, TEP, DEP), and other engine knobs (batch size, max num tokens), feeding the Planner for SLA-driven autoscaling.
## R
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
**Request Plane** - The transport layer that transmits RPCs between components (frontend-to-worker or router-to-router) utilizing one of these protocols: TCP, HTTP, or NATS.
## S
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
**Speculative Decoding** - An optimization where a draft model proposes tokens for parallel verification by the main model; reduces latency (for example, vLLM with Eagle).
## T
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment