Unverified Commit 784da90e authored by ZichengMa's avatar ZichengMa Committed by GitHub
Browse files

feat: LMCache integration in newest ux (#2079)


Signed-off-by: default avatarZichengMa <zichengma1225@gmail.com>
Co-authored-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent 63fbf498
# LMCache Integration in Dynamo
## Introduction
LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency.
### Key Benefits
- **Reduced Time to First Token (TTFT)**: Eliminates redundant prefill computations
- **Memory Offloading**: Intelligent KV cache placement across CPU/GPU/storage tiers
- **Improved Throughput**: Reduced GPU memory pressure enables higher batch sizes
## Aggregated Serving
### Configuration
LMCache is enabled by setting the `ENABLE_LMCACHE` environment variable:
```bash
export ENABLE_LMCACHE=1
```
Additional LMCache configuration can be customized via environment variables:
- `LMCACHE_CHUNK_SIZE=256` - Token chunk size for cache granularity (default: 256)
- `LMCACHE_LOCAL_CPU=True` - Enable CPU memory backend for offloading
- `LMCACHE_MAX_LOCAL_CPU_SIZE=20` - CPU memory limit in GB (user can adjust based on available RAM to a fixed value)
For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html):
- **CPU RAM**: Fast local memory offloading
- **Local Storage**: Disk-based persistence
- **Redis**: Distributed cache sharing
- **GDS Backend**: GPU Direct Storage for high throughput
- **InfiniStore/Mooncake**: Cloud-native storage solutions
### Deployment
Use the provided launch script for quick setup:
```bash
./components/backends/vllm/launch/agg_lmcache.sh
```
This will:
1. Start the dynamo frontend
2. Launch a single vLLM worker with LMCache enabled
### Architecture for Aggregated Mode
In aggregated mode, the system uses:
- **KV Connector**: `LMCacheConnectorV1`
- **KV Role**: `kv_both` (handles both reading and writing)
## Disaggregated Serving
Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments.
### Configuration
The same `ENABLE_LMCACHE=1` environment variable enables LMCache, but the system automatically configures different connector setups for prefill and decode workers.
### Deployment
Use the provided disaggregated launch script(the script requires at least 2 GPUs):
```bash
./components/backends/vllm/launch/disagg_lmcache.sh
```
This will:
1. Start the dynamo frontend
2. Launch a decode worker on GPU 0
3. Wait for initialization
4. Launch a prefill worker on GPU 1 with LMCache enabled
### Worker Roles
#### Decode Worker
- **Purpose**: Handles token generation (decode phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=0
- **LMCache Config**: Uses `NixlConnector` only for kv transfer between prefill and decode workers
#### Prefill Worker
- **Purpose**: Handles prompt processing (prefill phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker`
## Architecture
### KV Transfer Configuration
The system automatically configures KV transfer based on the deployment mode and worker type:
#### Prefill Worker (Disaggregated Mode)
```python
kv_transfer_config = KVTransferConfig(
kv_connector="MultiConnector",
kv_role="kv_both",
kv_connector_extra_config={
"connectors": [
{"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"},
{"kv_connector": "NixlConnector", "kv_role": "kv_both"}
]
}
)
```
#### Decode Worker or Aggregated Mode
```python
kv_transfer_config = KVTransferConfig(
kv_connector="LMCacheConnectorV1",
kv_role="kv_both"
)
```
#### Fallback (No LMCache)
```python
kv_transfer_config = KVTransferConfig(
kv_connector="NixlConnector",
kv_role="kv_both"
)
```
### Environment Setup
The system automatically configures LMCache environment variables when enabled:
```python
lmcache_config = {
"LMCACHE_CHUNK_SIZE": "256",
"LMCACHE_LOCAL_CPU": "True",
"LMCACHE_MAX_LOCAL_CPU_SIZE": "20"
}
```
### Integration Points
1. **Argument Parsing** (`args.py`):
- Detects `ENABLE_LMCACHE` environment variable
- Configures appropriate KV transfer settings
- Sets up connector configurations based on worker type
2. **Engine Setup** (`main.py`):
- Initializes LMCache environment variables
- Creates vLLM engine with proper KV transfer config
- Handles both aggregated and disaggregated modes
### Best Practices
1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case:
- Smaller chunks (128-256): Better reuse granularity for varied content
- Larger chunks (512-1024): More efficient for repetitive content patterns
2. **Memory Allocation**: Set `LMCACHE_MAX_LOCAL_CPU_SIZE` conservatively:
- Leave sufficient RAM for other system processes
- Monitor memory usage during peak loads
3. **Workload Optimization**: LMCache performs best with:
- Repeated prompt patterns (RAG, multi-turn conversations)
- Shared context across sessions
- Long-running services with warm caches
## References and Additional Resources
- [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference
- [Configuration Reference](https://docs.lmcache.ai/api_reference/config.html) - Detailed configuration options
......@@ -8,4 +8,5 @@ trap 'echo Cleaning up...; kill 0' EXIT
python -m dynamo.frontend &
# run worker
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# run ingress
python -m dynamo.frontend &
# run worker with LMCache enabled
ENABLE_LMCACHE=1 \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_LOCAL_CPU=True \
LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
\ No newline at end of file
......@@ -8,6 +8,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
python -m dynamo.frontend --router-mode kv &
# run workers
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager
......@@ -10,6 +10,7 @@ python -m dynamo.frontend --router-mode kv &
# Data Parallel Attention / Expert Parallelism
# Routing to DP workers managed by Dynamo
# Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
for i in {0..3}; do
CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
--model Qwen/Qwen3-30B-A3B \
......
......@@ -7,6 +7,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
# run ingress
python -m dynamo.frontend --router-mode kv &
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# run ingress with KV router
python -m dynamo.frontend --router-mode kv &
# run decode worker on GPU 0, without enabling LMCache
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B &
# wait for decode worker to initialize
sleep 20
# run prefill worker on GPU 1 with LMCache
ENABLE_LMCACHE=1 \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_LOCAL_CPU=True \
LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
CUDA_VISIBLE_DEVICES=1 \
python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker
\ No newline at end of file
......@@ -9,6 +9,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
python -m dynamo.frontend --router-mode kv &
# routing will happen between the two decode workers
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
......
......@@ -29,6 +29,9 @@ logger = logging.getLogger(__name__)
DEFAULT_ENDPOINT = "dyn://dynamo.backend.generate"
DEFAULT_MODEL = "Qwen/Qwen3-0.6B"
# Global LMCache configuration - initialize once on module import
ENABLE_LMCACHE = os.getenv("ENABLE_LMCACHE", "0").lower() in ("1", "true", "yes")
class Config:
"""Command line parameters or defaults"""
......@@ -209,6 +212,37 @@ def overwrite_args(config):
dp_rank = config.engine_args.data_parallel_rank or 0
# Set kv_transfer_config based on LMCache setting
if ENABLE_LMCACHE:
if config.is_prefill_worker:
# Prefill worker use LMCache with disaggregated serving (MultiConnector) for disaggregated serving
kv_transfer_config = KVTransferConfig(
kv_connector="MultiConnector",
kv_role="kv_both",
kv_connector_extra_config={
"connectors": [
{"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"},
{
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
]
},
)
logger.info("Using LMCache with MultiConnector serving")
else:
# If enable lmcache, single node in default uses single connector serving
kv_transfer_config = KVTransferConfig(
kv_connector="LMCacheConnectorV1", kv_role="kv_both"
)
logger.info("Using LMCache with LMCacheConnector serving")
else:
kv_transfer_config = KVTransferConfig(
kv_connector="NixlConnector", kv_role="kv_both"
)
logger.info("Using NixlConnector configuration")
defaults = {
"task": "generate",
# As of vLLM >=0.10.0 the engine unconditionally calls
......@@ -219,9 +253,7 @@ def overwrite_args(config):
"disable_log_requests": True,
# KV routing relies on logging KV metrics
"disable_log_stats": False,
"kv_transfer_config": KVTransferConfig(
kv_connector="NixlConnector", kv_role="kv_both"
),
"kv_transfer_config": kv_transfer_config,
}
if config.engine_args.enable_prefix_caching:
......
......@@ -20,7 +20,13 @@ from dynamo.llm import (
from dynamo.runtime import DistributedRuntime, dynamo_worker
from dynamo.runtime.logging import configure_dynamo_logging
from .args import Config, configure_ports_with_etcd, overwrite_args, parse_args
from .args import (
ENABLE_LMCACHE,
Config,
configure_ports_with_etcd,
overwrite_args,
parse_args,
)
from .handlers import DecodeWorkerHandler, PrefillWorkerHandler
from .publisher import StatLoggerFactory
......@@ -28,6 +34,22 @@ configure_dynamo_logging()
logger = logging.getLogger(__name__)
def setup_lmcache_environment():
"""Setup LMCache environment variables for KV cache offloading"""
# LMCache configuration for matching logic
lmcache_config = {
"LMCACHE_CHUNK_SIZE": "256", # Token chunk size
"LMCACHE_LOCAL_CPU": "True", # Enable CPU memory backend
"LMCACHE_MAX_LOCAL_CPU_SIZE": "20", # CPU memory limit in GB
}
# Set environment variables
for key, value in lmcache_config.items():
if key not in os.environ: # Only set if not already configured
os.environ[key] = value
logger.info(f"Set LMCache environment variable: {key}={value}")
async def graceful_shutdown(runtime):
"""
Shutdown dynamo distributed runtime.
......@@ -70,6 +92,14 @@ def setup_vllm_engine(config, stat_logger=None):
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
engine_args = config.engine_args
# KV transfer config is now handled by args.py based on ENABLE_LMCACHE env var
if ENABLE_LMCACHE:
setup_lmcache_environment()
logger.info("LMCache enabled for VllmWorker")
else:
logger.info("LMCache is disabled")
# Load default sampling params from `generation_config.json`
default_sampling_params = (
engine_args.create_model_config().get_diff_sampling_param()
......@@ -90,6 +120,9 @@ def setup_vllm_engine(config, stat_logger=None):
disable_log_requests=engine_args.disable_log_requests,
disable_log_stats=engine_args.disable_log_stats,
)
if ENABLE_LMCACHE:
logger.info(f"VllmWorker for {config.model} has been initialized with LMCache")
else:
logger.info(f"VllmWorker for {config.model} has been initialized")
return engine_client, vllm_config, default_sampling_params
......@@ -98,7 +131,6 @@ async def init_prefill(runtime: DistributedRuntime, config: Config):
"""
Instantiate and serve
"""
component = runtime.namespace(config.namespace).component(config.component)
await component.create_service()
......
......@@ -12,7 +12,7 @@ ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04"
# Make sure to update the dependency version in pyproject.toml when updating this
ARG VLLM_REF="v0.10.0"
ARG VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
ARG TORCH_BACKEND="cu128"
# Match 0.10.0 vLLM release
......
......@@ -20,7 +20,7 @@ set -euo pipefail
# Parse arguments
EDITABLE=true
VLLM_REF="v0.10.0"
VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
MAX_JOBS=16
INSTALLATION_DIR=/tmp
ARCH=$(uname -m)
......@@ -78,12 +78,12 @@ while [[ $# -gt 0 ]]; do
echo "Options:"
echo " --editable Install vllm in editable mode (default)"
echo " --no-editable Install vllm in non-editable mode"
echo " --vllm-ref REF Git reference to checkout (default: 059d4cd)"
echo " --vllm-ref REF Git reference to checkout (default: f4135232b9a8c4845f8961fb1cd17581c56ae2ce)"
echo " --max-jobs NUM Maximum number of parallel jobs (default: 16)"
echo " --arch ARCH Architecture (amd64|arm64, default: auto-detect)"
echo " --installation-dir DIR Directory to install vllm (default: /tmp/vllm)"
echo " --deepgemm-ref REF Git reference for DeepGEMM (default: 6c9558e)"
echo " --flashinf-ref REF Git reference for Flash Infer (default: 1d72ed4)"
echo " --deepgemm-ref REF Git reference for DeepGEMM (default: 1876566)"
echo " --flashinf-ref REF Git reference for Flash Infer (default: v0.2.8rc1)"
echo " --torch-backend BACKEND Torch backend to use (default: cu128)"
exit 0
;;
......@@ -107,6 +107,9 @@ echo " TORCH_BACKEND: $TORCH_BACKEND"
# Install common dependencies
uv pip install pip cuda-python
# Install LMCache
uv pip install lmcache
# Create vllm directory and clone
mkdir -p $INSTALLATION_DIR
cd $INSTALLATION_DIR
......
# LMCache Dynamo MMLU Testing Suite
## Overview
Test the correctness of Dynamo integration with LMCache by comparing MMLU benchmark results with and without LMCache enabled.
## Testing Principle
Compare MMLU test results under two configurations:
- **Baseline Test**: Dynamo without LMCache (`ENABLE_LMCACHE=0`)
- **LMCache Test**: Dynamo with LMCache enabled (`ENABLE_LMCACHE=1`)
If both configurations produce the same inference results, it verifies that LMCache functionality is correct.
## Quick Start
### Prerequisites
1. Ensure dynamo and its dependencies are properly installed (i.e. nats and etcd are running)
2. Download MMLU dataset to `data` directory
3. Ensure HuggingFace models are accessible
### Download MMLU Dataset
```bash
cd ./tests/lmcache
# Auto-download and organize data
python3 download_mmlu.py
```
### Run Single Model Test
Change model name in the script to test other models.
```bash
cd ./tests/lmcache
# 1. Baseline test (without LMCache)
./deploy-baseline-dynamo.sh Qwen/Qwen3-0.6B
# Wait for model to load, then run test in another terminal:
python3 mmlu-baseline-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
# Stop services with Ctrl+C in the deploy script terminal
# 2. LMCache test (with LMCache enabled)
./deploy-lmcache_enabled-dynamo.sh Qwen/Qwen3-0.6B
# Wait for model to load, then run test in another terminal:
python3 mmlu-lmcache_enabled-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
# Stop services with Ctrl+C in the deploy script terminal
# 3. Compare results
python3 summarize_scores_dynamo.py
```
## File Description
### Deployment Scripts
- **`deploy-baseline-dynamo.sh`**: Deploy Dynamo without LMCache (baseline)
- **`deploy-lmcache_enabled-dynamo.sh`**: Deploy Dynamo with LMCache enabled (test)
### Test Scripts
- **`mmlu-baseline-dynamo.py`**: Run MMLU test on baseline Dynamo
- **`mmlu-lmcache_enabled-dynamo.py`**: Run MMLU test on Dynamo with LMCache
- **`summarize_scores_dynamo.py`**: Compare and analyze test results
## Architecture Differences
### Baseline Architecture (deploy-baseline-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
Environment: ENABLE_LMCACHE=0
```
### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
Environment: ENABLE_LMCACHE=1
LMCACHE_CHUNK_SIZE=256
LMCACHE_LOCAL_CPU=True
LMCACHE_MAX_LOCAL_CPU_SIZE=1.0
```
## API Format
Test scripts use Dynamo's Chat Completions API:
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": Qwen/Qwen3-0.6B,
"messages": [{"role": "user", "content": "question content"}],
"temperature": 0,
"max_tokens": 3,
"stream": false,
"seed": 42
}'
```
## Result Interpretation
After testing completes, the following files will be generated:
- `dynamo-baseline-{model_name}.jsonl`: Baseline test results
- `dynamo-lmcache-{model_name}.jsonl`: LMCache test results
If the accuracy in both result files is very close (difference < 1%), it indicates LMCache functionality is correct.
## Notes
1. **Determinism guarantee**: All tests use the same seed (42) and zero temperature to ensure reproducible results
2. **Pre-requisites**: Ensure nats and etcd are running.
3. **Sequential execution**: Must stop the first test before starting the second to avoid port conflicts
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# ASSUMPTION: dynamo and its dependencies are properly installed
# i.e. nats and etcd are running
# Overview:
# This script deploys dynamo disaggregated serving without LMCache on port 8080
# Used as baseline for correctness testing
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Arguments:
MODEL_URL=$1
if [ -z "$MODEL_URL" ]; then
echo "Usage: $0 <MODEL_URL>"
echo "Example: $0 Qwen/Qwen3-0.6B"
exit 1
fi
echo "🚀 Starting dynamo disaggregated serving setup without LMCache:"
echo " Model: $MODEL_URL"
echo " Port: 8080"
echo " Mode: Disaggregated (prefill + decode workers)"
# Kill any existing dynamo processes
echo "🧹 Cleaning up any existing dynamo processes..."
pkill -f "dynamo-run" || true
sleep 2
# Disable LMCache
export ENABLE_LMCACHE=0
echo "🔧 Starting dynamo disaggregated serving without LMCache..."
python -m dynamo.frontend &
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model $MODEL_URL&
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model $MODEL_URL \
--is-prefill-worker
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# ASSUMPTION: dynamo and its dependencies are properly installed
# i.e. nats and etcd are running
# Overview:
# This script deploys dynamo without LMCache on port 8080
# Used as baseline for correctness testing
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Arguments:
MODEL_URL=$1
if [ -z "$MODEL_URL" ]; then
echo "Usage: $0 <MODEL_URL>"
echo "Example: $0 Qwen/Qwen3-0.6B"
exit 1
fi
echo "🚀 Starting dynamo setup without LMCache:"
echo " Model: $MODEL_URL"
echo " Port: 8080"
# Kill any existing dynamo processes
echo "🧹 Cleaning up any existing dynamo processes..."
pkill -f "dynamo-run" || true
sleep 2
# Disable LMCache
export ENABLE_LMCACHE=0
echo "🔧 Starting dynamo worker without LMCache..."
python -m dynamo.frontend &
python3 -m dynamo.vllm --model $MODEL_URL
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# ASSUMPTION: dynamo and its dependencies are properly installed
# i.e. nats and etcd are running
# Overview:
# This script deploys dynamo disaggregated serving with LMCache enabled on port 8080
# Used for LMCache correctness testing
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Arguments:
MODEL_URL=$1
if [ -z "$MODEL_URL" ]; then
echo "Usage: $0 <MODEL_URL>"
echo "Example: $0 Qwen/Qwen3-0.6B"
exit 1
fi
echo "🚀 Starting dynamo disaggregated serving setup with LMCache:"
echo " Model: $MODEL_URL"
echo " Port: 8080"
echo " Mode: Disaggregated (prefill + decode workers) + LMCache"
echo " !! Remember to kill the old dynamo processes otherwise the port will be busy !!"
# Kill any existing dynamo processes
echo "🧹 Cleaning up any existing dynamo processes..."
pkill -f "dynamo-run" || true
sleep 2
echo "🔧 Starting dynamo disaggregated serving with LMCache enabled..."
python -m dynamo.frontend &
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model $MODEL_URL&
sleep 20
# run prefill worker on GPU 1 with LMCache
ENABLE_LMCACHE=1 \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_LOCAL_CPU=True \
LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
CUDA_VISIBLE_DEVICES=1 \
python3 -m dynamo.vllm \
--model $MODEL_URL \
--is-prefill-worker
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# ASSUMPTION: dynamo and its dependencies are properly installed
# i.e. nats and etcd are running
# Overview:
# This script deploys dynamo with LMCache enabled on port 8080
# Used for LMCache correctness testing
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Arguments:
MODEL_URL=$1
if [ -z "$MODEL_URL" ]; then
echo "Usage: $0 <MODEL_URL>"
echo "Example: $0 Qwen/Qwen3-0.6B"
exit 1
fi
echo "🚀 Starting dynamo setup with LMCache:"
echo " Model: $MODEL_URL"
echo " Port: 8080"
echo " !! Remmber to kill the old dynamo processes other wise the port will be busy !! "
# Kill any existing dynamo processes
echo "🧹 Cleaning up any existing dynamo processes..."
pkill -f "dynamo-run" || true
sleep 2
echo "🔧 Starting dynamo worker with LMCache enabled..."
python -m dynamo.frontend &
ENABLE_LMCACHE=1 \
python3 -m dynamo.vllm --model $MODEL_URL
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Download MMLU dataset using HuggingFace datasets library.
This is a simpler alternative to the bash script.
"""
import argparse
import os
import pandas as pd
from datasets import load_dataset
def download_mmlu():
"""Download and prepare MMLU dataset."""
print("Downloading MMLU dataset...")
# Create data directories
os.makedirs("data/test", exist_ok=True)
os.makedirs("data/dev", exist_ok=True)
try:
# Download MMLU dataset
print("🔄 Loading MMLU dataset from HuggingFace...")
dataset = load_dataset("cais/mmlu", "all")
print("📊 Dataset information:")
print(f" - Training set: {len(dataset['auxiliary_train'])} samples")
print(f" - Validation set: {len(dataset['validation'])} samples")
print(f" - Test set: {len(dataset['test'])} samples")
# Get all subjects
subjects_set = set()
for split in ["validation", "test"]:
for item in dataset[split]:
subjects_set.add(item["subject"])
subjects = sorted(list(subjects_set))
print(f"Contains {len(subjects)} subjects")
# Process each subject
print("Organizing data files...")
for subject in subjects:
# Filter data for this subject
test_data = [item for item in dataset["test"] if item["subject"] == subject]
val_data = [
item for item in dataset["validation"] if item["subject"] == subject
]
if test_data:
# Convert to DataFrame and save as CSV
test_df = pd.DataFrame(test_data)
# Create the expected CSV format: question, A, B, C, D, answer
csv_data = []
for _, row in test_df.iterrows():
csv_row = [
row["question"],
row["choices"][0], # A
row["choices"][1], # B
row["choices"][2], # C
row["choices"][3], # D
chr(ord("A") + row["answer"]), # Convert 0,1,2,3 to A,B,C,D
]
csv_data.append(csv_row)
# Save test CSV
test_csv = pd.DataFrame(csv_data)
test_file = f"data/test/{subject}_test.csv"
test_csv.to_csv(test_file, header=False, index=False)
if val_data:
# Convert validation data (used as dev set)
val_df = pd.DataFrame(val_data)
csv_data = []
for _, row in val_df.iterrows():
csv_row = [
row["question"],
row["choices"][0], # A
row["choices"][1], # B
row["choices"][2], # C
row["choices"][3], # D
chr(ord("A") + row["answer"]), # Convert 0,1,2,3 to A,B,C,D
]
csv_data.append(csv_row)
# Save dev CSV
dev_csv = pd.DataFrame(csv_data)
dev_file = f"data/dev/{subject}_dev.csv"
dev_csv.to_csv(dev_file, header=False, index=False)
# Verify the download
test_files = [f for f in os.listdir("data/test") if f.endswith("_test.csv")]
dev_files = [f for f in os.listdir("data/dev") if f.endswith("_dev.csv")]
print("\nDownload completed statistics:")
print(f" Test files: {len(test_files)} files")
print(f" Dev files: {len(dev_files)} files")
if test_files and dev_files:
print("✅ MMLU dataset download completed!")
print("Data locations:")
print(" - Test set: data/test/")
print(" - Dev set: data/dev/")
print("\nAvailable subject examples:")
for i, f in enumerate(sorted(test_files)[:10]):
subject = f.replace("_test.csv", "")
print(f" - {subject}")
if len(test_files) > 10:
print(f" ... and {len(test_files) - 10} more subjects")
print("\n🚀 You can now run MMLU tests:")
print(' ./deploy-1-dynamo.sh "Qwen/Qwen3-0.6B"')
print(
' python3 1-mmlu-dynamo.py --model "Qwen/Qwen3-0.6B" --number-of-subjects 15'
)
else:
print("❌ Data download incomplete")
except Exception as e:
print(f"❌ Download failed: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download MMLU dataset")
args = parser.parse_args()
download_mmlu()
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This is a MMLU test script for dynamo baseline testing (without LMCache)
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) without LMCache
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
# Standard
import argparse
import json
import os
from typing import Optional
import numpy as np
import pandas as pd
import requests
# Third Party
from tqdm import tqdm
from transformers import AutoTokenizer, set_seed
tokenizer: Optional[AutoTokenizer] = None
choices = ["A", "B", "C", "D"]
# for complete determinism between runs of MMLU, we should:
# 1. set the seed of LLM requests to a fixed number (42)
# 2. set temperature to 0 on requests
def get_llm_response(args, prompt):
# Use dynamo's completions API format
data = {
"model": args.model,
"prompt": prompt,
"temperature": 0,
"max_tokens": 3,
"stream": False,
"seed": 42, # Add explicit seed for determinism
}
url = f"http://{args.host}:{args.port}/v1/completions"
res = requests.post(url, json=data, timeout=30)
if res.status_code != 200:
raise Exception(f"Error: {res.status_code} {res.text}")
response_json = res.json()
return response_json["choices"][0]["text"]
# grab the idx'th row of the df and generate a prompt string
# format of the MMLU csvs:
# question,option_A,option_B,option_C,option_D,answer
def prompt_string(df, idx, include_answer=True):
prompt = df.iloc[idx, 0]
k = df.shape[1] - 2 # number of columns - 2 (question and answer)
for i in range(k):
prompt += f"\n{choices[i]}. {df.iloc[idx, i + 1]}"
prompt += "\nRespond with **only the letter** (A, B, C, D). Do **not** output any explanation, analysis, or extra words. Answer:"
if include_answer:
prompt += f" {df.iloc[idx, k]}\n\n"
return prompt
def evaluate(args, subject, dev_df, test_df):
prompts, labels = [], []
shared_multi_shot_prefix = [
f"The following are multiple choice questions (with answers) \
about {subject}. \n\n"
]
shared_multi_shot_prefix_length = 0
for i in range(dev_df.shape[0]):
# the multi-shot examples should contain answers
shared_multi_shot_prefix.append(prompt_string(dev_df, i))
# Use plain list of token IDs, no torch tensors
assert tokenizer is not None, "Tokenizer must be initialized"
token_ids = tokenizer(shared_multi_shot_prefix[-1], add_special_tokens=True)[ # type: ignore
"input_ids"
]
shared_multi_shot_prefix_length += len(token_ids)
if shared_multi_shot_prefix_length > 4000:
break
# all already have double newlines at the end
shared_multi_shot_prefix_str = "".join(shared_multi_shot_prefix)
for i in range(test_df.shape[0]):
# do NOT include the answer for the actual question we want the LLM to answer
query_prompt = prompt_string(test_df, i, include_answer=False)
prompt = f"{shared_multi_shot_prefix_str}\n\n{query_prompt}"
prompts.append(prompt)
label = test_df.iloc[i, test_df.shape[1] - 1]
labels.append(label)
predictions = []
for i, prompt in enumerate(prompts):
prediction = get_llm_response(args, prompt)
prediction_stripped = prediction.strip()
if prediction_stripped and prediction_stripped[0] in ["A", "B", "C", "D"]:
predictions.append(prediction_stripped[0])
else:
# Fallback: look for any A, B, C, D in the response
for char in prediction_stripped:
if char in ["A", "B", "C", "D"]:
predictions.append(char)
break
else:
predictions.append("A") # Default fallback
accuracy = np.mean(np.array(predictions) == np.array(labels))
return accuracy
def main(args):
global tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model)
mmlu_files = os.listdir("data/test")
test_files = [f for f in mmlu_files if f.endswith("_test.csv")]
subjects = sorted([f.split("_test.csv")[0] for f in test_files])
accuracies = []
num_questions = []
output_dict = {}
for subject_raw in tqdm(
subjects[: args.number_of_subjects], desc="Processing subjects"
):
subject = " ".join(subject_raw.split("_")) # replace underscores with spaces
dev_df = pd.read_csv(
os.path.join("data/dev", subject_raw + "_dev.csv"), header=None
)
test_df = pd.read_csv(
os.path.join("data/test", subject_raw + "_test.csv"), header=None
)
accuracy = evaluate(args, subject, dev_df, test_df)
accuracies.append(accuracy)
num_questions.append(len(test_df))
output_dict[subject_raw] = {"accuracy": accuracy, "num_questions": len(test_df)}
total_accuracy = np.mean(accuracies)
total_num_questions = sum(num_questions)
output_dict["total"] = {
"accuracy": total_accuracy,
"num_questions": total_num_questions,
}
with open(args.result_file, "w") as f:
# output will be a jsonl file
for subject, value in output_dict.items():
f.write(json.dumps({subject: value}) + "\n")
if __name__ == "__main__":
set_seed(42) # some tokenizers may have randomness
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--result-file", type=str, required=False)
parser.add_argument("--number-of-subjects", type=int, required=True)
parser.add_argument("--host", type=str, default="localhost", help="Dynamo host")
parser.add_argument("--port", type=int, default=8080, help="Dynamo port")
args = parser.parse_args()
if args.result_file is None:
# Clean model name if it's a path or has slashes
model_name = args.model.split("/")[-1]
args.result_file = f"dynamo-baseline-{model_name}.jsonl"
main(args)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This is a MMLU test script for dynamo LMCache testing (with LMCache enabled)
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) with LMCache enabled
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
# Standard
import argparse
import json
import os
from typing import Optional
import numpy as np
import pandas as pd
import requests
# Third Party
from tqdm import tqdm
from transformers import AutoTokenizer, set_seed
tokenizer: Optional[AutoTokenizer] = None
choices = ["A", "B", "C", "D"]
# for complete determinism between runs of MMLU, we should:
# 1. set the seed of LLM requests to a fixed number (42)
# 2. set temperature to 0 on requests
def get_llm_response(args, prompt):
# Use dynamo's completions API format
data = {
"model": args.model,
"prompt": prompt,
"temperature": 0,
"max_tokens": 3,
"stream": False,
"seed": 42, # Add explicit seed for determinism
}
url = f"http://{args.host}:{args.port}/v1/completions"
res = requests.post(url, json=data, timeout=30)
if res.status_code != 200:
raise Exception(f"Error: {res.status_code} {res.text}")
response_json = res.json()
return response_json["choices"][0]["text"]
# grab the idx'th row of the df and generate a prompt string
# format of the MMLU csvs:
# question,option_A,option_B,option_C,option_D,answer
def prompt_string(df, idx, include_answer=True):
prompt = df.iloc[idx, 0]
k = df.shape[1] - 2 # number of columns - 2 (question and answer)
for i in range(k):
prompt += f"\n{choices[i]}. {df.iloc[idx, i + 1]}"
prompt += "\nRespond with **only the letter** (A, B, C, D). Do **not** output any explanation, analysis, or extra words. Answer:"
if include_answer:
prompt += f" {df.iloc[idx, k]}\n\n"
return prompt
def evaluate(args, subject, dev_df, test_df):
prompts, labels = [], []
shared_multi_shot_prefix = [
f"The following are multiple choice questions (with answers) \
about {subject}. \n\n"
]
shared_multi_shot_prefix_length = 0
for i in range(dev_df.shape[0]):
# the multi-shot examples should contain answers
shared_multi_shot_prefix.append(prompt_string(dev_df, i))
# Use plain list of token IDs, no torch tensors
assert tokenizer is not None, "Tokenizer must be initialized"
token_ids = tokenizer(shared_multi_shot_prefix[-1], add_special_tokens=True)[ # type: ignore
"input_ids"
]
shared_multi_shot_prefix_length += len(token_ids)
if shared_multi_shot_prefix_length > 4000:
break
# all already have double newlines at the end
shared_multi_shot_prefix_str = "".join(shared_multi_shot_prefix)
for i in range(test_df.shape[0]):
# do NOT include the answer for the actual question we want the LLM to answer
query_prompt = prompt_string(test_df, i, include_answer=False)
prompt = f"{shared_multi_shot_prefix_str}\n\n{query_prompt}"
prompts.append(prompt)
label = test_df.iloc[i, test_df.shape[1] - 1]
labels.append(label)
predictions = []
for i, prompt in enumerate(prompts):
prediction = get_llm_response(args, prompt)
prediction_stripped = prediction.strip()
if prediction_stripped and prediction_stripped[0] in ["A", "B", "C", "D"]:
predictions.append(prediction_stripped[0])
else:
# Fallback: look for any A, B, C, D in the response
for char in prediction_stripped:
if char in ["A", "B", "C", "D"]:
predictions.append(char)
break
else:
predictions.append("A") # Default fallback
accuracy = np.mean(np.array(predictions) == np.array(labels))
return accuracy
def main(args):
global tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model)
mmlu_files = os.listdir("data/test")
test_files = [f for f in mmlu_files if f.endswith("_test.csv")]
subjects = sorted([f.split("_test.csv")[0] for f in test_files])
accuracies = []
num_questions = []
output_dict = {}
for subject_raw in tqdm(
subjects[: args.number_of_subjects], desc="Processing subjects"
):
subject = " ".join(subject_raw.split("_")) # replace underscores with spaces
dev_df = pd.read_csv(
os.path.join("data/dev", subject_raw + "_dev.csv"), header=None
)
test_df = pd.read_csv(
os.path.join("data/test", subject_raw + "_test.csv"), header=None
)
accuracy = evaluate(args, subject, dev_df, test_df)
accuracies.append(accuracy)
num_questions.append(len(test_df))
output_dict[subject_raw] = {"accuracy": accuracy, "num_questions": len(test_df)}
total_accuracy = np.mean(accuracies)
total_num_questions = sum(num_questions)
output_dict["total"] = {
"accuracy": total_accuracy,
"num_questions": total_num_questions,
}
with open(args.result_file, "w") as f:
# output will be a jsonl file
for subject, value in output_dict.items():
f.write(json.dumps({subject: value}) + "\n")
if __name__ == "__main__":
set_seed(42) # some tokenizers may have randomness
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--result-file", type=str, required=False)
parser.add_argument("--number-of-subjects", type=int, required=True)
parser.add_argument("--host", type=str, default="localhost", help="Dynamo host")
parser.add_argument("--port", type=int, default=8080, help="Dynamo port")
args = parser.parse_args()
if args.result_file is None:
# Clean model name if it's a path or has slashes
model_name = args.model.split("/")[-1]
args.result_file = f"dynamo-lmcache-{model_name}.jsonl"
main(args)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment