feat: LMCache integration in newest ux (#2079)

Signed-off-by: ZichengMa <zichengma1225@gmail.com> Co-authored-by: Ziqi Fan <ziqif@nvidia.com>

feat: LMCache integration in newest ux (#2079)
Signed-off-by: ZichengMa <zichengma1225@gmail.com> Co-authored-by: Ziqi Fan <ziqif@nvidia.com>
784da90e · ZichengMa · GitHub · 63fbf498 · 784da90e · 784da90e
Unverified Commit 784da90e authored Aug 06, 2025 by ZichengMa Committed by GitHub Aug 07, 2025
20 changed files
--- a/components/backends/vllm/LMCache_Integration.md
+++ b/components/backends/vllm/LMCache_Integration.md
+# LMCache Integration in Dynamo
+
+## Introduction
+
+LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
+
+This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency.
+
+### Key Benefits
+- **Reduced Time to First Token (TTFT)**: Eliminates redundant prefill computations
+- **Memory Offloading**: Intelligent KV cache placement across CPU/GPU/storage tiers
+- **Improved Throughput**: Reduced GPU memory pressure enables higher batch sizes
+
+## Aggregated Serving
+
+
+### Configuration
+
+LMCache is enabled by setting the `ENABLE_LMCACHE` environment variable:
+
+```bash
+export ENABLE_LMCACHE=1
+```
+
+Additional LMCache configuration can be customized via environment variables:
+- `LMCACHE_CHUNK_SIZE=256` - Token chunk size for cache granularity (default: 256)
+- `LMCACHE_LOCAL_CPU=True` - Enable CPU memory backend for offloading
+- `LMCACHE_MAX_LOCAL_CPU_SIZE=20` - CPU memory limit in GB (user can adjust based on available RAM to a fixed value)
+
+For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html):
+- **CPU RAM**: Fast local memory offloading
+- **Local Storage**: Disk-based persistence
+- **Redis**: Distributed cache sharing
+- **GDS Backend**: GPU Direct Storage for high throughput
+- **InfiniStore/Mooncake**: Cloud-native storage solutions
+
+### Deployment
+
+Use the provided launch script for quick setup:
+
+```bash
+./components/backends/vllm/launch/agg_lmcache.sh
+```
+
+This will:
+1. Start the dynamo frontend
+2. Launch a single vLLM worker with LMCache enabled
+
+### Architecture for Aggregated Mode
+
+In aggregated mode, the system uses:
+- **KV Connector**: `LMCacheConnectorV1`
+- **KV Role**: `kv_both` (handles both reading and writing)
+
+## Disaggregated Serving
+
+Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments.
+
+### Configuration
+
+The same `ENABLE_LMCACHE=1` environment variable enables LMCache, but the system automatically configures different connector setups for prefill and decode workers.
+
+### Deployment
+
+Use the provided disaggregated launch script(the script requires at least 2 GPUs):
+
+```bash
+./components/backends/vllm/launch/disagg_lmcache.sh
+```
+
+This will:
+1. Start the dynamo frontend
+2. Launch a decode worker on GPU 0
+3. Wait for initialization
+4. Launch a prefill worker on GPU 1 with LMCache enabled
+
+### Worker Roles
+
+#### Decode Worker
+- **Purpose**: Handles token generation (decode phase)
+- **GPU Assignment**: CUDA_VISIBLE_DEVICES=0
+- **LMCache Config**: Uses `NixlConnector` only for kv transfer between prefill and decode workers
+
+#### Prefill Worker
+- **Purpose**: Handles prompt processing (prefill phase)
+- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
+- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers.
+- **Flag**: `--is-prefill-worker`
+
+## Architecture
+
+### KV Transfer Configuration
+
+The system automatically configures KV transfer based on the deployment mode and worker type:
+
+#### Prefill Worker (Disaggregated Mode)
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="MultiConnector",
+    kv_role="kv_both",
+    kv_connector_extra_config={
+        "connectors": [
+            {"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"},
+            {"kv_connector": "NixlConnector", "kv_role": "kv_both"}
+        ]
+    }
+)
+```
+
+#### Decode Worker or Aggregated Mode
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="LMCacheConnectorV1",
+    kv_role="kv_both"
+)
+```
+
+#### Fallback (No LMCache)
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="NixlConnector",
+    kv_role="kv_both"
+)
+```
+
+### Environment Setup
+
+The system automatically configures LMCache environment variables when enabled:
+
+```python
+lmcache_config = {
+    "LMCACHE_CHUNK_SIZE": "256",
+    "LMCACHE_LOCAL_CPU": "True",
+    "LMCACHE_MAX_LOCAL_CPU_SIZE": "20"
+}
+```
+
+### Integration Points
+
+1. **Argument Parsing** (`args.py`):
+   - Detects `ENABLE_LMCACHE` environment variable
+   - Configures appropriate KV transfer settings
+   - Sets up connector configurations based on worker type
+
+2. **Engine Setup** (`main.py`):
+   - Initializes LMCache environment variables
+   - Creates vLLM engine with proper KV transfer config
+   - Handles both aggregated and disaggregated modes
+
+
+### Best Practices
+
+1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case:
+   - Smaller chunks (128-256): Better reuse granularity for varied content
+   - Larger chunks (512-1024): More efficient for repetitive content patterns
+
+2. **Memory Allocation**: Set `LMCACHE_MAX_LOCAL_CPU_SIZE` conservatively:
+   - Leave sufficient RAM for other system processes
+   - Monitor memory usage during peak loads
+
+3. **Workload Optimization**: LMCache performs best with:
+   - Repeated prompt patterns (RAG, multi-turn conversations)
+   - Shared context across sessions
+   - Long-running services with warm caches
+
+## References and Additional Resources
+
+- [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference
+- [Configuration Reference](https://docs.lmcache.ai/api_reference/config.html) - Detailed configuration options
+
--- a/components/backends/vllm/launch/agg.sh
+++ b/components/backends/vllm/launch/agg.sh
@@ -8,4 +8,5 @@ trap 'echo Cleaning up...; kill 0' EXIT
 python -m dynamo.frontend &

 # run worker
+# --enforce-eager is added for quick deployment. for production use, need to remove this flag
 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching
--- a/components/backends/vllm/launch/agg_lmcache.sh
+++ b/components/backends/vllm/launch/agg_lmcache.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# run ingress
+python -m dynamo.frontend &
+
+# run worker with LMCache enabled
+ENABLE_LMCACHE=1 \
+LMCACHE_CHUNK_SIZE=256 \
+LMCACHE_LOCAL_CPU=True \
+LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
+  python -m dynamo.vllm --model Qwen/Qwen3-0.6B
\ No newline at end of file
--- a/components/backends/vllm/launch/agg_router.sh
+++ b/components/backends/vllm/launch/agg_router.sh
@@ -8,6 +8,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
 python -m dynamo.frontend --router-mode kv &

 # run workers
+# --enforce-eager is added for quick deployment. for production use, need to remove this flag
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &

 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager
--- a/components/backends/vllm/launch/dep.sh
+++ b/components/backends/vllm/launch/dep.sh
@@ -10,6 +10,7 @@ python -m dynamo.frontend --router-mode kv &
 # Data Parallel Attention / Expert Parallelism
 # Routing to DP workers managed by Dynamo
 # Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
+# --enforce-eager is added for quick deployment. for production use, need to remove this flag
 for i in {0..3}; do
    CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
    --model Qwen/Qwen3-30B-A3B \

--- a/components/backends/vllm/launch/disagg.sh
+++ b/components/backends/vllm/launch/disagg.sh
@@ -7,6 +7,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
 # run ingress
 python -m dynamo.frontend --router-mode kv &

+# --enforce-eager is added for quick deployment. for production use, need to remove this flag
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &

 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \

--- a/components/backends/vllm/launch/disagg_lmcache.sh
+++ b/components/backends/vllm/launch/disagg_lmcache.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# run ingress with KV router
+python -m dynamo.frontend --router-mode kv &
+
+# run decode worker on GPU 0, without enabling LMCache
+CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B &
+
+# wait for decode worker to initialize
+sleep 20
+
+# run prefill worker on GPU 1 with LMCache
+ENABLE_LMCACHE=1 \
+LMCACHE_CHUNK_SIZE=256 \
+LMCACHE_LOCAL_CPU=True \
+LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
+CUDA_VISIBLE_DEVICES=1 \
+  python3 -m dynamo.vllm \
+    --model Qwen/Qwen3-0.6B \
+    --is-prefill-worker
\ No newline at end of file
--- a/components/backends/vllm/launch/disagg_router.sh
+++ b/components/backends/vllm/launch/disagg_router.sh
@@ -9,6 +9,7 @@ trap 'echo Cleaning up...; kill 0' EXIT
 python -m dynamo.frontend --router-mode kv &

 # routing will happen between the two decode workers
+# --enforce-eager is added for quick deployment. for production use, need to remove this flag
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &

 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &

--- a/components/backends/vllm/src/dynamo/vllm/args.py
+++ b/components/backends/vllm/src/dynamo/vllm/args.py
@@ -29,6 +29,9 @@ logger = logging.getLogger(__name__)
 DEFAULT_ENDPOINT = "dyn://dynamo.backend.generate"
 DEFAULT_MODEL = "Qwen/Qwen3-0.6B"

+# Global LMCache configuration - initialize once on module import
+ENABLE_LMCACHE = os.getenv("ENABLE_LMCACHE", "0").lower() in ("1", "true", "yes")
+

 class Config:
    """Command line parameters or defaults"""
@@ -209,6 +212,37 @@ def overwrite_args(config):

    dp_rank = config.engine_args.data_parallel_rank or 0

+    # Set kv_transfer_config based on LMCache setting
+    if ENABLE_LMCACHE:
+        if config.is_prefill_worker:
+            # Prefill worker use LMCache with disaggregated serving (MultiConnector) for disaggregated serving
+            kv_transfer_config = KVTransferConfig(
+                kv_connector="MultiConnector",
+                kv_role="kv_both",
+                kv_connector_extra_config={
+                    "connectors": [
+                        {"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"},
+                        {
+                            "kv_connector": "NixlConnector",
+                            "kv_role": "kv_both",
+                        },
+                    ]
+                },
+            )
+            logger.info("Using LMCache with MultiConnector serving")
+        else:
+            # If enable lmcache, single node in default uses single connector serving
+            kv_transfer_config = KVTransferConfig(
+                kv_connector="LMCacheConnectorV1", kv_role="kv_both"
+            )
+            logger.info("Using LMCache with LMCacheConnector serving")
+
+    else:
+        kv_transfer_config = KVTransferConfig(
+            kv_connector="NixlConnector", kv_role="kv_both"
+        )
+        logger.info("Using NixlConnector configuration")
+
    defaults = {
        "task": "generate",
        # As of vLLM >=0.10.0 the engine unconditionally calls
@@ -219,9 +253,7 @@ def overwrite_args(config):
        "disable_log_requests": True,
        # KV routing relies on logging KV metrics
        "disable_log_stats": False,
-        "kv_transfer_config": KVTransferConfig(
-            kv_connector="NixlConnector", kv_role="kv_both"
-        ),
+        "kv_transfer_config": kv_transfer_config,
    }

    if config.engine_args.enable_prefix_caching:

--- a/components/backends/vllm/src/dynamo/vllm/main.py
+++ b/components/backends/vllm/src/dynamo/vllm/main.py
@@ -20,7 +20,13 @@ from dynamo.llm import (
 from dynamo.runtime import DistributedRuntime, dynamo_worker
 from dynamo.runtime.logging import configure_dynamo_logging

-from .args import Config, configure_ports_with_etcd, overwrite_args, parse_args
+from .args import (
+    ENABLE_LMCACHE,
+    Config,
+    configure_ports_with_etcd,
+    overwrite_args,
+    parse_args,
+)
 from .handlers import DecodeWorkerHandler, PrefillWorkerHandler
 from .publisher import StatLoggerFactory

@@ -28,6 +34,22 @@ configure_dynamo_logging()
 logger = logging.getLogger(__name__)


+def setup_lmcache_environment():
+    """Setup LMCache environment variables for KV cache offloading"""
+    # LMCache configuration for matching logic
+    lmcache_config = {
+        "LMCACHE_CHUNK_SIZE": "256",  # Token chunk size
+        "LMCACHE_LOCAL_CPU": "True",  # Enable CPU memory backend
+        "LMCACHE_MAX_LOCAL_CPU_SIZE": "20",  # CPU memory limit in GB
+    }
+
+    # Set environment variables
+    for key, value in lmcache_config.items():
+        if key not in os.environ:  # Only set if not already configured
+            os.environ[key] = value
+            logger.info(f"Set LMCache environment variable: {key}={value}")
+
+
 async def graceful_shutdown(runtime):
    """
    Shutdown dynamo distributed runtime.
@@ -70,6 +92,14 @@ def setup_vllm_engine(config, stat_logger=None):
    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

    engine_args = config.engine_args
+
+    # KV transfer config is now handled by args.py based on ENABLE_LMCACHE env var
+    if ENABLE_LMCACHE:
+        setup_lmcache_environment()
+        logger.info("LMCache enabled for VllmWorker")
+    else:
+        logger.info("LMCache is disabled")
+
    # Load default sampling params from `generation_config.json`
    default_sampling_params = (
        engine_args.create_model_config().get_diff_sampling_param()
@@ -90,6 +120,9 @@ def setup_vllm_engine(config, stat_logger=None):
        disable_log_requests=engine_args.disable_log_requests,
        disable_log_stats=engine_args.disable_log_stats,
    )
+    if ENABLE_LMCACHE:
+        logger.info(f"VllmWorker for {config.model} has been initialized with LMCache")
+    else:
        logger.info(f"VllmWorker for {config.model} has been initialized")
    return engine_client, vllm_config, default_sampling_params

@@ -98,7 +131,6 @@ async def init_prefill(runtime: DistributedRuntime, config: Config):
    """
    Instantiate and serve
    """
-
    component = runtime.namespace(config.namespace).component(config.component)
    await component.create_service()


--- a/container/Dockerfile.vllm
+++ b/container/Dockerfile.vllm
@@ -12,7 +12,7 @@ ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
 ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04"

 # Make sure to update the dependency version in pyproject.toml when updating this
-ARG VLLM_REF="v0.10.0"
+ARG VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
 ARG TORCH_BACKEND="cu128"

 # Match 0.10.0 vLLM release

--- a/container/deps/vllm/install_vllm.sh
+++ b/container/deps/vllm/install_vllm.sh
@@ -20,7 +20,7 @@ set -euo pipefail

 # Parse arguments
 EDITABLE=true
-VLLM_REF="v0.10.0"
+VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
 MAX_JOBS=16
 INSTALLATION_DIR=/tmp
 ARCH=$(uname -m)
@@ -78,12 +78,12 @@ while [[ $# -gt 0 ]]; do
            echo "Options:"
            echo "  --editable        Install vllm in editable mode (default)"
            echo "  --no-editable     Install vllm in non-editable mode"
-            echo "  --vllm-ref REF    Git reference to checkout (default: 059d4cd)"
+            echo "  --vllm-ref REF    Git reference to checkout (default: f4135232b9a8c4845f8961fb1cd17581c56ae2ce)"
            echo "  --max-jobs NUM    Maximum number of parallel jobs (default: 16)"
            echo "  --arch ARCH       Architecture (amd64|arm64, default: auto-detect)"
            echo "  --installation-dir DIR  Directory to install vllm (default: /tmp/vllm)"
-            echo "  --deepgemm-ref REF  Git reference for DeepGEMM (default: 6c9558e)"
-            echo "  --flashinf-ref REF  Git reference for Flash Infer (default: 1d72ed4)"
+            echo "  --deepgemm-ref REF  Git reference for DeepGEMM (default: 1876566)"
+            echo "  --flashinf-ref REF  Git reference for Flash Infer (default: v0.2.8rc1)"
            echo "  --torch-backend BACKEND  Torch backend to use (default: cu128)"
            exit 0
            ;;
@@ -107,6 +107,9 @@ echo "  TORCH_BACKEND: $TORCH_BACKEND"
 # Install common dependencies
 uv pip install pip cuda-python

+# Install LMCache
+uv pip install lmcache
+
 # Create vllm directory and clone
 mkdir -p $INSTALLATION_DIR
 cd $INSTALLATION_DIR

--- a/tests/lmcache/README.md
+++ b/tests/lmcache/README.md
+# LMCache Dynamo MMLU Testing Suite
+
+## Overview
+Test the correctness of Dynamo integration with LMCache by comparing MMLU benchmark results with and without LMCache enabled.
+
+## Testing Principle
+Compare MMLU test results under two configurations:
+- **Baseline Test**: Dynamo without LMCache (`ENABLE_LMCACHE=0`)
+- **LMCache Test**: Dynamo with LMCache enabled (`ENABLE_LMCACHE=1`)
+
+If both configurations produce the same inference results, it verifies that LMCache functionality is correct.
+
+## Quick Start
+
+### Prerequisites
+1. Ensure dynamo and its dependencies are properly installed (i.e. nats and etcd are running)
+2. Download MMLU dataset to `data` directory
+3. Ensure HuggingFace models are accessible
+
+### Download MMLU Dataset
+
+```bash
+cd ./tests/lmcache
+
+# Auto-download and organize data
+python3 download_mmlu.py
+```
+
+### Run Single Model Test
+Change model name in the script to test other models.
+```bash
+cd ./tests/lmcache
+
+# 1. Baseline test (without LMCache)
+./deploy-baseline-dynamo.sh Qwen/Qwen3-0.6B
+# Wait for model to load, then run test in another terminal:
+python3 mmlu-baseline-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
+# Stop services with Ctrl+C in the deploy script terminal
+
+# 2. LMCache test (with LMCache enabled)
+./deploy-lmcache_enabled-dynamo.sh Qwen/Qwen3-0.6B
+# Wait for model to load, then run test in another terminal:
+python3 mmlu-lmcache_enabled-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
+# Stop services with Ctrl+C in the deploy script terminal
+
+# 3. Compare results
+python3 summarize_scores_dynamo.py
+```
+
+## File Description
+
+### Deployment Scripts
+- **`deploy-baseline-dynamo.sh`**: Deploy Dynamo without LMCache (baseline)
+- **`deploy-lmcache_enabled-dynamo.sh`**: Deploy Dynamo with LMCache enabled (test)
+
+### Test Scripts
+- **`mmlu-baseline-dynamo.py`**: Run MMLU test on baseline Dynamo
+- **`mmlu-lmcache_enabled-dynamo.py`**: Run MMLU test on Dynamo with LMCache
+- **`summarize_scores_dynamo.py`**: Compare and analyze test results
+
+## Architecture Differences
+
+### Baseline Architecture (deploy-baseline-dynamo.sh)
+```
+HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
+Environment: ENABLE_LMCACHE=0
+```
+
+### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
+```
+HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
+Environment: ENABLE_LMCACHE=1
+            LMCACHE_CHUNK_SIZE=256
+            LMCACHE_LOCAL_CPU=True
+            LMCACHE_MAX_LOCAL_CPU_SIZE=1.0
+```
+
+## API Format
+
+Test scripts use Dynamo's Chat Completions API:
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": Qwen/Qwen3-0.6B,
+    "messages": [{"role": "user", "content": "question content"}],
+    "temperature": 0,
+    "max_tokens": 3,
+    "stream": false,
+    "seed": 42
+  }'
+```
+
+
+## Result Interpretation
+
+After testing completes, the following files will be generated:
+- `dynamo-baseline-{model_name}.jsonl`: Baseline test results
+- `dynamo-lmcache-{model_name}.jsonl`: LMCache test results
+
+If the accuracy in both result files is very close (difference < 1%), it indicates LMCache functionality is correct.
+
+## Notes
+
+1. **Determinism guarantee**: All tests use the same seed (42) and zero temperature to ensure reproducible results
+2. **Pre-requisites**: Ensure nats and etcd are running.
+3. **Sequential execution**: Must stop the first test before starting the second to avoid port conflicts
\ No newline at end of file
--- a/tests/lmcache/deploy-baseline-dynamo-disag.sh
+++ b/tests/lmcache/deploy-baseline-dynamo-disag.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# ASSUMPTION: dynamo and its dependencies are properly installed
+# i.e. nats and etcd are running
+
+# Overview:
+# This script deploys dynamo disaggregated serving without LMCache on port 8080
+# Used as baseline for correctness testing
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# Arguments:
+MODEL_URL=$1
+
+if [ -z "$MODEL_URL" ]; then
+    echo "Usage: $0 <MODEL_URL>"
+    echo "Example: $0 Qwen/Qwen3-0.6B"
+    exit 1
+fi
+
+echo "🚀 Starting dynamo disaggregated serving setup without LMCache:"
+echo "   Model: $MODEL_URL"
+echo "   Port: 8080"
+echo "   Mode: Disaggregated (prefill + decode workers)"
+
+# Kill any existing dynamo processes
+echo "🧹 Cleaning up any existing dynamo processes..."
+pkill -f "dynamo-run" || true
+sleep 2
+
+# Disable LMCache
+export ENABLE_LMCACHE=0
+echo "🔧 Starting dynamo disaggregated serving without LMCache..."
+
+python -m dynamo.frontend &
+
+CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model $MODEL_URL&
+
+CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
+    --model $MODEL_URL \
+    --is-prefill-worker
\ No newline at end of file
--- a/tests/lmcache/deploy-baseline-dynamo.sh
+++ b/tests/lmcache/deploy-baseline-dynamo.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# ASSUMPTION: dynamo and its dependencies are properly installed
+# i.e. nats and etcd are running
+
+# Overview:
+# This script deploys dynamo without LMCache on port 8080
+# Used as baseline for correctness testing
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+# Arguments:
+MODEL_URL=$1
+
+if [ -z "$MODEL_URL" ]; then
+    echo "Usage: $0 <MODEL_URL>"
+    echo "Example: $0 Qwen/Qwen3-0.6B"
+    exit 1
+fi
+
+echo "🚀 Starting dynamo setup without LMCache:"
+echo "   Model: $MODEL_URL"
+echo "   Port: 8080"
+
+# Kill any existing dynamo processes
+echo "🧹 Cleaning up any existing dynamo processes..."
+pkill -f "dynamo-run" || true
+sleep 2
+
+# Disable LMCache
+export ENABLE_LMCACHE=0
+echo "🔧 Starting dynamo worker without LMCache..."
+
+
+python -m dynamo.frontend &
+python3 -m dynamo.vllm --model $MODEL_URL
\ No newline at end of file
--- a/tests/lmcache/deploy-lmcache_enabled-dynamo-disag.sh
+++ b/tests/lmcache/deploy-lmcache_enabled-dynamo-disag.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# ASSUMPTION: dynamo and its dependencies are properly installed
+# i.e. nats and etcd are running
+
+# Overview:
+# This script deploys dynamo disaggregated serving with LMCache enabled on port 8080
+# Used for LMCache correctness testing
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# Arguments:
+MODEL_URL=$1
+
+if [ -z "$MODEL_URL" ]; then
+    echo "Usage: $0 <MODEL_URL>"
+    echo "Example: $0 Qwen/Qwen3-0.6B"
+    exit 1
+fi
+
+echo "🚀 Starting dynamo disaggregated serving setup with LMCache:"
+echo "   Model: $MODEL_URL"
+echo "   Port: 8080"
+echo "   Mode: Disaggregated (prefill + decode workers) + LMCache"
+echo "   !! Remember to kill the old dynamo processes otherwise the port will be busy !!"
+
+# Kill any existing dynamo processes
+echo "🧹 Cleaning up any existing dynamo processes..."
+pkill -f "dynamo-run" || true
+sleep 2
+
+echo "🔧 Starting dynamo disaggregated serving with LMCache enabled..."
+
+python -m dynamo.frontend &
+
+CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model $MODEL_URL&
+
+sleep 20
+
+# run prefill worker on GPU 1 with LMCache
+ENABLE_LMCACHE=1 \
+LMCACHE_CHUNK_SIZE=256 \
+LMCACHE_LOCAL_CPU=True \
+LMCACHE_MAX_LOCAL_CPU_SIZE=20 \
+CUDA_VISIBLE_DEVICES=1 \
+  python3 -m dynamo.vllm \
+    --model $MODEL_URL \
+    --is-prefill-worker
\ No newline at end of file
--- a/tests/lmcache/deploy-lmcache_enabled-dynamo.sh
+++ b/tests/lmcache/deploy-lmcache_enabled-dynamo.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# ASSUMPTION: dynamo and its dependencies are properly installed
+# i.e. nats and etcd are running
+
+# Overview:
+# This script deploys dynamo with LMCache enabled on port 8080
+# Used for LMCache correctness testing
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+# Arguments:
+MODEL_URL=$1
+
+if [ -z "$MODEL_URL" ]; then
+    echo "Usage: $0 <MODEL_URL>"
+    echo "Example: $0 Qwen/Qwen3-0.6B"
+    exit 1
+fi
+
+echo "🚀 Starting dynamo setup with LMCache:"
+echo "   Model: $MODEL_URL"
+echo "   Port: 8080"
+echo "   !! Remmber to kill the old dynamo processes other wise the port will be busy !! "
+
+# Kill any existing dynamo processes
+echo "🧹 Cleaning up any existing dynamo processes..."
+pkill -f "dynamo-run" || true
+sleep 2
+
+echo "🔧 Starting dynamo worker with LMCache enabled..."
+
+python -m dynamo.frontend &
+ENABLE_LMCACHE=1 \
+  python3 -m dynamo.vllm --model $MODEL_URL
\ No newline at end of file
--- a/tests/lmcache/download_mmlu.py
+++ b/tests/lmcache/download_mmlu.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Download MMLU dataset using HuggingFace datasets library.
+This is a simpler alternative to the bash script.
+"""
+
+import argparse
+import os
+
+import pandas as pd
+from datasets import load_dataset
+
+
+def download_mmlu():
+    """Download and prepare MMLU dataset."""
+    print("Downloading MMLU dataset...")
+
+    # Create data directories
+    os.makedirs("data/test", exist_ok=True)
+    os.makedirs("data/dev", exist_ok=True)
+
+    try:
+        # Download MMLU dataset
+        print("🔄 Loading MMLU dataset from HuggingFace...")
+        dataset = load_dataset("cais/mmlu", "all")
+
+        print("📊 Dataset information:")
+        print(f"  - Training set: {len(dataset['auxiliary_train'])} samples")
+        print(f"  - Validation set: {len(dataset['validation'])} samples")
+        print(f"  - Test set: {len(dataset['test'])} samples")
+
+        # Get all subjects
+        subjects_set = set()
+        for split in ["validation", "test"]:
+            for item in dataset[split]:
+                subjects_set.add(item["subject"])
+
+        subjects = sorted(list(subjects_set))
+        print(f"Contains {len(subjects)} subjects")
+
+        # Process each subject
+        print("Organizing data files...")
+        for subject in subjects:
+            # Filter data for this subject
+            test_data = [item for item in dataset["test"] if item["subject"] == subject]
+            val_data = [
+                item for item in dataset["validation"] if item["subject"] == subject
+            ]
+
+            if test_data:
+                # Convert to DataFrame and save as CSV
+                test_df = pd.DataFrame(test_data)
+
+                # Create the expected CSV format: question, A, B, C, D, answer
+                csv_data = []
+                for _, row in test_df.iterrows():
+                    csv_row = [
+                        row["question"],
+                        row["choices"][0],  # A
+                        row["choices"][1],  # B
+                        row["choices"][2],  # C
+                        row["choices"][3],  # D
+                        chr(ord("A") + row["answer"]),  # Convert 0,1,2,3 to A,B,C,D
+                    ]
+                    csv_data.append(csv_row)
+
+                # Save test CSV
+                test_csv = pd.DataFrame(csv_data)
+                test_file = f"data/test/{subject}_test.csv"
+                test_csv.to_csv(test_file, header=False, index=False)
+
+            if val_data:
+                # Convert validation data (used as dev set)
+                val_df = pd.DataFrame(val_data)
+
+                csv_data = []
+                for _, row in val_df.iterrows():
+                    csv_row = [
+                        row["question"],
+                        row["choices"][0],  # A
+                        row["choices"][1],  # B
+                        row["choices"][2],  # C
+                        row["choices"][3],  # D
+                        chr(ord("A") + row["answer"]),  # Convert 0,1,2,3 to A,B,C,D
+                    ]
+                    csv_data.append(csv_row)
+
+                # Save dev CSV
+                dev_csv = pd.DataFrame(csv_data)
+                dev_file = f"data/dev/{subject}_dev.csv"
+                dev_csv.to_csv(dev_file, header=False, index=False)
+
+        # Verify the download
+        test_files = [f for f in os.listdir("data/test") if f.endswith("_test.csv")]
+        dev_files = [f for f in os.listdir("data/dev") if f.endswith("_dev.csv")]
+
+        print("\nDownload completed statistics:")
+        print(f"  Test files: {len(test_files)} files")
+        print(f"  Dev files: {len(dev_files)} files")
+
+        if test_files and dev_files:
+            print("✅ MMLU dataset download completed!")
+            print("Data locations:")
+            print("  - Test set: data/test/")
+            print("  - Dev set: data/dev/")
+
+            print("\nAvailable subject examples:")
+            for i, f in enumerate(sorted(test_files)[:10]):
+                subject = f.replace("_test.csv", "")
+                print(f"  - {subject}")
+            if len(test_files) > 10:
+                print(f"  ... and {len(test_files) - 10} more subjects")
+
+            print("\n🚀 You can now run MMLU tests:")
+            print('   ./deploy-1-dynamo.sh "Qwen/Qwen3-0.6B"')
+            print(
+                '   python3 1-mmlu-dynamo.py --model "Qwen/Qwen3-0.6B" --number-of-subjects 15'
+            )
+        else:
+            print("❌ Data download incomplete")
+
+    except Exception as e:
+        print(f"❌ Download failed: {e}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Download MMLU dataset")
+    args = parser.parse_args()
+    download_mmlu()
--- a/tests/lmcache/mmlu-baseline-dynamo.py
+++ b/tests/lmcache/mmlu-baseline-dynamo.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# This is a MMLU test script for dynamo baseline testing (without LMCache)
+# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
+
+# ASSUMPTIONS:
+# 1. dynamo is running (default: localhost:8080) without LMCache
+# 2. the mmlu dataset is in a "data" directory
+# 3. all invocations of this script should be run in the same directory
+#    (for later consolidation)
+
+# Standard
+import argparse
+import json
+import os
+from typing import Optional
+
+import numpy as np
+import pandas as pd
+import requests
+
+# Third Party
+from tqdm import tqdm
+from transformers import AutoTokenizer, set_seed
+
+tokenizer: Optional[AutoTokenizer] = None
+choices = ["A", "B", "C", "D"]
+
+
+# for complete determinism between runs of MMLU, we should:
+# 1. set the seed of LLM requests to a fixed number (42)
+# 2. set temperature to 0 on requests
+def get_llm_response(args, prompt):
+    # Use dynamo's completions API format
+    data = {
+        "model": args.model,
+        "prompt": prompt,
+        "temperature": 0,
+        "max_tokens": 3,
+        "stream": False,
+        "seed": 42,  # Add explicit seed for determinism
+    }
+    url = f"http://{args.host}:{args.port}/v1/completions"
+    res = requests.post(url, json=data, timeout=30)
+    if res.status_code != 200:
+        raise Exception(f"Error: {res.status_code} {res.text}")
+    response_json = res.json()
+    return response_json["choices"][0]["text"]
+
+
+# grab the idx'th row of the df and generate a prompt string
+# format of the MMLU csvs:
+# question,option_A,option_B,option_C,option_D,answer
+def prompt_string(df, idx, include_answer=True):
+    prompt = df.iloc[idx, 0]
+    k = df.shape[1] - 2  # number of columns - 2 (question and answer)
+    for i in range(k):
+        prompt += f"\n{choices[i]}. {df.iloc[idx, i + 1]}"
+    prompt += "\nRespond with **only the letter** (A, B, C, D).  Do **not** output any explanation, analysis, or extra words. Answer:"
+    if include_answer:
+        prompt += f" {df.iloc[idx, k]}\n\n"
+    return prompt
+
+
+def evaluate(args, subject, dev_df, test_df):
+    prompts, labels = [], []
+
+    shared_multi_shot_prefix = [
+        f"The following are multiple choice questions (with answers) \
+                                about {subject}. \n\n"
+    ]
+    shared_multi_shot_prefix_length = 0
+    for i in range(dev_df.shape[0]):
+        # the multi-shot examples should contain answers
+        shared_multi_shot_prefix.append(prompt_string(dev_df, i))
+
+        # Use plain list of token IDs, no torch tensors
+        assert tokenizer is not None, "Tokenizer must be initialized"
+        token_ids = tokenizer(shared_multi_shot_prefix[-1], add_special_tokens=True)[  # type: ignore
+            "input_ids"
+        ]
+        shared_multi_shot_prefix_length += len(token_ids)
+
+        if shared_multi_shot_prefix_length > 4000:
+            break
+
+    # all already have double newlines at the end
+    shared_multi_shot_prefix_str = "".join(shared_multi_shot_prefix)
+
+    for i in range(test_df.shape[0]):
+        # do NOT include the answer for the actual question we want the LLM to answer
+        query_prompt = prompt_string(test_df, i, include_answer=False)
+        prompt = f"{shared_multi_shot_prefix_str}\n\n{query_prompt}"
+        prompts.append(prompt)
+        label = test_df.iloc[i, test_df.shape[1] - 1]
+        labels.append(label)
+
+    predictions = []
+    for i, prompt in enumerate(prompts):
+        prediction = get_llm_response(args, prompt)
+        prediction_stripped = prediction.strip()
+        if prediction_stripped and prediction_stripped[0] in ["A", "B", "C", "D"]:
+            predictions.append(prediction_stripped[0])
+        else:
+            # Fallback: look for any A, B, C, D in the response
+            for char in prediction_stripped:
+                if char in ["A", "B", "C", "D"]:
+                    predictions.append(char)
+                    break
+            else:
+                predictions.append("A")  # Default fallback
+
+    accuracy = np.mean(np.array(predictions) == np.array(labels))
+    return accuracy
+
+
+def main(args):
+    global tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+
+    mmlu_files = os.listdir("data/test")
+    test_files = [f for f in mmlu_files if f.endswith("_test.csv")]
+    subjects = sorted([f.split("_test.csv")[0] for f in test_files])
+
+    accuracies = []
+    num_questions = []
+    output_dict = {}
+
+    for subject_raw in tqdm(
+        subjects[: args.number_of_subjects], desc="Processing subjects"
+    ):
+        subject = " ".join(subject_raw.split("_"))  # replace underscores with spaces
+        dev_df = pd.read_csv(
+            os.path.join("data/dev", subject_raw + "_dev.csv"), header=None
+        )
+        test_df = pd.read_csv(
+            os.path.join("data/test", subject_raw + "_test.csv"), header=None
+        )
+        accuracy = evaluate(args, subject, dev_df, test_df)
+        accuracies.append(accuracy)
+        num_questions.append(len(test_df))
+        output_dict[subject_raw] = {"accuracy": accuracy, "num_questions": len(test_df)}
+
+    total_accuracy = np.mean(accuracies)
+    total_num_questions = sum(num_questions)
+    output_dict["total"] = {
+        "accuracy": total_accuracy,
+        "num_questions": total_num_questions,
+    }
+
+    with open(args.result_file, "w") as f:
+        # output will be a jsonl file
+        for subject, value in output_dict.items():
+            f.write(json.dumps({subject: value}) + "\n")
+
+
+if __name__ == "__main__":
+    set_seed(42)  # some tokenizers may have randomness
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", type=str, required=True)
+    parser.add_argument("--result-file", type=str, required=False)
+    parser.add_argument("--number-of-subjects", type=int, required=True)
+    parser.add_argument("--host", type=str, default="localhost", help="Dynamo host")
+    parser.add_argument("--port", type=int, default=8080, help="Dynamo port")
+
+    args = parser.parse_args()
+    if args.result_file is None:
+        # Clean model name if it's a path or has slashes
+        model_name = args.model.split("/")[-1]
+        args.result_file = f"dynamo-baseline-{model_name}.jsonl"
+
+    main(args)
--- a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
+++ b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This is a MMLU test script for dynamo LMCache testing (with LMCache enabled)
+# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
+
+# ASSUMPTIONS:
+# 1. dynamo is running (default: localhost:8080) with LMCache enabled
+# 2. the mmlu dataset is in a "data" directory
+# 3. all invocations of this script should be run in the same directory
+#    (for later consolidation)
+
+# Standard
+import argparse
+import json
+import os
+from typing import Optional
+
+import numpy as np
+import pandas as pd
+import requests
+
+# Third Party
+from tqdm import tqdm
+from transformers import AutoTokenizer, set_seed
+
+tokenizer: Optional[AutoTokenizer] = None
+choices = ["A", "B", "C", "D"]
+
+
+# for complete determinism between runs of MMLU, we should:
+# 1. set the seed of LLM requests to a fixed number (42)
+# 2. set temperature to 0 on requests
+def get_llm_response(args, prompt):
+    # Use dynamo's completions API format
+    data = {
+        "model": args.model,
+        "prompt": prompt,
+        "temperature": 0,
+        "max_tokens": 3,
+        "stream": False,
+        "seed": 42,  # Add explicit seed for determinism
+    }
+    url = f"http://{args.host}:{args.port}/v1/completions"
+    res = requests.post(url, json=data, timeout=30)
+    if res.status_code != 200:
+        raise Exception(f"Error: {res.status_code} {res.text}")
+    response_json = res.json()
+    return response_json["choices"][0]["text"]
+
+
+# grab the idx'th row of the df and generate a prompt string
+# format of the MMLU csvs:
+# question,option_A,option_B,option_C,option_D,answer
+def prompt_string(df, idx, include_answer=True):
+    prompt = df.iloc[idx, 0]
+    k = df.shape[1] - 2  # number of columns - 2 (question and answer)
+    for i in range(k):
+        prompt += f"\n{choices[i]}. {df.iloc[idx, i + 1]}"
+    prompt += "\nRespond with **only the letter** (A, B, C, D).  Do **not** output any explanation, analysis, or extra words. Answer:"
+    if include_answer:
+        prompt += f" {df.iloc[idx, k]}\n\n"
+    return prompt
+
+
+def evaluate(args, subject, dev_df, test_df):
+    prompts, labels = [], []
+
+    shared_multi_shot_prefix = [
+        f"The following are multiple choice questions (with answers) \
+                                about {subject}. \n\n"
+    ]
+    shared_multi_shot_prefix_length = 0
+    for i in range(dev_df.shape[0]):
+        # the multi-shot examples should contain answers
+        shared_multi_shot_prefix.append(prompt_string(dev_df, i))
+
+        # Use plain list of token IDs, no torch tensors
+        assert tokenizer is not None, "Tokenizer must be initialized"
+        token_ids = tokenizer(shared_multi_shot_prefix[-1], add_special_tokens=True)[  # type: ignore
+            "input_ids"
+        ]
+        shared_multi_shot_prefix_length += len(token_ids)
+
+        if shared_multi_shot_prefix_length > 4000:
+            break
+
+    # all already have double newlines at the end
+    shared_multi_shot_prefix_str = "".join(shared_multi_shot_prefix)
+
+    for i in range(test_df.shape[0]):
+        # do NOT include the answer for the actual question we want the LLM to answer
+        query_prompt = prompt_string(test_df, i, include_answer=False)
+        prompt = f"{shared_multi_shot_prefix_str}\n\n{query_prompt}"
+        prompts.append(prompt)
+        label = test_df.iloc[i, test_df.shape[1] - 1]
+        labels.append(label)
+
+    predictions = []
+    for i, prompt in enumerate(prompts):
+        prediction = get_llm_response(args, prompt)
+        prediction_stripped = prediction.strip()
+        if prediction_stripped and prediction_stripped[0] in ["A", "B", "C", "D"]:
+            predictions.append(prediction_stripped[0])
+        else:
+            # Fallback: look for any A, B, C, D in the response
+            for char in prediction_stripped:
+                if char in ["A", "B", "C", "D"]:
+                    predictions.append(char)
+                    break
+            else:
+                predictions.append("A")  # Default fallback
+
+    accuracy = np.mean(np.array(predictions) == np.array(labels))
+    return accuracy
+
+
+def main(args):
+    global tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+
+    mmlu_files = os.listdir("data/test")
+    test_files = [f for f in mmlu_files if f.endswith("_test.csv")]
+    subjects = sorted([f.split("_test.csv")[0] for f in test_files])
+
+    accuracies = []
+    num_questions = []
+    output_dict = {}
+
+    for subject_raw in tqdm(
+        subjects[: args.number_of_subjects], desc="Processing subjects"
+    ):
+        subject = " ".join(subject_raw.split("_"))  # replace underscores with spaces
+        dev_df = pd.read_csv(
+            os.path.join("data/dev", subject_raw + "_dev.csv"), header=None
+        )
+        test_df = pd.read_csv(
+            os.path.join("data/test", subject_raw + "_test.csv"), header=None
+        )
+        accuracy = evaluate(args, subject, dev_df, test_df)
+        accuracies.append(accuracy)
+        num_questions.append(len(test_df))
+        output_dict[subject_raw] = {"accuracy": accuracy, "num_questions": len(test_df)}
+
+    total_accuracy = np.mean(accuracies)
+    total_num_questions = sum(num_questions)
+    output_dict["total"] = {
+        "accuracy": total_accuracy,
+        "num_questions": total_num_questions,
+    }
+
+    with open(args.result_file, "w") as f:
+        # output will be a jsonl file
+        for subject, value in output_dict.items():
+            f.write(json.dumps({subject: value}) + "\n")
+
+
+if __name__ == "__main__":
+    set_seed(42)  # some tokenizers may have randomness
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", type=str, required=True)
+    parser.add_argument("--result-file", type=str, required=False)
+    parser.add_argument("--number-of-subjects", type=int, required=True)
+    parser.add_argument("--host", type=str, default="localhost", help="Dynamo host")
+    parser.add_argument("--port", type=int, default=8080, help="Dynamo port")
+
+    args = parser.parse_args()
+    if args.result_file is None:
+        # Clean model name if it's a path or has slashes
+        model_name = args.model.split("/")[-1]
+        args.result_file = f"dynamo-lmcache-{model_name}.jsonl"
+
+    main(args)