README.md 4.23 KB
Newer Older
Ryan Olson's avatar
Ryan Olson committed
1
2
3
4
# KV Behavior & Model Determinism Tests (kvbm)

## Overview

5
This suite validates the determinism properties of the API-backed LLM under fixed sampling parameters and, optionally, across prefix cache resets. The tests can automatically start a local LLM server—either a vLLM server or a TensorRT-LLM server—warm it up, and compare responses for identical prompts over multiple iterations. The suite also automatically detects whether the vLLM or TensorRT-LLM wheel is installed and starts the corresponding server.
Ryan Olson's avatar
Ryan Olson committed
6
7
8

## Files

9
- `test_determinism.py` — comprehensive determinism tests with automatic LLM server lifecycle and warmup.
Ryan Olson's avatar
Ryan Olson committed
10
11
12
13
14
15
16
17
18
19
20
21
  - `test_determinism_with_cache_reset` — run test with warmup, reset cache, then run again without warmup to test determinism across cache reset boundary
  - `test_concurrent_determinism_with_ifeval` — send parametrized number of IFEval prompts (default: 120) with controlled concurrency, with warmup, then reset cache and test again without warmup to validate determinism across cache reset

## Markers

- `kvbm` — KV behavior and model determinism tests
- `e2e` — end-to-end tests
- `slow` — tests may take a while due to warmup/iterations
- `nightly` — preferred for nightly runs

## How It Works

22
- A `LLMServerManager` fixture (`llm_server`) launches `vllm serve` or `trtllm-serve` with the Dynamo connector and optional cache block overrides.
Ryan Olson's avatar
Ryan Olson committed
23
24
25
26
27
28
29
30
31
32
33
- A `tester` fixture binds the test client to the running server's base URL.
- The test performs a comprehensive warmup across prompts, then executes repeated requests and checks that responses are identical (deterministic). An optional cache reset phase re-validates determinism across the reset boundary.

## Running

Run all kvbm tests:

```bash
pytest -v -m "kvbm" -s
```

34
Run the determinism test file directly inside dynamo repo:
Ryan Olson's avatar
Ryan Olson committed
35
36

```bash
37
38
39
40
pytest -v tests/kvbm/test_determinism_agg.py -s

# disagg needs 2 GPUs to run
pytest -v tests/kvbm/test_determinism_disagg.py -s
Ryan Olson's avatar
Ryan Olson committed
41
42
43
44
45
46
47
48
```

## Configuration

Environment variables control server settings and test load:

- Server/model
  - `KVBM_MODEL_ID` (default: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`)
49
50
  - `KVBM_SERVER_PORT` (default: `8000`)
  - `KVBM_SERVER_START_TIMEOUT` (default: `300` seconds)
Ryan Olson's avatar
Ryan Olson committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

- Cache size overrides
  - `KVBM_CPU_BLOCKS` (used via test parametrization; default: `10000`)
  - `--num-gpu-blocks-override` is applied when `gpu_blocks` is parametrized

- Request/test parameters
  - `KVBM_MAX_TOKENS` (default: `48`)
  - `KVBM_SEED` (default: `42`)
  - `KVBM_MAX_ITERATIONS` (default: `500`)
  - `KVBM_WORD_COUNT` (default: `200`)
  - `KVBM_CONTROL_INTERVAL` (default: `10`)
  - `KVBM_SHAKESPEARE_INTERVAL` (default: `1`)
  - `KVBM_RANDOM_INTERVAL` (default: `7`)
  - `KVBM_HTTP_TIMEOUT` (default: `30` seconds)
  - `KVBM_SHAKESPEARE_URL` (default: MIT OCW Shakespeare text)

- Concurrent testing
  - `KVBM_CONCURRENT_REQUESTS` (default: `"3"` - comma-separated list for parametrization of max concurrent workers)
  - `KVBM_MAX_TOKENS` (default: `"10"` - comma-separated list for parametrization of max_tokens in concurrent tests)
  - `KVBM_IFEVAL_PROMPTS` (default: `"120"` - comma-separated list for parametrization of number of IFEval prompts to use)

Example:

```bash
KVBM_MODEL_ID=Qwen/Qwen3-0.6B \
KVBM_CPU_BLOCKS=12000 \
KVBM_MAX_ITERATIONS=100 \
KVBM_CONCURRENT_REQUESTS="10,25,50" \
KVBM_MAX_TOKENS="48,128,256" \
KVBM_IFEVAL_PROMPTS="50,120,200" \
pytest -v -m "kvbm" -s
```

## Requirements

- `vllm` executable available in PATH inside the test environment.
- The connector module path must be valid: `dynamo.llm.vllm_integration.connector`.
- NATS and etcd services (provided automatically by the `runtime_services` fixture).
- `datasets` library for IFEval concurrent testing (included in test dependencies).
- For containerized workflows, follow the top-level `tests/README.md` guidance to build/run the appropriate image, then execute pytest inside the container.

## Notes

- Warmup is critical to avoid initialization effects impacting determinism.
- For faster local iteration, reduce `KVBM_MAX_ITERATIONS` and/or increase intervals.
96
97
- Logs are written under the per-test directory created by `tests/conftest.py` and include the LLM server stdout/stderr.
- Tests use the static port defined by `KVBM_SERVER_PORT` for LLM server communication.