README.md 4.23 KB
Newer Older
Ryan Olson's avatar
Ryan Olson committed
1
2
3
4
# KV Behavior & Model Determinism Tests (kvbm)

## Overview

5
This suite validates the determinism properties of the API-backed LLM under fixed sampling parameters and, optionally, across prefix cache resets. The tests can automatically start a local LLM server—either a vLLM server or a TensorRT-LLM server—warm it up, and compare responses for identical prompts over multiple iterations. The suite also automatically detects whether the vLLM or TensorRT-LLM wheel is installed and starts the corresponding server.
Ryan Olson's avatar
Ryan Olson committed
6
7
8

## Files

9
- `test_determinism.py` — comprehensive determinism tests with automatic LLM server lifecycle and warmup.
Ryan Olson's avatar
Ryan Olson committed
10
11
12
13
14
15
16
17
18
19
20
21
  - `test_determinism_with_cache_reset` — run test with warmup, reset cache, then run again without warmup to test determinism across cache reset boundary
  - `test_concurrent_determinism_with_ifeval` — send parametrized number of IFEval prompts (default: 120) with controlled concurrency, with warmup, then reset cache and test again without warmup to validate determinism across cache reset

## Markers

- `kvbm` — KV behavior and model determinism tests
- `e2e` — end-to-end tests
- `slow` — tests may take a while due to warmup/iterations
- `nightly` — preferred for nightly runs

## How It Works

22
- A `LLMServerManager` fixture (`llm_server`) launches `vllm serve` or `trtllm-serve` with the Dynamo connector and optional cache block overrides.
Ryan Olson's avatar
Ryan Olson committed
23
24
25
26
27
28
29
30
31
32
33
- A `tester` fixture binds the test client to the running server's base URL.
- The test performs a comprehensive warmup across prompts, then executes repeated requests and checks that responses are identical (deterministic). An optional cache reset phase re-validates determinism across the reset boundary.

## Running

Run all kvbm tests:

```bash
pytest -v -m "kvbm" -s
```

34
Run the determinism test file directly inside dynamo repo:
Ryan Olson's avatar
Ryan Olson committed
35
36

```bash
Richard Huo's avatar
Richard Huo committed
37
pytest -v tests/kvbm_integration/test_determinism_agg.py -s
38
39

# disagg needs 2 GPUs to run
Richard Huo's avatar
Richard Huo committed
40
pytest -v tests/kvbm_integration/test_determinism_disagg.py -s
Ryan Olson's avatar
Ryan Olson committed
41
42
43
44
45
46
47
48
```

## Configuration

Environment variables control server settings and test load:

- Server/model
  - `KVBM_MODEL_ID` (default: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`)
49
50
  - `KVBM_SERVER_PORT` (default: `8000`)
  - `KVBM_SERVER_START_TIMEOUT` (default: `300` seconds)
Ryan Olson's avatar
Ryan Olson committed
51
52
53
54
55
56

- Cache size overrides
  - `KVBM_CPU_BLOCKS` (used via test parametrization; default: `10000`)
  - `--num-gpu-blocks-override` is applied when `gpu_blocks` is parametrized

- Request/test parameters
57
  - `KVBM_MAX_TOKENS` (default: `48`) - single integer for max tokens per request
Ryan Olson's avatar
Ryan Olson committed
58
59
60
61
62
63
64
65
66
  - `KVBM_SEED` (default: `42`)
  - `KVBM_MAX_ITERATIONS` (default: `500`)
  - `KVBM_WORD_COUNT` (default: `200`)
  - `KVBM_CONTROL_INTERVAL` (default: `10`)
  - `KVBM_SHAKESPEARE_INTERVAL` (default: `1`)
  - `KVBM_RANDOM_INTERVAL` (default: `7`)
  - `KVBM_HTTP_TIMEOUT` (default: `30` seconds)
  - `KVBM_SHAKESPEARE_URL` (default: MIT OCW Shakespeare text)

67
68
69
- Concurrent testing (only for `test_concurrent_determinism_with_ifeval`)
  - `KVBM_CONCURRENT_REQUESTS` (default: `3`) - comma-separated list for parametrization of max concurrent workers
  - `KVBM_IFEVAL_PROMPTS` (default: `120`) - comma-separated list for parametrization of number of IFEval prompts
Ryan Olson's avatar
Ryan Olson committed
70

71
### Example
Ryan Olson's avatar
Ryan Olson committed
72
73

```bash
74
75
KVBM_MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
KVBM_CPU_BLOCKS=10000 \
Ryan Olson's avatar
Ryan Olson committed
76
KVBM_MAX_ITERATIONS=100 \
77
KVBM_MAX_TOKENS=48 \
Ryan Olson's avatar
Ryan Olson committed
78
79
80
81
82
83
84
85
KVBM_CONCURRENT_REQUESTS="10,25,50" \
KVBM_IFEVAL_PROMPTS="50,120,200" \
pytest -v -m "kvbm" -s
```

## Requirements

- `vllm` executable available in PATH inside the test environment.
Richard Huo's avatar
Richard Huo committed
86
- The connector module path must be valid: `kvbm.vllm_integration.connector`.
Ryan Olson's avatar
Ryan Olson committed
87
88
89
90
91
92
93
94
- NATS and etcd services (provided automatically by the `runtime_services` fixture).
- `datasets` library for IFEval concurrent testing (included in test dependencies).
- For containerized workflows, follow the top-level `tests/README.md` guidance to build/run the appropriate image, then execute pytest inside the container.

## Notes

- Warmup is critical to avoid initialization effects impacting determinism.
- For faster local iteration, reduce `KVBM_MAX_ITERATIONS` and/or increase intervals.
95
96
- Logs are written under the per-test directory created by `tests/conftest.py` and include the LLM server stdout/stderr.
- Tests use the static port defined by `KVBM_SERVER_PORT` for LLM server communication.