README.md

# KV Behavior & Model Determinism Tests (kvbm)

## Overview

This suite validates determinism properties of the API-backed LLM under fixed sampling parameters and optionally across prefix cache resets. The tests can automatically start a local vLLM server, warm it up, and compare responses for identical prompts over multiple iterations.

## Files

- `test_determinism.py` — comprehensive determinism tests with automatic vLLM server lifecycle and warmup.
  - `test_determinism_with_cache_reset` — run test with warmup, reset cache, then run again without warmup to test determinism across cache reset boundary
  - `test_concurrent_determinism_with_ifeval` — send parametrized number of IFEval prompts (default: 120) with controlled concurrency, with warmup, then reset cache and test again without warmup to validate determinism across cache reset

## Markers

- `kvbm` — KV behavior and model determinism tests
- `e2e` — end-to-end tests
- `slow` — tests may take a while due to warmup/iterations
- `nightly` — preferred for nightly runs

## How It Works

- A `VLLMServerManager` fixture (`vllm_server`) launches `vllm serve` with the Dynamo connector and optional cache block overrides.
- A `tester` fixture binds the test client to the running server's base URL.
- The test performs a comprehensive warmup across prompts, then executes repeated requests and checks that responses are identical (deterministic). An optional cache reset phase re-validates determinism across the reset boundary.

## Running

Run all kvbm tests:

```bash
pytest -v -m "kvbm" -s
```

Run the determinism test file directly:

```bash
pytest -v dynamo/tests/kvbm/test_determinism.py -s
```

## Configuration

Environment variables control server settings and test load:

- Server/model
  - `KVBM_MODEL_ID` (default: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`)
  - `KVBM_VLLM_PORT` (default: `8000`)
  - `KVBM_VLLM_START_TIMEOUT` (default: `300` seconds)

- Cache size overrides
  - `KVBM_CPU_BLOCKS` (used via test parametrization; default: `10000`)
  - `--num-gpu-blocks-override` is applied when `gpu_blocks` is parametrized

- Request/test parameters
  - `KVBM_MAX_TOKENS` (default: `48`)
  - `KVBM_SEED` (default: `42`)
  - `KVBM_MAX_ITERATIONS` (default: `500`)
  - `KVBM_WORD_COUNT` (default: `200`)
  - `KVBM_CONTROL_INTERVAL` (default: `10`)
  - `KVBM_SHAKESPEARE_INTERVAL` (default: `1`)
  - `KVBM_RANDOM_INTERVAL` (default: `7`)
  - `KVBM_HTTP_TIMEOUT` (default: `30` seconds)
  - `KVBM_SHAKESPEARE_URL` (default: MIT OCW Shakespeare text)

- Concurrent testing
  - `KVBM_CONCURRENT_REQUESTS` (default: `"3"` - comma-separated list for parametrization of max concurrent workers)
  - `KVBM_MAX_TOKENS` (default: `"10"` - comma-separated list for parametrization of max_tokens in concurrent tests)
  - `KVBM_IFEVAL_PROMPTS` (default: `"120"` - comma-separated list for parametrization of number of IFEval prompts to use)

Example:

```bash
KVBM_MODEL_ID=Qwen/Qwen3-0.6B \
KVBM_CPU_BLOCKS=12000 \
KVBM_MAX_ITERATIONS=100 \
KVBM_CONCURRENT_REQUESTS="10,25,50" \
KVBM_MAX_TOKENS="48,128,256" \
KVBM_IFEVAL_PROMPTS="50,120,200" \
pytest -v -m "kvbm" -s
```

## Requirements

- `vllm` executable available in PATH inside the test environment.
- The connector module path must be valid: `dynamo.llm.vllm_integration.connector`.
- NATS and etcd services (provided automatically by the `runtime_services` fixture).
- `datasets` library for IFEval concurrent testing (included in test dependencies).
- For containerized workflows, follow the top-level `tests/README.md` guidance to build/run the appropriate image, then execute pytest inside the container.

## Notes

- Warmup is critical to avoid initialization effects impacting determinism.
- For faster local iteration, reduce `KVBM_MAX_ITERATIONS` and/or increase intervals.
- Logs are written under the per-test directory created by `tests/conftest.py` and include the vLLM server stdout/stderr.
- Tests use the static port defined by `KVBM_VLLM_PORT` for vLLM server communication.