README.md 2.28 KB
Newer Older
1
2
3
4
5
6
7
8
9
# Mocker engine

The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:

- Testing distributed system components without GPU resources
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis

10
## Basic usage
11

12
The mocker engine now supports a vLLM-style CLI interface with individual arguments for all configuration options.
13

14
15
### Required arguments:
- `--model-path`: Path to model directory or HuggingFace model ID (required for tokenizer)
16

17
18
19
20
21
22
23
24
25
26
27
28
### MockEngineArgs parameters (vLLM-style):
- `--num-gpu-blocks-override`: Number of GPU blocks for KV cache (default: 16384)
- `--block-size`: Token block size for KV cache blocks (default: 64)
- `--max-num-seqs`: Maximum number of sequences per iteration (default: 256)
- `--max-num-batched-tokens`: Maximum number of batched tokens per iteration (default: 8192)
- `--enable-prefix-caching` / `--no-enable-prefix-caching`: Enable/disable automatic prefix caching (default: True)
- `--enable-chunked-prefill` / `--no-enable-chunked-prefill`: Enable/disable chunked prefill (default: True)
- `--watermark`: KV cache watermark threshold as a fraction (default: 0.01)
- `--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster
- `--data-parallel-size`: Number of data parallel workers to simulate (default: 1)

### Example with individual arguments (vLLM-style):
29
```bash
30
31
32
33
34
35
36
37
38
39
# Start mocker with custom configuration
python -m dynamo.mocker \
  --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --num-gpu-blocks-override 8192 \
  --block-size 16 \
  --speedup-ratio 10.0 \
  --max-num-seqs 512 \
  --enable-prefix-caching

# Start frontend server
40
python -m dynamo.frontend --http-port 8080
41
42
43
44
45
46
47
48
49
50
51
52
53
```

### Legacy JSON file support:
For backward compatibility, you can still provide configuration via a JSON file:

```bash
echo '{"speedup_ratio": 10.0, "num_gpu_blocks": 8192}' > mocker_args.json
python -m dynamo.mocker \
  --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --extra-engine-args mocker_args.json
```

Note: If `--extra-engine-args` is provided, it overrides all individual CLI arguments.