README.md 3.75 KB
Newer Older
Reid's avatar
Reid committed
1
2
3
4
# vLLM CLI Guide

The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:

5
```bash
Reid's avatar
Reid committed
6
7
8
9
10
vllm --help
```

Available Commands:

11
```bash
Reid's avatar
Reid committed
12
13
14
vllm {chat,complete,serve,bench,collect-env,run-batch}
```

15
## serve
16

17
Starts the vLLM OpenAI Compatible API server.
18

19
Start with a model:
Reid's avatar
Reid committed
20

21
22
23
```bash
vllm serve meta-llama/Llama-2-7b-hf
```
Reid's avatar
Reid committed
24

25
Specify the port:
Reid's avatar
Reid committed
26

27
28
29
```bash
vllm serve meta-llama/Llama-2-7b-hf --port 8100
```
Reid's avatar
Reid committed
30

31
Serve over a Unix domain socket:
Reid's avatar
Reid committed
32

33
34
35
```bash
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
```
36

37
Check with --help for more options:
Reid's avatar
Reid committed
38

39
40
41
```bash
# To list all groups
vllm serve --help=listgroup
Reid's avatar
Reid committed
42

43
44
# To view a argument group
vllm serve --help=ModelConfig
Reid's avatar
Reid committed
45

46
47
# To view a single argument
vllm serve --help=max-num-seqs
48

49
50
# To search by keyword
vllm serve --help=max
Reid's avatar
Reid committed
51

52
53
54
# To view full help with pager (less/more)
vllm serve --help=page
```
55

56
See [vllm serve](./serve.md) for the full reference of all available arguments.
57

Reid's avatar
Reid committed
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
## chat

Generate chat completions via the running API server.

```bash
# Directly connect to localhost API without arguments
vllm chat

# Specify API url
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick chat with a single prompt
vllm chat --quick "hi"
```

73
74
See [vllm chat](./chat.md) for the full reference of all available arguments.

Reid's avatar
Reid committed
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
## complete

Generate text completions based on the given prompt via the running API server.

```bash
# Directly connect to localhost API without arguments
vllm complete

# Specify API url
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick complete with a single prompt
vllm complete --quick "The future of AI is"
```

90
See [vllm complete](./complete.md) for the full reference of all available arguments.
91

Reid's avatar
Reid committed
92
93
94
95
## bench

Run benchmark tests for latency online serving throughput and offline inference throughput.

96
97
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.

Reid's avatar
Reid committed
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
Available Commands:

```bash
vllm bench {latency, serve, throughput}
```

### latency

Benchmark the latency of a single batch of requests.

```bash
vllm bench latency \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --input-len 32 \
    --output-len 1 \
    --enforce-eager \
    --load-format dummy
```

117
118
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.

Reid's avatar
Reid committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
### serve

Benchmark the online serving throughput.

```bash
vllm bench serve \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --host server-host \
    --port server-port \
    --random-input-len 32 \
    --random-output-len 4  \
    --num-prompts  5
```

133
134
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.

Reid's avatar
Reid committed
135
136
137
138
139
140
141
142
143
144
145
146
147
### throughput

Benchmark offline inference throughput.

```bash
vllm bench throughput \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --input-len 32 \
    --output-len 1 \
    --enforce-eager \
    --load-format dummy
```

148
149
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.

Reid's avatar
Reid committed
150
151
152
153
154
155
156
157
158
159
160
161
## collect-env

Start collecting environment information.

```bash
vllm collect-env
```

## run-batch

Run batch prompts and write results to file.

162
Running with a local file:
Reid's avatar
Reid committed
163
164
165
166
167
168

```bash
vllm run-batch \
    -i offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct
169
```
Reid's avatar
Reid committed
170

171
172
173
Using remote file:

```bash
Reid's avatar
Reid committed
174
175
176
177
178
179
vllm run-batch \
    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct
```

180
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
181

Reid's avatar
Reid committed
182
183
184
185
186
187
188
## More Help

For detailed options of any subcommand, use:

```bash
vllm <subcommand> --help
```