Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
norm
vllm
Commits
bc064457
"src/diffusers/models/consistency_decoder_vae.py" did not exist on "7a4324cce3f84d14afe8e5cfd47fb67701ce2fd3"
Unverified
Commit
bc064457
authored
Sep 19, 2023
by
Woosuk Kwon
Committed by
GitHub
Sep 19, 2023
Browse files
Add gpu_memory_utilization and swap_space to LLM (#1090)
parent
400b8289
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
19 additions
and
3 deletions
+19
-3
vllm/entrypoints/llm.py
vllm/entrypoints/llm.py
+19
-3
No files found.
vllm/entrypoints/llm.py
View file @
bc064457
...
@@ -37,12 +37,22 @@ class LLM:
...
@@ -37,12 +37,22 @@ class LLM:
the `torch_dtype` attribute specified in the model config file.
the `torch_dtype` attribute specified in the model config file.
However, if the `torch_dtype` in the config is `float32`, we will
However, if the `torch_dtype` in the config is `float32`, we will
use `float16` instead.
use `float16` instead.
seed: The seed to initialize the random number generator for sampling.
quantization: The method used to quantize the model weights. Currently,
quantization: The method used to quantize the model weights. Currently,
we support "awq". If None, we assume the model weights are not
we support "awq". If None, we assume the model weights are not
quantized and use `dtype` to determine the data type of the weights.
quantized and use `dtype` to determine the data type of the weights.
revision: The specific model version to use. It can be a branch name,
revision: The specific model version to use. It can be a branch name,
a tag name, or a commit id.
a tag name, or a commit id.
seed: The seed to initialize the random number generator for sampling.
gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
reserve for the model weights, activations, and KV cache. Higher
values will increase the KV cache size and thus improve the model's
throughput. However, if the value is too high, it may cause out-of-
memory (OOM) errors.
swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
This can be used for temporarily storing the states of the requests
when their `best_of` sampling parameters are larger than 1. If all
requests will have `best_of=1`, you can safely set this to 0.
Otherwise, too small values may cause out-of-memory (OOM) errors.
"""
"""
def
__init__
(
def
__init__
(
...
@@ -53,8 +63,11 @@ class LLM:
...
@@ -53,8 +63,11 @@ class LLM:
trust_remote_code
:
bool
=
False
,
trust_remote_code
:
bool
=
False
,
tensor_parallel_size
:
int
=
1
,
tensor_parallel_size
:
int
=
1
,
dtype
:
str
=
"auto"
,
dtype
:
str
=
"auto"
,
seed
:
int
=
0
,
quantization
:
Optional
[
str
]
=
None
,
quantization
:
Optional
[
str
]
=
None
,
revision
:
Optional
[
str
]
=
None
,
seed
:
int
=
0
,
gpu_memory_utilization
:
float
=
0.9
,
swap_space
:
int
=
4
,
**
kwargs
,
**
kwargs
,
)
->
None
:
)
->
None
:
if
"disable_log_stats"
not
in
kwargs
:
if
"disable_log_stats"
not
in
kwargs
:
...
@@ -66,8 +79,11 @@ class LLM:
...
@@ -66,8 +79,11 @@ class LLM:
trust_remote_code
=
trust_remote_code
,
trust_remote_code
=
trust_remote_code
,
tensor_parallel_size
=
tensor_parallel_size
,
tensor_parallel_size
=
tensor_parallel_size
,
dtype
=
dtype
,
dtype
=
dtype
,
seed
=
seed
,
quantization
=
quantization
,
quantization
=
quantization
,
revision
=
revision
,
seed
=
seed
,
gpu_memory_utilization
=
gpu_memory_utilization
,
swap_space
=
swap_space
,
**
kwargs
,
**
kwargs
,
)
)
self
.
llm_engine
=
LLMEngine
.
from_engine_args
(
engine_args
)
self
.
llm_engine
=
LLMEngine
.
from_engine_args
(
engine_args
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment