Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
91121552
Unverified
Commit
91121552
authored
May 11, 2025
by
Kuntai Du
Committed by
GitHub
May 11, 2025
Browse files
[Perf] Use small max_num_batched_tokens for A100 (#17885)
Signed-off-by:
KuntaiDu
<
kuntai@uchicago.edu
>
parent
90d0a74b
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
1 deletion
+5
-1
vllm/engine/arg_utils.py
vllm/engine/arg_utils.py
+5
-1
No files found.
vllm/engine/arg_utils.py
View file @
91121552
...
...
@@ -1438,11 +1438,15 @@ class EngineArgs:
from
vllm.platforms
import
current_platform
try
:
device_memory
=
current_platform
.
get_device_total_memory
()
device_name
=
current_platform
.
get_device_name
().
lower
()
except
Exception
:
# This is only used to set default_max_num_batched_tokens
device_memory
=
0
if
device_memory
>=
70
*
GiB_bytes
:
# NOTE(Kuntai): Setting large `max_num_batched_tokens` for A100 reduces
# throughput, see PR #17885 for more details.
# So here we do an extra device name check to prevent such regression.
if
device_memory
>=
70
*
GiB_bytes
and
"a100"
not
in
device_name
:
# For GPUs like H100 and MI300x, use larger default values.
default_max_num_batched_tokens
=
{
UsageContext
.
LLM_CLASS
:
16384
,
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment