Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
1f55e057
Unverified
Commit
1f55e057
authored
Nov 12, 2024
by
Woosuk Kwon
Committed by
GitHub
Nov 12, 2024
Browse files
[V1] Enable Inductor when using piecewise CUDA graphs (#10268)
Signed-off-by:
Woosuk Kwon
<
woosuk.kwon@berkeley.edu
>
parent
8a06428c
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
7 additions
and
4 deletions
+7
-4
vllm/v1/worker/gpu_model_runner.py
vllm/v1/worker/gpu_model_runner.py
+7
-4
No files found.
vllm/v1/worker/gpu_model_runner.py
View file @
1f55e057
...
...
@@ -404,14 +404,17 @@ class GPUModelRunner:
def
load_model
(
self
)
->
None
:
if
self
.
use_cuda_graph
:
# FIXME(woosuk): Currently, we do not use inductor to reduce the
# compilation time and any potential issues with the inductor.
os
.
environ
[
"VLLM_CUSTOM_OPS"
]
=
"all"
# NOTE(woosuk): Currently, we use inductor because the piecewise
# CUDA graphs do not work properly with the custom CUDA kernels.
# FIXME(woosuk): Disable inductor to reduce the compilation time
# and avoid any potential issues with the inductor.
os
.
environ
[
"VLLM_CUSTOM_OPS"
]
=
"none"
set_compilation_config
(
CompilationConfig
(
use_cudagraph
=
True
,
non_cudagraph_ops
=
[
"vllm.unified_v1_flash_attention"
],
use_inductor
=
False
,
use_inductor
=
True
,
enable_fusion
=
False
,
))
logger
.
info
(
"Starting to load model %s..."
,
self
.
model_config
.
model
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment