Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
2dd72d23
Unverified
Commit
2dd72d23
authored
Jul 25, 2025
by
weiliang
Committed by
GitHub
Jul 24, 2025
Browse files
update flashinfer to v0.2.9rc1 (#21485)
Signed-off-by:
Weiliang Liu
<
weiliangl@nvidia.com
>
parent
a6c7fb8c
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
6 additions
and
15 deletions
+6
-15
docker/Dockerfile
docker/Dockerfile
+1
-1
vllm/attention/backends/flashinfer.py
vllm/attention/backends/flashinfer.py
+3
-7
vllm/v1/attention/backends/flashinfer.py
vllm/v1/attention/backends/flashinfer.py
+2
-7
No files found.
docker/Dockerfile
View file @
2dd72d23
...
...
@@ -386,7 +386,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
# Install FlashInfer from source
ARG
FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
ARG
FLASHINFER_GIT_REF="v0.2.
8
"
ARG
FLASHINFER_GIT_REF="v0.2.
9rc1
"
RUN
--mount
=
type
=
cache,target
=
/root/.cache/uv bash -
<<
'
BASH
'
. /etc/environment
git clone --depth 1 --recursive --shallow-submodules \
...
...
vllm/attention/backends/flashinfer.py
View file @
2dd72d23
...
...
@@ -1169,16 +1169,12 @@ class FlashInferImpl(AttentionImpl):
query
=
decode_query
,
kv_cache
=
kv_cache
.
permute
(
*
stride_order
),
workspace_buffer
=
workspace_buffer
,
num_heads
=
num_heads
,
num_kv_heads
=
num_kv_heads
,
scale
=
softmax_scale
,
block_tables
=
attn_metadata
.
block_tables
,
seq_lens
=
decode_meta
.
seq_lens_tensor
,
block_size
=
attn_metadata
.
page_size
,
max_seq_len
=
attn_metadata
.
max_decode_seq_len
,
kv_cache_dtype
=
kv_cache_dtyp
e
,
k
_scale
=
layer
.
_
k
_scale_float
,
v_scale
=
layer
.
_v_scale_float
)
bmm1_scale
=
layer
.
_k_scale_float
*
softmax_scal
e
,
bmm2
_scale
=
layer
.
_
v
_scale_float
,
)
if
prefill_output
is
None
and
decode_output
is
not
None
:
# Decode only batch.
...
...
vllm/v1/attention/backends/flashinfer.py
View file @
2dd72d23
...
...
@@ -678,15 +678,10 @@ class FlashInferImpl(AttentionImpl):
query
=
decode_query
,
kv_cache
=
kv_cache_permute
,
workspace_buffer
=
attn_metadata
.
workspace_buffer
,
num_heads
=
self
.
num_heads
,
num_kv_heads
=
self
.
num_kv_heads
,
scale
=
self
.
scale
,
block_tables
=
block_tables_decode
,
seq_lens
=
seq_lens_decode
,
block_size
=
attn_metadata
.
page_size
,
max_seq_len
=
attn_metadata
.
max_seq_len
,
kv_cache_dtype
=
self
.
kv_cache_dtype
,
k_scale
=
layer
.
_k_scale_float
,
v_scale
=
layer
.
_v_scale_float
,
bmm1_scale
=
layer
.
_k_scale_float
*
self
.
scale
,
bmm2_scale
=
layer
.
_v_scale_float
,
))
return
output_padded
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment