Unverified Commit 24dc30f7 authored by Nick Hill's avatar Nick Hill Committed by GitHub
Browse files

[ModelRunner V2] Don't pin reused flashinfer tensors (#32799)


Signed-off-by: default avatarNick Hill <nickhill123@gmail.com>
parent 180fba65
......@@ -603,7 +603,12 @@ class FlashInferMetadataBuilder(AttentionMetadataBuilder[FlashInferMetadata]):
"earlier GPUs."
)
# Preparing persistent buffers
self.pin_memory = is_pin_memory_available()
# Since we do not have explicit synchronization in ModelRunnerV2, we do not pin
# reused CPU buffers to avoid a race condition between step N async copies to
# GPU and step N+1 buffer updates.
self.pin_memory = (
not envs.VLLM_USE_V2_MODEL_RUNNER and is_pin_memory_available()
)
self.paged_kv_indptr = self._make_buffer(max_num_reqs + 1)
self.paged_kv_indptr_cpu_buffer = torch.zeros_like(
self.paged_kv_indptr.cpu, pin_memory=self.pin_memory
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment