Unverified Commit e184c9c5 authored by Lehua Ding's avatar Lehua Ding Committed by GitHub
Browse files

[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)


Signed-off-by: default avatarLehua Ding <lehuading@tencent.com>
parent d7e34b42
......@@ -2478,7 +2478,7 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
effective_drafter_max_model_len = (
self.speculative_config.draft_model_config.max_model_len)
input_fits_in_drafter = spec_decode_common_attn_metadata and (
spec_decode_common_attn_metadata.seq_lens.max() +
spec_decode_common_attn_metadata.max_seq_len +
self.speculative_config.num_speculative_tokens
<= effective_drafter_max_model_len)
if use_padded_batch_for_eagle and input_fits_in_drafter:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment