Commits · 367cb8ce8c0486e22941cf8990902fc5ec992ec1 · OpenDAS / vllm_cscc

15 Feb, 2025 4 commits
- [V1][PP] Run engine busy loop with batch queue (#13064) · 9206b3d7
  Cody Yu authored Feb 15, 2025
  
  9206b3d7
- [V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288) · 2ad1bc7a
  Mark McLoughlin authored Feb 15, 2025
  
  2ad1bc7a
- [V1][PP] Fix memory profiling in PP (#13315) · 0c730268
  Woosuk Kwon authored Feb 14, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  0c730268
- [V1][Sampler] Don't apply temp for greedy-only (#13311) · 6a854c7a
  Nick Hill authored Feb 14, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
```
  6a854c7a
14 Feb, 2025 8 commits
- [V1][Core] min_p sampling support (#13191) · a12934d3
  Aoyu authored Feb 15, 2025
```
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
```
  a12934d3
- Support logit_bias in v1 Sampler (#13079) · 6224a9f6
  Lu Fang authored Feb 14, 2025
  
  6224a9f6
- [V1] Simplify GPUModelRunner._update_states check (#13265) · 085b7b2d
  Nick Hill authored Feb 14, 2025
  
  085b7b2d
- [WIP] TPU V1 Support Refactored (#13049) · 45f90bcb
  Alexander Matveev authored Feb 14, 2025
  
  45f90bcb
- [Bugfix][V1] GPUModelRunner._update_states should return True when there is a... · b0ccfc56
  Kero Liang authored Feb 14, 2025
```
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126)
```
  b0ccfc56
- [ROCm][V1] Add intial ROCm support to V1 (#12790) · ba59b78a
  Sage Moore authored Feb 13, 2025
  
  ba59b78a
- [V1] LoRA - Enable Serving Usecase (#12883) · cbc40128
  Varun Sundar Rabindranath authored Feb 14, 2025
```
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  cbc40128
- [V1] Consolidate MM cache size to vllm.envs (#13239) · dd5ede44
  Roger Wang authored Feb 13, 2025
  
  dd5ede44
13 Feb, 2025 3 commits
- [V1][Core] Add worker_base for v1 worker (#12816) · 2092a6fa
  Aoyu authored Feb 13, 2025
```
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  2092a6fa
- [V1] Clarify input processing and multimodal feature caching logic (#13211) · fdcf64d3
  Roger Wang authored Feb 13, 2025
  
  fdcf64d3
- [V1][core] Implement pipeline parallel on Ray (#12996) · 9605c125
  Rui Qiao authored Feb 13, 2025
  
  9605c125
12 Feb, 2025 2 commits
- [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (#13173) · 4c0d93f4
  Murali Andoorveedu authored Feb 12, 2025
```
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
```
  4c0d93f4
- [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (#13108) · f4d97e4f
  bnellnm authored Feb 12, 2025
  
  f4d97e4f
11 Feb, 2025 3 commits
- [V1][Metrics] Add several request timing histograms (#12644) · 75e6e145
  Mark McLoughlin authored Feb 11, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  75e6e145
- [V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) · 41c5dd45
  Cody Yu authored Feb 11, 2025
  
  41c5dd45
- [V1][Minor] Move scheduler outputs to a separate file (#13062) · 2ff48576
  Woosuk Kwon authored Feb 10, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  2ff48576
10 Feb, 2025 1 commit
- [V1] Use msgpack for core request serialization (#12918) · 67c4637c
  Nick Hill authored Feb 09, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
```
  67c4637c
08 Feb, 2025 7 commits
- [V1] Cache `uses_mrope` in GPUModelRunner (#12969) · 24700c34
  Woosuk Kwon authored Feb 08, 2025
  
  24700c34
- [V1][Minor] Remove outdated comment (#12968) · 870c3748
  Woosuk Kwon authored Feb 08, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  870c3748
- [bugfix] fix early import of flash attention (#12959) · fe743b79
  youkaichao authored Feb 09, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  fe743b79
- [V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943) · 4ea48fb3
  Woosuk Kwon authored Feb 08, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  4ea48fb3
- [bugfix] respect distributed_executor_backend in world_size=1 (#12934) · 91dd8f7a
  youkaichao authored Feb 08, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  91dd8f7a
- [V1] Move KV block hashes from Request to KVCacheManager (#12922) · 32431583
  Woosuk Kwon authored Feb 07, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  32431583
- [V1][Minor] Remove outdated comment (#12928) · b21f0f9d
  Woosuk Kwon authored Feb 07, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  b21f0f9d
07 Feb, 2025 1 commit

[V1] Logprobs and prompt logprobs support (#9880) · 0630d453

afeldman-nm authored Feb 07, 2025



This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>

0630d453

06 Feb, 2025 2 commits

[V1] LoRA Support (#10957) · 467a96a5

Varun Sundar Rabindranath authored Feb 06, 2025


Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

467a96a5

[Attention] Use FA3 for MLA on Hopper (#12807) · c786e757
Lucas Wilkinson authored Feb 06, 2025
```
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
c786e757

05 Feb, 2025 3 commits
- [VLM] Qwen2.5-VL · bf3b79ef
  Roger Wang authored Feb 05, 2025
  
  bf3b79ef
- [V1][Misc] Shorten `FinishReason` enum and use constant strings (#12760) · 3d09e592
  Nick Hill authored Feb 04, 2025
  
  3d09e592
- [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) · 233df6f5
  Mark McLoughlin authored Feb 05, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  233df6f5
04 Feb, 2025 2 commits
- [V1] Remove scheduling constraint on partial requests (#12674) · 18a88fcc
  Woosuk Kwon authored Feb 04, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  18a88fcc
- [Core] Improve hash collision avoidance in prefix caching (#12621) · 73b35cca
  Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  73b35cca
03 Feb, 2025 1 commit
- [V1] Revert `uncache_blocks` and support recaching full blocks (#12415) · 5095e966
  Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
  5095e966
02 Feb, 2025 3 commits

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) · f8ece6e1

Shawn Du authored Feb 02, 2025

As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254

,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------
Signed-off-by: Shawn Du <shawnd200@outlook.com>

f8ece6e1

[V1][Minor] Avoid frequently creating ConstantList (#12653) · abfcdcdf

Woosuk Kwon authored Feb 01, 2025



A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

abfcdcdf