Commits · f1042e86f05cfe93bcadac445e78671ed2e8fddb · OpenDAS / vllm_cscc

11 Feb, 2025 3 commits
- [V1][Metrics] Add several request timing histograms (#12644) · 75e6e145
  Mark McLoughlin authored Feb 11, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  75e6e145
- [V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) · 41c5dd45
  Cody Yu authored Feb 11, 2025
  
  41c5dd45
- [V1][Minor] Move scheduler outputs to a separate file (#13062) · 2ff48576
  Woosuk Kwon authored Feb 10, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  2ff48576
10 Feb, 2025 1 commit
- [V1] Use msgpack for core request serialization (#12918) · 67c4637c
  Nick Hill authored Feb 09, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
```
  67c4637c
08 Feb, 2025 7 commits
- [V1] Cache `uses_mrope` in GPUModelRunner (#12969) · 24700c34
  Woosuk Kwon authored Feb 08, 2025
  
  24700c34
- [V1][Minor] Remove outdated comment (#12968) · 870c3748
  Woosuk Kwon authored Feb 08, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  870c3748
- [bugfix] fix early import of flash attention (#12959) · fe743b79
  youkaichao authored Feb 09, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  fe743b79
- [V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943) · 4ea48fb3
  Woosuk Kwon authored Feb 08, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  4ea48fb3
- [bugfix] respect distributed_executor_backend in world_size=1 (#12934) · 91dd8f7a
  youkaichao authored Feb 08, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  91dd8f7a
- [V1] Move KV block hashes from Request to KVCacheManager (#12922) · 32431583
  Woosuk Kwon authored Feb 07, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  32431583
- [V1][Minor] Remove outdated comment (#12928) · b21f0f9d
  Woosuk Kwon authored Feb 07, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  b21f0f9d
07 Feb, 2025 1 commit

[V1] Logprobs and prompt logprobs support (#9880) · 0630d453

afeldman-nm authored Feb 07, 2025



This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>

0630d453

06 Feb, 2025 2 commits

[V1] LoRA Support (#10957) · 467a96a5

Varun Sundar Rabindranath authored Feb 06, 2025


Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

467a96a5

[Attention] Use FA3 for MLA on Hopper (#12807) · c786e757
Lucas Wilkinson authored Feb 06, 2025
```
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
c786e757

05 Feb, 2025 3 commits
- [VLM] Qwen2.5-VL · bf3b79ef
  Roger Wang authored Feb 05, 2025
  
  bf3b79ef
- [V1][Misc] Shorten `FinishReason` enum and use constant strings (#12760) · 3d09e592
  Nick Hill authored Feb 04, 2025
  
  3d09e592
- [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) · 233df6f5
  Mark McLoughlin authored Feb 05, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  233df6f5
04 Feb, 2025 2 commits
- [V1] Remove scheduling constraint on partial requests (#12674) · 18a88fcc
  Woosuk Kwon authored Feb 04, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  18a88fcc
- [Core] Improve hash collision avoidance in prefix caching (#12621) · 73b35cca
  Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  73b35cca
03 Feb, 2025 1 commit
- [V1] Revert `uncache_blocks` and support recaching full blocks (#12415) · 5095e966
  Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
  5095e966
02 Feb, 2025 3 commits

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) · f8ece6e1

Shawn Du authored Feb 02, 2025

As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254

,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------
Signed-off-by: Shawn Du <shawnd200@outlook.com>

f8ece6e1

[V1][Minor] Avoid frequently creating ConstantList (#12653) · abfcdcdf

Woosuk Kwon authored Feb 01, 2025



A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

abfcdcdf

01 Feb, 2025 1 commit

[V1] Bugfix: Validate Model Input Length (#12600) · b1340f9d

Robert Shaw authored Jan 31, 2025

SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX #12567(*link existing issues this PR will resolve*)

b1340f9d

31 Jan, 2025 1 commit

[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603) · 89003c40

Chen Zhang authored Feb 01, 2025



This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com>

89003c40

30 Jan, 2025 2 commits
- [V1][Log] Add max request concurrency log to V1 (#12569) · 4078052f
  Michael Goin authored Jan 30, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  4078052f
- [V1][Metrics] Add GPU cache usage % gauge (#12561) · f17f1d46
  Mark McLoughlin authored Jan 30, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  f17f1d46
29 Jan, 2025 2 commits
- [V1][BugFix] Free encoder cache for aborted requests (#12545) · e0cc5f25
  Woosuk Kwon authored Jan 29, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  e0cc5f25
- [V1][Metrics] Add TTFT and TPOT histograms (#12530) · 46fb0567
  Mark McLoughlin authored Jan 29, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  46fb0567
28 Jan, 2025 4 commits
- [V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516) · c386c43c
  Mark McLoughlin authored Jan 28, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  c386c43c
- [V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478) · 3fd1fb63
  Mark McLoughlin authored Jan 28, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  3fd1fb63
- [V1] Include Engine Version in Logs (#12496) · e29d4358
  Robert Shaw authored Jan 28, 2025
```
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
```
  e29d4358
- Update `pre-commit` hooks (#12475) · 823ab796
  Harry Mellor authored Jan 28, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  823ab796
27 Jan, 2025 3 commits
- [V1][Metrics] Add initial Prometheus logger (#12416) · 01ba9270
  Mark McLoughlin authored Jan 27, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  01ba9270
- [V1][Minor] Minor optimizations for update_from_output (#12454) · 624a1e47
  Woosuk Kwon authored Jan 27, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  624a1e47
- [V1] Avoid list creation in input preparation (#12457) · 28e07508
  Woosuk Kwon authored Jan 26, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  28e07508
26 Jan, 2025 2 commits
- [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439) · 0ee349b5
  Roger Wang authored Jan 26, 2025
```
Signed-off-by: Roger Wang <ywang@roblox.com>
```
  0ee349b5
- [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094) · fa63e710
  Keyun Tong authored Jan 26, 2025
```
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
```
  fa63e710
24 Jan, 2025 2 commits
- [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375) · ab5bbf5a
  Lucas Wilkinson authored Jan 24, 2025
```
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
  ab5bbf5a
- [V1][Frontend] Coalesce bunched `RequestOutput`s (#12298) · 24b0205f
  Nick Hill authored Jan 23, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  24b0205f