Commits · f5dda63eb5fcb5624b93fa5f09da01d5372bbce4 · OpenDAS / vllm_cscc

21 Jun, 2024 1 commit
- [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) · f5dda63e
  rohithkrn authored Jun 21, 2024
  
  f5dda63e
20 Jun, 2024 1 commit
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
19 Jun, 2024 2 commits
- [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) · 78687504
  zifeitong authored Jun 19, 2024
  
  78687504
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) · afed90a0
  Michael Goin authored Jun 19, 2024
  
  afed90a0
18 Jun, 2024 1 commit

[Misc] Add OpenTelemetry support (#4687) · 7879f24d

Ronen Schaffer authored Jun 18, 2024

This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

7879f24d

17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
14 Jun, 2024 2 commits
- [Doc] Update documentation on Tensorizer (#5471) · 6e2527a7
  Sanger Steel authored Jun 14, 2024
  
  6e2527a7
- [Core] Remove duplicate processing in async engine (#5525) · 77490c6f
  Cyrus Leung authored Jun 15, 2024
  
  77490c6f
13 Jun, 2024 1 commit
- [Misc] Add vLLM version getter to utils (#5098) · 03dccc88
  Cyrus Leung authored Jun 14, 2024
  
  03dccc88
12 Jun, 2024 1 commit
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 4 commits
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) · 00e6a2dc
  Ali Panahi authored Jun 11, 2024
  
  00e6a2dc
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) · 351d5e7b
  maor-ps authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  351d5e7b
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
05 Jun, 2024 3 commits
- [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) · 0f83ddd4
  Alex Wu authored Jun 05, 2024
  
  0f83ddd4
- [Bugfix] Make EngineArgs use named arguments for config construction (#5285) · 065aff6c
  Michael Goin authored Jun 05, 2024
  
  065aff6c
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) · 974fc9b8
  zifeitong authored Jun 04, 2024
  
  974fc9b8
03 Jun, 2024 2 commits
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
30 May, 2024 1 commit
- [Doc] Use intersphinx and update entrypoints docs (#5125) · a9bcc7af
  Cyrus Leung authored May 31, 2024
  
  a9bcc7af
29 May, 2024 1 commit
- [Bugfix] Remove the last EOS token unless explicitly specified (#5077) · dfba529b
  Junichi Sato authored May 29, 2024
  
  dfba529b
28 May, 2024 2 commits
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 3 commits
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
21 May, 2024 1 commit
- [Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935) · 14772eeb
  Kante Yin authored May 22, 2024
```
Signed-off-by: kerthcet <kerthcet@gmail.com>
```
  14772eeb
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

15 May, 2024 2 commits

[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
zifeitong authored May 15, 2024

a5675d34

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
13 May, 2024 2 commits
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
08 May, 2024 1 commit
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
06 May, 2024 1 commit
- [Bugfix] Fix `asyncio.Task` not being subscriptable (#4623) · 323f27b9
  Cyrus Leung authored May 07, 2024
  
  323f27b9