Commits · fd95e026e0f9f50bacf1a63ef419df8bacfc99c0 · OpenDAS / vllm_cscc

06 Aug, 2024 1 commit

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

05 Aug, 2024 2 commits
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
- [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) · 82a1b1a8
  Cade Daniel authored Aug 05, 2024
  
  82a1b1a8
30 Jul, 2024 1 commit
- [Doc] Super tiny fix doc typo (#6949) · f0584036
  fzyzcjy authored Jul 31, 2024
  
  f0584036
27 Jul, 2024 1 commit
- [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba (#6784) · ed94e4f4
  tomeras91 authored Jul 27, 2024
  
  ed94e4f4
23 Jul, 2024 3 commits
- [bitsandbytes]: support read bnb pre-quantized model (#5753) · 87525fab
  dongmao zhang authored Jul 23, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  87525fab
- support ignore patterns in model loader (#6673) · 3eda4ec7
  Simon Mo authored Jul 22, 2024
  
  3eda4ec7
- [Misc] Enable chunked prefill by default for long context models (#6666) · 729171ae
  Woosuk Kwon authored Jul 22, 2024
  
  729171ae
22 Jul, 2024 1 commit
- [Frontend] Refactor prompt processing (#4028) · 739b61a3
  Cyrus Leung authored Jul 23, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  739b61a3
21 Jul, 2024 1 commit
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both... · 14f91fe6
  sroy745 authored Jul 20, 2024
```
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
```
  14f91fe6
20 Jul, 2024 1 commit
- [Core] Allow specifying custom Executor (#6557) · 7bd82002
  Antoni Baum authored Jul 19, 2024
  
  7bd82002
18 Jul, 2024 1 commit
- [core][model] yet another cpu offload implementation (#6496) · 1c27d25f
  youkaichao authored Jul 17, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  1c27d25f
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

03 Jul, 2024 1 commit

[vlm] Remove vision language config. (#6089) · d9e98f42

xwjiang2010 authored Jul 03, 2024


Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

d9e98f42

02 Jul, 2024 1 commit

[VLM] Remove `image_input_type` from VLM config (#5852) · 98d6682c

xwjiang2010 authored Jul 02, 2024


Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

98d6682c

01 Jul, 2024 1 commit
- [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) · 80ca1e6a
  sroy745 authored Jul 01, 2024
  
  80ca1e6a
28 Jun, 2024 2 commits
- [Hardware][Intel] OpenVINO vLLM backend (#5379) · 57f09a41
  Ilya Lavrenov authored Jun 28, 2024
  
  57f09a41
- [Core] Registry for processing model inputs (#5214) · 5cbe8d15
  Cyrus Leung authored Jun 28, 2024
```
Co-authored-by: ywang96 <ywang@roblox.com>
```
  5cbe8d15
25 Jun, 2024 1 commit
- [Speculative Decoding] Support draft model on different tensor-parallel size... · 2ce5d668
  Woo-Yeon Lee authored Jun 25, 2024
```
 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
```
  2ce5d668
20 Jun, 2024 1 commit
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
19 Jun, 2024 1 commit
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) · afed90a0
  Michael Goin authored Jun 19, 2024
  
  afed90a0
18 Jun, 2024 1 commit

[Misc] Add OpenTelemetry support (#4687) · 7879f24d

Ronen Schaffer authored Jun 18, 2024

This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

7879f24d

17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

14 Jun, 2024 1 commit
- [Doc] Update documentation on Tensorizer (#5471) · 6e2527a7
  Sanger Steel authored Jun 14, 2024
  
  6e2527a7
12 Jun, 2024 1 commit
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 3 commits
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) · 00e6a2dc
  Ali Panahi authored Jun 11, 2024
  
  00e6a2dc
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) · 351d5e7b
  maor-ps authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  351d5e7b
05 Jun, 2024 1 commit
- [Bugfix] Make EngineArgs use named arguments for config construction (#5285) · 065aff6c
  Michael Goin authored Jun 05, 2024
  
  065aff6c
03 Jun, 2024 2 commits
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
28 May, 2024 1 commit
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 2 commits
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
21 May, 2024 1 commit
- [Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935) · 14772eeb
  Kante Yin authored May 22, 2024
```
Signed-off-by: kerthcet <kerthcet@gmail.com>
```
  14772eeb
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

15 May, 2024 2 commits

[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
zifeitong authored May 15, 2024

a5675d34

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1