Commits · 57f09a419c04ecec4718ea9d5be1e6f4a8cc336e · OpenDAS / vllm_cscc

28 Jun, 2024 2 commits
- [Hardware][Intel] OpenVINO vLLM backend (#5379) · 57f09a41
  Ilya Lavrenov authored Jun 28, 2024
  
  57f09a41
- [Core] Registry for processing model inputs (#5214) · 5cbe8d15
  Cyrus Leung authored Jun 28, 2024
```
Co-authored-by: ywang96 <ywang@roblox.com>
```
  5cbe8d15
25 Jun, 2024 1 commit
- [Speculative Decoding] Support draft model on different tensor-parallel size... · 2ce5d668
  Woo-Yeon Lee authored Jun 25, 2024
```
 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
```
  2ce5d668
20 Jun, 2024 1 commit
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
19 Jun, 2024 1 commit
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) · afed90a0
  Michael Goin authored Jun 19, 2024
  
  afed90a0
18 Jun, 2024 1 commit

[Misc] Add OpenTelemetry support (#4687) · 7879f24d

Ronen Schaffer authored Jun 18, 2024

This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

7879f24d

17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

14 Jun, 2024 1 commit
- [Doc] Update documentation on Tensorizer (#5471) · 6e2527a7
  Sanger Steel authored Jun 14, 2024
  
  6e2527a7
12 Jun, 2024 1 commit
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 3 commits
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) · 00e6a2dc
  Ali Panahi authored Jun 11, 2024
  
  00e6a2dc
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) · 351d5e7b
  maor-ps authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  351d5e7b
05 Jun, 2024 1 commit
- [Bugfix] Make EngineArgs use named arguments for config construction (#5285) · 065aff6c
  Michael Goin authored Jun 05, 2024
  
  065aff6c
03 Jun, 2024 2 commits
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
28 May, 2024 1 commit
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 2 commits
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
21 May, 2024 1 commit
- [Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935) · 14772eeb
  Kante Yin authored May 22, 2024
```
Signed-off-by: kerthcet <kerthcet@gmail.com>
```
  14772eeb
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

15 May, 2024 2 commits

[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
zifeitong authored May 15, 2024

a5675d34

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
13 May, 2024 1 commit
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
08 May, 2024 1 commit
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
04 May, 2024 1 commit
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
  DearPlanet authored May 05, 2024
  
  43029870
03 May, 2024 2 commits
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
01 May, 2024 1 commit
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
27 Apr, 2024 1 commit
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
23 Apr, 2024 1 commit
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 2 commits
- Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) · cc74b2b2
  Noam Gat authored Apr 20, 2024
  
  cc74b2b2
- Fix missing docs and out of sync `EngineArgs` (#4219) · 682789d4
  Harry Mellor authored Apr 20, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
  682789d4
18 Apr, 2024 1 commit
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb