Commits · 36ea79079bc499cd8fb07d3fe82fe069564e5570 · OpenDAS / vllm_cscc

11 Oct, 2024 1 commit
- [Misc][LoRA] Support loading LoRA weights for target_modules in reg format (#9275) · 36ea7907
  Jee Jee Li authored Oct 11, 2024
  
  36ea7907
09 Oct, 2024 1 commit
- [Bugfix] Fix lora loading for Compressed Tensors in #9120 (#9179) · 21906a6f
  Ahmad Fahadh Ilyas authored Oct 09, 2024
  
  21906a6f
04 Oct, 2024 1 commit
- [Misc] Move registry to its own file (#9064) · 0e36fd49
  Cyrus Leung authored Oct 04, 2024
  
  0e36fd49
29 Sep, 2024 1 commit
- [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199) · 3d49776b
  Jee Jee Li authored Sep 29, 2024
  
  3d49776b
23 Sep, 2024 1 commit
- [Kernel][LoRA] Add assertion for punica sgmv kernels (#7585) · 9b0e3ec9
  Jee Jee Li authored Sep 24, 2024
  
  9b0e3ec9
20 Sep, 2024 1 commit
- [Core] Support Lora lineage and base model metadata management (#6315) · 260d40b5
  Jiaxin Shan authored Sep 19, 2024
  
  260d40b5
06 Sep, 2024 2 commits
- [Misc] Remove `SqueezeLLM` (#8220) · 23f32229
  Dipika Sikka authored Sep 06, 2024
  
  23f32229
- [Core] Support load and unload LoRA in api server (#6566) · db3bf7c9
  Jiaxin Shan authored Sep 05, 2024
```
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
```
  db3bf7c9
28 Aug, 2024 1 commit
- [Bugfix] Make torch registration of punica ops optional (#7970) · 3cdfe1f3
  bnellnm authored Aug 28, 2024
  
  3cdfe1f3
20 Aug, 2024 1 commit
- [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) (#7685) · 6e4658c7
  Kunshang Ji authored Aug 21, 2024
  
  6e4658c7
19 Aug, 2024 1 commit
- [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) · ff7ec82c
  SangBin Cho authored Aug 18, 2024
  
  ff7ec82c
16 Aug, 2024 1 commit
- [Kernel] register punica functions as torch ops (#7591) · 9f698563
  bnellnm authored Aug 16, 2024
  
  9f698563
09 Aug, 2024 1 commit
- [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) · 57b7be0e
  William Lin authored Aug 08, 2024
  
  57b7be0e
08 Aug, 2024 1 commit
- [Bugfix] Fix LoRA with PP (#7292) · 6dffa4b0
  Murali Andoorveedu authored Aug 08, 2024
  
  6dffa4b0
06 Aug, 2024 1 commit
- [LoRA] Relax LoRA condition (#7146) · 9118217f
  Jee Jee Li authored Aug 06, 2024
  
  9118217f
05 Aug, 2024 1 commit
- [Bugfix] Specify device when loading LoRA and embedding tensors (#7129) · 89b8db6b
  Jacob Schein authored Aug 05, 2024
```
Co-authored-by: Jacob Schein <jacobschein@Jacobs-MacBook-Pro-2.local>
```
  89b8db6b
04 Aug, 2024 1 commit
- Clean up remaining Punica C information (#7027) · f80ab352
  Jee Jee Li authored Aug 05, 2024
  
  f80ab352
03 Aug, 2024 1 commit
- [LoRA] ReplicatedLinear support LoRA (#7081) · 99d7cabd
  Jee Jee Li authored Aug 03, 2024
  
  99d7cabd
01 Aug, 2024 1 commit
- [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) · 7ecee343
  Jee Jee Li authored Aug 01, 2024
  
  7ecee343
27 Jul, 2024 1 commit
- [TPU] Support collective communications in XLA devices (#6813) · d09b94ca
  Woosuk Kwon authored Jul 26, 2024
  
  d09b94ca
22 Jul, 2024 1 commit
- [Core] Support dynamically loading Lora adapter from HuggingFace (#6234) · 42c7f66a
  Jiaxin Shan authored Jul 22, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  42c7f66a
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

03 Jul, 2024 1 commit
- [hardware][misc] introduce platform abstraction (#6080) · 482045ee
  youkaichao authored Jul 02, 2024
  
  482045ee
02 Jul, 2024 1 commit
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
01 Jul, 2024 1 commit
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512
30 Jun, 2024 1 commit
- [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909) · f5e73c9f
  SangBin Cho authored Jul 01, 2024
```
Co-authored-by: sang <sangcho@anyscale.com>
```
  f5e73c9f
27 Jun, 2024 2 commits
- [Model] Add Gemma 2 (#5908) · 79c92c7c
  Woosuk Kwon authored Jun 27, 2024
  
  79c92c7c
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
21 Jun, 2024 2 commits
- [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) · f5dda63e
  rohithkrn authored Jun 21, 2024
  
  f5dda63e
- [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665) · 67005a07
  Jee Li authored Jun 21, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  67005a07
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
10 Jun, 2024 1 commit
- [Misc] Improve error message when LoRA parsing fails (#5194) · 0bfa1c4f
  Cyrus Leung authored Jun 10, 2024
  
  0bfa1c4f
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 1 commit
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
01 Jun, 2024 1 commit
- [Bugfix] Remove deprecated @abstractproperty (#5174) · 8279078e
  Zhuohan Li authored Jun 01, 2024
  
  8279078e
22 May, 2024 2 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [misc] remove comments that were supposed to be removed (#4977) · c74c913b
  SangBin Cho authored May 22, 2024
  
  c74c913b
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

08 May, 2024 1 commit
- [Core] Faster startup for LoRA enabled models (#4634) · ad932a22
  Antoni Baum authored May 08, 2024
  
  ad932a22
07 May, 2024 1 commit
- [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609) · 10760da8
  Austin Veselka authored May 07, 2024
  
  10760da8