Commits · b4cf96afe493e0d22ac4fd831d8fa1f7ffa177a7 · OpenDAS / vllm_cscc

10 Dec, 2024 1 commit

test_audio文件将TEST_AUDIO_URLS改为本地；testchat文件360行路径改为本地；test_metrics文件197行将base_... · 859566e0

guanyu authored Dec 10, 2024

test_audio文件将TEST_AUDIO_URLS改为本地；testchat文件360行路径改为本地；test_metrics文件197行将base_url从0.0.0.0改为localhost；test_run_batch将INPUT_BATCH，INVALID_INPUT_BATCH,INPUT_EMBEDDING_BATCH改为原来的格式；test_tokenizer_group将18行的gpt路径改为修改后的路径；test_braodcast将model的判断改为if llava-hf/llava-1.5-7b-hf in model

859566e0

29 Nov, 2024 1 commit
- add VLLM_OPTEST_URLS_PORT to load https from local · 41b09879
  zhuwenwen authored Nov 29, 2024
  
  41b09879
27 Nov, 2024 1 commit
- add VLLM_OPTEST_MODELS_PATH/OPTEST_MODELS_PATH to load models from local path... · 3c9817d2
  zhuwenwen authored Nov 27, 2024
```
add VLLM_OPTEST_MODELS_PATH/OPTEST_MODELS_PATH  to load models from local path instead of Hugging Face Hub
```
  3c9817d2
15 Nov, 2024 3 commits
- [fix]修复单测test_punica_variation报共享内存不足的问题 · 137e8a16
  王敏 authored Nov 15, 2024
  
  137e8a16
- [fix]回退test_long_context中限制输入长度修改 · 9736caa9
  王敏 authored Nov 15, 2024
  
  9736caa9
- [fix]修复test_long_context中报错问题，单测依然无法通过，nv也是同样的问题 · 1d6cfb11
  王敏 authored Nov 15, 2024
  
  1d6cfb11
23 Sep, 2024 1 commit
- [Kernel][LoRA] Add assertion for punica sgmv kernels (#7585) · 9b0e3ec9
  Jee Jee Li authored Sep 24, 2024
  
  9b0e3ec9
18 Sep, 2024 2 commits
- [CI/Build] Update Ruff version (#8469) · 9d104b5b
  Aaron Pham authored Sep 18, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  9d104b5b
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
04 Sep, 2024 2 commits
- [CI] Change test input in Gemma LoRA test (#8163) · 561d6f80
  Woosuk Kwon authored Sep 04, 2024
  
  561d6f80
- [CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369) · d1dec642
  alexeykondrat authored Sep 04, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  d1dec642
23 Aug, 2024 1 commit
- [Core] Add multi-step support to LLMEngine (#7789) · 9db93de2
  Alexander Matveev authored Aug 23, 2024
  
  9db93de2
16 Aug, 2024 1 commit
- [Misc/Testing] Use `torch.testing.assert_close` (#7324) · 50b8d08d
  jon-chuang authored Aug 15, 2024
  
  50b8d08d
14 Aug, 2024 1 commit
- [CI/Build]Reduce the time consumption for LoRA tests (#7396) · 97992802
  Jee Jee Li authored Aug 14, 2024
  
  97992802
06 Aug, 2024 1 commit
- [LoRA] Relax LoRA condition (#7146) · 9118217f
  Jee Jee Li authored Aug 06, 2024
  
  9118217f
03 Aug, 2024 1 commit
- [LoRA] ReplicatedLinear support LoRA (#7081) · 99d7cabd
  Jee Jee Li authored Aug 03, 2024
  
  99d7cabd
01 Aug, 2024 1 commit
- [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) · 7ecee343
  Jee Jee Li authored Aug 01, 2024
  
  7ecee343
22 Jul, 2024 1 commit
- [Core] Support dynamically loading Lora adapter from HuggingFace (#6234) · 42c7f66a
  Jiaxin Shan authored Jul 22, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  42c7f66a
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

02 Jul, 2024 1 commit
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
30 Jun, 2024 1 commit
- [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909) · f5e73c9f
  SangBin Cho authored Jul 01, 2024
```
Co-authored-by: sang <sangcho@anyscale.com>
```
  f5e73c9f
29 Jun, 2024 1 commit
- [Kernel] Add punica dimensions for Granite 3b and 8b (#5930) · ba499444
  Joe Runde authored Jun 28, 2024
```
Signed-off-by: Joe Runde <joe@joerun.de>
```
  ba499444
21 Jun, 2024 3 commits
- [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) · f5dda63e
  rohithkrn authored Jun 21, 2024
  
  f5dda63e
- [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665) · 67005a07
  Jee Li authored Jun 21, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  67005a07
- [Kernel] Add punica dimension for Qwen2 LoRA (#5441) · 1f567421
  Jinzhen Lin authored Jun 21, 2024
  
  1f567421
18 Jun, 2024 2 commits
- [Model] LoRA support added for command-r (#5178) · 07feecde
  sergey-tinkoff authored Jun 18, 2024
  
  07feecde
- [Kernel] Add punica dimensions for Granite 13b (#5559) · 5002175e
  Joe Runde authored Jun 17, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  5002175e
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
13 Jun, 2024 1 commit
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5
10 Jun, 2024 1 commit
- [Misc] Improve error message when LoRA parsing fails (#5194) · 0bfa1c4f
  Cyrus Leung authored Jun 10, 2024
  
  0bfa1c4f
07 Jun, 2024 1 commit
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
28 May, 2024 1 commit
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
22 May, 2024 2 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [misc] remove comments that were supposed to be removed (#4977) · c74c913b
  SangBin Cho authored May 22, 2024
  
  c74c913b
21 May, 2024 1 commit
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

16 May, 2024 1 commit
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) · 8435b207
  Silencio authored May 17, 2024
```
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
```
  8435b207
14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
27 Apr, 2024 1 commit
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
24 Apr, 2024 1 commit
- [Misc] Reduce supported Punica dtypes (#4304) · 468d761b
  Woosuk Kwon authored Apr 23, 2024
  
  468d761b