Commits · 30299a41fa78c7bf485aca7ef8ad584ca340a64d · OpenDAS / vllm_cscc

13 Jun, 2024 13 commits
- [MISC] Remove FP8 warning (#5472) · 30299a41
  Cody Yu authored Jun 13, 2024
```
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
```
  30299a41
- [Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56
  Tyler Michael Smith authored Jun 13, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  85657b56
- [Doc] Update LLaVA docs (#5437) · 0ce7b952
  Cyrus Leung authored Jun 14, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  0ce7b952
- [CI/Build] Simplify OpenAI server setup in tests (#5100) · 39873476
  Cyrus Leung authored Jun 14, 2024
  
  39873476
- [Misc] Add vLLM version getter to utils (#5098) · 03dccc88
  Cyrus Leung authored Jun 14, 2024
  
  03dccc88
- [Docs] Add 4th meetup slides (#5509) · a65634d3
  Woosuk Kwon authored Jun 13, 2024
  
  a65634d3
- [Hardware][Intel] Optimize CPU backend and add more performance tips (#4971) · 80aa7e91
  Li, Jiang authored Jun 14, 2024
```
Co-authored-by: Jianan Gu <jianan.gu@intel.com>
```
  80aa7e91
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) · bd439735
  wenyujin333 authored Jun 14, 2024
```
Tune Qwen2-57B-A14B configs based on #4921

Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2

A100 GPU

benchmark	no config	w/ PR
tp=2	10.53 requests/s, 11058.17 tokens/s	12.47 requests/s, 13088.57 tokens/s
tp=4	17.77 requests/s, 18662.95 tokens/s	20.20 requests/s, 21212.32 tokens/s
```
  bd439735
- [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) · 23ec72fa
  Michael Goin authored Jun 13, 2024
  
  23ec72fa
- [Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
  Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  c2637a61
- [Bugfix]if the content is started with ":"(response of ping), client should i… (#5303) · 88407532
  Wang, Yi authored Jun 13, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  88407532
- [ci] Use sccache to build images (#5419) · 916d219d
  Kevin H. Luu authored Jun 12, 2024
```
Signed-off-by: kevin <kevin@anyscale.com>
```
  916d219d
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5
12 Jun, 2024 14 commits
- [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) · 2135cacb
  Isotr0py authored Jun 13, 2024
  
  2135cacb
- [Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425) · 7d19de2e
  Michael Goin authored Jun 12, 2024
  
  7d19de2e
- [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) · 94a07bbd
  Michael Goin authored Jun 12, 2024
  
  94a07bbd
- [Doc] Update debug docs (#5438) · b8d4dfff
  Cyrus Leung authored Jun 13, 2024
  
  b8d4dfff
- [misc] add hint for AttributeError (#5462) · 622d4512
  youkaichao authored Jun 12, 2024
  
  622d4512
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Bugfix] TYPE_CHECKING for MultiModalData (#5444) · 5cc50a53
  Arthur Kim authored Jun 13, 2024
  
  5cc50a53
- [Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342
  Cody Yu authored Jun 12, 2024
```
Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
```
  5985e342
- [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464) · 8b82a899
  Kevin H. Luu authored Jun 12, 2024
```
Signed-off-by: kevin <kevin@anyscale.com>
```
  8b82a899
- [Bugfix] Add device assertion to TorchSDPA (#5402) · c3c2903e
  Li, Jiang authored Jun 13, 2024
  
  c3c2903e
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
- [CI] Upgrade codespell version. (#5381) · 847cdcca
  SangBin Cho authored Jun 13, 2024
  
  847cdcca
- Revert "[CI/Build] Add `is_quant_method_supported` to control quantization... · e3c12bf6
  Simon Mo authored Jun 12, 2024
```
Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463)
```
  e3c12bf6
- [CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253) · 3dd6853b
  Michael Goin authored Jun 12, 2024
  
  3dd6853b
11 Jun, 2024 13 commits
- [Doc] add common case for long waiting time (#5430) · 8f89d720
  youkaichao authored Jun 11, 2024
  
  8f89d720
- [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) · 99dac099
  Nick Hill authored Jun 11, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  99dac099
- [Core][Distributed] add same-node detection (#5369) · c4bd03c7
  youkaichao authored Jun 11, 2024
  
  c4bd03c7
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) · 00e6a2dc
  Ali Panahi authored Jun 11, 2024
  
  00e6a2dc
- [Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254) · 2e02311a
  Junichi Sato authored Jun 12, 2024
  
  2e02311a
- [Docs] [Spec decode] Fix docs error in code example (#5427) · 89ec06c3
  Cade Daniel authored Jun 11, 2024
  
  89ec06c3
- [Doc] Add an automatic prefix caching section in vllm documentation (#5324) · 9fde251b
  Kuntai Du authored Jun 11, 2024
```
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  9fde251b
- [Speculative decoding] Initial spec decode docs (#5400) · 4c2ffb28
  Cade Daniel authored Jun 11, 2024
  
  4c2ffb28
- [CI] docfix (#5410) · 246598a6
  SangBin Cho authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: ywang96 <ywang@roblox.com>
```
  246598a6
- [Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) · 8bab4959
  Woosuk Kwon authored Jun 11, 2024
  
  8bab4959
- [Doc][Typo] Fixing Missing Comma (#5403) · 3c4cebf7
  Roger Wang authored Jun 11, 2024
  
  3c4cebf7
- [Doc] add debugging tips (#5409) · d8f31f2f
  youkaichao authored Jun 10, 2024
  
  d8f31f2f