Commits · e2afb03c92a06700d296a2e7f6565d4a4f05168c · OpenDAS / vllm_cscc

14 Jun, 2024 3 commits
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) · e2afb03c
  Thomas Parnell authored Jun 14, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  e2afb03c
- [ Misc ] Rs/compressed tensors cleanup (#5432) · 15985680
  Robert Shaw authored Jun 14, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
```
  15985680
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 5 commits

[Kernel] Disable CUTLASS kernels for fp8 (#5505) · e38042d4
Tyler Michael Smith authored Jun 13, 2024

e38042d4

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

[Doc] Update LLaVA docs (#5437) · 0ce7b952
Cyrus Leung authored Jun 14, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
0ce7b952

[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) · bd439735

wenyujin333 authored Jun 14, 2024

Tune Qwen2-57B-A14B configs based on #4921

Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2

A100 GPU

benchmark	no config	w/ PR
tp=2	10.53 requests/s, 11058.17 tokens/s	12.47 requests/s, 13088.57 tokens/s
tp=4	17.77 requests/s, 18662.95 tokens/s	20.20 requests/s, 21212.32 tokens/s

bd439735

[Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
c2637a61

12 Jun, 2024 2 commits
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
10 Jun, 2024 3 commits
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
- [Misc] Update to comply with the new `compressed-tensors` config (#5350) · 5884c2b4
  Dipika Sikka authored Jun 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  5884c2b4
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
08 Jun, 2024 2 commits

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
Michael Goin authored Jun 08, 2024

c09dade2

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 5 commits
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 1 commit
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
05 Jun, 2024 5 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [Misc] Skip for logits_scale == 1.0 (#5291) · 6a7c7711
  Woosuk Kwon authored Jun 05, 2024
  
  6a7c7711
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
04 Jun, 2024 2 commits
- [Kernel] Enhance MoE benchmarking & tuning script (#4921) · 3a434b07
  Woosuk Kwon authored Jun 03, 2024
  
  3a434b07
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 3 commits
- [FRONTEND] OpenAI `tools` support named functions (#5032) · f775a07e
  Breno Faria authored Jun 04, 2024
  
  f775a07e
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) · c3540728
  Ye Cao authored Jun 02, 2024
```
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
```
  c3540728
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) · b35be540
  Robert Shaw authored May 30, 2024
  
  b35be540
30 May, 2024 1 commit
- [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) · 5bf185a1
  Alexander Matveev authored May 29, 2024
  
  5bf185a1