Commits · 360bd67cf0ea4a79a59c1aae736cc495a5a63ec5 · OpenDAS / vllm_cscc

05 Aug, 2024 1 commit
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
02 Aug, 2024 1 commit
- [Misc] Disambiguate quantized types via a new ScalarType (#6396) · a8d604ca
  Lucas Wilkinson authored Aug 02, 2024
  
  a8d604ca
01 Aug, 2024 1 commit
- [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) · 7ecee343
  Jee Jee Li authored Aug 01, 2024
  
  7ecee343
31 Jul, 2024 3 commits
- Support W4A8 quantization for vllm (#5218) · 6512937d
  HandH1998 authored Jul 31, 2024
  
  6512937d
- [CI/Build] Fix mypy errors (#6968) · 9f0e69b6
  Cyrus Leung authored Jul 31, 2024
  
  9f0e69b6
- [mypy] Enable following imports for some directories (#6681) · da1f7cc1
  Cyrus Leung authored Jul 31, 2024
  
  da1f7cc1
30 Jul, 2024 1 commit
- [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) · d7a299ed
  Tyler Michael Smith authored Jul 30, 2024
  
  d7a299ed
27 Jul, 2024 1 commit
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) · 75acdaa4
  Alexander Matveev authored Jul 27, 2024
  
  75acdaa4
24 Jul, 2024 1 commit
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 2 commits
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
19 Jul, 2024 1 commit
- [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) · 4cc24f01
  Robert Shaw authored Jul 19, 2024
  
  4cc24f01
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
17 Jul, 2024 1 commit
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) · e76466dd
  Alexander Matveev authored Jul 17, 2024
  
  e76466dd
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
26 Jun, 2024 1 commit

[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) · 5bfd1bbc

Luka Govedič authored Jun 26, 2024


Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

5bfd1bbc

20 Jun, 2024 2 commits
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
- [Model] Port over CLIPVisionModel for VLMs (#5591) · ad137cd1
  Roger Wang authored Jun 20, 2024
  
  ad137cd1
17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

13 Jun, 2024 1 commit

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 1 commit
- [misc] add hint for AttributeError (#5462) · 622d4512
  youkaichao authored Jun 12, 2024
  
  622d4512
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 3 commits

[Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b

Dipika Sikka authored Jun 07, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

ca3ea51b

[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48

Tyler Michael Smith authored Jun 07, 2024

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

8d75fe48

[Misc] Missing error message for custom ops import (#5282) · 15063741
Jie Fu (傅杰) authored Jun 07, 2024

15063741

03 Jun, 2024 1 commit
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

16 May, 2024 2 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
10 May, 2024 2 commits
- [Core]fix type annotation for `swap_blocks` (#4726) · 64b77dfd
  Kunshang Ji authored May 10, 2024
  
  64b77dfd
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 1 commit

[Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc

Philipp Moritz authored May 09, 2024

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).

We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.

Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:

qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

379da6dc

03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
30 Apr, 2024 1 commit
- [Core]Refactor gptq_marlin ops (#4466) · 26f2fb51
  Kunshang Ji authored Apr 30, 2024
  
  26f2fb51
27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c