Commits · 37fd47e7803fedd9715abceee8bdb57070fc09f4 · OpenDAS / vllm_cscc

16 Aug, 2024 2 commits
- [Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596) · 37fd47e7
  bnellnm authored Aug 16, 2024
  
  37fd47e7
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
12 Aug, 2024 1 commit
- [Misc] Use scalar type to dispatch to different `gptq_marlin` kernels (#7323) · 6aa33cb2
  Lucas Wilkinson authored Aug 12, 2024
  
  6aa33cb2
06 Aug, 2024 1 commit
- [Kernel] Add per-tensor and per-token AZP epilogues (#5941) · 8d59dbb0
  Luka Govedič authored Aug 06, 2024
```
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  8d59dbb0
05 Aug, 2024 3 commits
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
- [CI/Build] Suppress divide-by-zero and missing return statement warnings (#7001) · 6e4852ce
  Tyler Michael Smith authored Aug 05, 2024
  
  6e4852ce
- [Kernel] Update CUTLASS to 3.5.1 (#7085) · 8571ac46
  Tyler Michael Smith authored Aug 05, 2024
  
  8571ac46
02 Aug, 2024 1 commit
- [Misc] Disambiguate quantized types via a new ScalarType (#6396) · a8d604ca
  Lucas Wilkinson authored Aug 02, 2024
  
  a8d604ca
31 Jul, 2024 3 commits
- [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996) · 35e9c12b
  Varun Sundar Rabindranath authored Jul 31, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  35e9c12b
- [Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950) · 93548eb3
  Varun Sundar Rabindranath authored Jul 31, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  93548eb3
- Support W4A8 quantization for vllm (#5218) · 6512937d
  HandH1998 authored Jul 31, 2024
  
  6512937d
30 Jul, 2024 4 commits
- [Kernel] Squash a few more warnings (#6914) · cbbc9044
  Tyler Michael Smith authored Jul 30, 2024
  
  cbbc9044
- [Kernel] Tuned int8 kernels for Ada Lovelace (#6848) · af647fb8
  Varun Sundar Rabindranath authored Jul 29, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  af647fb8
- [Kernel] Fix marlin divide-by-zero warnings (#6904) · 61a97c32
  Tyler Michael Smith authored Jul 29, 2024
  
  61a97c32
- [Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908) · aae6d36f
  Tyler Michael Smith authored Jul 29, 2024
  
  aae6d36f
29 Jul, 2024 2 commits
- [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901) · 60d1c6e5
  Tyler Michael Smith authored Jul 29, 2024
  
  60d1c6e5
- [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677) · 766435e6
  Varun Sundar Rabindranath authored Jul 29, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  766435e6
27 Jul, 2024 2 commits
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) · 75acdaa4
  Alexander Matveev authored Jul 27, 2024
  
  75acdaa4
- [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852) · 55712941
  Lucas Wilkinson authored Jul 26, 2024
  
  55712941
26 Jul, 2024 1 commit
- [Bugfix][Kernel] Promote another index to int64_t (#6838) · 50704f52
  Tyler Michael Smith authored Jul 26, 2024
  
  50704f52
22 Jul, 2024 1 commit
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) · fea59c77
  Tyler Michael Smith authored Jul 22, 2024
  
  fea59c77
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
14 Jul, 2024 1 commit
- [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384) · 9dad5cc8
  Tyler Michael Smith authored Jul 14, 2024
  
  9dad5cc8
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
28 Jun, 2024 1 commit
- [Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) · 6a2d659d
  Tyler Michael Smith authored Jun 28, 2024
  
  6a2d659d
26 Jun, 2024 1 commit

[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) · 5bfd1bbc

Luka Govedič authored Jun 26, 2024


Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

5bfd1bbc

23 Jun, 2024 1 commit
- [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744) · 6c916ac8
  Varun Sundar Rabindranath authored Jun 24, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  6c916ac8
20 Jun, 2024 3 commits
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
- [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275) · a7dcc620
  Varun Sundar Rabindranath authored Jun 20, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a7dcc620
- [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) · 111af1fa
  Varun Sundar Rabindranath authored Jun 20, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  111af1fa
18 Jun, 2024 1 commit
- [Bugfix] Fix CUDA version check for mma warning suppression (#5642) · b23ce920
  Tyler Michael Smith authored Jun 18, 2024
  
  b23ce920
14 Jun, 2024 2 commits
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) · 348616ac
  Tyler Michael Smith authored Jun 14, 2024
  
  348616ac
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 1 commit

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 1 commit

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 1 commit

[Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b

Dipika Sikka authored Jun 07, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

ca3ea51b

05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129