Commits · 47f0954af0a5aefd0db19875f6bdcbe933d055a9 · OpenDAS / vllm_cscc · GitLab

03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
29 Jun, 2024 1 commit
- [Kernel] Add punica dimensions for Granite 3b and 8b (#5930) · ba499444
  Joe Runde authored Jun 28, 2024
```
Signed-off-by: Joe Runde <joe@joerun.de>
```
  ba499444
28 Jun, 2024 2 commits
- Unmark more files as executable (#5962) · 5d2a1a9c
  Tyler Michael Smith authored Jun 28, 2024
  
  5d2a1a9c
- [Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) · 6a2d659d
  Tyler Michael Smith authored Jun 28, 2024
  
  6a2d659d
26 Jun, 2024 2 commits
- Support CPU inference with VSX PowerPC ISA (#5652) · 38a1674a
  Chip Kerchner authored Jun 26, 2024
  
  38a1674a
- [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) · 5bfd1bbc
  Luka Govedič authored Jun 26, 2024
```
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
  5bfd1bbc
23 Jun, 2024 1 commit
- [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744) · 6c916ac8
  Varun Sundar Rabindranath authored Jun 24, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  6c916ac8
21 Jun, 2024 2 commits
- [Kernel][CPU] Add Quick `gelu` to CPU (#5717) · bd620b01
  Roger Wang authored Jun 20, 2024
  
  bd620b01
- [Kernel] Add punica dimension for Qwen2 LoRA (#5441) · 1f567421
  Jinzhen Lin authored Jun 21, 2024
  
  1f567421
20 Jun, 2024 4 commits
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
- [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275) · a7dcc620
  Varun Sundar Rabindranath authored Jun 20, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a7dcc620
- [Model] Port over CLIPVisionModel for VLMs (#5591) · ad137cd1
  Roger Wang authored Jun 20, 2024
  
  ad137cd1
- [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) · 111af1fa
  Varun Sundar Rabindranath authored Jun 20, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  111af1fa
18 Jun, 2024 3 commits
- [Bugfix] Fix CUDA version check for mma warning suppression (#5642) · b23ce920
  Tyler Michael Smith authored Jun 18, 2024
  
  b23ce920
- [Model] LoRA support added for command-r (#5178) · 07feecde
  sergey-tinkoff authored Jun 18, 2024
  
  07feecde
- [Kernel] Add punica dimensions for Granite 13b (#5559) · 5002175e
  Joe Runde authored Jun 17, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  5002175e
14 Jun, 2024 2 commits
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) · 348616ac
  Tyler Michael Smith authored Jun 14, 2024
  
  348616ac
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 2 commits

[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) · cd9c0d65
Jie Fu (傅杰) authored Jun 14, 2024

cd9c0d65

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 1 commit

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 2 commits
- [Misc] Remove unused cuda_utils.h in CPU backend (#5345) · 6840a716
  Jie Fu (傅杰) authored Jun 08, 2024
  
  6840a716
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
03 Jun, 2024 2 commits
- [CI/BUILD] enable intel queue for longer CPU tests (#4113) · cafb8e06
  Yuan authored Jun 04, 2024
  
  cafb8e06
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
- [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) · 1197e021
  Tyler Michael Smith authored May 31, 2024
  
  1197e021
31 May, 2024 3 commits
- Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · e9d3aa04
  Simon Mo authored May 31, 2024
```
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
```
  e9d3aa04
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
- [Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · 6d21fa1c
  Alexander Matveev authored May 30, 2024
```
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
```
  6d21fa1c
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 2 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 3 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1