Commits · 111af1fa2c4fdb2d83b466935a327b1a5009874a · OpenDAS / vllm_cscc

20 Jun, 2024 1 commit
- [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) · 111af1fa
  Varun Sundar Rabindranath authored Jun 20, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  111af1fa
18 Jun, 2024 3 commits
- [Bugfix] Fix CUDA version check for mma warning suppression (#5642) · b23ce920
  Tyler Michael Smith authored Jun 18, 2024
  
  b23ce920
- [Model] LoRA support added for command-r (#5178) · 07feecde
  sergey-tinkoff authored Jun 18, 2024
  
  07feecde
- [Kernel] Add punica dimensions for Granite 13b (#5559) · 5002175e
  Joe Runde authored Jun 17, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  5002175e
14 Jun, 2024 2 commits
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) · 348616ac
  Tyler Michael Smith authored Jun 14, 2024
  
  348616ac
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 2 commits

[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) · cd9c0d65
Jie Fu (傅杰) authored Jun 14, 2024

cd9c0d65

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 1 commit

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 2 commits
- [Misc] Remove unused cuda_utils.h in CPU backend (#5345) · 6840a716
  Jie Fu (傅杰) authored Jun 08, 2024
  
  6840a716
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
03 Jun, 2024 2 commits
- [CI/BUILD] enable intel queue for longer CPU tests (#4113) · cafb8e06
  Yuan authored Jun 04, 2024
  
  cafb8e06
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
- [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) · 1197e021
  Tyler Michael Smith authored May 31, 2024
  
  1197e021
31 May, 2024 3 commits
- Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · e9d3aa04
  Simon Mo authored May 31, 2024
```
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
```
  e9d3aa04
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
- [Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · 6d21fa1c
  Alexander Matveev authored May 30, 2024
```
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
```
  6d21fa1c
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 2 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 3 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
20 May, 2024 1 commit
- Remove marlin warning (#4918) · da5a0b53
  Alexander Matveev authored May 20, 2024
  
  da5a0b53
16 May, 2024 4 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) · 8435b207
  Silencio authored May 17, 2024
```
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
```
  8435b207
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
10 May, 2024 2 commits
- [Misc] Apply a couple g++ cleanups (#4719) · dac6a3f6
  Steve Grubb authored May 10, 2024
  
  dac6a3f6
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 2 commits
- [ROCm] Add support for Punica kernels on AMD GPUs (#3140) · ff5abcd7
  kliuae authored May 10, 2024
```
Co-authored-by: miloice <jeffaw99@hotmail.com>
```
  ff5abcd7
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) · e288df06
  alexm-nm authored May 08, 2024
  
  e288df06
08 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
07 May, 2024 2 commits

[Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
youkaichao authored May 06, 2024

63575bc2

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf