Commits · e499f96ce34b2bebd81d0c4d25852916c95fafcb · OpenDAS / vllm_cscc

17 Jul, 2024 1 commit
- add rotary_embedding for tgi · e499f96c
  huangwb authored Jul 17, 2024
  
  e499f96c
10 Jul, 2024 2 commits
- pa_v1用原始代码pa_v2用新代码 · deeb9cb8
  zhangshao authored Jul 10, 2024
  
  deeb9cb8
- 优化rmsnorm和page_attn · 9e10e8f7
  zhangshao authored Jul 10, 2024
  
  9e10e8f7
02 Jul, 2024 2 commits
- change pa v1 to 128 · bbf9488b
  zhuwenwen authored Jul 02, 2024
  
  bbf9488b
- change num_thread to 256 · 8ee4ae1f
  zhuwenwen authored Jul 02, 2024
  
  8ee4ae1f
12 Jun, 2024 1 commit
- skip fp8 · 103f3110
  zhuwenwen authored Jun 12, 2024
  
  103f3110
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 2 commits
- [Misc] Remove unused cuda_utils.h in CPU backend (#5345) · 6840a716
  Jie Fu (傅杰) authored Jun 08, 2024
  
  6840a716
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
03 Jun, 2024 2 commits
- [CI/BUILD] enable intel queue for longer CPU tests (#4113) · cafb8e06
  Yuan authored Jun 04, 2024
  
  cafb8e06
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
- [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) · 1197e021
  Tyler Michael Smith authored May 31, 2024
  
  1197e021
31 May, 2024 4 commits
- add int8 · 0de4f1dc
  zhuwenwen authored May 31, 2024
  
  0de4f1dc
- Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · e9d3aa04
  Simon Mo authored May 31, 2024
```
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
```
  e9d3aa04
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
- [Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · 6d21fa1c
  Alexander Matveev authored May 30, 2024
```
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
```
  6d21fa1c
29 May, 2024 1 commit
- support bf16 infer · fc92ed40
  zhuwenwen authored May 29, 2024
  
  fc92ed40
25 May, 2024 2 commits

skip fp8 · f09d77ac
zhuwenwen authored May 25, 2024

f09d77ac

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 2 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 3 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
20 May, 2024 1 commit
- Remove marlin warning (#4918) · da5a0b53
  Alexander Matveev authored May 20, 2024
  
  da5a0b53
16 May, 2024 4 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) · 8435b207
  Silencio authored May 17, 2024
```
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
```
  8435b207
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
10 May, 2024 2 commits
- [Misc] Apply a couple g++ cleanups (#4719) · dac6a3f6
  Steve Grubb authored May 10, 2024
  
  dac6a3f6
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 2 commits
- [ROCm] Add support for Punica kernels on AMD GPUs (#3140) · ff5abcd7
  kliuae authored May 10, 2024
```
Co-authored-by: miloice <jeffaw99@hotmail.com>
```
  ff5abcd7
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) · e288df06
  alexm-nm authored May 08, 2024
  
  e288df06
08 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
07 May, 2024 2 commits

[Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
youkaichao authored May 06, 2024

63575bc2

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf