Commits · 2acba47d9bf97135d33355eff303d61a2c8d3d8a · OpenDAS / vllm_cscc

21 Jan, 2025 1 commit
- [bugfix] moe tuning. rm is_navi() (#12273) · 2acba47d
  Divakar Verma authored Jan 21, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  2acba47d
17 Jan, 2025 1 commit
- [ROCm][MoE] moe tuning support for rocm (#12049) · 8027a724
  Divakar Verma authored Jan 17, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  8027a724
16 Jan, 2025 1 commit
- [misc] Add LoRA kernel micro benchmarks (#11579) · 5fd24ec0
  Varun Sundar Rabindranath authored Jan 16, 2025
  
  5fd24ec0
17 Dec, 2024 1 commit

[Misc] Kernel Benchmark for `RMSNorm` (#11241) · 02222a02

Roger Wang authored Dec 16, 2024


Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com>

02222a02

19 Nov, 2024 1 commit
- [Model][Quantization] HQQ support through Marlin kernel expansion (#9766) · b00b33d7
  ElizaWszola authored Nov 19, 2024
```
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
```
  b00b33d7
18 Nov, 2024 1 commit
- [Kernel] Initial Machete W4A8 support + Refactors (#9855) · 96d999fb
  Lucas Wilkinson authored Nov 18, 2024
```
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
  96d999fb
06 Nov, 2024 1 commit
- [CI/Build] drop support for Python 3.8 EOL (#8464) · 21063c11
  Aaron Pham authored Nov 06, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
```
  21063c11
29 Oct, 2024 1 commit
- [Hardware] using current_platform.seed_everything (#9785) · 622b7ab9
  wangshuai09 authored Oct 29, 2024
```
Signed-off-by: wangshuai09 <391746016@qq.com>
```
  622b7ab9
28 Oct, 2024 1 commit
- [torch.compile] support moe models (#9632) · 32176fee
  youkaichao authored Oct 27, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  32176fee
16 Oct, 2024 1 commit
- [Misc] Standardize RoPE handling for Qwen2-VL (#9250) · 7e7eae33
  Cyrus Leung authored Oct 16, 2024
  
  7e7eae33
23 Sep, 2024 1 commit

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701) · 86e9c8df

Lucas Wilkinson authored Sep 23, 2024

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

86e9c8df

18 Sep, 2024 2 commits
- [CI/Build] Update Ruff version (#8469) · 9d104b5b
  Aaron Pham authored Sep 18, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  9d104b5b
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
22 Aug, 2024 1 commit
- [Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233) · 7937009a
  Luka Govedič authored Aug 21, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  7937009a
20 Aug, 2024 1 commit
- [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) · 5288c06a
  Lucas Wilkinson authored Aug 20, 2024
  
  5288c06a
16 Aug, 2024 1 commit
- [Kernel] W8A16 Int8 inside FusedMoE (#7415) · 7fc23be8
  Mor Zusman authored Aug 16, 2024
  
  7fc23be8
02 Aug, 2024 1 commit
- [Misc] Disambiguate quantized types via a new ScalarType (#6396) · a8d604ca
  Lucas Wilkinson authored Aug 02, 2024
  
  a8d604ca
27 Jul, 2024 2 commits
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) · 75acdaa4
  Alexander Matveev authored Jul 27, 2024
  
  75acdaa4
- [Model] H2O Danube3-4b (#6451) · 14dbd5a7
  Joe authored Jul 26, 2024
  
  14dbd5a7
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
11 Jul, 2024 1 commit
- [ Misc ] Refactor Marlin Python Utilities (#6082) · b675069d
  Robert Shaw authored Jul 11, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  b675069d
20 Jun, 2024 1 commit
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
14 Jun, 2024 1 commit
- [Misc] Fix arg names (#5524) · d74674bb
  Allen.Dou authored Jun 15, 2024
  
  d74674bb
05 Jun, 2024 1 commit
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
04 Jun, 2024 2 commits
- [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) · 27208be6
  Woosuk Kwon authored Jun 04, 2024
  
  27208be6
- [Kernel] Enhance MoE benchmarking & tuning script (#4921) · 3a434b07
  Woosuk Kwon authored Jun 03, 2024
  
  3a434b07
31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
23 May, 2024 1 commit
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

16 May, 2024 1 commit
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
03 May, 2024 1 commit
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
01 May, 2024 1 commit

[Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4

Philipp Moritz authored May 01, 2024

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency

24bb4fe4

25 Apr, 2024 1 commit
- [Core]refactor aqlm quant ops (#4351) · f4bc4de1
  Kunshang Ji authored Apr 25, 2024
  
  f4bc4de1
23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
11 Apr, 2024 1 commit
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
14 Mar, 2024 1 commit
- [Kernel] change benchmark script so that result can be directly used; tune moe... · 8fe83865
  youkaichao authored Mar 14, 2024
```
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
```
  8fe83865