Commits · 4e2d95e372ad5fbef7b27c66d527c37477c0c8bb · OpenDAS / vllm_cscc

28 Oct, 2024 1 commit
- [Hardware][ROCM] using current_platform.is_rocm (#9642) · 4e2d95e3
  wangshuai09 authored Oct 28, 2024
```
Signed-off-by: wangshuai09 <391746016@qq.com>
```
  4e2d95e3
18 Sep, 2024 1 commit
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
30 Aug, 2024 1 commit

[Model] Adding support for MSFT Phi-3.5-MoE (#7729) · 1248e850

Wenxiang authored Aug 31, 2024


Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>

1248e850

27 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) · fc911880
  Dipika Sikka authored Aug 27, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  fc911880
22 Aug, 2024 2 commits
- [Misc] update fp8 to use `vLLMParameter` (#7437) · 955b5191
  Dipika Sikka authored Aug 22, 2024
  
  955b5191
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) · aae74ef9
  Michael Goin authored Aug 21, 2024
  
  aae74ef9
21 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) · 8678a69a
  Dipika Sikka authored Aug 21, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  8678a69a
16 Aug, 2024 2 commits
- [Kernel] W8A16 Int8 inside FusedMoE (#7415) · 7fc23be8
  Mor Zusman authored Aug 16, 2024
  
  7fc23be8
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
13 Aug, 2024 1 commit
- [Misc] Update Fused MoE weight loading (#7334) · d3bdfd3a
  Dipika Sikka authored Aug 13, 2024
  
  d3bdfd3a
07 Aug, 2024 1 commit
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) · 5223199e
  Michael Goin authored Aug 07, 2024
  
  5223199e
29 Jul, 2024 1 commit
- [Bugfix] Allow vllm to still work if triton is not installed. (#6786) · 9a7e2d05
  Thomas Parnell authored Jul 29, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  9a7e2d05
25 Jul, 2024 1 commit
- [ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524) · 889da130
  Robert Shaw authored Jul 25, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  889da130
23 Jul, 2024 2 commits
- [Misc] Add ignored layers for `fp8` quantization (#6657) · 0eb0757b
  Michael Goin authored Jul 23, 2024
  
  0eb0757b
- [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) · 9e0b558a
  Michael Goin authored Jul 23, 2024
  
  9e0b558a
20 Jul, 2024 1 commit
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
19 Jul, 2024 1 commit
- [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) · 4cc24f01
  Robert Shaw authored Jul 19, 2024
  
  4cc24f01
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
14 Jul, 2024 1 commit
- [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) · fb6af8bc
  Robert Shaw authored Jul 13, 2024
  
  fb6af8bc
11 Jul, 2024 1 commit
- [ Misc ] Refactor Marlin Python Utilities (#6082) · b675069d
  Robert Shaw authored Jul 11, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  b675069d
07 Jul, 2024 1 commit
- [ Misc ] Support Fp8 via `llm-compressor` (#6110) · abfe705a
  Robert Shaw authored Jul 07, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  abfe705a
03 Jul, 2024 2 commits
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
- [hardware][misc] introduce platform abstraction (#6080) · 482045ee
  youkaichao authored Jul 02, 2024
  
  482045ee
02 Jul, 2024 1 commit

[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) · 7c008c51

Robert Shaw authored Jul 02, 2024


Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

7c008c51

01 Jul, 2024 1 commit
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
28 Jun, 2024 1 commit
- [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) · 2cd402e1
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  2cd402e1
20 Jun, 2024 1 commit
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
14 Jun, 2024 1 commit
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 2 commits

[Kernel] Disable CUTLASS kernels for fp8 (#5505) · e38042d4
Tyler Michael Smith authored Jun 13, 2024

e38042d4

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

08 Jun, 2024 2 commits

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
Michael Goin authored Jun 08, 2024

c09dade2

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 1 commit

[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48

Tyler Michael Smith authored Jun 07, 2024

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

8d75fe48

05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

09 May, 2024 1 commit

[Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc

Philipp Moritz authored May 09, 2024

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).

We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.

Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:

qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

379da6dc

30 Apr, 2024 1 commit

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d