Commits · a98187cf7227695819e199e2e3ad35be0a9a84f3 · OpenDAS / vllm_cscc

04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
18 Apr, 2024 1 commit
- [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) · e8cc7967
  Michał Moskal authored Apr 18, 2024
  
  e8cc7967
11 Apr, 2024 2 commits
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

30 Mar, 2024 1 commit
- [Kernel] Layernorm performance optimization (#3662) · b6d10354
  mawong-amd authored Mar 30, 2024
  
  b6d10354
27 Mar, 2024 1 commit
- feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) · 45b6ef65
  Roger Wang authored Mar 27, 2024
  
  45b6ef65
25 Mar, 2024 2 commits
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
24 Mar, 2024 2 commits
- [CI] typo fix: is_hip --> is_hip() (#3595) · 8b268a46
  youkaichao authored Mar 24, 2024
  
  8b268a46
- [BugFix] 1D query fix for MoE models (#3597) · 41deac4a
  Nick Hill authored Mar 24, 2024
  
  41deac4a
20 Mar, 2024 1 commit
- [1/n] Triton sampling kernel (#3186) · 426ec4ec
  Antoni Baum authored Mar 20, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  426ec4ec
16 Mar, 2024 2 commits
- fix lint · ad50bf4b
  simon-mo authored Mar 15, 2024
  
  ad50bf4b
- Fixes the incorrect argument in the prefix-prefill test cases (#3246) · 3123f151
  Tao He authored Mar 16, 2024
  
  3123f151
13 Mar, 2024 2 commits
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
27 Feb, 2024 1 commit
- Enable GQA support in the prefix prefill kernels (#3007) · 71bcaf99
  Tao He authored Feb 27, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
```
  71bcaf99
22 Feb, 2024 1 commit
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
06 Feb, 2024 2 commits
- [Minor] More fix of test_cache.py CI test failure (#2750) · fe6d09ae
  Lily Liu authored Feb 06, 2024
  
  fe6d09ae
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
05 Feb, 2024 1 commit
- [ROCm] Fix some kernels failed unit tests (#2498) · 56f738ae
  Hongxia Yang authored Feb 05, 2024
  
  56f738ae
01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
31 Jan, 2024 2 commits
- Add unit test for Mixtral MoE layer (#2677) · d0d93b92
  Philipp Moritz authored Jan 31, 2024
  
  d0d93b92
- [Minor] Fix test_cache.py CI test failure (#2684) · 89efcf1c
  Philipp Moritz authored Jan 31, 2024
  
  89efcf1c
30 Jan, 2024 2 commits
- Add swap_blocks unit tests (#2616) · 4f65af0e
  Vladimir authored Jan 30, 2024
  
  4f65af0e
- DeepseekMoE support with Fused MoE kernel (#2453) · 5d60def0
  wangding zeng authored Jan 30, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  5d60def0
29 Jan, 2024 1 commit

Support FP8-E5M2 KV Cache (#2279) · 9090bf02

zhaoyang-star authored Jan 29, 2024


Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

9090bf02

22 Jan, 2024 1 commit
- Add a 1-line docstring to explain why calling context_attention_fwd twice in... · 7a0b011d
  Jason Zhu authored Jan 22, 2024
```
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
```
  7a0b011d
18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

14 Jan, 2024 1 commit
- [CI] Add Buildkite (#2355) · 6e01e8c1
  Simon Mo authored Jan 14, 2024
  
  6e01e8c1
04 Jan, 2024 1 commit
- Revert the changes in test_cache (#2335) · 94176712
  Woosuk Kwon authored Jan 03, 2024
  
  94176712
03 Jan, 2024 2 commits
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
- [FIX] Support non-zero CUDA devices in custom kernels (#1959) · 77af974b
  Jee Li authored Jan 03, 2024
  
  77af974b
10 Dec, 2023 1 commit
- Replace head_mapping params with num_kv_heads to attention kernel. (#1997) · dacaf5a4
  wbn authored Dec 11, 2023
```
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
```
  dacaf5a4
03 Dec, 2023 1 commit
- Add PyTorch-native implementation of custom layers (#1898) · 9b294976
  Woosuk Kwon authored Dec 02, 2023
  
  9b294976