Commits · e150cf119b2e20f81775f36fd2cdf55321b7e5a8 · OpenDAS / vllm_cscc

05 Dec, 2024 1 commit
- added support for kernels tests with torch 2.3 · e150cf11
  zhuwenwen authored Dec 05, 2024
  
  e150cf11
25 Sep, 2024 1 commit
- [Kernel] Fullgraph and opcheck tests (#8479) · 300da091
  bnellnm authored Sep 25, 2024
  
  300da091
18 Sep, 2024 1 commit
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
16 Sep, 2024 1 commit
- [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032) · a091e2da
  ElizaWszola authored Sep 16, 2024
```
Co-authored-by: Dipika <dipikasikka1@gmail.com>
```
  a091e2da
10 Sep, 2024 1 commit
- [Misc] Fused MoE Marlin support for GPTQ (#8217) · 6cd5e5b0
  Dipika Sikka authored Sep 09, 2024
  
  6cd5e5b0
16 Aug, 2024 1 commit
- [Misc/Testing] Use `torch.testing.assert_close` (#7324) · 50b8d08d
  jon-chuang authored Aug 15, 2024
  
  50b8d08d
02 Jul, 2024 1 commit

[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) · 7c008c51

Robert Shaw authored Jul 02, 2024


Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

7c008c51

01 Jul, 2024 1 commit
- [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) · 12a59959
  Avshalom Manevich authored Jul 02, 2024
  
  12a59959
04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

11 Apr, 2024 1 commit
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
24 Mar, 2024 1 commit
- [BugFix] 1D query fix for MoE models (#3597) · 41deac4a
  Nick Hill authored Mar 24, 2024
  
  41deac4a
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
06 Feb, 2024 1 commit
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
31 Jan, 2024 1 commit
- Add unit test for Mixtral MoE layer (#2677) · d0d93b92
  Philipp Moritz authored Jan 31, 2024
  
  d0d93b92