Commits · b02fd288b28f0bfa2d7ac8958fe0d71ec22ffc1b · OpenDAS / vllm_cscc

29 Jan, 2025 1 commit
- [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787) · b02fd288
  Pavani Majety authored Jan 29, 2025
```
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  b02fd288
16 Jan, 2025 1 commit
- Various cosmetic/comment fixes (#12089) · 9aa1519f
  Michael Goin authored Jan 16, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  9aa1519f
15 Jan, 2025 1 commit

[Misc][Quark] Upstream Quark format to VLLM (#10765) · de0526f6

kewang-xlnx authored Jan 16, 2025


Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

de0526f6

22 Nov, 2024 1 commit
- [torch.compile] support all attention backends (#10558) · eebad39f
  youkaichao authored Nov 22, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  eebad39f
18 Nov, 2024 1 commit
- [Misc] Add uninitialized params tracking for `AutoWeightsLoader` (#10327) · c4e46433
  Isotr0py authored Nov 18, 2024
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  c4e46433
17 Nov, 2024 1 commit
- [V1] Refactor model executable interface for all text-only language models (#10374) · 643ecf7b
  Roger Wang authored Nov 16, 2024
```
Signed-off-by: Roger Wang <ywang@roblox.com>
```
  643ecf7b
11 Nov, 2024 1 commit
- [6/N] pass whole config to inner model (#10205) · f89d18ff
  youkaichao authored Nov 10, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  f89d18ff
09 Nov, 2024 1 commit
- [5/N] pass the whole config to model (#9983) · 1a95f10e
  youkaichao authored Nov 08, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  1a95f10e
06 Nov, 2024 2 commits
- [V1] Make v1 more testable (#9888) · d58268c5
  Joe Runde authored Nov 06, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  d58268c5
- [CI/Build] drop support for Python 3.8 EOL (#8464) · 21063c11
  Aaron Pham authored Nov 06, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
```
  21063c11
28 Oct, 2024 1 commit
- Adding "torch compile" annotations to moe models (#9758) · aa0addb3
  Yongzao authored Oct 29, 2024
  
  aa0addb3
04 Oct, 2024 2 commits

[Model] add a bunch of supported lora modules for mixtral (#9008) · 9ade8bbc
Prashant Gupta authored Oct 04, 2024
```
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
```
9ade8bbc

[Models] Add remaining model PP support (#7168) · 0f6d7a9a

Murali Andoorveedu authored Oct 03, 2024

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

0f6d7a9a

10 Sep, 2024 1 commit
- [Misc] Fused MoE Marlin support for GPTQ (#8217) · 6cd5e5b0
  Dipika Sikka authored Sep 09, 2024
  
  6cd5e5b0
30 Aug, 2024 1 commit
- [Core] Logprobs support in Multi-step (#7652) · 428dd144
  afeldman-nm authored Aug 29, 2024
  
  428dd144
27 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) · fc911880
  Dipika Sikka authored Aug 27, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  fc911880
22 Aug, 2024 1 commit
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) · aae74ef9
  Michael Goin authored Aug 21, 2024
  
  aae74ef9
21 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) · 8678a69a
  Dipika Sikka authored Aug 21, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  8678a69a
20 Aug, 2024 1 commit
- [Bugfix] support `tie_word_embeddings` for all models (#5724) · f4fc7337
  Zijian Hu authored Aug 19, 2024
  
  f4fc7337
13 Aug, 2024 2 commits
- [Misc] Update Fused MoE weight loading (#7334) · d3bdfd3a
  Dipika Sikka authored Aug 13, 2024
  
  d3bdfd3a
- [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) · 7025b11d
  Cyrus Leung authored Aug 13, 2024
  
  7025b11d
19 Jul, 2024 1 commit
- [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) · dbe55885
  Robert Shaw authored Jul 18, 2024
  
  dbe55885
18 Jul, 2024 1 commit
- [Model] Pipeline parallel support for Mixtral (#6516) · b5af8c22
  Cody Yu authored Jul 17, 2024
  
  b5af8c22
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
14 Jul, 2024 1 commit
- [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) · fb6af8bc
  Robert Shaw authored Jul 13, 2024
  
  fb6af8bc
10 Jul, 2024 1 commit
- [Bugfix] Support 2D input shape in MoE layer (#6287) · e72ae80b
  Woosuk Kwon authored Jul 10, 2024
  
  e72ae80b
02 Jul, 2024 3 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) · 7c008c51
  Robert Shaw authored Jul 02, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  7c008c51
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 2 commits
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
08 Jun, 2024 1 commit
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
  Michael Goin authored Jun 08, 2024
  
  c09dade2
05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
31 May, 2024 1 commit
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

13 May, 2024 2 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

01 May, 2024 1 commit
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6