- 01 Aug, 2024 1 commit
-
-
zhuwenwen authored
-
- 24 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 22 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 20 Jul, 2024 3 commits
- 09 Jul, 2024 1 commit
-
-
huangwb authored
-
- 08 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 06 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 10 Jun, 2024 2 commits
-
-
Cyrus Leung authored
-
Cyrus Leung authored
Co-authored-by:Roger Wang <ywang@roblox.com>
-
- 08 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 07 Jun, 2024 1 commit
-
-
Calvinn Ng authored
Co-authored-by:team <calvinn.ng@ahrefs.com>
-
- 05 Jun, 2024 1 commit
-
-
Cody Yu authored
-
- 03 Jun, 2024 1 commit
-
-
Cyrus Leung authored
-
- 01 Jun, 2024 1 commit
-
-
chenqianfzh authored
-
- 31 May, 2024 1 commit
-
-
Cody Yu authored
-
- 27 May, 2024 2 commits
-
-
Isotr0py authored
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 25 May, 2024 1 commit
-
-
Eric Xihui Lin authored
Co-authored-by:
beagleski <yunanzhang@microsoft.com> Co-authored-by:
bapatra <bapatra@microsoft.com> Co-authored-by:
Barun Patra <codedecde@users.noreply.github.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
- 23 May, 2024 1 commit
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 22 May, 2024 3 commits
-
-
Philipp Moritz authored
-
raywanb authored
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 21 May, 2024 2 commits
- 20 May, 2024 1 commit
-
-
Cyrus Leung authored
-
- 19 May, 2024 1 commit
-
-
Cyrus Leung authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 1 commit
-
-
eigenLiu authored
-
- 13 May, 2024 2 commits
-
-
Philipp Moritz authored
-
Woosuk Kwon authored
-
- 12 May, 2024 1 commit
-
-
Yikang Shen authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 09 May, 2024 1 commit
-
-
Hao Zhang authored
Co-authored-by:
Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by:
Aurick Qiao <qiao@aurick.net> Co-authored-by:
Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by:
Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
- 04 May, 2024 1 commit
-
-
Michael Goin authored
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
-
- 01 May, 2024 1 commit
-
-
Philipp Moritz authored
Remove the device="cuda" declarations in mixtral as promised in #4343
-
- 27 Apr, 2024 2 commits
-
-
Robert Shaw authored
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 26 Apr, 2024 1 commit
-
-
Cody Yu authored
-