- 01 Feb, 2025 3 commits
-
-
Lucas Wilkinson authored
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
simon-mo <simon.mo@hey.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
Zhuohan Li <zhuohan123@gmail.com> Co-authored-by:
Tyler Michael Smith <tysmith@redhat.com> Co-authored-by:
Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
-
Rahul Tuli authored
This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](https://github.com/vllm-project/vllm/pull/12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cddb # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> **Note:** Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160 ). Signed-off-by:
Rahul Tuli <rahul@neuralmagic.com>
-
Eldar Kurtic authored
Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.
-
- 31 Jan, 2025 3 commits
-
-
Tyler Michael Smith authored
Integrates the block-quantized kernels introduced in https://github.com/vllm-project/vllm/pull/11868 for use in linear layers. Signed-off-by:
Tyler Michael Smith <tyler@neuralmagic.com>
-
Robert Shaw authored
SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files ) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com>
-
Robert Shaw authored
Co-authored-by:simon-mo <xmo@berkeley.edu>
-
- 30 Jan, 2025 1 commit
-
-
Robert Shaw authored
Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
simon-mo <xmo@berkeley.edu>
-
- 29 Jan, 2025 1 commit
-
-
Jinzhen Lin authored
-
- 28 Jan, 2025 1 commit
-
-
Harry Mellor authored
Signed-off-by:Harry Mellor <19981378+hmellor@users.noreply.github.com>
-
- 26 Jan, 2025 1 commit
-
-
Tyler Michael Smith authored
Signed-off-by:
Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
- 25 Jan, 2025 1 commit
-
-
Divakar Verma authored
Signed-off-by:Divakar Verma <divakar.verma@amd.com>
-
- 23 Jan, 2025 3 commits
-
-
Dipika Sikka authored
[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (#11528) Signed-off-by:
ElizaWszola <eliza@neuralmagic.com> Co-authored-by:
ElizaWszola <eliza@neuralmagic.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
Gregory Shtrasberg authored
Signed-off-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
Micah Williamson <micah.williamson@amd.com>
-
rasmith authored
Signed-off-by:Randall Smith <Randall.Smith@amd.com>
-
- 21 Jan, 2025 2 commits
-
-
Nicolò Lucchesi authored
Signed-off-by:NickLucche <nlucches@redhat.com>
-
Gregory Shtrasberg authored
Signed-off-by:Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
-
- 19 Jan, 2025 3 commits
-
-
Martin Gleize authored
Signed-off-by:
Martin Gleize <mgleize@meta.com> Co-authored-by:
mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
-
Roger Wang authored
Signed-off-by:
Roger Wang <ywang@roblox.com> Signed-off-by:
DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by:
imkero <kerorek@outlook.com> Co-authored-by:
DarkLight1337 <tlleungac@connect.ust.hk>
-
yancong authored
-
- 17 Jan, 2025 2 commits
-
-
Gregory Shtrasberg authored
Signed-off-by:Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
-
Li, Jiang authored
Signed-off-by:jiang1.li <jiang1.li@intel.com>
-
- 16 Jan, 2025 4 commits
-
-
youkaichao authored
Signed-off-by:youkaichao <youkaichao@gmail.com>
-
Michael Goin authored
Signed-off-by:mgoin <michael@neuralmagic.com>
-
Elfie Guo authored
Signed-off-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
Michael Goin <mgoin@redhat.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
zhuwenwen authored
-
- 15 Jan, 2025 3 commits
-
-
kewang-xlnx authored
Signed-off-by:
kewang-xlnx <kewang@xilinx.com> Signed-off-by:
kewang2 <kewang2@amd.com> Co-authored-by:
kewang2 <kewang2@amd.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
Rahul Tuli authored
Signed-off-by:Rahul Tuli <rahul@neuralmagic.com>
-
Jee Jee Li authored
Signed-off-by:Jee Jee Li <pandaleefree@gmail.com>
-
- 14 Jan, 2025 1 commit
-
-
zhuwenwen authored
-
- 13 Jan, 2025 2 commits
-
-
Steve Luo authored
Signed-off-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
Isotr0py authored
Signed-off-by:
Isotr0py <2037008807@qq.com> Co-authored-by:
Cyrus Leung <cyrus.tl.leung@gmail.com>
-
- 12 Jan, 2025 1 commit
-
-
Avshalom Manevich authored
-
- 11 Jan, 2025 1 commit
-
-
shaochangxu authored
Signed-off-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com>
-
- 10 Jan, 2025 3 commits
-
-
Li, Jiang authored
Signed-off-by:jiang1.li <jiang1.li@intel.com>
-
wangxiyuan authored
Signed-off-by:
wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
youkaichao <youkaichao@gmail.com>
-
cennn authored
Co-authored-by:youkaichao <youkaichao@gmail.com>
-
- 09 Jan, 2025 1 commit
-
-
Cyrus Leung authored
Signed-off-by:
DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by:
Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by:
Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
-
- 08 Jan, 2025 3 commits
-
-
rasmith authored
Signed-off-by:Randall Smith <Randall.Smith@amd.com>
-
Robert Shaw authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
Yan Ma authored
Signed-off-by:yan ma <yan.ma@intel.com>
-