- 02 Feb, 2025 1 commit
-
-
Jinzhen Lin authored
Fix https://github.com/vllm-project/vllm/issues/12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by:
Jinzhen Lin <linjinzhen@hotmail.com>
-
- 01 Feb, 2025 4 commits
-
-
Michael Goin authored
-
Lucas Wilkinson authored
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
simon-mo <simon.mo@hey.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
Zhuohan Li <zhuohan123@gmail.com> Co-authored-by:
Tyler Michael Smith <tysmith@redhat.com> Co-authored-by:
Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
-
Rahul Tuli authored
This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](https://github.com/vllm-project/vllm/pull/12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cddb # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> **Note:** Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160 ). Signed-off-by:
Rahul Tuli <rahul@neuralmagic.com>
-
Eldar Kurtic authored
Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.
-
- 31 Jan, 2025 6 commits
-
-
Ryan Nguyen authored
**[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B.  With the optimization, this is no longer the case:  --------- Signed-off-by:
Ryan N <ryan.nguyen@centml.ai>
-
Tyler Michael Smith authored
Integrates the block-quantized kernels introduced in https://github.com/vllm-project/vllm/pull/11868 for use in linear layers. Signed-off-by:
Tyler Michael Smith <tyler@neuralmagic.com>
-
Robert Shaw authored
SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files ) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com>
-
Robert Shaw authored
Co-authored-by:simon-mo <xmo@berkeley.edu>
-
Lucas Wilkinson authored
Signed-off-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by:
simon-mo <xmo@berkeley.edu> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
simon-mo <simon.mo@hey.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
Zhuohan Li <zhuohan123@gmail.com> Co-authored-by:
Tyler Michael Smith <tysmith@redhat.com> Co-authored-by:
Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by:
simon-mo <xmo@berkeley.edu>
-
Aleksandr Malyshev authored
Signed-off-by:
Aleksandr Malyshev <maleksan@amd.com> Co-authored-by:
Aleksandr Malyshev <maleksan@amd.com>
-
- 30 Jan, 2025 1 commit
-
-
Robert Shaw authored
Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
simon-mo <xmo@berkeley.edu>
-
- 29 Jan, 2025 5 commits
-
-
Jinzhen Lin authored
-
Pavani Majety authored
Signed-off-by:
Pavani Majety <pmajety@nvidia.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
Alphi authored
Signed-off-by:
hzh <hezhihui_thu@163.com> Signed-off-by:
Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by:
DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by:
NickLucche <nlucches@redhat.com> Signed-off-by:
Isotr0py <2037008807@qq.com> Signed-off-by:
Roger Wang <ywang@roblox.com> Signed-off-by:
Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by:
Akshat Tripathi <akshat@krai.ai> Signed-off-by:
Oleg Mosalov <oleg@krai.ai> Signed-off-by:
Jee Jee Li <pandaleefree@gmail.com> Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by:
Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by:
Chenguang Li <757486878@qq.com> Signed-off-by:
youkaichao <youkaichao@gmail.com> Signed-off-by:
Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by:
Chen Zhang <zhangch99@outlook.com> Signed-off-by:
Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by:
Shanshan Shen <467638484@qq.com> Signed-off-by:
elijah <f1renze.142857@gmail.com> Signed-off-by:
Yikun <yikunkero@gmail.com> Signed-off-by:
mgoin <michael@neuralmagic.com> Signed-off-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by:
Konrad Zawora <kzawora@habana.ai> Signed-off-by:
tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by:
wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by:
Rui Qiao <ruisearch42@gmail.com> Co-authored-by:
Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by:
shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by:
Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by:
Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by:
sixgod <evethwillbeok@outlook.com> Co-authored-by:
Isotr0py <2037008807@qq.com> Co-authored-by:
Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by:
Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by:
Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by:
Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by:
Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by:
Oleg Mosalov <oleg@krai.ai> Co-authored-by:
Jee Jee Li <pandaleefree@gmail.com> Co-authored-by:
Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by:
Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by:
Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by:
Concurrensee <yida.wu@amd.com> Co-authored-by:
Chenguang Li <757486878@qq.com> Co-authored-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
Alex Brooks <alex.brooks@ibm.com> Co-authored-by:
Chen Zhang <zhangch99@outlook.com> Co-authored-by:
Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by:
Shanshan Shen <467638484@qq.com> Co-authored-by:
elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by:
Yikun Jiang <yikunkero@gmail.com> Co-authored-by:
Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
Konrad Zawora <kzawora@habana.ai> Co-authored-by:
TJian <tunjian1996@gmail.com> Co-authored-by:
tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by:
wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by:
maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by:
Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by:
Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
Travis Johnson authored
Signed-off-by:
Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by:
Wallas Santos <wallashss@ibm.com> Co-authored-by:
Wallas Santos <wallashss@ibm.com>
-
Michael Goin authored
Signed-off-by:mgoin <michael@neuralmagic.com>
-
- 28 Jan, 2025 2 commits
-
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
Harry Mellor authored
Signed-off-by:Harry Mellor <19981378+hmellor@users.noreply.github.com>
-
- 27 Jan, 2025 3 commits
-
-
Bowen Wang authored
Signed-off-by:
Bowen Wang <abmfy@icloud.com> Signed-off-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
youkaichao <youkaichao@gmail.com>
-
Isotr0py authored
Signed-off-by:Isotr0py <2037008807@qq.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
- 26 Jan, 2025 1 commit
-
-
Tyler Michael Smith authored
Signed-off-by:
Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
- 25 Jan, 2025 2 commits
-
-
Divakar Verma authored
Signed-off-by:Divakar Verma <divakar.verma@amd.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
- 24 Jan, 2025 1 commit
-
-
Russell Bryant authored
Signed-off-by:Russell Bryant <rbryant@redhat.com>
-
- 23 Jan, 2025 4 commits
-
-
Dipika Sikka authored
[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (#11528) Signed-off-by:
ElizaWszola <eliza@neuralmagic.com> Co-authored-by:
ElizaWszola <eliza@neuralmagic.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
Gregory Shtrasberg authored
Signed-off-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
Micah Williamson <micah.williamson@amd.com>
-
Isotr0py authored
Signed-off-by:Isotr0py <2037008807@qq.com>
-
rasmith authored
Signed-off-by:Randall Smith <Randall.Smith@amd.com>
-
- 22 Jan, 2025 7 commits
-
-
Jee Jee Li authored
Signed-off-by:Jee Jee Li <pandaleefree@gmail.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
Roger Wang authored
Signed-off-by:Roger Wang <ywang@roblox.com>
-
zhou fan authored
Signed-off-by:xffxff <1247714429@qq.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
Kevin H. Luu authored
Signed-off-by:kevin <kevin@anyscale.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
- 21 Jan, 2025 3 commits
-
-
wangxiyuan authored
Signed-off-by:wangxiyuan <wangxiyuan1007@gmail.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
Jee Jee Li authored
Signed-off-by:Jee Jee Li <pandaleefree@gmail.com>
-