1. 02 Feb, 2025 1 commit
  2. 01 Feb, 2025 4 commits
    • Michael Goin's avatar
      3194039c
    • Lucas Wilkinson's avatar
      [Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25
      Lucas Wilkinson authored
      
      
      This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 
      
      ---------
      Signed-off-by: default avatarLucas Wilkinson <lwilkinson@neuralmagic.com>
      Co-authored-by: default avatarWoosuk Kwon <woosuk.kwon@berkeley.edu>
      Co-authored-by: default avatarsimon-mo <simon.mo@hey.com>
      Co-authored-by: default avatarMichael Goin <mgoin64@gmail.com>
      Co-authored-by: default avatarZhuohan Li <zhuohan123@gmail.com>
      Co-authored-by: default avatarTyler Michael Smith <tysmith@redhat.com>
      Co-authored-by: default avatarAlexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
      baeded25
    • Rahul Tuli's avatar
      Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf
      Rahul Tuli authored
      This PR addresses a bug in the Cutlass integration where the
      `sparsity_config.ignore` list was not being respected. When only a
      subset of modules were configured as Sparse24, the system incorrectly
      selected Cutlass for non-sparse modules as well. This update ensures the
      correct scheme is selected for non-sparse modules, fixing this behavior.
      
      ---
      
      ### Changes
      
      - Updated logic to correctly respect `sparsity_config.ignore`.
      - Ensured non-sparse modules use the appropriate scheme instead of
      defaulting to Cutlass.
      
      ---
      
      <details>
      <summary>Testing Setup</summary>
      
      The fix has been tested on top of [this
      diff](https://github.com/vllm-project/vllm/pull/12097).
      
      #### Steps to Test:
      ```bash
      git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
      git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
      git cherry-pick ca624cddb # this branch
      ```
      
      #### Additional Patch Required:
      ```diff
      diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
      index a54177c1c..f916dd0c9 100644
      --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
      +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
      @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                                    QuantizationStrategy,
                                                    QuantizationType)
       from pydantic import BaseModel
      -
      +from vllm.logger import init_logger
       from vllm.model_executor.layers.fused_moe import FusedMoE
       from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                      UnquantizedLinearMethod)
      @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
           should_ignore_layer)
       from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
       from vllm.platforms import current_platform
      -
      +logger = init_logger(__name__)
       __all__ = ["CompressedTensorsLinearMethod"]
       
       SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
      ```
      
      Apply using:
      ```bash
      git apply logging-patch.patch
      ```
      
      </details>
      
      ---
      
      <details>
      <summary>Models Tested</summary>
      
      - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
      - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
      -
      `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
      -
      `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`
      
      </details>
      
      ---
      
      
      <details>
      <summary>Example Output</summary>
      
      #### Layers 0-5 (Sparse24)
      ```
      Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
      Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
      Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
      Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
      ...
      ```
      
      #### Layers 6+ (Non-Sparse, FP8)
      ```
      Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
      Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
      Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
      Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
      ...
      ```
      
      </details>
      
      **Note:** Assumed all modules in fused layers such as `QKV_proj` and
      `Gate_up_proj` follow the same quantization/pruning scheme.
      
      ---
      
      For related tasks using the Asana app for GitHub, refer to [[this
      link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160
      
      ).
      Signed-off-by: default avatarRahul Tuli <rahul@neuralmagic.com>
      3e1c76cf
    • Eldar Kurtic's avatar
      Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258
      Eldar Kurtic authored
      Without this PR
      ---------------
      Quantizing models with llm-compressor and a recipe that explicitly lists
      names of layers produces a model that is not loadable by vLLM (i.e.
      `vllm serve <model>` fails with `raise ValueError(f"Unable to find
      matching target for {module} in the ...`).
      
      Example recipe:
      ```
      recipe = """
      quantization_stage:
        run_type: oneshot
        quantization_modifiers:
          GPTQModifier:
            ignore: ["lm_head"]
            config_groups:
              group_0:
                weights:
                  num_bits: 4
                  type: "int"
                  symmetric: true
                  strategy: "group"
                  group_size: 128
                targets: [
                  "model.layers.0.mlp.down_proj",
                  "model.layers.2.mlp.down_proj",
                  "model.layers.3.mlp.down_proj",
                  "model.layers.4.mlp.down_proj",
                  "model.layers.5.mlp.down_proj",
                  "model.layers.6.mlp.down_proj",
                  "model.layers.7.mlp.down_proj",
                  "model.layers.8.mlp.down_proj",
                  "model.layers.9.mlp.down_proj",
                  "model.layers.10.mlp.down_proj",
                  "model.layers.11.mlp.down_proj",
                  "model.layers.12.mlp.down_proj",
                  "model.layers.13.mlp.down_proj",
                  "model.layers.14.mlp.down_proj",
                  "model.layers.15.mlp.down_proj",
                  "model.layers.16.mlp.down_proj",
                  "model.layers.17.mlp.down_proj",
                  "model.layers.19.mlp.down_proj",
                  "model.layers.21.mlp.down_proj",
                  "model.layers.22.mlp.down_proj",
                  .
                  .
                  .
                ]
      """
      ```
      
      To reproduce the vLLM error: 
      ```bash
      vllm serve nm-testing/eldar-test
      ```
      
      With this PR
      ------------
      Models are loaded correctly without any errors.
      1867c258
  3. 31 Jan, 2025 6 commits
  4. 30 Jan, 2025 1 commit
  5. 29 Jan, 2025 5 commits
  6. 28 Jan, 2025 2 commits
  7. 27 Jan, 2025 3 commits
  8. 26 Jan, 2025 1 commit
  9. 25 Jan, 2025 2 commits
  10. 24 Jan, 2025 1 commit
  11. 23 Jan, 2025 4 commits
  12. 22 Jan, 2025 7 commits
  13. 21 Jan, 2025 3 commits