Commits · a9944aabfa0eb0f133cf869b3ed5defb44ed7d33 · OpenDAS / vllm_cscc

07 May, 2025 2 commits
- [Quantization] Quark MXFP4 format loading (#16943) · db593aa6
  Bowen Bao authored May 07, 2025
  
  db593aa6
- [Misc] Split model loader (#17712) · 822de7fb
  Jee Jee Li authored May 07, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  822de7fb
24 Apr, 2025 1 commit
- More informative error when using Transformers backend (#16988) · 2c8ed8ee
  Harry Mellor authored Apr 24, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  2c8ed8ee
23 Apr, 2025 1 commit
- Improve Transformers backend model loading QoL (#17039) · 8e630d68
  Harry Mellor authored Apr 23, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  8e630d68
01 Apr, 2025 1 commit
- Rename fallback model and refactor supported models section (#15829) · a76f547e
  Harry Mellor authored Apr 01, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  a76f547e
31 Mar, 2025 1 commit
- Fix Transformers backend compatibility check (#15290) · d4bfc23e
  Harry Mellor authored Mar 31, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  d4bfc23e
26 Mar, 2025 1 commit

Separate base model from `TransformersModel` (#15467) · cf5c8f16

Harry Mellor authored Mar 26, 2025


Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

cf5c8f16

24 Mar, 2025 1 commit
- [Bugfix][V1] Avoid importing PreTrainedModel (#15366) · 948ab03e
  ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 authored Mar 24, 2025
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
```
  948ab03e
15 Mar, 2025 1 commit

[V1] V1 Enablement Oracle (#13726) · d4d93db2

Robert Shaw authored Mar 15, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

d4d93db2

18 Feb, 2025 1 commit
- [Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method (#13403) · 8cf97f86
  Isotr0py authored Feb 18, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  8cf97f86
05 Feb, 2025 1 commit

[Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634) · 7ff7a638

Kyle Sayers authored Feb 05, 2025


Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

7ff7a638

04 Feb, 2025 1 commit
- [Misc] Add BNB quantization for Whisper (#12381) · 96b23621
  Jee Jee Li authored Feb 04, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  96b23621
03 Feb, 2025 1 commit

[Model]: Add `transformers` backend support (#11330) · a1a2aaad

Arthur authored Feb 03, 2025

# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235

, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

a1a2aaad

02 Feb, 2025 1 commit

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

14 Jan, 2025 1 commit
- [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (#11924) · a3a3ee4e
  Jee Jee Li authored Jan 15, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  a3a3ee4e
24 Dec, 2024 1 commit
- [Model] Automatic conversion of classification and reward models (#11469) · 3f3e92e1
  Cyrus Leung authored Dec 25, 2024
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  3f3e92e1
11 Dec, 2024 1 commit
- [Misc] Split up pooling tasks (#10820) · 8f10d5e3
  Cyrus Leung authored Dec 11, 2024
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  8f10d5e3
07 Dec, 2024 1 commit
- [Model] Composite weight loading for multimodal Qwen2 (#10944) · bf0e382e
  Cyrus Leung authored Dec 07, 2024
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  bf0e382e
01 Dec, 2024 1 commit
- [Model] Replace embedding models with pooling adapter (#10769) · 13370712
  Cyrus Leung authored Dec 01, 2024
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  13370712
04 Oct, 2024 1 commit
- [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973) · 05d68643
  ElizaWszola authored Oct 04, 2024
```
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
```
  05d68643
16 Sep, 2024 1 commit
- [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032) · a091e2da
  ElizaWszola authored Sep 16, 2024
```
Co-authored-by: Dipika <dipikasikka1@gmail.com>
```
  a091e2da
10 Sep, 2024 1 commit
- [Misc] Fused MoE Marlin support for GPTQ (#8217) · 6cd5e5b0
  Dipika Sikka authored Sep 09, 2024
  
  6cd5e5b0
27 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) · fc911880
  Dipika Sikka authored Aug 27, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  fc911880
22 Aug, 2024 1 commit
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) · aae74ef9
  Michael Goin authored Aug 21, 2024
  
  aae74ef9
21 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) · 8678a69a
  Dipika Sikka authored Aug 21, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  8678a69a
06 Aug, 2024 1 commit
- [Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153) · 1f26efbb
  Cyrus Leung authored Aug 06, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  1f26efbb
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb