Commits · f8e7adda21810104382bdf3febe3ea02c72f7348 · OpenDAS / vllm_cscc

03 May, 2024 1 commit
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 2 commits
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
24 Apr, 2024 3 commits

[Bugfix] Fix marlin kernel crash on H100 (#4218) · aae08249

alexm-nm authored Apr 24, 2024

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

aae08249

[Misc] Reduce supported Punica dtypes (#4304) · 468d761b
Woosuk Kwon authored Apr 23, 2024

468d761b

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
17 Apr, 2024 1 commit
- [Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) · a5322254
  Shoichi Uchinami authored Apr 18, 2024
  
  a5322254
13 Apr, 2024 1 commit
- [Kernel] Add punica dimension for Baichuan-13B (#4053) · 989ae253
  Jee Li authored Apr 13, 2024
  
  989ae253
11 Apr, 2024 3 commits
- Add extra punica sizes to support bigger vocabs (#4015) · 1e96c334
  Antoni Baum authored Apr 11, 2024
  
  1e96c334
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- punica fix-bgmv-kernel-640 (#4007) · 08ccee1e
  fuchen.ljl authored Apr 11, 2024
  
  08ccee1e
08 Apr, 2024 1 commit
- [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782) · 59a6abf3
  Matt Wong authored Apr 08, 2024
  
  59a6abf3
04 Apr, 2024 1 commit
- [Bugfix] Add kv_scale input parameter to CPU backend (#3840) · 498eb5cf
  Woosuk Kwon authored Apr 03, 2024
  
  498eb5cf
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

02 Apr, 2024 1 commit

[Hardware][Intel] Add CPU inference backend (#3634) · 0e3f06fe

bigPYJ1151 authored Apr 02, 2024


Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

0e3f06fe

30 Mar, 2024 1 commit
- [Kernel] Layernorm performance optimization (#3662) · b6d10354
  mawong-amd authored Mar 30, 2024
  
  b6d10354
27 Mar, 2024 1 commit
- [Kernel] support non-zero cuda devices in punica kernels (#3636) · 566b57c5
  Jee Li authored Mar 27, 2024
  
  566b57c5
26 Mar, 2024 1 commit
- Enable more models to inference based on LoRA (#3382) · 8af890a8
  Jee Li authored Mar 26, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  8af890a8
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
18 Mar, 2024 1 commit
- [Bugfix] Make moe_align_block_size AMD-compatible (#3470) · 9101d832
  Woosuk Kwon authored Mar 18, 2024
  
  9101d832
16 Mar, 2024 1 commit
- [Misc] fix line length for entire codebase (#3444) · 8e67598a
  Simon Mo authored Mar 16, 2024
  
  8e67598a
15 Mar, 2024 1 commit
- Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) · 78b6c484
  akhoroshev authored Mar 15, 2024
  
  78b6c484
13 Mar, 2024 3 commits
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when... · ae0ccb40
  Or Sharir authored Mar 13, 2024
```
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350)
```
  ae0ccb40
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
11 Mar, 2024 1 commit
- [ROCm] Fix warp and lane calculation in blockReduceSum (#3321) · c9415c19
  kliuae authored Mar 12, 2024
  
  c9415c19
10 Mar, 2024 2 commits
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) · e4a28e53
  Douglas Lehr authored Mar 10, 2024
  
  e4a28e53
- Enhance lora tests with more layer and rank variations (#3243) · 0bba88df
  Terry authored Mar 09, 2024
  
  0bba88df
08 Mar, 2024 1 commit
- Feature add lora support for Qwen2 (#3177) · c59e120c
  whyiug authored Mar 08, 2024
  
  c59e120c
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

29 Feb, 2024 1 commit
- Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) · 01a5d18a
  CHU Tianxiang authored Feb 29, 2024
  
  01a5d18a
28 Feb, 2024 1 commit
- Add LoRA support for Gemma (#3050) · 929b4f29
  Woosuk Kwon authored Feb 28, 2024
  
  929b4f29
26 Feb, 2024 1 commit
- [Minor] Remove gather_cached_kv kernel (#3043) · d6e4a130
  Woosuk Kwon authored Feb 26, 2024
  
  d6e4a130
22 Feb, 2024 1 commit
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
12 Feb, 2024 1 commit
- Refactor 2 awq gemm kernels into m16nXk32 (#2723) · 56383649
  Rex authored Feb 12, 2024
```
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
```
  56383649
06 Feb, 2024 1 commit
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
01 Feb, 2024 1 commit
- Fix compile error when using rocm (#2648) · 923797fe
  zhaoyang-star authored Feb 02, 2024
  
  923797fe