Commits · c83310174055bb124ea2197885b652efd59b7a0f · OpenDAS / vllm_cscc

"examples/pooling/plugin/prithvi_geospatial_mae_online.py" did not exist on "9cd76b71abf15b31878f8d9675546f809a6ba150"

10 May, 2024 1 commit
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 2 commits
- [ROCm] Add support for Punica kernels on AMD GPUs (#3140) · ff5abcd7
  kliuae authored May 10, 2024
```
Co-authored-by: miloice <jeffaw99@hotmail.com>
```
  ff5abcd7
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) · e288df06
  alexm-nm authored May 08, 2024
  
  e288df06
08 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
07 May, 2024 2 commits

[Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
youkaichao authored May 06, 2024

63575bc2

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf

03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 2 commits
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
24 Apr, 2024 3 commits

[Bugfix] Fix marlin kernel crash on H100 (#4218) · aae08249

alexm-nm authored Apr 24, 2024

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

aae08249

[Misc] Reduce supported Punica dtypes (#4304) · 468d761b
Woosuk Kwon authored Apr 23, 2024

468d761b

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
17 Apr, 2024 1 commit
- [Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) · a5322254
  Shoichi Uchinami authored Apr 18, 2024
  
  a5322254
13 Apr, 2024 1 commit
- [Kernel] Add punica dimension for Baichuan-13B (#4053) · 989ae253
  Jee Li authored Apr 13, 2024
  
  989ae253
11 Apr, 2024 3 commits
- Add extra punica sizes to support bigger vocabs (#4015) · 1e96c334
  Antoni Baum authored Apr 11, 2024
  
  1e96c334
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- punica fix-bgmv-kernel-640 (#4007) · 08ccee1e
  fuchen.ljl authored Apr 11, 2024
  
  08ccee1e
08 Apr, 2024 1 commit
- [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782) · 59a6abf3
  Matt Wong authored Apr 08, 2024
  
  59a6abf3
04 Apr, 2024 1 commit
- [Bugfix] Add kv_scale input parameter to CPU backend (#3840) · 498eb5cf
  Woosuk Kwon authored Apr 03, 2024
  
  498eb5cf
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

02 Apr, 2024 1 commit

[Hardware][Intel] Add CPU inference backend (#3634) · 0e3f06fe

bigPYJ1151 authored Apr 02, 2024


Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

0e3f06fe

30 Mar, 2024 1 commit
- [Kernel] Layernorm performance optimization (#3662) · b6d10354
  mawong-amd authored Mar 30, 2024
  
  b6d10354
27 Mar, 2024 1 commit
- [Kernel] support non-zero cuda devices in punica kernels (#3636) · 566b57c5
  Jee Li authored Mar 27, 2024
  
  566b57c5
26 Mar, 2024 1 commit
- Enable more models to inference based on LoRA (#3382) · 8af890a8
  Jee Li authored Mar 26, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  8af890a8
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
18 Mar, 2024 1 commit
- [Bugfix] Make moe_align_block_size AMD-compatible (#3470) · 9101d832
  Woosuk Kwon authored Mar 18, 2024
  
  9101d832
16 Mar, 2024 1 commit
- [Misc] fix line length for entire codebase (#3444) · 8e67598a
  Simon Mo authored Mar 16, 2024
  
  8e67598a
15 Mar, 2024 1 commit
- Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) · 78b6c484
  akhoroshev authored Mar 15, 2024
  
  78b6c484
13 Mar, 2024 3 commits
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when... · ae0ccb40
  Or Sharir authored Mar 13, 2024
```
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350)
```
  ae0ccb40
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
11 Mar, 2024 1 commit
- [ROCm] Fix warp and lane calculation in blockReduceSum (#3321) · c9415c19
  kliuae authored Mar 12, 2024
  
  c9415c19
10 Mar, 2024 2 commits
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) · e4a28e53
  Douglas Lehr authored Mar 10, 2024
  
  e4a28e53
- Enhance lora tests with more layer and rank variations (#3243) · 0bba88df
  Terry authored Mar 09, 2024
  
  0bba88df
08 Mar, 2024 1 commit
- Feature add lora support for Qwen2 (#3177) · c59e120c
  whyiug authored Mar 08, 2024
  
  c59e120c
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c