Commits · cc74b2b232070f74d8765a5eefa49ae93ee45490 · OpenDAS / vllm_cscc

20 Apr, 2024 2 commits

Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) · cc74b2b2
Noam Gat authored Apr 20, 2024

cc74b2b2

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

18 Apr, 2024 1 commit
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
12 Apr, 2024 1 commit
- [Doc] Add typing hints / mypy types cleanup (#3816) · c2b4a1bc
  Michael Feil authored Apr 11, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  c2b4a1bc
11 Apr, 2024 3 commits
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- [Kernel] Fused MoE Config for Mixtral 8x22 (#4002) · c1dc5471
  Roger Wang authored Apr 11, 2024
  
  c1dc5471
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
10 Apr, 2024 3 commits
- [Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f
  youkaichao authored Apr 10, 2024
```
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
```
  63e7176f
- [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876) · 0258b7a9
  Travis Johnson authored Apr 10, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  0258b7a9
- [Bugfix] Fix logits processor when prompt_logprobs is not None (#3899) · b3104b2a
  胡译文 authored Apr 10, 2024
  
  b3104b2a
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

29 Mar, 2024 1 commit
- [BugFix][Frontend] Fix completion logprobs=0 error (#3731) · f510395b
  Roy authored Mar 30, 2024
  
  f510395b
28 Mar, 2024 3 commits
- [Kernel] Add MoE Triton kernel configs for A100 40GB (#3700) · cb40b3ab
  Woosuk Kwon authored Mar 28, 2024
  
  cb40b3ab
- [Kernel] DBRX Triton MoE kernel H100 (#3692) · ce567a29
  Roger Wang authored Mar 28, 2024
  
  ce567a29
- [Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679) · 8267b06c
  Woosuk Kwon authored Mar 27, 2024
  
  8267b06c
25 Mar, 2024 6 commits
- Optimize `_get_ranks` in Sampler (#3623) · 3a243095
  Antoni Baum authored Mar 25, 2024
  
  3a243095
- feat: implement the min_tokens sampling parameter (#3124) · c13ad1b7
  Travis Johnson authored Mar 25, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  c13ad1b7
- [Core] Adding token ranks along with logprobs (#3516) · 819924e7
  Swapnil Parekh authored Mar 25, 2024
```
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
```
  819924e7
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
- [BugFix] tensor.get_device() -> tensor.device (#3604) · 6d93d353
  Kunshang Ji authored Mar 25, 2024
  
  6d93d353
22 Mar, 2024 1 commit
- [Hardware][Neuron] Refactor neuron support (#3471) · e90fc21f
  Zhuohan Li authored Mar 21, 2024
  
  e90fc21f
21 Mar, 2024 1 commit
- Fix 1D query issue from `_prune_hidden_states` (#3539) · 3bbff9e5
  SangBin Cho authored Mar 21, 2024
  
  3bbff9e5
20 Mar, 2024 3 commits
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
- [1/n][Chunked Prefill] Refactor input query shapes (#3236) · 6e435de7
  SangBin Cho authored Mar 21, 2024
  
  6e435de7
- [1/n] Triton sampling kernel (#3186) · 426ec4ec
  Antoni Baum authored Mar 20, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  426ec4ec
14 Mar, 2024 2 commits
- fix marlin config repr (#3414) · b983ba35
  Enrique Shockwave authored Mar 14, 2024
  
  b983ba35
- [Kernel] change benchmark script so that result can be directly used; tune moe... · 8fe83865
  youkaichao authored Mar 14, 2024
```
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
```
  8fe83865
13 Mar, 2024 4 commits
- Fix lint (#3388) · c33afd89
  Antoni Baum authored Mar 13, 2024
  
  c33afd89
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- [Minor] Fix bias in if to remove ambiguity (#3259) · ba8dc958
  Hui Liu authored Mar 13, 2024
  
  ba8dc958
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
09 Mar, 2024 2 commits
- [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) · 8437bae6
  Cade Daniel authored Mar 08, 2024
  
  8437bae6
- [FIX] Fix prefix test error on main (#3286) · f48c6791
  Zhuohan Li authored Mar 08, 2024
  
  f48c6791
08 Mar, 2024 1 commit
- [FIX] Make `flash_attn` optional (#3269) · 1cb0cc29
  Woosuk Kwon authored Mar 08, 2024
  
  1cb0cc29
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
05 Mar, 2024 1 commit
- Store `eos_token_id` in `Sequence` for easy access (#3166) · 8999ec3c
  Nick Hill authored Mar 05, 2024
  
  8999ec3c
04 Mar, 2024 1 commit
- Push logprob generation to LLMEngine (#3065) · 22de4523
  Antoni Baum authored Mar 04, 2024
```
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
```
  22de4523
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c