Commits · a22cdea371bb26b4bdba112d4602736b48ca4a3a · OpenDAS / vllm_cscc

20 Apr, 2024 2 commits

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

[Bugfix] Add fix for JSON whitespace (#4189) · 138485a8
Ayush Rautwar authored Apr 19, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
```
138485a8

18 Apr, 2024 2 commits
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
- [Typing] Mypy typing part 2 (#4043) · 533d2a1f
  SangBin Cho authored Apr 18, 2024
```
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
```
  533d2a1f
16 Apr, 2024 2 commits
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
- LM Format Enforcer Guided Decoding Support (#3868) · 05434764
  Noam Gat authored Apr 16, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  05434764
15 Apr, 2024 1 commit
- [Doc] Add better clarity for tensorizer usage (#4090) · d619ae2d
  Sanger Steel authored Apr 15, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  d619ae2d
14 Apr, 2024 1 commit
- [Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) · 711a0002
  Sanger Steel authored Apr 13, 2024
  
  711a0002
12 Apr, 2024 1 commit
- [Doc] Add typing hints / mypy types cleanup (#3816) · c2b4a1bc
  Michael Feil authored Apr 11, 2024
```
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
```
  c2b4a1bc
11 Apr, 2024 4 commits
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- [Kernel] Fused MoE Config for Mixtral 8x22 (#4002) · c1dc5471
  Roger Wang authored Apr 11, 2024
  
  c1dc5471
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
- [Core][Model] torch.compile for layernorm in commandr (#3985) · caada5e5
  youkaichao authored Apr 10, 2024
```
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985)
```
  caada5e5
10 Apr, 2024 4 commits
- [Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f
  youkaichao authored Apr 10, 2024
```
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
```
  63e7176f
- [Bugfix] Remove key sorting for `guided_json` parameter in OpenAi compatible Server (#3945) · e4c4072c
  Daniel E Marasco authored Apr 10, 2024
  
  e4c4072c
- [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876) · 0258b7a9
  Travis Johnson authored Apr 10, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  0258b7a9
- [Bugfix] Fix logits processor when prompt_logprobs is not None (#3899) · b3104b2a
  胡译文 authored Apr 10, 2024
  
  b3104b2a
09 Apr, 2024 2 commits
- [Bugfix] Fix KeyError on loading GPT-NeoX (#3925) · e23a43ae
  Junichi Sato authored Apr 10, 2024
  
  e23a43ae
- [Core] separate distributed_init from worker (#3904) · 6d592eb4
  youkaichao authored Apr 09, 2024
  
  6d592eb4
08 Apr, 2024 4 commits
- [BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919) · d036198e
  Roy authored Apr 09, 2024
  
  d036198e
- [Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration (#3767) · bc0c0192
  Kiran R authored Apr 09, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  bc0c0192
- [Bugfix] Added Command-R GPTQ support (#3849) · f46864d6
  egortolmachev authored Apr 08, 2024
```
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
```
  f46864d6
- [Model] add minicpm (#3893) · b4543c8f
  ywfang authored Apr 08, 2024
  
  b4543c8f
07 Apr, 2024 1 commit
- [Core] enable out-of-tree model register (#3871) · 95baec82
  youkaichao authored Apr 06, 2024
  
  95baec82
05 Apr, 2024 1 commit
- [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869) · 54951ac4
  Isotr0py authored Apr 06, 2024
  
  54951ac4
04 Apr, 2024 4 commits
- [Core] improve robustness of pynccl (#3860) · c391e4b6
  youkaichao authored Apr 04, 2024
  
  c391e4b6
- [Model] Cohere CommandR+ (#3829) · 9117f892
  Saurabh Dash authored Apr 05, 2024
  
  9117f892
- [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) · ca81ff51
  youkaichao authored Apr 04, 2024
  
  ca81ff51
- [Core] Enable hf_transfer by default if available (#3817) · 537ee25f
  Michael Feil authored Apr 03, 2024
  
  537ee25f
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

29 Mar, 2024 6 commits
- [BugFix] Use consistent logger everywhere (#3738) · 991143cf
  Nick Hill authored Mar 29, 2024
  
  991143cf
- [ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic (#3699) · 9765b5c4
  Hongxia Yang authored Mar 29, 2024
  
  9765b5c4
- [BugFix][Frontend] Fix completion logprobs=0 error (#3731) · f510395b
  Roy authored Mar 30, 2024
  
  f510395b
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
- [Core][Test] move local_rank to the last arg with default value(#3711) · 756b30a5
  youkaichao authored Mar 28, 2024
```
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
```
  756b30a5
- [Core] fix del of communicator (#3702) · 0267fef5
  youkaichao authored Mar 28, 2024
  
  0267fef5
28 Mar, 2024 4 commits
- fix logging msg for block manager (#3701) · 4716a32d
  Simon Mo authored Mar 28, 2024
  
  4716a32d
- [Kernel] Add MoE Triton kernel configs for A100 40GB (#3700) · cb40b3ab
  Woosuk Kwon authored Mar 28, 2024
  
  cb40b3ab
- [Core] Support multi-node inference(eager and cuda graph) (#3686) · 515386ef
  Roy authored Mar 29, 2024
  
  515386ef
- [Kernel] DBRX Triton MoE kernel H100 (#3692) · ce567a29
  Roger Wang authored Mar 28, 2024
  
  ce567a29