Commits · a9bcc7afb23d208efaa1b47549fa93eaa1d9d6cf · OpenDAS / vllm_cscc

30 May, 2024 1 commit
- [Doc] Use intersphinx and update entrypoints docs (#5125) · a9bcc7af
  Cyrus Leung authored May 31, 2024
  
  a9bcc7af
28 May, 2024 1 commit
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) · 16bc0a09
  Mahmoud Ashraf authored May 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  16bc0a09
03 May, 2024 1 commit
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 2 commits

[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

12 Apr, 2024 2 commits
- [Core] fix custom allreduce default value (#4040) · fbb9d9ee
  youkaichao authored Apr 12, 2024
  
  fbb9d9ee
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
29 Mar, 2024 1 commit
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
25 Mar, 2024 2 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
06 Mar, 2024 1 commit
- Add tqdm `dynamic_ncols=True` (#3242) · 4cb3b924
  Chujie Zheng authored Mar 06, 2024
  
  4cb3b924
02 Mar, 2024 1 commit

Add Automatic Prefix Caching (#2762) · ce4f5a29

Sage Moore authored Mar 02, 2024


Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

ce4f5a29

04 Feb, 2024 1 commit
- set&get llm internal tokenizer instead of the TokenizerGroup (#2741) · 51cd22ce
  dancingpipi authored Feb 05, 2024
```
Co-authored-by: shujunhua1 <shujunhua1@jd.com>
```
  51cd22ce
27 Jan, 2024 1 commit
- Implement custom all reduce kernels (#2192) · 38017003
  Hanzhi Zhou authored Jan 28, 2024
  
  38017003
23 Jan, 2024 1 commit

[Experimental] Add multi-LoRA support (#1804) · 9b945daa

Antoni Baum authored Jan 24, 2024


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>

9b945daa

18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

17 Dec, 2023 2 commits
- [Minor] Add more detailed explanation on `quantization` argument (#2145) · 30fb0956
  Woosuk Kwon authored Dec 17, 2023
  
  30fb0956
- Optimize model execution with CUDA graph (#1926) · 37ca5581
  Woosuk Kwon authored Dec 16, 2023
```
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  37ca5581
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
03 Oct, 2023 1 commit
- add support for tokenizer revision (#1163) · 66d18a7f
  Federico Cassano authored Oct 02, 2023
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  66d18a7f
20 Sep, 2023 1 commit
- Add gpu_memory_utilization and swap_space to LLM (#1090) · bc064457
  Woosuk Kwon authored Sep 19, 2023
  
  bc064457
18 Sep, 2023 1 commit
- added support for quantize on LLM module (#1080) · fbe66e1d
  orellavie1212 authored Sep 18, 2023
  
  fbe66e1d
13 Sep, 2023 1 commit

Add Model Revision Support (#1014) · ab019eea

Jasmond L authored Sep 14, 2023


Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

ab019eea

08 Jul, 2023 1 commit
- Sort the outputs before return (#402) · b6fbb9a5
  Woosuk Kwon authored Jul 08, 2023
  
  b6fbb9a5
07 Jul, 2023 1 commit
- Add trust-remote-code flag to handle remote tokenizers (#364) · a945fcc2
  codethazine authored Jul 07, 2023
  
  a945fcc2
03 Jul, 2023 1 commit
- [Quality] Add code formatter and linter (#326) · d6fa1be3
  Zhuohan Li authored Jul 03, 2023
  
  d6fa1be3
28 Jun, 2023 3 commits
- [Tokenizer] Add tokenizer mode (#298) · 998d9d15
  Woosuk Kwon authored Jun 28, 2023
  
  998d9d15
- [Tokenizer] Add an option to specify tokenizer (#284) · 4338cc47
  Woosuk Kwon authored Jun 28, 2023
  
  4338cc47
- Add LLM.set_tokenizer (#283) · bdd6b4c8
  Jishnu Ray Chowdhury authored Jun 28, 2023
  
  bdd6b4c8
22 Jun, 2023 1 commit
- [Bugfix] Fix a bug in RequestOutput.finished (#202) · 14f0b39c
  Woosuk Kwon authored Jun 22, 2023
  
  14f0b39c
17 Jun, 2023 2 commits
- Change the name to vLLM (#150) · 0b98ba15
  Woosuk Kwon authored Jun 17, 2023
  
  0b98ba15
- Rename servers to engines (#152) · e5464ee4
  Zhuohan Li authored Jun 17, 2023
  
  e5464ee4
16 Jun, 2023 1 commit
- Rename servers and change port numbers to reduce confusion (#149) · eedb46bf
  Zhuohan Li authored Jun 17, 2023
  
  eedb46bf
07 Jun, 2023 1 commit
- Support FP32 (#141) · e38074b1
  Woosuk Kwon authored Jun 07, 2023
  
  e38074b1
04 Jun, 2023 1 commit
- Add docstrings for LLM (#137) · 8274ca23
  Woosuk Kwon authored Jun 04, 2023
  
  8274ca23