Commits · a74dee9b62d10767eb0580f196f5e508e9e80a2d · OpenDAS / vllm_cscc

"vscode:/vscode.git/clone" did not exist on "be429d0cfda9ad0c5e0df6bf1cd140c40073063d"

23 Apr, 2024 3 commits
- [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) · 1e8f4252
  Cyrus Leung authored Apr 24, 2024
  
  1e8f4252
- [Bugfix] Fixing max token error message for openai compatible server (#4016) · d3c8180a
  Jack Gordley authored Apr 23, 2024
  
  d3c8180a
- [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) · 0ae11f78
  SangBin Cho authored Apr 23, 2024
  
  0ae11f78
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 3 commits

[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214) · bc9df157
Chirag Jain authored Apr 20, 2024

bc9df157

18 Apr, 2024 2 commits
- [Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) · e1bb2fd5
  James Whedbee authored Apr 18, 2024
  
  e1bb2fd5
- Allow model to be served under multiple names (#2894) · 66ded030
  Harry Mellor authored Apr 18, 2024
```
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
```
  66ded030
16 Apr, 2024 1 commit
- LM Format Enforcer Guided Decoding Support (#3868) · 05434764
  Noam Gat authored Apr 16, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  05434764
12 Apr, 2024 3 commits
- [Core] fix custom allreduce default value (#4040) · fbb9d9ee
  youkaichao authored Apr 12, 2024
  
  fbb9d9ee
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
- [Frontend][Core] Move `merge_async_iterators` to utils (#4026) · 7fd3949a
  Cyrus Leung authored Apr 12, 2024
  
  7fd3949a
11 Apr, 2024 1 commit
- Fix echo/logprob OpenAI completion bug (#3441) · 95e7d4a9
  Dylan Hawk authored Apr 11, 2024
```
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
```
  95e7d4a9
05 Apr, 2024 1 commit
- Add option to completion API to truncate prompt tokens (#3144) · 1d7c940d
  Thomas Parnell authored Apr 05, 2024
  
  1d7c940d
02 Apr, 2024 1 commit
- [Frontend][Bugfix] allow using the default middleware with a root path (#3788) · 0739b194
  A-Mahla authored Apr 02, 2024
```
Co-authored-by: A-Mahla <>
```
  0739b194
29 Mar, 2024 3 commits
- [BugFix][Frontend] Fix completion logprobs=0 error (#3731) · f510395b
  Roy authored Mar 30, 2024
  
  f510395b
- [BugFix] Fix tokenizer out of vocab size (#3685) · 6110c39d
  Roy authored Mar 29, 2024
  
  6110c39d
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
26 Mar, 2024 1 commit
- [Misc] Include matched stop string/token in responses (#2976) · dfeb2ecc
  Nick Hill authored Mar 25, 2024
```
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
```
  dfeb2ecc
25 Mar, 2024 4 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [Bugfix] API stream returning two stops (#3450) · 0b4997e0
  Dylan Hawk authored Mar 25, 2024
```
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
```
  0b4997e0
- feat: implement the min_tokens sampling parameter (#3124) · c13ad1b7
  Travis Johnson authored Mar 25, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  c13ad1b7
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
21 Mar, 2024 1 commit
- [Misc][Log] Add log for tokenizer length not equal to vocabulary size (#3500) · 86573234
  Roy authored Mar 21, 2024
  
  86573234
19 Mar, 2024 1 commit
- [Doc] Add docs about OpenAI compatible server (#3288) · ef65dcfa
  Simon Mo authored Mar 18, 2024
  
  ef65dcfa
16 Mar, 2024 2 commits
- Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) · 120157fd
  Simon Mo authored Mar 16, 2024
  
  120157fd
- Removed Extraneous Print Message From OAI Server (#3440) · 10585e03
  Robert Shaw authored Mar 15, 2024
  
  10585e03
15 Mar, 2024 2 commits
- Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220) · 14b8ae02
  Tao He authored Mar 16, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  14b8ae02
- [Fix] Add args for mTLS support (#3430) · 03d37f24
  Dan Clark authored Mar 15, 2024
```
Co-authored-by: declark1 <daniel.clark@ibm.com>
```
  03d37f24
14 Mar, 2024 1 commit
- Add args for mTLS support (#3410) · c17ca8ef
  Dan Clark authored Mar 14, 2024
```
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
```
  c17ca8ef
11 Mar, 2024 2 commits
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
- [BugFix] Fix get tokenizer when using ray (#3301) · 9e8744a5
  Roy authored Mar 11, 2024
  
  9e8744a5
08 Mar, 2024 1 commit
- Connect engine healthcheck to openai server (#3260) · d2339d68
  Nick Hill authored Mar 07, 2024
  
  d2339d68
06 Mar, 2024 1 commit
- Add tqdm `dynamic_ncols=True` (#3242) · 4cb3b924
  Chujie Zheng authored Mar 06, 2024
  
  4cb3b924
04 Mar, 2024 1 commit
- Push logprob generation to LLMEngine (#3065) · 22de4523
  Antoni Baum authored Mar 04, 2024
```
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
```
  22de4523
03 Mar, 2024 1 commit
- Add vLLM version info to logs and openai API server (#3161) · d65fac27
  Jason Cox authored Mar 03, 2024
  
  d65fac27
02 Mar, 2024 1 commit

Add Automatic Prefix Caching (#2762) · ce4f5a29

Sage Moore authored Mar 02, 2024


Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

ce4f5a29

01 Mar, 2024 1 commit
- allow user chose log level by --log-level instead of fixed 'info'. (#3109) · 29e70e3e
  Allen.Dou authored Mar 02, 2024
```
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  29e70e3e