Commits · f12b20deccbc6c8bb5cdeac053d75178341c66c1 · OpenDAS / vllm_cscc

"vscode:/vscode.git/clone" did not exist on "d100d78eb31ee8db5d987fa0d2dc18bc96b52d3a"

09 May, 2024 2 commits
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
- [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) · 16bc0a09
  Mahmoud Ashraf authored May 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  16bc0a09
06 May, 2024 1 commit
- [Bugfix] Fix `asyncio.Task` not being subscriptable (#4623) · 323f27b9
  Cyrus Leung authored May 07, 2024
  
  323f27b9
04 May, 2024 1 commit
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
  DearPlanet authored May 05, 2024
  
  43029870
03 May, 2024 4 commits
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [BugFix] Prevent the task of `_force_log` from being garbage collected (#4567) · 808632d3
  Yang, Bo authored May 02, 2024
  
  808632d3
02 May, 2024 1 commit
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
01 May, 2024 3 commits
- [Bugfix] Add validation for seed (#4529) · c47ba4aa
  sasha0552 authored May 01, 2024
  
  c47ba4aa
- [Bugfix] Fix 307 Redirect for `/metrics` (#4523) · 4dc8026d
  Robert Shaw authored May 01, 2024
  
  4dc8026d
- Allow user to define whitespace pattern for outlines (#4305) · c3845d82
  Robert Caulk authored May 01, 2024
  
  c3845d82
30 Apr, 2024 1 commit
- [Frontend] Support complex message content for chat completions endpoint (#3467) · a4941404
  Florian Greinacher authored May 01, 2024
```
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
```
  a4941404
27 Apr, 2024 2 commits
- [Bugfix][Core] Fix get decoding config from ray (#4335) · 7134303c
  Roy authored Apr 27, 2024
  
  7134303c
- [Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) · 8947bc3c
  Cyrus Leung authored Apr 27, 2024
  
  8947bc3c
26 Apr, 2024 2 commits
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
- [Frontend] Add --log-level option to api server (#4377) · 2f30e7c7
  Norman Mu authored Apr 25, 2024
  
  2f30e7c7
23 Apr, 2024 3 commits
- [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) · 1e8f4252
  Cyrus Leung authored Apr 24, 2024
  
  1e8f4252
- [Bugfix] Fixing max token error message for openai compatible server (#4016) · d3c8180a
  Jack Gordley authored Apr 23, 2024
  
  d3c8180a
- [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) · 0ae11f78
  SangBin Cho authored Apr 23, 2024
  
  0ae11f78
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 3 commits

[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214) · bc9df157
Chirag Jain authored Apr 20, 2024

bc9df157

18 Apr, 2024 2 commits
- [Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) · e1bb2fd5
  James Whedbee authored Apr 18, 2024
  
  e1bb2fd5
- Allow model to be served under multiple names (#2894) · 66ded030
  Harry Mellor authored Apr 18, 2024
```
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
```
  66ded030
16 Apr, 2024 1 commit
- LM Format Enforcer Guided Decoding Support (#3868) · 05434764
  Noam Gat authored Apr 16, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  05434764
12 Apr, 2024 3 commits
- [Core] fix custom allreduce default value (#4040) · fbb9d9ee
  youkaichao authored Apr 12, 2024
  
  fbb9d9ee
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
- [Frontend][Core] Move `merge_async_iterators` to utils (#4026) · 7fd3949a
  Cyrus Leung authored Apr 12, 2024
  
  7fd3949a
11 Apr, 2024 1 commit
- Fix echo/logprob OpenAI completion bug (#3441) · 95e7d4a9
  Dylan Hawk authored Apr 11, 2024
```
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
```
  95e7d4a9
05 Apr, 2024 1 commit
- Add option to completion API to truncate prompt tokens (#3144) · 1d7c940d
  Thomas Parnell authored Apr 05, 2024
  
  1d7c940d
02 Apr, 2024 1 commit
- [Frontend][Bugfix] allow using the default middleware with a root path (#3788) · 0739b194
  A-Mahla authored Apr 02, 2024
```
Co-authored-by: A-Mahla <>
```
  0739b194
29 Mar, 2024 3 commits
- [BugFix][Frontend] Fix completion logprobs=0 error (#3731) · f510395b
  Roy authored Mar 30, 2024
  
  f510395b
- [BugFix] Fix tokenizer out of vocab size (#3685) · 6110c39d
  Roy authored Mar 29, 2024
  
  6110c39d
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
26 Mar, 2024 1 commit
- [Misc] Include matched stop string/token in responses (#2976) · dfeb2ecc
  Nick Hill authored Mar 25, 2024
```
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
```
  dfeb2ecc
25 Mar, 2024 3 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [Bugfix] API stream returning two stops (#3450) · 0b4997e0
  Dylan Hawk authored Mar 25, 2024
```
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
```
  0b4997e0
- feat: implement the min_tokens sampling parameter (#3124) · c13ad1b7
  Travis Johnson authored Mar 25, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  c13ad1b7