Commits · baaedfdb2d3f1d70b7dbcde08b083abfe6017a92 · OpenDAS / vllm_cscc

21 Aug, 2024 1 commit
- [mypy] Enable following imports for entrypoints (#7248) · baaedfdb
  Cyrus Leung authored Aug 21, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
```
  baaedfdb
16 Aug, 2024 2 commits

support tqdm in notebooks (#7510) · ec724a72
fzyzcjy authored Aug 17, 2024

ec724a72

Chat method for offline llm (#5049) · 3b19e39d

nunjunj authored Aug 16, 2024

Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

3b19e39d

09 Aug, 2024 1 commit
- [Core] Support serving encoder/decoder models (#7258) · 7eb4a51c
  Cyrus Leung authored Aug 09, 2024
  
  7eb4a51c
06 Aug, 2024 1 commit

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

04 Aug, 2024 1 commit
- Support for guided decoding for offline LLM (#6878) · 654bc5ca
  Yihuan Bu authored Aug 03, 2024
```
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  654bc5ca
22 Jul, 2024 1 commit
- [Frontend] Refactor prompt processing (#4028) · 739b61a3
  Cyrus Leung authored Jul 23, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  739b61a3
18 Jul, 2024 1 commit
- [core][model] yet another cpu offload implementation (#6496) · 1c27d25f
  youkaichao authored Jul 17, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  1c27d25f
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

03 Jul, 2024 1 commit

[vlm] Remove vision language config. (#6089) · d9e98f42

xwjiang2010 authored Jul 03, 2024


Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

d9e98f42

12 Jun, 2024 1 commit
- [Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425) · 7d19de2e
  Michael Goin authored Jun 12, 2024
  
  7d19de2e
06 Jun, 2024 1 commit
- [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) · 828da0d4
  Matthew Goldey authored Jun 06, 2024
  
  828da0d4
05 Jun, 2024 1 commit
- [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) · eb8fcd26
  DriverSong authored Jun 06, 2024
```
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com>
```
  eb8fcd26
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [BugFix] Prevent `LLM.encode` for non-generation Models (#5184) · 044793d8
  Robert Shaw authored Jun 01, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  044793d8
30 May, 2024 1 commit
- [Doc] Use intersphinx and update entrypoints docs (#5125) · a9bcc7af
  Cyrus Leung authored May 31, 2024
  
  a9bcc7af
28 May, 2024 1 commit
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) · 16bc0a09
  Mahmoud Ashraf authored May 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  16bc0a09
03 May, 2024 1 commit
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 2 commits

[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

12 Apr, 2024 2 commits
- [Core] fix custom allreduce default value (#4040) · fbb9d9ee
  youkaichao authored Apr 12, 2024
  
  fbb9d9ee
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
29 Mar, 2024 1 commit
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
25 Mar, 2024 2 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
06 Mar, 2024 1 commit
- Add tqdm `dynamic_ncols=True` (#3242) · 4cb3b924
  Chujie Zheng authored Mar 06, 2024
  
  4cb3b924
02 Mar, 2024 1 commit

Add Automatic Prefix Caching (#2762) · ce4f5a29

Sage Moore authored Mar 02, 2024


Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

ce4f5a29

04 Feb, 2024 1 commit
- set&get llm internal tokenizer instead of the TokenizerGroup (#2741) · 51cd22ce
  dancingpipi authored Feb 05, 2024
```
Co-authored-by: shujunhua1 <shujunhua1@jd.com>
```
  51cd22ce
27 Jan, 2024 1 commit
- Implement custom all reduce kernels (#2192) · 38017003
  Hanzhi Zhou authored Jan 28, 2024
  
  38017003
23 Jan, 2024 1 commit

[Experimental] Add multi-LoRA support (#1804) · 9b945daa

Antoni Baum authored Jan 24, 2024


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>

9b945daa

18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

17 Dec, 2023 2 commits
- [Minor] Add more detailed explanation on `quantization` argument (#2145) · 30fb0956
  Woosuk Kwon authored Dec 17, 2023
  
  30fb0956
- Optimize model execution with CUDA graph (#1926) · 37ca5581
  Woosuk Kwon authored Dec 16, 2023
```
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  37ca5581
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
03 Oct, 2023 1 commit
- add support for tokenizer revision (#1163) · 66d18a7f
  Federico Cassano authored Oct 02, 2023
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  66d18a7f