Commits · 4bb53e2dde809ea5727b8cac95a080893733a1ef · OpenDAS / vllm_cscc

30 Apr, 2024 1 commit
- [BugFix] fix num_lookahead_slots missing in async executor (#4165) · 4bb53e2d
  leiwen83 authored May 01, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  4bb53e2d
27 Apr, 2024 1 commit
- [Bugfix][Core] Fix get decoding config from ray (#4335) · 7134303c
  Roy authored Apr 27, 2024
  
  7134303c
26 Apr, 2024 2 commits
- [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) · 603ad848
  SangBin Cho authored Apr 26, 2024
  
  603ad848
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 1 commit
- [Core] Move ray_utils.py from `engine` to `executor` package (#4347) · 479d69fa
  Nick Hill authored Apr 24, 2024
  
  479d69fa
22 Apr, 2024 1 commit
- [Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993) · 077f0a2e
  Tao He authored Apr 22, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
```
  077f0a2e
19 Apr, 2024 1 commit
- [Bugfix][Core] Restore logging of stats in the async engine (#4150) · 7be4f562
  Ronen Schaffer authored Apr 19, 2024
  
  7be4f562
18 Apr, 2024 2 commits
- [CI/CD] add neuron docker and ci test scripts (#3571) · cd2f63fb
  Liangfu Chen authored Apr 18, 2024
  
  cd2f63fb
- [Typing] Mypy typing part 2 (#4043) · 533d2a1f
  SangBin Cho authored Apr 18, 2024
```
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
```
  533d2a1f
16 Apr, 2024 2 commits
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) · e95cd879
  Cade Daniel authored Apr 16, 2024
  
  e95cd879
- [Core] Fix engine-use-ray broken (#4105) · 4e7ee664
  SangBin Cho authored Apr 16, 2024
  
  4e7ee664
03 Apr, 2024 1 commit
- [Speculative decoding] Adding configuration object for speculative decoding (#3706) · 5757d90e
  Cade Daniel authored Apr 02, 2024
```
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
```
  5757d90e
29 Mar, 2024 1 commit
- Usage Stats Collection (#2852) · d8658c8c
  yhu422 authored Mar 28, 2024
  
  d8658c8c
25 Mar, 2024 2 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
22 Mar, 2024 1 commit
- [Hardware][Neuron] Refactor neuron support (#3471) · e90fc21f
  Zhuohan Li authored Mar 21, 2024
  
  e90fc21f
15 Mar, 2024 1 commit
- Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220) · 14b8ae02
  Tao He authored Mar 16, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  14b8ae02
11 Mar, 2024 2 commits
- Add distributed model executor abstraction (#3191) · 4c922709
  Zhuohan Li authored Mar 11, 2024
  
  4c922709
- [BugFix] Fix get tokenizer when using ray (#3301) · 9e8744a5
  Roy authored Mar 11, 2024
  
  9e8744a5
04 Mar, 2024 2 commits
- Add health check, make async Engine more robust (#3015) · ff578cae
  Antoni Baum authored Mar 04, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  ff578cae
- Push logprob generation to LLMEngine (#3065) · 22de4523
  Antoni Baum authored Mar 04, 2024
```
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
```
  22de4523
02 Mar, 2024 1 commit

Add Automatic Prefix Caching (#2762) · ce4f5a29

Sage Moore authored Mar 02, 2024


Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

ce4f5a29

29 Feb, 2024 1 commit
- Add guided decoding for OpenAI API server (#2819) · 703e42ee
  felixzhu555 authored Feb 29, 2024
```
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  703e42ee
31 Jan, 2024 1 commit
- fix some bugs (#2689) · c664b0e6
  zspo authored Feb 01, 2024
  
  c664b0e6
30 Jan, 2024 1 commit
- Fix 'Actor methods cannot be called directly' when using `--engine-use-ray` (#2664) · d79ced32
  Wen Sun authored Jan 31, 2024
```
* fix: engine-useray complain

* fix: typo
```
  d79ced32
28 Jan, 2024 1 commit
- Small async_llm_engine refactor (#2618) · 89be30fa
  Murali Andoorveedu authored Jan 27, 2024
  
  89be30fa
23 Jan, 2024 1 commit

[Experimental] Add multi-LoRA support (#1804) · 9b945daa

Antoni Baum authored Jan 24, 2024


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>

9b945daa

18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

12 Jan, 2024 1 commit
- [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) · 6549aef2
  Jiaxiang authored Jan 12, 2024
  
  6549aef2
05 Jan, 2024 1 commit
- Ensure metrics are logged regardless of requests (#2347) · d0215a58
  Iskren Ivov Chernev authored Jan 05, 2024
  
  d0215a58
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
26 Dec, 2023 1 commit
- [BUGFIX] Do not return ignored sentences twice in async llm engine (#2258) · e0ff9200
  Zhuohan Li authored Dec 26, 2023
  
  e0ff9200
14 Dec, 2023 1 commit
- Fix typing in AsyncLLMEngine & add toml to requirements-dev (#2100) · 6774bd50
  mezuzza authored Dec 14, 2023
  
  6774bd50
03 Dec, 2023 1 commit
- Fix num_gpus when TP > 1 (#1852) · 464dd985
  Woosuk Kwon authored Dec 03, 2023
  
  464dd985
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

11 Nov, 2023 1 commit
- Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) · 1b290ace
  Dominik Schwabe authored Nov 11, 2023
  
  1b290ace
01 Nov, 2023 1 commit
- [BugFix] Set engine_use_ray=True when TP>1 (#1531) · 5687d584
  ljss authored Nov 01, 2023
  
  5687d584
03 Oct, 2023 1 commit
- Use monotonic time where appropriate (#1249) · acbed3ef
  Antoni Baum authored Oct 02, 2023
  
  acbed3ef
18 Sep, 2023 1 commit
- align llm_engine and async_engine. (#1081) · 95592fa0
  Roy authored Sep 19, 2023
  
  95592fa0
17 Sep, 2023 1 commit
- Remove AsyncLLMEngine busy loop, shield background task (#1059) · ff36139f
  Antoni Baum authored Sep 17, 2023
  
  ff36139f