Commits · d10f8e1d43bfb0656b6848ad0c681ecbdec812d6 · kecinstone / 2024pra-vllm

18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

12 Jan, 2024 1 commit
- [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) · 6549aef2
  Jiaxiang authored Jan 12, 2024
  
  6549aef2
07 Jan, 2024 1 commit
- Changed scheduler to use deques instead of lists (#2290) · 05921a9a
  Nadav Shmayovits authored Jan 07, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  05921a9a
05 Jan, 2024 1 commit
- Ensure metrics are logged regardless of requests (#2347) · d0215a58
  Iskren Ivov Chernev authored Jan 05, 2024
  
  d0215a58
04 Jan, 2024 1 commit
- Miner fix of type hint (#2340) · aee8ef66
  ljss authored Jan 04, 2024
  
  aee8ef66
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
26 Dec, 2023 1 commit
- [BUGFIX] Do not return ignored sentences twice in async llm engine (#2258) · e0ff9200
  Zhuohan Li authored Dec 26, 2023
  
  e0ff9200
21 Dec, 2023 1 commit
- Disable Ray usage stats collection (#2206) · 3a4fd5ca
  Woosuk Kwon authored Dec 20, 2023
  
  3a4fd5ca
18 Dec, 2023 3 commits
- Update Help Text for --gpu-memory-utilization Argument (#2183) · 290e015c
  Suhong Moon authored Dec 18, 2023
  
  290e015c
- [Minor] Fix typo (#2166) · bbe4466f
  JohnSaxon authored Dec 18, 2023
```
Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>
```
  bbe4466f
- [BugFix] Raise error when max_model_len is larger than KV cache (#2163) · 8041b730
  Woosuk Kwon authored Dec 17, 2023
  
  8041b730
17 Dec, 2023 3 commits
- [Minor] Add more detailed explanation on `quantization` argument (#2145) · 30fb0956
  Woosuk Kwon authored Dec 17, 2023
  
  30fb0956
- Remove dependency on CuPy (#2152) · c3372e87
  Woosuk Kwon authored Dec 17, 2023
  
  c3372e87
- Optimize model execution with CUDA graph (#1926) · 37ca5581
  Woosuk Kwon authored Dec 16, 2023
```
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  37ca5581
15 Dec, 2023 2 commits
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
- Add a flag to include stop string in output text (#1976) · c06170cc
  Yunfeng Bai authored Dec 15, 2023
  
  c06170cc
14 Dec, 2023 1 commit
- Fix typing in AsyncLLMEngine & add toml to requirements-dev (#2100) · 6774bd50
  mezuzza authored Dec 14, 2023
  
  6774bd50
08 Dec, 2023 1 commit

Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) · 6ccc0bff

TJian authored Dec 08, 2023


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>

6ccc0bff

03 Dec, 2023 2 commits
- Fix num_gpus when TP > 1 (#1852) · 464dd985
  Woosuk Kwon authored Dec 03, 2023
  
  464dd985
- Add Production Metrics in Prometheus format (#1890) · 5313c2cb
  Simon Mo authored Dec 02, 2023
  
  5313c2cb
30 Nov, 2023 1 commit
- Refactor Worker & InputMetadata (#1843) · 27feead2
  Woosuk Kwon authored Nov 29, 2023
  
  27feead2
29 Nov, 2023 1 commit
- Better integration with Ray Serve (#1821) · 0229c386
  FlorianJoncour authored Nov 29, 2023
```
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
```
  0229c386
28 Nov, 2023 1 commit
- [FIX] Fix class naming (#1803) · 708e6c18
  Zhuohan Li authored Nov 28, 2023
  
  708e6c18
22 Nov, 2023 1 commit
- [DOCS] Add engine args documentation (#1741) · a921d8be
  Casper authored Nov 22, 2023
  
  a921d8be
21 Nov, 2023 1 commit
- fix RAM OOM when load large models in tensor parallel mode. (#1395) · 4bb6b671
  boydfd authored Nov 21, 2023
```
Co-authored-by: ran_lin <rlin@thoughtworks.com>
```
  4bb6b671
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
16 Nov, 2023 2 commits

[Minor] Fix duplication of ignored seq group in engine step (#1666) · cb08cd0d
Simon Mo authored Nov 16, 2023

cb08cd0d

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

11 Nov, 2023 1 commit
- Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) · 1b290ace
  Dominik Schwabe authored Nov 11, 2023
  
  1b290ace
01 Nov, 2023 1 commit
- [BugFix] Set engine_use_ray=True when TP>1 (#1531) · 5687d584
  ljss authored Nov 01, 2023
  
  5687d584
30 Oct, 2023 1 commit
- Add support for `spaces_between_special_tokens` · 7013a801
  Dan Lord authored Oct 30, 2023
  
  7013a801
22 Oct, 2023 1 commit

Support SqueezeLLM (#1326) · 1f24755b

chooper1 authored Oct 22, 2023


Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

1f24755b

17 Oct, 2023 1 commit
- Change scheduler & input tensor shape (#1381) · c1376e0f
  Woosuk Kwon authored Oct 16, 2023
  
  c1376e0f
16 Oct, 2023 1 commit
- Implement prompt logprobs & Batched topk for computing logprobs (#1328) · 9d9072a0
  Zhuohan Li authored Oct 16, 2023
```
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
```
  9d9072a0
03 Oct, 2023 2 commits
- Use monotonic time where appropriate (#1249) · acbed3ef
  Antoni Baum authored Oct 02, 2023
  
  acbed3ef
- add support for tokenizer revision (#1163) · 66d18a7f
  Federico Cassano authored Oct 02, 2023
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  66d18a7f
28 Sep, 2023 3 commits
- Provide default max model length (#1224) · f936657e
  Woosuk Kwon authored Sep 28, 2023
  
  f936657e
- [Mistral] Mistral-7B-v0.1 support (#1196) · bb1ba58f
  Chris Bamford authored Sep 28, 2023
```
Co-authored-by: timlacroix <t@mistral.ai>
```
  bb1ba58f
- Add `skip_special_tokens` sampling params (#1186) · 20f7cc4c
  Dan Lord authored Sep 27, 2023
  
  20f7cc4c
27 Sep, 2023 1 commit
- Automatically configure `max_num_batched_tokens` (#1198) · a19bc5c6
  Woosuk Kwon authored Sep 27, 2023
  
  a19bc5c6