Commits · 7076fa1c9f5769469bc2671afaca5af604a9bed3 · norm / vllm

16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

14 Nov, 2023 1 commit
- Add DeepSpeed MII backend to benchmark script (#1649) · 660a7fcf
  Woosuk Kwon authored Nov 14, 2023
  
  660a7fcf
13 Nov, 2023 1 commit
- [Minor] Move RoPE selection logic to `get_rope` (#1633) · 054072be
  Woosuk Kwon authored Nov 12, 2023
  
  054072be
12 Nov, 2023 1 commit
- Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape (#1631) · eb825c1e
  lirui authored Nov 13, 2023
  
  eb825c1e
11 Nov, 2023 1 commit
- Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) · 1b290ace
  Dominik Schwabe authored Nov 11, 2023
  
  1b290ace
10 Nov, 2023 1 commit
- config parser: add ChatGLM2 seq_length to `_get_and_verify_max_len` (#1617) · 0d578228
  Sin authored Nov 10, 2023
  
  0d578228
09 Nov, 2023 3 commits
- Dockerfile: Upgrade Cuda to 12.1 (#1609) · aebfcb26
  GhaziSyed authored Nov 09, 2023
  
  aebfcb26
- Add Yi model to quantization support (#1600) · ab9e8488
  forpanyang authored Nov 10, 2023
  
  ab9e8488
- Build CUDA11.8 wheels for release (#1596) · fd58b73a
  Woosuk Kwon authored Nov 09, 2023
  
  fd58b73a
08 Nov, 2023 2 commits
- Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) · 8efe23f1
  Yanming W authored Nov 09, 2023
  
  8efe23f1
- Upgrade to CUDA 12 (#1527) · 06458a0b
  Zhuohan Li authored Nov 08, 2023
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  06458a0b
07 Nov, 2023 1 commit
- ChatGLM Support (#1261) · 1a2bbc93
  GoHomeToMacDonal authored Nov 07, 2023
  
  1a2bbc93
06 Nov, 2023 1 commit
- Support Yi model (#1567) · e7f579eb
  Roy authored Nov 07, 2023
  
  e7f579eb
05 Nov, 2023 1 commit
- Add Quantization and AutoAWQ to docs (#1235) · 85169994
  Casper authored Nov 05, 2023
  
  85169994
03 Nov, 2023 3 commits
- Support YaRN models (#1264) · 9f669a9a
  Antoni Baum authored Nov 03, 2023
```
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  9f669a9a
- Added logits processor API to sampling params (#1469) · 555bdcc5
  Noam Gat authored Nov 03, 2023
  
  555bdcc5
- docs: add description (#1553) · 54ca1ba7
  lots-o authored Nov 04, 2023
  
  54ca1ba7
01 Nov, 2023 4 commits
- Force paged attention v2 for long contexts (#1510) · 9738b84a
  Antoni Baum authored Nov 01, 2023
  
  9738b84a
- Remove `MPTConfig` (#1529) · 1fe09900
  Woosuk Kwon authored Nov 01, 2023
  
  1fe09900
- Add `/health` Endpoint for both Servers (#1540) · 7e90a2d1
  Fluder-Paradyne authored Nov 01, 2023
  
  7e90a2d1
- [BugFix] Set engine_use_ray=True when TP>1 (#1531) · 5687d584
  ljss authored Nov 01, 2023
  
  5687d584
31 Oct, 2023 5 commits
- Add `MptForCausalLM` key in model_loader (#1526) · cf8849f2
  Wenfei Yan authored Oct 31, 2023
  
  cf8849f2
- [Small] Formatter only checks lints in changed files (#1528) · e575df33
  Cade Daniel authored Oct 31, 2023
  
  e575df33
- Fix integer overflows in attention & cache ops (#1514) · 0ce8647d
  Woosuk Kwon authored Oct 31, 2023
  
  0ce8647d
- Add Dockerfile (#1350) · 9cabcb76
  Stephen Krider authored Oct 31, 2023
  
  9cabcb76
- [Fix] Fix duplicated logging messages (#1524) · 7b895c59
  Zhuohan Li authored Oct 31, 2023
  
  7b895c59
30 Oct, 2023 5 commits
- Add support for `spaces_between_special_tokens` · 7013a801
  Dan Lord authored Oct 30, 2023
  
  7013a801
- Add py.typed so consumers of vLLM can get type checking (#1509) · 79a30912
  Jared Roesch authored Oct 30, 2023
```
* Add py.typed so consumers of vLLM can get type checking

* Update py.typed

---------
Co-authored-by: aarnphm <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  79a30912
- Fix logging so we actually get info level entries in the log. (#1494) · 2f3d36a8
  Adam Brusselback authored Oct 30, 2023
  
  2f3d36a8
- Refactor LLMEngine demo script for clarity and modularity (#1413) · ac8d36f3
  iongpt authored Oct 30, 2023
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  ac8d36f3
- Delay GPU->CPU sync in sampling (#1337) · 15f56323
  Antoni Baum authored Oct 30, 2023
  
  15f56323
29 Oct, 2023 4 commits
- Fix bias in InternLM (#1501) · aa9af07c
  Woosuk Kwon authored Oct 30, 2023
  
  aa9af07c
- Support repetition_penalty (#1424) · 69be658b
  ljss authored Oct 30, 2023
  
  69be658b
- fix: don't skip first special token. (#1497) · beac8dd4
  Ricardo Lu authored Oct 29, 2023
  
  beac8dd4
- Add rope_scaling to Aquila model (#1457) · 28b47d1e
  Qing authored Oct 29, 2023
  
  28b47d1e
22 Oct, 2023 1 commit

Support SqueezeLLM (#1326) · 1f24755b

chooper1 authored Oct 22, 2023


Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

1f24755b

21 Oct, 2023 1 commit
- Pin pydantic dependency versions (#1429) · bf31d360
  Thiago Salvatore authored Oct 21, 2023
  
  bf31d360
20 Oct, 2023 2 commits
- remove useless statements (#1408) · d189170b
  Wang Ran (汪然) authored Oct 20, 2023
  
  d189170b
- Fix type hints (#1427) · f61dc807
  Light Lin authored Oct 20, 2023
  
  f61dc807
17 Oct, 2023 1 commit
- [BugFix] Define `__eq__` in SequenceGroupOutputs (#1389) · f8a1e39f
  Woosuk Kwon authored Oct 17, 2023
  
  f8a1e39f