Commits · 643ecf7b11a3e74c838f438cfc1b3e59c018853b · OpenDAS / vllm_cscc

17 Nov, 2024 1 commit
- [V1] Refactor model executable interface for all text-only language models (#10374) · 643ecf7b
  Roger Wang authored Nov 16, 2024
```
Signed-off-by: Roger Wang <ywang@roblox.com>
```
  643ecf7b
14 Nov, 2024 1 commit
- [misc] error early for old-style class (#10304) · 504ac53d
  youkaichao authored Nov 13, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  504ac53d
11 Nov, 2024 1 commit
- [6/N] pass whole config to inner model (#10205) · f89d18ff
  youkaichao authored Nov 10, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  f89d18ff
09 Nov, 2024 1 commit
- [5/N] pass the whole config to model (#9983) · 1a95f10e
  youkaichao authored Nov 08, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  1a95f10e
06 Nov, 2024 3 commits
- [V1] Make v1 more testable (#9888) · d58268c5
  Joe Runde authored Nov 06, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  d58268c5
- Remove ScaledActivation for AWQ (#10057) · 399c7986
  Michael Goin authored Nov 06, 2024
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  399c7986
- [CI/Build] drop support for Python 3.8 EOL (#8464) · 21063c11
  Aaron Pham authored Nov 06, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
```
  21063c11
24 Oct, 2024 1 commit
- [torch.compile] expanding support and fix allgather compilation (#9637) · ad6f7805
  Yongzao authored Oct 24, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  ad6f7805
04 Oct, 2024 1 commit

[Models] Add remaining model PP support (#7168) · 0f6d7a9a

Murali Andoorveedu authored Oct 03, 2024

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

0f6d7a9a

30 Aug, 2024 1 commit
- [Core] Logprobs support in Multi-step (#7652) · 428dd144
  afeldman-nm authored Aug 29, 2024
  
  428dd144
20 Aug, 2024 1 commit
- [Bugfix] support `tie_word_embeddings` for all models (#5724) · f4fc7337
  Zijian Hu authored Aug 19, 2024
  
  f4fc7337
13 Aug, 2024 1 commit
- [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) · 7025b11d
  Cyrus Leung authored Aug 13, 2024
  
  7025b11d
11 Jul, 2024 1 commit
- [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. (#6326) · 8a1415cf
  Thomas Parnell authored Jul 11, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  8a1415cf
02 Jul, 2024 2 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 2 commits
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
14 Jun, 2024 1 commit
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) · e2afb03c
  Thomas Parnell authored Jun 14, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  e2afb03c
22 May, 2024 2 commits
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
27 Apr, 2024 1 commit
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418) · 4ea1f967
  Robert Shaw authored Apr 27, 2024
  
  4ea1f967
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

25 Mar, 2024 1 commit
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
20 Mar, 2024 1 commit
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
17 Dec, 2023 1 commit

Optimize model execution with CUDA graph (#1926) · 37ca5581

Woosuk Kwon authored Dec 16, 2023


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

37ca5581

30 Nov, 2023 1 commit
- Refactor Worker & InputMetadata (#1843) · 27feead2
  Woosuk Kwon authored Nov 29, 2023
  
  27feead2
24 Nov, 2023 1 commit
- Fix model docstrings (#1764) · 7c600440
  Woosuk Kwon authored Nov 23, 2023
  
  7c600440
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
19 Nov, 2023 1 commit
- Add AWQ support for all models (#1714) · 8d17774f
  Woosuk Kwon authored Nov 18, 2023
  
  8d17774f
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

02 Oct, 2023 1 commit
- TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) · ba0bfd40
  Zhuohan Li authored Oct 02, 2023
  
  ba0bfd40
13 Sep, 2023 1 commit

Add Model Revision Support (#1014) · ab019eea

Jasmond L authored Sep 14, 2023


Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

ab019eea

07 Sep, 2023 1 commit
- Enable safetensors loading for all models (#974) · c957c741
  Zhuohan Li authored Sep 07, 2023
  
  c957c741
05 Sep, 2023 1 commit
- Align vLLM's beam search implementation with HF generate (#857) · 002800f0
  Zhuohan Li authored Sep 04, 2023
  
  002800f0
30 Aug, 2023 1 commit
- Accelerate LLaMA model loading (#234) · 0d93f156
  JFDuan authored Aug 30, 2023
  
  0d93f156