Commits · d6459b4516dbac4f346ce29fe90d43ebfafa1114 · OpenDAS / vllm_cscc

24 Oct, 2024 1 commit
- [Bugfix] Fix PP for ChatGLM and Molmo (#9422) · 836e8ef6
  Cyrus Leung authored Oct 24, 2024
  
  836e8ef6
16 Oct, 2024 1 commit
- [Core] Rename input data types (#8688) · cee711fd
  Cyrus Leung authored Oct 16, 2024
  
  cee711fd
11 Oct, 2024 1 commit
- [Model] Add GLM-4v support and meet vllm==0.6.2 (#9242) · 6cf1167c
  sixgod authored Oct 12, 2024
  
  6cf1167c
04 Oct, 2024 1 commit

[Models] Add remaining model PP support (#7168) · 0f6d7a9a

Murali Andoorveedu authored Oct 03, 2024

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

0f6d7a9a

30 Aug, 2024 1 commit
- [Core] Logprobs support in Multi-step (#7652) · 428dd144
  afeldman-nm authored Aug 29, 2024
  
  428dd144
20 Aug, 2024 1 commit
- [Bugfix] support `tie_word_embeddings` for all models (#5724) · f4fc7337
  Zijian Hu authored Aug 19, 2024
  
  f4fc7337
13 Aug, 2024 1 commit
- [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) · 7025b11d
  Cyrus Leung authored Aug 13, 2024
  
  7025b11d
02 Jul, 2024 2 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 2 commits
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

26 Mar, 2024 1 commit
- Enable more models to inference based on LoRA (#3382) · 8af890a8
  Jee Li authored Mar 26, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  8af890a8
25 Mar, 2024 2 commits
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
20 Mar, 2024 1 commit
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
17 Dec, 2023 1 commit

Optimize model execution with CUDA graph (#1926) · 37ca5581

Woosuk Kwon authored Dec 16, 2023


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

37ca5581

15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
30 Nov, 2023 1 commit
- Refactor Worker & InputMetadata (#1843) · 27feead2
  Woosuk Kwon authored Nov 29, 2023
  
  27feead2
29 Nov, 2023 2 commits
- Refactor Attention (#1840) · a9e45742
  Woosuk Kwon authored Nov 29, 2023
  
  a9e45742
- [Fix] Fix RoPE in ChatGLM-32K (#1841) · a7b3e330
  Woosuk Kwon authored Nov 29, 2023
  
  a7b3e330
24 Nov, 2023 1 commit
- Fix model docstrings (#1764) · 7c600440
  Woosuk Kwon authored Nov 23, 2023
  
  7c600440
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

07 Nov, 2023 1 commit
- ChatGLM Support (#1261) · 1a2bbc93
  GoHomeToMacDonal authored Nov 07, 2023
  
  1a2bbc93