Commits · 0df7ec0b2d890799ca71e2f862fdff5fcc52cdc0 · OpenDAS / vllm_cscc

19 Aug, 2024 1 commit
- [Misc] Remove Gemma RoPE (#7638) · df845b2b
  Woosuk Kwon authored Aug 19, 2024
  
  df845b2b
13 Aug, 2024 1 commit
- [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) · 7025b11d
  Cyrus Leung authored Aug 13, 2024
  
  7025b11d
01 Aug, 2024 2 commits
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
- [Bugfix] Lower gemma's unloaded_params exception to warning (#7002) · f4fd390f
  Michael Goin authored Aug 01, 2024
  
  f4fd390f
04 Jul, 2024 1 commit
- [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) · 69ec3ca1
  Lily Liu authored Jul 04, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  69ec3ca1
02 Jul, 2024 2 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 3 commits
- [Model] Add Gemma 2 (#5908) · 79c92c7c
  Woosuk Kwon authored Jun 27, 2024
  
  79c92c7c
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
26 Apr, 2024 2 commits
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

27 Mar, 2024 1 commit
- [Bugfix] More faithful implementation of Gemma (#3653) · 82c540be
  Woosuk Kwon authored Mar 27, 2024
  
  82c540be
25 Mar, 2024 2 commits
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
21 Mar, 2024 1 commit
- [BugFix] gemma loading after quantization or LoRA. (#3553) · b7050ca7
  Taemin Lee authored Mar 22, 2024
  
  b7050ca7
20 Mar, 2024 1 commit
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
07 Mar, 2024 2 commits
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
- Add GPTQ support for Gemma (#3200) · d3c04b6a
  TechxGenus authored Mar 07, 2024
  
  d3c04b6a
28 Feb, 2024 1 commit
- Add LoRA support for Gemma (#3050) · 929b4f29
  Woosuk Kwon authored Feb 28, 2024
  
  929b4f29
22 Feb, 2024 2 commits
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
- Use Llama RMSNorm custom op for Gemma (#2974) · 95529e32
  Woosuk Kwon authored Feb 21, 2024
  
  95529e32
21 Feb, 2024 1 commit
- Add Gemma model (#2964) · 5253edaa
  Xiang Xu authored Feb 21, 2024
  
  5253edaa
25 Jan, 2024 1 commit
- fix names and license for Qwen2 (#2589) · 2832e7b9
  Junyang Lin authored Jan 25, 2024
  
  2832e7b9
22 Jan, 2024 1 commit
- Add qwen2 (#2495) · 94b5edeb
  Junyang Lin authored Jan 23, 2024
  
  94b5edeb
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
17 Dec, 2023 1 commit

Optimize model execution with CUDA graph (#1926) · 37ca5581

Woosuk Kwon authored Dec 16, 2023


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

37ca5581

15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
30 Nov, 2023 1 commit
- Refactor Worker & InputMetadata (#1843) · 27feead2
  Woosuk Kwon authored Nov 29, 2023
  
  27feead2
29 Nov, 2023 1 commit
- Refactor Attention (#1840) · a9e45742
  Woosuk Kwon authored Nov 29, 2023
  
  a9e45742
24 Nov, 2023 1 commit
- Fix model docstrings (#1764) · 7c600440
  Woosuk Kwon authored Nov 23, 2023
  
  7c600440
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
19 Nov, 2023 1 commit
- [Optimization] Implement fused add rmsnorm (#1667) · e1054247
  ljss authored Nov 19, 2023
  
  e1054247
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

22 Oct, 2023 1 commit

Support SqueezeLLM (#1326) · 1f24755b

chooper1 authored Oct 22, 2023


Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

1f24755b