Commits · 18bfcdd05c657e6997b132488e6f4e74307d6cee · kecinstone / 2024pra-vllm

22 Jan, 2024 1 commit
- [Speculative decoding 2/9] Multi-step worker for draft model (#2424) · 18bfcdd0
  Cade Daniel authored Jan 21, 2024
  
  18bfcdd0
21 Jan, 2024 1 commit
- Add `group` as an argument in broadcast ops (#2522) · 5b23c3f2
  Junda Chen authored Jan 20, 2024
  
  5b23c3f2
19 Jan, 2024 1 commit
- Simplify broadcast logic for control messages (#2501) · ef9b636e
  Zhuohan Li authored Jan 19, 2024
  
  ef9b636e
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
17 Dec, 2023 2 commits
- Remove dependency on CuPy (#2152) · c3372e87
  Woosuk Kwon authored Dec 17, 2023
  
  c3372e87
- Optimize model execution with CUDA graph (#1926) · 37ca5581
  Woosuk Kwon authored Dec 16, 2023
```
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  37ca5581
28 Nov, 2023 1 commit
- Correct comments in parallel_state.py (#1818) · a1125ad4
  explainerauthors authored Nov 28, 2023
  
  a1125ad4
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

16 Oct, 2023 1 commit
- Implement prompt logprobs & Batched topk for computing logprobs (#1328) · 9d9072a0
  Zhuohan Li authored Oct 16, 2023
```
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
```
  9d9072a0
02 Oct, 2023 1 commit
- TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) · ba0bfd40
  Zhuohan Li authored Oct 02, 2023
  
  ba0bfd40
18 Sep, 2023 1 commit
- [FIX] Don't initialize parameter by default (#1067) · 90979c38
  Zhuohan Li authored Sep 17, 2023
  
  90979c38
16 Sep, 2023 1 commit

Implement AWQ quantization support for LLaMA (#1032) · e3e79e9e

Woosuk Kwon authored Sep 16, 2023


Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>

e3e79e9e

02 Aug, 2023 1 commit
- Add Falcon support (new) (#592) · 1b0bd0fe
  Zhuohan Li authored Aug 02, 2023
  
  1b0bd0fe
25 Jul, 2023 1 commit
- fixed tensor parallel is not defined (#564) · 2d867b55
  MoeedDar authored Jul 25, 2023
  
  2d867b55
17 Jun, 2023 1 commit
- Change the name to vLLM (#150) · 0b98ba15
  Woosuk Kwon authored Jun 17, 2023
  
  0b98ba15