Commits · 515386ef3cacb44a2bcfab9d66eaee6143d94e95 · OpenDAS / vllm_cscc

"tests/vscode:/vscode.git/clone" did not exist on "93b9d9f499982b723c975ba7066af533afd04f08"

28 Mar, 2024 1 commit
- [Core] Support multi-node inference(eager and cuda graph) (#3686) · 515386ef
  Roy authored Mar 29, 2024
  
  515386ef
27 Mar, 2024 2 commits
- [Bugfix] [Hotfix] fix nccl library name (#3661) · d18f4e73
  youkaichao authored Mar 27, 2024
  
  d18f4e73
- [Core] remove cupy dependency (#3625) · 8f44facd
  youkaichao authored Mar 27, 2024
  
  8f44facd
25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
22 Mar, 2024 1 commit
- [BugFix] Some fixes for custom allreduce kernels (#2760) · f721096d
  Hanzhi Zhou authored Mar 21, 2024
  
  f721096d
15 Mar, 2024 1 commit
- Fix `dist.broadcast` stall without group argument (#3408) · 429284dc
  Junda Chen authored Mar 14, 2024
  
  429284dc
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
22 Feb, 2024 1 commit
- chore(vllm): codespell for spell checking (#2820) · 93dc5a28
  Massimiliano Pronesti authored Feb 22, 2024
  
  93dc5a28
14 Feb, 2024 1 commit
- Don't use cupy NCCL for AMD backends (#2855) · 25e86b6a
  Woosuk Kwon authored Feb 14, 2024
  
  25e86b6a
13 Feb, 2024 1 commit
- Use CuPy for CUDA graphs (#2811) · a463c333
  Woosuk Kwon authored Feb 13, 2024
  
  a463c333
30 Jan, 2024 2 commits
- [Minor] Fix false warning when TP=1 (#2674) · 105a40f5
  Woosuk Kwon authored Jan 30, 2024
  
  105a40f5
- [Minor] Fix a small typo (#2672) · bbe9bd96
  Philipp Moritz authored Jan 30, 2024
  
  bbe9bd96
27 Jan, 2024 1 commit
- Implement custom all reduce kernels (#2192) · 38017003
  Hanzhi Zhou authored Jan 28, 2024
  
  38017003
23 Jan, 2024 1 commit

[Experimental] Add multi-LoRA support (#1804) · 9b945daa

Antoni Baum authored Jan 24, 2024


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>

9b945daa

22 Jan, 2024 1 commit
- [Speculative decoding 2/9] Multi-step worker for draft model (#2424) · 18bfcdd0
  Cade Daniel authored Jan 21, 2024
  
  18bfcdd0
21 Jan, 2024 1 commit
- Add `group` as an argument in broadcast ops (#2522) · 5b23c3f2
  Junda Chen authored Jan 20, 2024
  
  5b23c3f2
19 Jan, 2024 1 commit
- Simplify broadcast logic for control messages (#2501) · ef9b636e
  Zhuohan Li authored Jan 19, 2024
  
  ef9b636e
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
17 Dec, 2023 2 commits
- Remove dependency on CuPy (#2152) · c3372e87
  Woosuk Kwon authored Dec 17, 2023
  
  c3372e87
- Optimize model execution with CUDA graph (#1926) · 37ca5581
  Woosuk Kwon authored Dec 16, 2023
```
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  37ca5581
28 Nov, 2023 1 commit
- Correct comments in parallel_state.py (#1818) · a1125ad4
  explainerauthors authored Nov 28, 2023
  
  a1125ad4
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c

16 Oct, 2023 1 commit
- Implement prompt logprobs & Batched topk for computing logprobs (#1328) · 9d9072a0
  Zhuohan Li authored Oct 16, 2023
```
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
```
  9d9072a0
02 Oct, 2023 1 commit
- TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) · ba0bfd40
  Zhuohan Li authored Oct 02, 2023
  
  ba0bfd40
18 Sep, 2023 1 commit
- [FIX] Don't initialize parameter by default (#1067) · 90979c38
  Zhuohan Li authored Sep 17, 2023
  
  90979c38
16 Sep, 2023 1 commit

Implement AWQ quantization support for LLaMA (#1032) · e3e79e9e

Woosuk Kwon authored Sep 16, 2023


Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>

e3e79e9e

02 Aug, 2023 1 commit
- Add Falcon support (new) (#592) · 1b0bd0fe
  Zhuohan Li authored Aug 02, 2023
  
  1b0bd0fe
25 Jul, 2023 1 commit
- fixed tensor parallel is not defined (#564) · 2d867b55
  MoeedDar authored Jul 25, 2023
  
  2d867b55
17 Jun, 2023 1 commit
- Change the name to vLLM (#150) · 0b98ba15
  Woosuk Kwon authored Jun 17, 2023
  
  0b98ba15