- 13 May, 2024 3 commits
-
-
youkaichao authored
-
Swapnil Parekh authored
-
Robert Shaw authored
-
- 12 May, 2024 1 commit
-
-
Yikang Shen authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 10 May, 2024 11 commits
-
-
youkaichao authored
-
Robert Shaw authored
-
heeju-kim2 authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Allen.Dou authored
-
SangBin Cho authored
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
-
Steve Grubb authored
-
Kunshang Ji authored
-
Simon Mo authored
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by:
sang <rkooo567@gmail.com>
-
Allen.Dou authored
-
youkaichao authored
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
-
Cody Yu authored
-
- 09 May, 2024 11 commits
-
-
Philipp Moritz authored
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
-
Hao Zhang authored
Co-authored-by:
Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by:
Aurick Qiao <qiao@aurick.net> Co-authored-by:
Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by:
Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
Michael Goin authored
-
Robert Shaw authored
-
Cyrus Leung authored
-
kliuae authored
Co-authored-by:miloice <jeffaw99@hotmail.com>
-
Woosuk Kwon authored
-
Woosuk Kwon authored
-
Cyrus Leung authored
-
Mahmoud Ashraf authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
alexm-nm authored
-
- 08 May, 2024 11 commits
-
-
Cade Daniel authored
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Woosuk Kwon authored
-
youkaichao authored
-
youkaichao authored
-
Antoni Baum authored
-
Woosuk Kwon authored
-
DefTruth authored
-
SangBin Cho authored
-
SangBin Cho authored
-
youkaichao authored
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
-
- 07 May, 2024 2 commits
-
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
youkaichao authored
-