- 31 Jan, 2025 9 commits
-
-
Ryan Nguyen authored
**[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B.  With the optimization, this is no longer the case:  --------- Signed-off-by:
Ryan N <ryan.nguyen@centml.ai>
-
Tyler Michael Smith authored
Integrates the block-quantized kernels introduced in https://github.com/vllm-project/vllm/pull/11868 for use in linear layers. Signed-off-by:
Tyler Michael Smith <tyler@neuralmagic.com>
-
Robert Shaw authored
SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files ) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com>
-
Chen Zhang authored
This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by:Chen Zhang <zhangch99@outlook.com>
-
Robert Shaw authored
Co-authored-by:simon-mo <xmo@berkeley.edu>
-
Roger Wang authored
-
Lucas Wilkinson authored
Signed-off-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by:
simon-mo <xmo@berkeley.edu> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
simon-mo <simon.mo@hey.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
Zhuohan Li <zhuohan123@gmail.com> Co-authored-by:
Tyler Michael Smith <tysmith@redhat.com> Co-authored-by:
Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by:
simon-mo <xmo@berkeley.edu>
-
Aleksandr Malyshev authored
Signed-off-by:
Aleksandr Malyshev <maleksan@amd.com> Co-authored-by:
Aleksandr Malyshev <maleksan@amd.com>
-
Lucas Wilkinson authored
-
- 30 Jan, 2025 4 commits
-
-
Michael Goin authored
Signed-off-by:mgoin <michael@neuralmagic.com>
-
Robert Shaw authored
Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
simon-mo <xmo@berkeley.edu>
-
Beim authored
Signed-off-by:Beim <beim2015@outlook.com>
-
Mark McLoughlin authored
Signed-off-by:Mark McLoughlin <markmc@redhat.com>
-
- 29 Jan, 2025 11 commits
-
-
Woosuk Kwon authored
Signed-off-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
Jinzhen Lin authored
-
Pavani Majety authored
Signed-off-by:
Pavani Majety <pmajety@nvidia.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
Yanyi Liu authored
Signed-off-by:liuyanyi <wolfsonliu@163.com>
-
Alphi authored
Signed-off-by:
hzh <hezhihui_thu@163.com> Signed-off-by:
Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by:
DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by:
NickLucche <nlucches@redhat.com> Signed-off-by:
Isotr0py <2037008807@qq.com> Signed-off-by:
Roger Wang <ywang@roblox.com> Signed-off-by:
Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by:
Akshat Tripathi <akshat@krai.ai> Signed-off-by:
Oleg Mosalov <oleg@krai.ai> Signed-off-by:
Jee Jee Li <pandaleefree@gmail.com> Signed-off-by:
rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by:
Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by:
Chenguang Li <757486878@qq.com> Signed-off-by:
youkaichao <youkaichao@gmail.com> Signed-off-by:
Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by:
Chen Zhang <zhangch99@outlook.com> Signed-off-by:
Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by:
Shanshan Shen <467638484@qq.com> Signed-off-by:
elijah <f1renze.142857@gmail.com> Signed-off-by:
Yikun <yikunkero@gmail.com> Signed-off-by:
mgoin <michael@neuralmagic.com> Signed-off-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by:
Konrad Zawora <kzawora@habana.ai> Signed-off-by:
tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by:
wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by:
Rui Qiao <ruisearch42@gmail.com> Co-authored-by:
Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by:
shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by:
shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by:
Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by:
Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by:
sixgod <evethwillbeok@outlook.com> Co-authored-by:
Isotr0py <2037008807@qq.com> Co-authored-by:
Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by:
Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by:
Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by:
Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by:
Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by:
Oleg Mosalov <oleg@krai.ai> Co-authored-by:
Jee Jee Li <pandaleefree@gmail.com> Co-authored-by:
Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by:
Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by:
Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by:
Concurrensee <yida.wu@amd.com> Co-authored-by:
Chenguang Li <757486878@qq.com> Co-authored-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
Alex Brooks <alex.brooks@ibm.com> Co-authored-by:
Chen Zhang <zhangch99@outlook.com> Co-authored-by:
Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by:
Shanshan Shen <467638484@qq.com> Co-authored-by:
elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by:
Yikun Jiang <yikunkero@gmail.com> Co-authored-by:
Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
Konrad Zawora <kzawora@habana.ai> Co-authored-by:
TJian <tunjian1996@gmail.com> Co-authored-by:
tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by:
wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by:
maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by:
Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by:
Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
Travis Johnson authored
Signed-off-by:
Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by:
Wallas Santos <wallashss@ibm.com> Co-authored-by:
Wallas Santos <wallashss@ibm.com>
-
Maximilien de Bayser authored
Signed-off-by:Max de Bayser <mbayser@br.ibm.com>
-
Robert Shaw authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
Michael Goin authored
Signed-off-by:mgoin <michael@neuralmagic.com>
-
Mark McLoughlin authored
Signed-off-by:Mark McLoughlin <markmc@redhat.com>
-
Ce Gao authored
Signed-off-by:
Ce Gao <cegao@tensorchord.ai> Co-authored-by:
Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by:
Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by:
Michael Goin <mgoin@redhat.com>
-
- 28 Jan, 2025 11 commits
-
-
fenghuizhang authored
Signed-off-by:Fenghui Zhang <fhzhang@google.com>
-
Mark McLoughlin authored
Signed-off-by:Mark McLoughlin <markmc@redhat.com>
-
Michael Goin authored
Signed-off-by:mgoin <michael@neuralmagic.com>
-
Mark McLoughlin authored
Signed-off-by:Mark McLoughlin <markmc@redhat.com>
-
Cyrus Leung authored
Signed-off-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
Sebastian Schoennenbeck authored
Signed-off-by:Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
-
Robert Shaw authored
Signed-off-by:rshaw@neuralmagic.com <rshaw@neuralmagic.com>
-
Mengqing Cao authored
Signed-off-by:Mengqing Cao <cmq0113@163.com>
-
Gabriel Marinho authored
Signed-off-by:Gabriel Marinho <gmarinho@ibm.com>
-
Liangfu Chen authored
Signed-off-by:
Liangfu Chen <liangfc@amazon.com> Co-authored-by:
Jiangfei Duan <jfduan@outlook.com>
-
Harry Mellor authored
Signed-off-by:Harry Mellor <19981378+hmellor@users.noreply.github.com>
-
- 27 Jan, 2025 5 commits
-
-
Nicolò Lucchesi authored
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill (#10132) Signed-off-by:
NickLucche <nlucches@redhat.com> Signed-off-by:
wallashss <wallashss@ibm.com> Co-authored-by:
wallashss <wallashss@ibm.com>
-
Bowen Wang authored
Signed-off-by:
Bowen Wang <abmfy@icloud.com> Signed-off-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
youkaichao <youkaichao@gmail.com>
-
Mark McLoughlin authored
Signed-off-by:Mark McLoughlin <markmc@redhat.com>
-
Isotr0py authored
Signed-off-by:Isotr0py <2037008807@qq.com>
-
Woosuk Kwon authored
Signed-off-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-