Unverified Commit 55211b01 authored by Szymon Ożóg's avatar Szymon Ożóg Committed by GitHub
Browse files

[Bugfix] Fix chunked prefill for GGUF (#14666)


Signed-off-by: default avatarSzymonOzog <szymon.ozog@aleph-alpha.com>
parent 5d043c16
...@@ -98,6 +98,13 @@ MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES ...@@ -98,6 +98,13 @@ MMQ_QUANT_TYPES = STANDARD_QUANT_TYPES | KQUANT_TYPES
def _fuse_mul_mat(x: torch.Tensor, qweight: torch.Tensor, def _fuse_mul_mat(x: torch.Tensor, qweight: torch.Tensor,
qweight_type: int) -> torch.Tensor: qweight_type: int) -> torch.Tensor:
# HACK: when doing chunked prefill we don't generate output tokens
# so input to logits generator is empty which causes invalid parameter
if x.shape[0] == 0:
return torch.empty(x.shape[0],
qweight.shape[0],
dtype=x.dtype,
device=x.device)
# there is no need to call any kernel for fp16/bf16 # there is no need to call any kernel for fp16/bf16
if qweight_type in UNQUANTIZED_TYPES: if qweight_type in UNQUANTIZED_TYPES:
return x @ qweight.T return x @ qweight.T
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment