"git@developer.sourcefind.cn:zhaoyu6/sglang.git" did not exist on "399e7ec8b3bcc681ed55e98a761466a6e6d78f6b"
  • Jesse Gross's avatar
    kvcache: Optimize sliding window attention · 2d6eac90
    Jesse Gross authored
    Currently sliding window attention allocates and uses the full
    context size and just masks out any tokens that are outside of the
    window. However, we really only need (roughly) the sliding window
    size.
    
    At large context sizes this improves two things:
     - Memory allocated - since the fully context size is allocated up front,
       memory requirements drop substantially. On Gemma3:4b with a 32k
       context window, total memory usage (including weights and non-sliding
       layers) drops from ~20GB to ~8GB.
     - Computation - ranges that are completely outside of the sliding
       window are now removed from the tensors that are returned from the
       cache rather than simply being masked out. This results in more
       efficient processing, scaling with the size of the context that
       has actually been used.
    
    Notable, this does not update the scheduler for any model to be aware of
    the smaller memory requirements. This is difficult for Gemma3 because
    the layers are heterogeneous between sliding and non-sliding attention.
    As a result, while actual memory consumption will be reduced, the
    scheduler will over-estimate the requirements of the model. This means
    that splitting between GPUs or GPUs and CPUs will still be suboptimal.
    
    Bug #9730
    2d6eac90
causal.go 16.4 KB