• Jesse Gross's avatar
    kvcache: Skip computing causal mask for worst case graph reservation · ea790031
    Jesse Gross authored
    Computing an attention mask for a large context and max batch is
    expensive - over 100ms. Models like Gemma3 that have multiple types
    of caches and custom attention masks need to do this 4 times, so this
    adds approximately 500ms to startup time when using 128k context
    
    When we are reserving the worst case graph, we don't need the mask,
    only its shape, so we can skip this.
    ea790031
causal.go 18 KB