kvcache/causal.go · ea79003180205680000bacf97466fc9d78d71f5e · OpenDAS / ollama

kvcache: Skip computing causal mask for worst case graph reservation · ea790031

Jesse Gross authored May 27, 2025

Computing an attention mask for a large context and max batch is
expensive - over 100ms. Models like Gemma3 that have multiple types
of caches and custom attention masks need to do this 4 times, so this
adds approximately 500ms to startup time when using 128k context

When we are reserving the worst case graph, we don't need the mask,
only its shape, so we can skip this.

ea790031

causal.go 18 KB

Replace causal.go