• Jesse Gross's avatar
    runner.go: Don't set cross attention before sending embeddings · 26acdcf4
    Jesse Gross authored
    Currently if an input has embeddings at any point then we will set
    cross attention to true from the beginning. This means that any
    tokens before the embeddings are sent will incorrectly have cross
    attention layers applied.
    
    This only sets cross attention when we have an embedding, either
    previously in this sequence or in the cache. It also makes cross
    attention capable of supporting parallelism at the runner level,
    though the mllama implementation doesn't support that yet.
    26acdcf4
runner.go 23.9 KB