• Jesse Gross's avatar
    models: Prune unused outputs earlier in the forward pass · 5c5535c0
    Jesse Gross authored
    Currently Rows is called as the last step in a model computation
    to get the values for the output tokens. However, if we move it
    earlier in the process then we can trim out computations that
    never get used. This is similar to how models are defined in
    llama.cpp.
    
    Changing the model definition in this way improves token generation
    performance by approximately 8%.
    5c5535c0
model.go 5.33 KB