• Gabe Goodhart's avatar
    Hybrid and recurrent memory estimates (#12186) · 7b91c9ce
    Gabe Goodhart authored
    
    
    This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.
    
    The logic for the sizing of the recurrent layers comes from the llama.cpp implementation
    
            ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
            ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    7b91c9ce
ggml.go 24.3 KB