• Jesse Gross's avatar
    runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
    Jesse Gross authored
    Mllama has large embeddings (100 MB per image) and each embedding is
    represented as 1 token when passed to llama.cpp. Batches are pre-
    allocated for the size of the tokens times the batch size, so this
    results in allocations of over 50 GB at the default batch size.
    On some systems, these mallocs will fail.
    
    Since an image is represented as a single token and mllama doesn't
    support more than 1 image per request, we only need to allocate a
    batch size of 1, which is much more reasonable. In addition, for
    non-multimodal models, we don't need to allocate the embedding
    batches at all.
    
    Fixes #7464
    a103dae0
runner.go 24 KB