llama/runner/runner.go · a103dae01eb947f08d49a8b73c6b66ad97204a19 · OpenDAS / ollama

runner.go: Only allocate 1 element embedding batches for mllama · a103dae0

Jesse Gross authored Nov 01, 2024

Mllama has large embeddings (100 MB per image) and each embedding is
represented as 1 token when passed to llama.cpp. Batches are pre-
allocated for the size of the tokens times the batch size, so this
results in allocations of over 50 GB at the default batch size.
On some systems, these mallocs will fail.

Since an image is represented as a single token and mllama doesn't
support more than 1 image per request, we only need to allocate a
batch size of 1, which is much more reasonable. In addition, for
non-multimodal models, we don't need to allocate the embedding
batches at all.

Fixes #7464

a103dae0

runner.go 24 KB

Replace runner.go