1. 02 Nov, 2024 2 commits
    • Jesse Gross's avatar
      llama: Improve error handling · 312d9de1
      Jesse Gross authored
      Check for NULL return values from llama.cpp in more places and
      convert them into Go errors, which should make debugging easier
      in the future rather than having hidden surprises in our data
      structures.
      312d9de1
    • Jesse Gross's avatar
      runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
      Jesse Gross authored
      Mllama has large embeddings (100 MB per image) and each embedding is
      represented as 1 token when passed to llama.cpp. Batches are pre-
      allocated for the size of the tokens times the batch size, so this
      results in allocations of over 50 GB at the default batch size.
      On some systems, these mallocs will fail.
      
      Since an image is represented as a single token and mllama doesn't
      support more than 1 image per request, we only need to allocate a
      batch size of 1, which is much more reasonable. In addition, for
      non-multimodal models, we don't need to allocate the embedding
      batches at all.
      
      Fixes #7464
      a103dae0
  2. 31 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Don't set cross attention before sending embeddings · 26acdcf4
      Jesse Gross authored
      Currently if an input has embeddings at any point then we will set
      cross attention to true from the beginning. This means that any
      tokens before the embeddings are sent will incorrectly have cross
      attention layers applied.
      
      This only sets cross attention when we have an embedding, either
      previously in this sequence or in the cache. It also makes cross
      attention capable of supporting parallelism at the runner level,
      though the mllama implementation doesn't support that yet.
      26acdcf4
  3. 30 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Better abstract vision model integration · c826e574
      Jesse Gross authored
      
      
      -Update mllama to take the cross attention state as embeddings in
      a batch, more similar to how Llava handles it. This improves
      integration with the input cache.
      -Pass locations in a prompt for embeddings using tags similar to Llava.
      -Abstract interface to vision models so the main runner accesses Clip
      and Mllama similarly
      Co-authored-by: default avatarMichael Yang <mxyng@pm.me>
      c826e574