1. 10 Dec, 2024 1 commit
    • Daniel Hiltgen's avatar
      build: Make target improvements (#7499) · 4879a234
      Daniel Hiltgen authored
      * llama: wire up builtin runner
      
      This adds a new entrypoint into the ollama CLI to run the cgo built runner.
      On Mac arm64, this will have GPU support, but on all other platforms it will
      be the lowest common denominator CPU build.  After we fully transition
      to the new Go runners more tech-debt can be removed and we can stop building
      the "default" runner via make and rely on the builtin always.
      
      * build: Make target improvements
      
      Add a few new targets and help for building locally.
      This also adjusts the runner lookup to favor local builds, then
      runners relative to the executable, and finally payloads.
      
      * Support customized CPU flags for runners
      
      This implements a simplified custom CPU flags pattern for the runners.
      When built without overrides, the runner name contains the vector flag
      we check for (AVX) to ensure we don't try to run on unsupported systems
      and crash.  If the user builds a customized set, we omit the naming
      scheme and don't check for compatibility.  This avoids checking
      requirements at runtime, so that logic has been removed as well.  This
      can be used to build GPU runners with no vector flags, or CPU/GPU
      runners with additional flags (e.g. AVX512) enabled.
      
      * Use relative paths
      
      If the user checks out the repo in a path that contains spaces, make gets
      really confused so use relative paths for everything in-repo to avoid breakage.
      
      * Remove payloads from main binary
      
      * install: clean up prior libraries
      
      This removes support for v0.3.6 and older versions (before the tar bundle)
      and ensures we clean up prior libraries before extracting the bundle(s).
      Without this change, runners and dependent libraries could leak when we
      update and lead to subtle runtime errors.
      4879a234
  2. 08 Nov, 2024 1 commit
  3. 02 Nov, 2024 2 commits
    • Jesse Gross's avatar
      llama: Improve error handling · 312d9de1
      Jesse Gross authored
      Check for NULL return values from llama.cpp in more places and
      convert them into Go errors, which should make debugging easier
      in the future rather than having hidden surprises in our data
      structures.
      312d9de1
    • Jesse Gross's avatar
      runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
      Jesse Gross authored
      Mllama has large embeddings (100 MB per image) and each embedding is
      represented as 1 token when passed to llama.cpp. Batches are pre-
      allocated for the size of the tokens times the batch size, so this
      results in allocations of over 50 GB at the default batch size.
      On some systems, these mallocs will fail.
      
      Since an image is represented as a single token and mllama doesn't
      support more than 1 image per request, we only need to allocate a
      batch size of 1, which is much more reasonable. In addition, for
      non-multimodal models, we don't need to allocate the embedding
      batches at all.
      
      Fixes #7464
      a103dae0
  4. 31 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Don't set cross attention before sending embeddings · 26acdcf4
      Jesse Gross authored
      Currently if an input has embeddings at any point then we will set
      cross attention to true from the beginning. This means that any
      tokens before the embeddings are sent will incorrectly have cross
      attention layers applied.
      
      This only sets cross attention when we have an embedding, either
      previously in this sequence or in the cache. It also makes cross
      attention capable of supporting parallelism at the runner level,
      though the mllama implementation doesn't support that yet.
      26acdcf4
  5. 30 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Better abstract vision model integration · c826e574
      Jesse Gross authored
      
      
      -Update mllama to take the cross attention state as embeddings in
      a batch, more similar to how Llava handles it. This improves
      integration with the input cache.
      -Pass locations in a prompt for embeddings using tags similar to Llava.
      -Abstract interface to vision models so the main runner accesses Clip
      and Mllama similarly
      Co-authored-by: default avatarMichael Yang <mxyng@pm.me>
      c826e574