1. 22 Oct, 2024 1 commit
  2. 19 Oct, 2024 1 commit
  3. 18 Oct, 2024 1 commit
  4. 17 Oct, 2024 4 commits
    • Daniel Hiltgen's avatar
      llama: Decouple patching script from submodule (#7139) · bf4018b9
      Daniel Hiltgen authored
      * Refine llama.cpp vendoring workflow tools
      
      Switch from the sync.sh over to make based tooling
      
      * Run new make sync and patch flow
      bf4018b9
    • Daniel Hiltgen's avatar
      llama: add compiler tags for cpu features (#7137) · f86d00cd
      Daniel Hiltgen authored
      This adds the ability to customize the default runner with user specified flags
      f86d00cd
    • Gabe Goodhart's avatar
      IBM granite/granitemoe architecture support (#6760) · f2890a44
      Gabe Goodhart authored
      * fix(ext_server): Port llama.cpp sampling refactors to ext_server
      
      This was a fairly large changeset. I closely followed the changes here:
      https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82
      
      
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat: Bump llama.cpp to the latest master with `granite` support
      
      This does not yet have granite MoE support, but that can come in a
      follow up PR
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update solar patch for llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update the solar-pro patch for latest llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump to the latest master of llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches for latest bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama): Always run sync.sh from the right directory
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Update llama patches
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama)!: Rough sync with llama.cpp submodule
      
      There are a number of changes that will need to be propagated to llama.go
      before any of this works!
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Add a patch and update for missing ggml-impl.h include
      
      This include is where the ggml_cgraph struct is defined. It is included in
      many of the .c files to define the forward declartion in ggml.h. It seems
      that with the subset of code included here, the import was somehow lost (or
      out-of-order) when building, so adding this include to llama.cpp fixes the
      missing definition.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Add missing log.cpp
      
      This was added as part of the logging overhaul done in llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Overhaul use of sampling module for llama.cpp changes
      
      The changes here reflect the changes made in the big llama.cpp sampling PR
      https://github.com/ggerganov/llama.cpp/pull/9294
      
      
      
      The sampling functionality is now broken into the base interface
      (llama_sampler) and the generation implementation (gpt_sampler). The
      changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
      STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
      access a pure-C interface.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix the impl of SampleTokenGreedy for new sampling
      
      I don't think this method is currently used, so it could probably just be
      removed so that all sampling goes through the GPT interface, but in the
      interest of doing no harm, this should keep the method working as expected.
      
      Branch: IBMGraniteArchitectureSupport
      
      * fix(llama): Remove unused SampleTokenGreedy
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(sync): Remove bash-specific change to sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * chore(gofumpt): Format on llama.go to pass linting
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Fix missing <thread> include in ext_server
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove TODO about grammar_first
      
      This feature was not used/needed previously so should be fine without
      plumbing it through now.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Better naming for sampling wrapper and args
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix patch 05 to use new wrapper api and re-sync
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * runner: Flush pending responses before returning
      
      If there are any pending reponses (such as from potential stop
      tokens) then we should send them back before ending the sequence.
      Otherwise, we can be missing tokens at the end of a response.
      
      Fixes #6707
      
      * fix(llama/sampling): Use gpt_sampler with a forward declaration
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove unnecessary patch for gguf impl header
      
      This was caused by an earlier mistake in the embeddings patch that was
      dereferencing the pointer instead of using the wrapper API.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Remove use of deprecated --log-disable flag
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      ---------
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      f2890a44
    • Daniel Hiltgen's avatar
      Rename gpu package discover (#7143) · 05cd82ef
      Daniel Hiltgen authored
      Cleaning up go package naming
      05cd82ef
  5. 16 Oct, 2024 1 commit
    • Daniel Hiltgen's avatar
      Move macos v11 support flags to build script (#7203) · 7d6eb0d4
      Daniel Hiltgen authored
      Having v11 support hard-coded into the cgo settings causes warnings
      for newer Xcode versions.  This should help keep the build clean for users
      building from source with the latest tools, while still allow us to target
      the older OS via our CI processes.
      7d6eb0d4
  6. 15 Oct, 2024 3 commits
  7. 14 Oct, 2024 1 commit
  8. 13 Oct, 2024 1 commit
  9. 12 Oct, 2024 1 commit
  10. 10 Oct, 2024 3 commits
    • Jesse Gross's avatar
      cli: Send all images in conversation history · 7fe39025
      Jesse Gross authored
      Currently the CLI only sends images from the most recent image-
      containing message. This prevents doing things like sending
      one message with an image and then a follow message with a
      second image and asking for comparision based on additional
      information not present in any text that was output.
      
      It's possible that some models have a problem with this but the
      CLI is not the right place to do this since any adjustments are
      model-specific and should affect all clients.
      
      Both llava:34b and minicpm-v do reasonable things with multiple
      images in the history.
      7fe39025
    • Jesse Gross's avatar
      runner.go: Handle truncation of tokens for stop sequences · 0077e22d
      Jesse Gross authored
      When a single token contains both text to be return and a stop
      sequence, this causes an out of bounds error when we update the
      cache to match our text. This is because we currently assume that
      the removing the stop sequence will consume at least one token.
      
      This also inverts the logic to deal with positive numbers, rather
      than a value to be subtracted, which is easier to reason about.
      
      Fixes #7153
      0077e22d
    • Jesse Gross's avatar
      server: Don't clear cmd when closing a server · 03408f34
      Jesse Gross authored
      Close can be called on an LLM server if the runner subprocess dies.
      However, the Ollama scheduler code may not know about this yet and
      still try to access it. In this case, it is important that 'cmd'
      is still available as it is used to check on the status of the
      subprocess. If this happens, Kill may be called twice on the subprocess -
      that is fine.
      
      In addition, model unloading may race with new accesses, so we should
      hold a lock around this. This may result in the model being reloaded
      after the first close call - this is also fine as close will be called
      again later.
      03408f34
  11. 09 Oct, 2024 2 commits
  12. 08 Oct, 2024 3 commits
    • Daniel Hiltgen's avatar
      Fix build leakages (#7141) · f9584deb
      Daniel Hiltgen authored
      The recent change to applying patches leaves the submodule dirty based on
      "new commits" being present.  This ensures we clean up so the tree no longer
      reports dirty after a `go generate ./...` run.
      
      The Makefile was being a bit too aggressive in cleaning things up and would result in deleting the placeholder files which someone might accidentally commit.
      f9584deb
    • Jeffrey Morgan's avatar
      Re-introduce the `llama` package (#5034) · 96efd905
      Jeffrey Morgan authored
      * Re-introduce the llama package
      
      This PR brings back the llama package, making it possible to call llama.cpp and
      ggml APIs from Go directly via CGo. This has a few advantages:
      
      - C APIs can be called directly from Go without needing to use the previous
        "server" REST API
      - On macOS and for CPU builds on Linux and Windows, Ollama can be built without
        a go generate ./... step, making it easy to get up and running to hack on
        parts of Ollama that don't require fast inference
      - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
        takes <5 min on a fast CPU)
      - No git submodule making it easier to clone and build from source
      
      This is a big PR, but much of it is vendor code except for:
      
      - llama.go CGo bindings
      - example/: a simple example of running inference
      - runner/: a subprocess server designed to replace the llm/ext_server package
      - Makefile an as minimal as possible Makefile to build the runner package for
        different...
      96efd905
    • Shifra Goldstone's avatar
  13. 05 Oct, 2024 1 commit
  14. 01 Oct, 2024 1 commit
  15. 29 Sep, 2024 1 commit
  16. 26 Sep, 2024 1 commit
    • Blake Mizerany's avatar
      server: close response body on error (#6986) · 03608cb4
      Blake Mizerany authored
      This change closes the response body when an error occurs in
      makeRequestWithRetry. Previously, the first, non-200 response body was
      not closed before reattempting the request. This change ensures that
      the response body is closed in all cases where an error occurs,
      preventing leaks of file descriptors.
      
      Fixes #6974
      03608cb4
  17. 25 Sep, 2024 2 commits
  18. 24 Sep, 2024 3 commits
  19. 22 Sep, 2024 1 commit
  20. 21 Sep, 2024 3 commits
  21. 20 Sep, 2024 3 commits
  22. 18 Sep, 2024 2 commits