1. 29 Jan, 2025 1 commit
    • Michael Yang's avatar
      next build (#8539) · dcfb7a10
      Michael Yang authored
      
      
      * add build to .dockerignore
      
      * test: only build one arch
      
      * add build to .gitignore
      
      * fix ccache path
      
      * filter amdgpu targets
      
      * only filter if autodetecting
      
      * Don't clobber gpu list for default runner
      
      This ensures the GPU specific environment variables are set properly
      
      * explicitly set CXX compiler for HIP
      
      * Update build_windows.ps1
      
      This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.
      
      * build: add ollama subdir
      
      * add .git to .dockerignore
      
      * docs: update development.md
      
      * update build_darwin.sh
      
      * remove unused scripts
      
      * llm: add cwd and build/lib/ollama to library paths
      
      * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS
      
      * add additional cmake output vars for msvc
      
      * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12
      
      * remove unncessary filepath.Dir, cleanup
      
      * add hardware-specific directory to path
      
      * use absolute server path
      
      * build: linux arm
      
      * cmake install targets
      
      * remove unused files
      
      * ml: visit each library path once
      
      * build: skip cpu variants on arm
      
      * build: install cpu targets
      
      * build: fix workflow
      
      * shorter names
      
      * fix rocblas install
      
      * docs: clean up development.md
      
      * consistent build dir removal in development.md
      
      * silence -Wimplicit-function-declaration build warnings in ggml-cpu
      
      * update readme
      
      * update development readme
      
      * llm: update library lookup logic now that there is one runner (#8587)
      
      * tweak development.md
      
      * update docs
      
      * add windows cuda/rocm tests
      
      ---------
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      dcfb7a10
  2. 08 Jan, 2025 1 commit
  3. 14 Dec, 2024 1 commit
  4. 11 Dec, 2024 1 commit
  5. 17 Oct, 2024 1 commit
    • Gabe Goodhart's avatar
      IBM granite/granitemoe architecture support (#6760) · f2890a44
      Gabe Goodhart authored
      * fix(ext_server): Port llama.cpp sampling refactors to ext_server
      
      This was a fairly large changeset. I closely followed the changes here:
      https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82
      
      
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat: Bump llama.cpp to the latest master with `granite` support
      
      This does not yet have granite MoE support, but that can come in a
      follow up PR
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update solar patch for llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update the solar-pro patch for latest llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump to the latest master of llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches for latest bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama): Always run sync.sh from the right directory
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Update llama patches
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama)!: Rough sync with llama.cpp submodule
      
      There are a number of changes that will need to be propagated to llama.go
      before any of this works!
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Add a patch and update for missing ggml-impl.h include
      
      This include is where the ggml_cgraph struct is defined. It is included in
      many of the .c files to define the forward declartion in ggml.h. It seems
      that with the subset of code included here, the import was somehow lost (or
      out-of-order) when building, so adding this include to llama.cpp fixes the
      missing definition.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Add missing log.cpp
      
      This was added as part of the logging overhaul done in llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Overhaul use of sampling module for llama.cpp changes
      
      The changes here reflect the changes made in the big llama.cpp sampling PR
      https://github.com/ggerganov/llama.cpp/pull/9294
      
      
      
      The sampling functionality is now broken into the base interface
      (llama_sampler) and the generation implementation (gpt_sampler). The
      changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
      STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
      access a pure-C interface.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix the impl of SampleTokenGreedy for new sampling
      
      I don't think this method is currently used, so it could probably just be
      removed so that all sampling goes through the GPT interface, but in the
      interest of doing no harm, this should keep the method working as expected.
      
      Branch: IBMGraniteArchitectureSupport
      
      * fix(llama): Remove unused SampleTokenGreedy
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(sync): Remove bash-specific change to sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * chore(gofumpt): Format on llama.go to pass linting
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Fix missing <thread> include in ext_server
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove TODO about grammar_first
      
      This feature was not used/needed previously so should be fine without
      plumbing it through now.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Better naming for sampling wrapper and args
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix patch 05 to use new wrapper api and re-sync
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * runner: Flush pending responses before returning
      
      If there are any pending reponses (such as from potential stop
      tokens) then we should send them back before ending the sequence.
      Otherwise, we can be missing tokens at the end of a response.
      
      Fixes #6707
      
      * fix(llama/sampling): Use gpt_sampler with a forward declaration
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove unnecessary patch for gguf impl header
      
      This was caused by an earlier mistake in the embeddings patch that was
      dereferencing the pointer instead of using the wrapper API.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Remove use of deprecated --log-disable flag
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      ---------
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      f2890a44