1. 04 Nov, 2024 3 commits
  2. 02 Nov, 2024 4 commits
    • Daniel Hiltgen's avatar
      nvidia libs have inconsistent ordering (#7473) · 29ab9fa7
      Daniel Hiltgen authored
      The runtime and management libraries may not always have
      identical ordering, so use the device UUID to correlate instead of ID.
      29ab9fa7
    • Daniel Hiltgen's avatar
      CI: omit unused tools for faster release builds (#7432) · b8d5036e
      Daniel Hiltgen authored
      This leverages caching, and some reduced installer scope to try
      to speed up builds. It also tidies up some windows build logic
      that was only relevant for the older generate/cmake builds.
      b8d5036e
    • Jesse Gross's avatar
      llama: Improve error handling · 312d9de1
      Jesse Gross authored
      Check for NULL return values from llama.cpp in more places and
      convert them into Go errors, which should make debugging easier
      in the future rather than having hidden surprises in our data
      structures.
      312d9de1
    • Jesse Gross's avatar
      runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
      Jesse Gross authored
      Mllama has large embeddings (100 MB per image) and each embedding is
      represented as 1 token when passed to llama.cpp. Batches are pre-
      allocated for the size of the tokens times the batch size, so this
      results in allocations of over 50 GB at the default batch size.
      On some systems, these mallocs will fail.
      
      Since an image is represented as a single token and mllama doesn't
      support more than 1 image per request, we only need to allocate a
      batch size of 1, which is much more reasonable. In addition, for
      non-multimodal models, we don't need to allocate the embedding
      batches at all.
      
      Fixes #7464
      a103dae0
  3. 01 Nov, 2024 3 commits
  4. 31 Oct, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Don't set cross attention before sending embeddings · 26acdcf4
      Jesse Gross authored
      Currently if an input has embeddings at any point then we will set
      cross attention to true from the beginning. This means that any
      tokens before the embeddings are sent will incorrectly have cross
      attention layers applied.
      
      This only sets cross attention when we have an embedding, either
      previously in this sequence or in the cache. It also makes cross
      attention capable of supporting parallelism at the runner level,
      though the mllama implementation doesn't support that yet.
      26acdcf4
    • Daniel Hiltgen's avatar
      Give unicode test more time to run (#7437) · 921779bb
      Daniel Hiltgen authored
      * Give unicode test more time to run
      
      Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test
      
      * Give more time for concurrency test
      
      CPU inference can be very slow under stress
      921779bb
  5. 30 Oct, 2024 6 commits
    • Daniel Hiltgen's avatar
      Refine default thread selection for NUMA systems (#7322) · 16f4eabe
      Daniel Hiltgen authored
      Until we have full NUMA support, this adjusts the default thread selection
      algorithm to count up the number of performance cores across all sockets.
      16f4eabe
    • Jesse Gross's avatar
      runner.go: Better abstract vision model integration · c826e574
      Jesse Gross authored
      
      
      -Update mllama to take the cross attention state as embeddings in
      a batch, more similar to how Llava handles it. This improves
      integration with the input cache.
      -Pass locations in a prompt for embeddings using tags similar to Llava.
      -Abstract interface to vision models so the main runner accesses Clip
      and Mllama similarly
      Co-authored-by: default avatarMichael Yang <mxyng@pm.me>
      c826e574
    • Daniel Hiltgen's avatar
      Soften windows clang requirement (#7428) · 712e99d4
      Daniel Hiltgen authored
      This will no longer error if built with regular gcc on windows.  To help
      triage issues that may come in related to different compilers, the runner now
      reports the compier used by cgo.
      712e99d4
    • Daniel Hiltgen's avatar
      Remove submodule and shift to Go server - 0.4.0 (#7157) · b754f5a6
      Daniel Hiltgen authored
      * Remove llama.cpp submodule and shift new build to top
      
      * CI: install msys and clang gcc on win
      
      Needed for deepseek to work properly on windows
      b754f5a6
    • Daniel Hiltgen's avatar
      Move windows app out of preview (#7347) · a805e594
      Daniel Hiltgen authored
      a805e594
    • Daniel Hiltgen's avatar
      windows: Support alt install paths, fit and finish (#6967) · 91dfbb1b
      Daniel Hiltgen authored
      * windows: Support alt install paths
      
      Advanced users are leveraging innosetup's /DIR switch to target
      an alternate location, but we get confused by things not existing in the LocalAppData dir.
      This also hardens the server path lookup code for a future attempt to unify with a ./bin prefix
      
      * Fit and finish improvements for windows app
      
      Document alternate install location instructions for binaries and model.
      Pop up progress UI for upgrades (automatic, with cancel button).
      Expose non-default port in menu to disambiguate mutiple instances.
      Set minimum Windows version to 10 22H2
      91dfbb1b
  6. 29 Oct, 2024 4 commits
  7. 28 Oct, 2024 1 commit
  8. 27 Oct, 2024 1 commit
  9. 26 Oct, 2024 2 commits
    • Daniel Hiltgen's avatar
      Fix deepseek deseret regex (#7369) · 099f7077
      Daniel Hiltgen authored
      On windows compiled with gcc the c++ regex library failed to handle
      the characters
      099f7077
    • Daniel Hiltgen's avatar
      Better support for AMD multi-GPU on linux (#7212) · d7c94e0c
      Daniel Hiltgen authored
      * Better support for AMD multi-GPU
      
      This resolves a number of problems related to AMD multi-GPU setups on linux.
      
      The numeric IDs used by rocm are not the same as the numeric IDs exposed in
      sysfs although the ordering is consistent.  We have to count up from the first
      valid gfx (major/minor/patch with non-zero values) we find starting at zero.
      
      There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
      supports UUID based identification, so we should favor that one, and try
      to use UUIDs if detected to avoid potential ordering bugs with numeric IDs
      
      * ROCR_VISIBLE_DEVICES only works on linux
      
      Use the numeric ID only HIP_VISIBLE_DEVICES on windows
      d7c94e0c
  10. 25 Oct, 2024 2 commits
  11. 24 Oct, 2024 1 commit
  12. 23 Oct, 2024 1 commit
  13. 22 Oct, 2024 5 commits
  14. 19 Oct, 2024 1 commit
  15. 18 Oct, 2024 1 commit
  16. 17 Oct, 2024 3 commits
    • Daniel Hiltgen's avatar
      llama: Decouple patching script from submodule (#7139) · bf4018b9
      Daniel Hiltgen authored
      * Refine llama.cpp vendoring workflow tools
      
      Switch from the sync.sh over to make based tooling
      
      * Run new make sync and patch flow
      bf4018b9
    • Daniel Hiltgen's avatar
      llama: add compiler tags for cpu features (#7137) · f86d00cd
      Daniel Hiltgen authored
      This adds the ability to customize the default runner with user specified flags
      f86d00cd
    • Gabe Goodhart's avatar
      IBM granite/granitemoe architecture support (#6760) · f2890a44
      Gabe Goodhart authored
      * fix(ext_server): Port llama.cpp sampling refactors to ext_server
      
      This was a fairly large changeset. I closely followed the changes here:
      https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82
      
      
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat: Bump llama.cpp to the latest master with `granite` support
      
      This does not yet have granite MoE support, but that can come in a
      follow up PR
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update solar patch for llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update the solar-pro patch for latest llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump to the latest master of llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches for latest bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama): Always run sync.sh from the right directory
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Update llama patches
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama)!: Rough sync with llama.cpp submodule
      
      There are a number of changes that will need to be propagated to llama.go
      before any of this works!
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Add a patch and update for missing ggml-impl.h include
      
      This include is where the ggml_cgraph struct is defined. It is included in
      many of the .c files to define the forward declartion in ggml.h. It seems
      that with the subset of code included here, the import was somehow lost (or
      out-of-order) when building, so adding this include to llama.cpp fixes the
      missing definition.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Add missing log.cpp
      
      This was added as part of the logging overhaul done in llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Overhaul use of sampling module for llama.cpp changes
      
      The changes here reflect the changes made in the big llama.cpp sampling PR
      https://github.com/ggerganov/llama.cpp/pull/9294
      
      
      
      The sampling functionality is now broken into the base interface
      (llama_sampler) and the generation implementation (gpt_sampler). The
      changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
      STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
      access a pure-C interface.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix the impl of SampleTokenGreedy for new sampling
      
      I don't think this method is currently used, so it could probably just be
      removed so that all sampling goes through the GPT interface, but in the
      interest of doing no harm, this should keep the method working as expected.
      
      Branch: IBMGraniteArchitectureSupport
      
      * fix(llama): Remove unused SampleTokenGreedy
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(sync): Remove bash-specific change to sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * chore(gofumpt): Format on llama.go to pass linting
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Fix missing <thread> include in ext_server
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove TODO about grammar_first
      
      This feature was not used/needed previously so should be fine without
      plumbing it through now.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Better naming for sampling wrapper and args
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix patch 05 to use new wrapper api and re-sync
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * runner: Flush pending responses before returning
      
      If there are any pending reponses (such as from potential stop
      tokens) then we should send them back before ending the sequence.
      Otherwise, we can be missing tokens at the end of a response.
      
      Fixes #6707
      
      * fix(llama/sampling): Use gpt_sampler with a forward declaration
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove unnecessary patch for gguf impl header
      
      This was caused by an earlier mistake in the embeddings patch that was
      dereferencing the pointer instead of using the wrapper API.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Remove use of deprecated --log-disable flag
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      ---------
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      f2890a44