1. 05 Nov, 2024 3 commits
    • Daniel Hiltgen's avatar
      One corrupt manifest should not wedge model operations (#7515) · a4c70fe1
      Daniel Hiltgen authored
      One potential failure mode is an empty file which bubbles up as an EOF error,
      leading to all pulls and listing operations failing.  Instead, continue and
      warn about the corrupt manifest.  This also allows re-pulling the corrupt
      manifest to repair the system.
      a4c70fe1
    • Jesse Gross's avatar
      prompt: Use a single token when estimating mllama context size · 34a75102
      Jesse Gross authored
      Currently we assume that images take 768 tokens of context size for
      the purposes of clipping old messages that exceed the context window.
      However, our mllama implementation stores the full image embedding
      in a single token. As a result, there is significant waste of context
      space.
      
      Ideally, we would handle this more generically and have the
      implementation report the number of tokens. However, at the moment
      this would just result in a similar set of 'if' conditions in the
      runner plus APIs to report it back. So for now, we just keep this
      simple.
      34a75102
    • Med Marrouchi's avatar
  2. 04 Nov, 2024 6 commits
  3. 02 Nov, 2024 4 commits
    • Daniel Hiltgen's avatar
      nvidia libs have inconsistent ordering (#7473) · 29ab9fa7
      Daniel Hiltgen authored
      The runtime and management libraries may not always have
      identical ordering, so use the device UUID to correlate instead of ID.
      29ab9fa7
    • Daniel Hiltgen's avatar
      CI: omit unused tools for faster release builds (#7432) · b8d5036e
      Daniel Hiltgen authored
      This leverages caching, and some reduced installer scope to try
      to speed up builds. It also tidies up some windows build logic
      that was only relevant for the older generate/cmake builds.
      b8d5036e
    • Jesse Gross's avatar
      llama: Improve error handling · 312d9de1
      Jesse Gross authored
      Check for NULL return values from llama.cpp in more places and
      convert them into Go errors, which should make debugging easier
      in the future rather than having hidden surprises in our data
      structures.
      312d9de1
    • Jesse Gross's avatar
      runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
      Jesse Gross authored
      Mllama has large embeddings (100 MB per image) and each embedding is
      represented as 1 token when passed to llama.cpp. Batches are pre-
      allocated for the size of the tokens times the batch size, so this
      results in allocations of over 50 GB at the default batch size.
      On some systems, these mallocs will fail.
      
      Since an image is represented as a single token and mllama doesn't
      support more than 1 image per request, we only need to allocate a
      batch size of 1, which is much more reasonable. In addition, for
      non-multimodal models, we don't need to allocate the embedding
      batches at all.
      
      Fixes #7464
      a103dae0
  4. 01 Nov, 2024 3 commits
  5. 31 Oct, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Don't set cross attention before sending embeddings · 26acdcf4
      Jesse Gross authored
      Currently if an input has embeddings at any point then we will set
      cross attention to true from the beginning. This means that any
      tokens before the embeddings are sent will incorrectly have cross
      attention layers applied.
      
      This only sets cross attention when we have an embedding, either
      previously in this sequence or in the cache. It also makes cross
      attention capable of supporting parallelism at the runner level,
      though the mllama implementation doesn't support that yet.
      26acdcf4
    • Daniel Hiltgen's avatar
      Give unicode test more time to run (#7437) · 921779bb
      Daniel Hiltgen authored
      * Give unicode test more time to run
      
      Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test
      
      * Give more time for concurrency test
      
      CPU inference can be very slow under stress
      921779bb
  6. 30 Oct, 2024 6 commits
    • Daniel Hiltgen's avatar
      Refine default thread selection for NUMA systems (#7322) · 16f4eabe
      Daniel Hiltgen authored
      Until we have full NUMA support, this adjusts the default thread selection
      algorithm to count up the number of performance cores across all sockets.
      16f4eabe
    • Jesse Gross's avatar
      runner.go: Better abstract vision model integration · c826e574
      Jesse Gross authored
      
      
      -Update mllama to take the cross attention state as embeddings in
      a batch, more similar to how Llava handles it. This improves
      integration with the input cache.
      -Pass locations in a prompt for embeddings using tags similar to Llava.
      -Abstract interface to vision models so the main runner accesses Clip
      and Mllama similarly
      Co-authored-by: default avatarMichael Yang <mxyng@pm.me>
      c826e574
    • Daniel Hiltgen's avatar
      Soften windows clang requirement (#7428) · 712e99d4
      Daniel Hiltgen authored
      This will no longer error if built with regular gcc on windows.  To help
      triage issues that may come in related to different compilers, the runner now
      reports the compier used by cgo.
      712e99d4
    • Daniel Hiltgen's avatar
      Remove submodule and shift to Go server - 0.4.0 (#7157) · b754f5a6
      Daniel Hiltgen authored
      * Remove llama.cpp submodule and shift new build to top
      
      * CI: install msys and clang gcc on win
      
      Needed for deepseek to work properly on windows
      b754f5a6
    • Daniel Hiltgen's avatar
      Move windows app out of preview (#7347) · a805e594
      Daniel Hiltgen authored
      a805e594
    • Daniel Hiltgen's avatar
      windows: Support alt install paths, fit and finish (#6967) · 91dfbb1b
      Daniel Hiltgen authored
      * windows: Support alt install paths
      
      Advanced users are leveraging innosetup's /DIR switch to target
      an alternate location, but we get confused by things not existing in the LocalAppData dir.
      This also hardens the server path lookup code for a future attempt to unify with a ./bin prefix
      
      * Fit and finish improvements for windows app
      
      Document alternate install location instructions for binaries and model.
      Pop up progress UI for upgrades (automatic, with cancel button).
      Expose non-default port in menu to disambiguate mutiple instances.
      Set minimum Windows version to 10 22H2
      91dfbb1b
  7. 29 Oct, 2024 4 commits
  8. 28 Oct, 2024 1 commit
  9. 27 Oct, 2024 1 commit
  10. 26 Oct, 2024 2 commits
    • Daniel Hiltgen's avatar
      Fix deepseek deseret regex (#7369) · 099f7077
      Daniel Hiltgen authored
      On windows compiled with gcc the c++ regex library failed to handle
      the characters
      099f7077
    • Daniel Hiltgen's avatar
      Better support for AMD multi-GPU on linux (#7212) · d7c94e0c
      Daniel Hiltgen authored
      * Better support for AMD multi-GPU
      
      This resolves a number of problems related to AMD multi-GPU setups on linux.
      
      The numeric IDs used by rocm are not the same as the numeric IDs exposed in
      sysfs although the ordering is consistent.  We have to count up from the first
      valid gfx (major/minor/patch with non-zero values) we find starting at zero.
      
      There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
      supports UUID based identification, so we should favor that one, and try
      to use UUIDs if detected to avoid potential ordering bugs with numeric IDs
      
      * ROCR_VISIBLE_DEVICES only works on linux
      
      Use the numeric ID only HIP_VISIBLE_DEVICES on windows
      d7c94e0c
  11. 25 Oct, 2024 2 commits
  12. 24 Oct, 2024 1 commit
  13. 23 Oct, 2024 1 commit
  14. 22 Oct, 2024 4 commits