"vscode:/vscode.git/clone" did not exist on "8b47c484daa7508699e4f4fa051e8e851c1c88b8"
  1. 21 Nov, 2024 1 commit
  2. 20 Nov, 2024 5 commits
    • Jesse Gross's avatar
      runner.go: Truncate inputs that exceed context rather than shifting · c4b34f2a
      Jesse Gross authored
      Previous versions of the runner would truncate inputs to the context
      window before beginning processing. The main processing loop relied
      on this behavior if the context needed to be shifted later (due to
      token generation). If truncation did not occur then invariants
      would be broken, causing crashes or infinite loops.
      
      Later versions attempted to fix these bugs and make the logic less
      subtle so that all inputs could be handled. Truncation was removed
      to make things consistent.
      
      However, truncation is much faster than processing and shifting, so
      removing it caused performance problems when the input vastly exceeded
      the context size. This restores the input truncation as a performance
      optimization while keeping the more robust processing logic.
      
      Fixes #7762
      c4b34f2a
    • Jesse Gross's avatar
      runner.go: Don't add inputs to cache view until actually processed · c3ff9164
      Jesse Gross authored
      We need to track which tokens are in the cache ourselves. We currently
      add tokens to the cache tracker when we add them to batch but they are
      not actually in the cache until we call Decode. This can cause
      confusion when we are shifting the cache.
      
      Avoids "could not find a KV slot for the batch" issues.
      
      Bug #7545
      c3ff9164
    • Jesse Gross's avatar
      runner.go: Hard fail on errors rather than potentially infinite looping · 3fc1dc0e
      Jesse Gross authored
      We try to recover from errors by dropping the tokens that caused the
      problem and re-trying. However, dropping the tokens is not correct
      and continuing often leads to infinite loops. To avoid, this we
      end the sequence if such a condition is detected, which is also
      surprising.
      
      At this point, it is better to just report the error. This will make
      it easier to find problems and the alternatives are perhaps even more
      surprising to users.
      
      This is not a very satisfactory solution either - we should isolate
      the error and return it to the user without killing the whole process.
      However, this is an incremental step and consistent with most other
      failures (which either manifest as abort() or panic).
      3fc1dc0e
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
    • Jesse Gross's avatar
      runner.go: Use correct index when retrieving embedding results · 5f68fcab
      Jesse Gross authored
      This doesn't have any impact currently because NUM_PARALLEL is forced
      to 1 for embeddings, so both indicies will always be 0.
      5f68fcab
  3. 19 Nov, 2024 1 commit
  4. 15 Nov, 2024 2 commits
    • Jesse Gross's avatar
      runner.go: Propagate panics back to the user. · d875e99e
      Jesse Gross authored
      This is a partial revert of 8a35bb92
      "runner.go: Increase survivability of main processing loop", removing
      the panic handler.
      
      Although we want to avoid errors taking down the runner, we also
      should make the user aware of problems when they happen. In the
      future, we can restructure things so both parts are true.
      d875e99e
    • Jesse Gross's avatar
      runner.go: Increase survivability of main processing loop · 8a35bb92
      Jesse Gross authored
      Currently, if an error occurs during the prep stages (such as
      tokenizing) of a single request, it will only affect that request.
      However, if an error happens during decoding, it can take down the
      entire runner.
      
      Instead, it's better to drop the tokens that triggered the error and try to
      keep going. However, we also need to stop when we run out of tokens,
      otherwise, this just causes an infinite loop. This is likely the cause
      of at least some of the hanging issues that have been reported.
      
      Bug #7573
      8a35bb92
  5. 14 Nov, 2024 3 commits
    • Jesse Gross's avatar
      runner.go: Don't trim whitespace from inputs · c25ffde9
      Jesse Gross authored
      It's possible to get prompts that consist entirely of whitespace -
      this is most likely to happen when generating embeddings. Currently,
      we will trim this away, leaving an empty prompt, which will then
      generate an error.
      
      Generating embeddings from whitespace should not trigger an error,
      as this may break pipelines. It's better to just leave the whitespace
      in place and process what we are given. This is consistent with
      past versions of Ollama.
      
      Bug #7578
      c25ffde9
    • Jesse Gross's avatar
      runner.go: Enforce NUM_PARALLEL directly in the runner · 17b386a8
      Jesse Gross authored
      NUM_PARALEL is currently enforced by the Ollama server process - it
      will only issue requests to the runner if the maximum number of
      concurrent requests has not been exceeded. Although this should
      be sufficient, it is good for the runner to protect its own data
      structures. Currently, if too many requests get through to the
      runner, they will just get stuck and never return.
      
      This may help with reports of Ollama hanging, though it is unclear
      how it would actually occur.
      
      Bug #7573
      17b386a8
    • Michael Yang's avatar
      fix(mllama): sync backend between batches · 5b3393b6
      Michael Yang authored
      5b3393b6
  6. 12 Nov, 2024 3 commits
    • Jesse Gross's avatar
      runner.go: Fix off-by-one for num predicted · d7eb05b9
      Jesse Gross authored
      d7eb05b9
    • Daniel Hiltgen's avatar
      Jetpack support for Go server (#7217) · df011054
      Daniel Hiltgen authored
      This adds support for the Jetson JetPack variants into the Go runner
      df011054
    • Jesse Gross's avatar
      runner.go: Make KV entry accounting more robust · 65973ceb
      Jesse Gross authored
      The structure of the accounting for KV cache shifting was carried
      over from the old runner but it now doesn't feel natural with the new
      runner. There are a number of invariants that should hold true but
      are difficult to reason about. There is at least one bug report
      that would imply that the invariants are not holding.
      
      This reduces the number of implicit assumptions and is more forgiving
      of unexpected situations. It also improves behavior around which input
      tokens are kept when truncation occurs.
      
      Bug #7545
      65973ceb
  7. 08 Nov, 2024 1 commit
  8. 07 Nov, 2024 3 commits
  9. 06 Nov, 2024 1 commit
  10. 02 Nov, 2024 2 commits
    • Jesse Gross's avatar
      llama: Improve error handling · 312d9de1
      Jesse Gross authored
      Check for NULL return values from llama.cpp in more places and
      convert them into Go errors, which should make debugging easier
      in the future rather than having hidden surprises in our data
      structures.
      312d9de1
    • Jesse Gross's avatar
      runner.go: Only allocate 1 element embedding batches for mllama · a103dae0
      Jesse Gross authored
      Mllama has large embeddings (100 MB per image) and each embedding is
      represented as 1 token when passed to llama.cpp. Batches are pre-
      allocated for the size of the tokens times the batch size, so this
      results in allocations of over 50 GB at the default batch size.
      On some systems, these mallocs will fail.
      
      Since an image is represented as a single token and mllama doesn't
      support more than 1 image per request, we only need to allocate a
      batch size of 1, which is much more reasonable. In addition, for
      non-multimodal models, we don't need to allocate the embedding
      batches at all.
      
      Fixes #7464
      a103dae0
  11. 31 Oct, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Don't set cross attention before sending embeddings · 26acdcf4
      Jesse Gross authored
      Currently if an input has embeddings at any point then we will set
      cross attention to true from the beginning. This means that any
      tokens before the embeddings are sent will incorrectly have cross
      attention layers applied.
      
      This only sets cross attention when we have an embedding, either
      previously in this sequence or in the cache. It also makes cross
      attention capable of supporting parallelism at the runner level,
      though the mllama implementation doesn't support that yet.
      26acdcf4
  12. 30 Oct, 2024 3 commits
  13. 29 Oct, 2024 2 commits
    • Daniel Hiltgen's avatar
      Switch windows to clang (#7407) · c9ca3861
      Daniel Hiltgen authored
      * Switch over to clang for deepseek on windows
      
      The patch for deepseek requires clang on windows. gcc on windows
      has a buggy c++ library and can't handle the unicode characters
      
      * Fail fast with wrong compiler on windows
      
      Avoid users mistakenly building with GCC when we need clang
      c9ca3861
    • Jesse Gross's avatar
      runner.go: Better handle return NULL values from llama.cpp · de1557a0
      Jesse Gross authored
      Llama.cpp sometimes returns NULL as a return value to report an
      error. We should explicitly check for this and convert it to a Go
      error rather than putting NULL in our data structures and waiting
      for it to blow up later.
      de1557a0
  14. 27 Oct, 2024 1 commit
  15. 26 Oct, 2024 1 commit
  16. 25 Oct, 2024 1 commit
  17. 24 Oct, 2024 1 commit
  18. 22 Oct, 2024 2 commits
    • Daniel Hiltgen's avatar
      Fix rocm windows build and clean up dependency gathering (#7305) · 5c44461c
      Daniel Hiltgen authored
      On windows ensure windows version define is properly set for rocm.
      Remove duplicate rocm arch flags.
      Resolve wildcards in the targets so parallel builds don't race.
      Use readlink to resolve rocm dependencies since wildcards omit libelf
      Keep windows rocm deps aligned with unified packaging model
      5c44461c
    • Jesse Gross's avatar
      runner.go: Merge partial unicode characters before sending · 03e40efa
      Jesse Gross authored
      We check for partial unicode characters and accumulate them before
      sending. However, when we did send, we still sent each individual piece
      separately, leading to broken output. This combines everything into
      a single group, which is also more efficient.
      
      This also switches to the built-in check for valid unicode characters,
      which is stricter. After this, we should never send back an invalid
      sequence.
      
      Fixes #7290
      03e40efa
  19. 18 Oct, 2024 1 commit
  20. 17 Oct, 2024 3 commits
    • Daniel Hiltgen's avatar
      llama: Decouple patching script from submodule (#7139) · bf4018b9
      Daniel Hiltgen authored
      * Refine llama.cpp vendoring workflow tools
      
      Switch from the sync.sh over to make based tooling
      
      * Run new make sync and patch flow
      bf4018b9
    • Daniel Hiltgen's avatar
      llama: add compiler tags for cpu features (#7137) · f86d00cd
      Daniel Hiltgen authored
      This adds the ability to customize the default runner with user specified flags
      f86d00cd
    • Gabe Goodhart's avatar
      IBM granite/granitemoe architecture support (#6760) · f2890a44
      Gabe Goodhart authored
      * fix(ext_server): Port llama.cpp sampling refactors to ext_server
      
      This was a fairly large changeset. I closely followed the changes here:
      https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82
      
      
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat: Bump llama.cpp to the latest master with `granite` support
      
      This does not yet have granite MoE support, but that can come in a
      follow up PR
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update solar patch for llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump llama.cpp for granitemoe support
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(solar): Update the solar-pro patch for latest llama.cpp bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama.cpp): Bump to the latest master of llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(patches): Update all patches for latest bump
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama): Always run sync.sh from the right directory
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Update llama patches
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * feat(llama)!: Rough sync with llama.cpp submodule
      
      There are a number of changes that will need to be propagated to llama.go
      before any of this works!
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/patches): Add a patch and update for missing ggml-impl.h include
      
      This include is where the ggml_cgraph struct is defined. It is included in
      many of the .c files to define the forward declartion in ggml.h. It seems
      that with the subset of code included here, the import was somehow lost (or
      out-of-order) when building, so adding this include to llama.cpp fixes the
      missing definition.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Add missing log.cpp
      
      This was added as part of the logging overhaul done in llama.cpp
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Overhaul use of sampling module for llama.cpp changes
      
      The changes here reflect the changes made in the big llama.cpp sampling PR
      https://github.com/ggerganov/llama.cpp/pull/9294
      
      
      
      The sampling functionality is now broken into the base interface
      (llama_sampler) and the generation implementation (gpt_sampler). The
      changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
      STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
      access a pure-C interface.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix the impl of SampleTokenGreedy for new sampling
      
      I don't think this method is currently used, so it could probably just be
      removed so that all sampling goes through the GPT interface, but in the
      interest of doing no harm, this should keep the method working as expected.
      
      Branch: IBMGraniteArchitectureSupport
      
      * fix(llama): Remove unused SampleTokenGreedy
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(sync): Remove bash-specific change to sync.sh
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * chore(gofumpt): Format on llama.go to pass linting
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Fix missing <thread> include in ext_server
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove TODO about grammar_first
      
      This feature was not used/needed previously so should be fine without
      plumbing it through now.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Better naming for sampling wrapper and args
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Fix patch 05 to use new wrapper api and re-sync
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * runner: Flush pending responses before returning
      
      If there are any pending reponses (such as from potential stop
      tokens) then we should send them back before ending the sequence.
      Otherwise, we can be missing tokens at the end of a response.
      
      Fixes #6707
      
      * fix(llama/sampling): Use gpt_sampler with a forward declaration
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llama): Remove unnecessary patch for gguf impl header
      
      This was caused by an earlier mistake in the embeddings patch that was
      dereferencing the pointer instead of using the wrapper API.
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      * fix(llm): Remove use of deprecated --log-disable flag
      
      Branch: IBMGraniteArchitectureSupport
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      
      ---------
      Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
      f2890a44
  21. 16 Oct, 2024 1 commit
    • Daniel Hiltgen's avatar
      Move macos v11 support flags to build script (#7203) · 7d6eb0d4
      Daniel Hiltgen authored
      Having v11 support hard-coded into the cgo settings causes warnings
      for newer Xcode versions.  This should help keep the build clean for users
      building from source with the latest tools, while still allow us to target
      the older OS via our CI processes.
      7d6eb0d4
  22. 13 Oct, 2024 1 commit