1. 15 May, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Base cached tokens on current prompt · 499ae731
      Jesse Gross authored
      When we restore a sequence from the cache, we split the prompt into
      the already used tokens (stored in the cache) and new tokens that
      need to be processed. Currently, the references to the used tokens
      are coming from the stored previous sequence.
      
      However, even though we know that the used tokens are semantically
      equivalent to the prefix of the prompt, tokens can contain pointers
      which are no longer valid. As a result, it is better to get the
      used tokens from the prompt, which has currently valid pointers.
      
      This doesn't currently have any impact because it isn't possible
      to reuse the pointers (which are tensors) anyways. However, it
      becomes an issue once we can.
      499ae731
  2. 14 May, 2025 1 commit
  3. 12 May, 2025 1 commit
  4. 08 May, 2025 1 commit
  5. 05 May, 2025 1 commit
  6. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  7. 31 Mar, 2025 2 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529
  8. 14 Mar, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
  9. 04 Mar, 2025 1 commit
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
  10. 28 Feb, 2025 1 commit
  11. 27 Feb, 2025 2 commits
  12. 14 Feb, 2025 2 commits
    • Jesse Gross's avatar
      llamarunner: Init GGML before printing system info · 010313bb
      Jesse Gross authored
      We currently print system info before the GGML backends are loaded.
      This results in only getting information about the default lowest
      common denominator runner. If we move up the GGML init then we can
      see what we are actually running.
      
      Before:
      time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24
      
      After:
      time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
      010313bb
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03