1. 21 Mar, 2025 2 commits
    • Jesse Gross's avatar
      kvcache: Pass granular cache size into implementations · 3ed7ad3a
      Jesse Gross authored
      Currently the runner computes the kv size needed and creates a
      cache of that size. This is the context size times number of
      parallel sequences.
      
      Cache implementations can make better decisions about their memory
      usage, so instead pass in the required capacity, number of sequences
      and maximum batch size. For now, the causal cache just uses this to
      compute the size in the same way as before.
      3ed7ad3a
    • Jesse Gross's avatar
      ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758
      Jesse Gross authored
      This enables the runner to report progress back to the Ollama server,
      both for showing status to the user and also to prevent the server
      from killing the runner if it thinks things have stalled.
      
      Most of the infrastructure was already there, this extends it to
      be available to the backends.
      0ff28758
  2. 20 Mar, 2025 2 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
  3. 17 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Check for minBatch of context space when shifting · bf24498b
      Jesse Gross authored
      Models can specify that a group of inputs need to be handled a single
      batch. However, context shifting didn't respect this and could trigger
      a break anyways. In this case, we should instead trigger a context
      shift earlier so that it occurs before the grouped batch.
      
      Note that there still some corner cases:
       - A long prompt that exceeds the context window can get truncated
         in the middle of an image. With the current models, this will
         result in the model not recognizing the image at all, which is
         pretty much the expected result with truncation.
       - The context window is set less than the minimum batch size. The
         only solution to this is to refuse to load the model with these
         settings. However, this can never occur with current models and
         default settings.
      
      Since users are unlikely to run into these scenarios, fixing them is
      left as a follow up.
      bf24498b
    • Bruce MacDonald's avatar
      runner: remove cache prompt flag from ollama runner (#9826) · 95e271d9
      Bruce MacDonald authored
      We do not need to bypass the prompt caching in the ollama runner yet, as
      only embedding models needed to bypass the prompt caching. When embedding
      models are implemented they can skip initializing this cache completely.
      95e271d9
  4. 14 Mar, 2025 3 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
  5. 13 Mar, 2025 1 commit
  6. 11 Mar, 2025 2 commits
  7. 10 Mar, 2025 2 commits
    • Jeffrey Morgan's avatar
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  8. 09 Mar, 2025 1 commit
  9. 08 Mar, 2025 2 commits
  10. 07 Mar, 2025 3 commits
    • Parth Sareen's avatar
      sample: improve ollama engine sampler performance (#9374) · 0682dae0
      Parth Sareen authored
      This change bring in various interface cleanups along with greatly improving the performance of the sampler.
      
      Tested with llama3.2 on local machine.
      Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
      Without topK performance is ~ 110 tokens/s
      0682dae0
    • Jesse Gross's avatar
      ollamarunner: Improve multimodal input handling · a7e63b82
      Jesse Gross authored
      Various vision models have different requirements for how they
      receive their inputs. For example:
       - Mllama wants images together with text and the image embeddings
         don't themselves have positions or get stored in the main KV cache
       - Llava-style models feed in embeddings similar to tokens and
         images correspond to a varying number of tokens in the cache.
      
      In addition, the strategy for providing inputs must support batching
      and multiple sequences, which are managed by the runner. At the same
      time, we want to keep data handling fully in the model so that new
      architectures are not bottlenecked by runner code which does not
      understand their particular requirements.
      
      This provides a method for models to edit the input stream so that
      it meets their needs while still being in a format that the runner
      understands. This allows the runner to avoid special processing
      for different models.
      
      In addition, this fixes a regression where non-vision models may
      try to incorrectly interpret images.
      a7e63b82
    • Jesse Gross's avatar
      model: Don't unconditionally add special tokens · b70fc4d5
      Jesse Gross authored
      We sometimes tokenize partial strings. For example, with
      multimodal inputs, we split the input string around the images
      and then tokenize each piece. In these cases, we should only add
      the special tokens on the first piece.
      b70fc4d5
  11. 04 Mar, 2025 1 commit
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
  12. 02 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
  13. 28 Feb, 2025 2 commits
  14. 27 Feb, 2025 2 commits
  15. 25 Feb, 2025 1 commit
  16. 20 Feb, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Pass runner performance parameters to backends · bd6a7d5e
      Jesse Gross authored
      Currently the following parameters are in the runner but not used:
       - numGPULayers
       - mainGPU
       - threads
       - tensorSplit
      
      This passes them through to the backend, which is where they would
      actually get used. However, the GGML backend does not yet do anything
      with them.
      bd6a7d5e
  17. 14 Feb, 2025 3 commits
    • Daniel Hiltgen's avatar
      df2680b4
    • Jesse Gross's avatar
      llamarunner: Init GGML before printing system info · 010313bb
      Jesse Gross authored
      We currently print system info before the GGML backends are loaded.
      This results in only getting information about the default lowest
      common denominator runner. If we move up the GGML init then we can
      see what we are actually running.
      
      Before:
      time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24
      
      After:
      time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
      010313bb
    • Jesse Gross's avatar
      Runner for Ollama engine · ed443a03
      Jesse Gross authored
      This provides integration with the new Ollama engine
      (58245413 next ollama runner (#7913)) and the rest of the Ollama
      infrastructure such as the runner and Ollama server.
      
      In addition, it also builds out the KV cache infrastructure to
      support requirements of how Ollama runs models such as:
       - Parallel processing
       - Memory management for defragmentation and shifting
       - Multi-modal modals
      
      Both old and new engines continue to be supported. By default, only
      the old engine is used. To enable the new engine:
      
      Start the server with the OLLAMA_NEW_ENGINE environment variable set:
      OLLAMA_NEW_ENGINE=1 ./ollama serve
      
      Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
      ./ollama run jessegross/llama3.1
      ed443a03