1. 02 Mar, 2025 6 commits
    • Soulter's avatar
      af68d60a
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
    • Jesse Gross's avatar
      ml: Empty tensor constructor for tensors · ee141cc8
      Jesse Gross authored
      In cases where we allocate a tensor and then fully overwrite it with
      copied data, it is wasteful to first zero out the memory.
      ee141cc8
    • Jesse Gross's avatar
      ggml-backend: Store parent backend as part of tensor · 55e5776c
      Jesse Gross authored
      It can be important for a tensor to know what backend it came from -
      for example, to know if flash attention is enabled.
      55e5776c
    • Jesse Gross's avatar
      attention: Remove unnecessary contiguous operations · 854a9195
      Jesse Gross authored
      Prior to performing attention, we need to permute query, key
      and value. Currently we call Contiguous after each of these
      permutations, which is correct but expensive. Avoiding the
      3 calls to Contiguous increases performance by over 20%.
      
      The permutations of query and key do not violate the continuity
      rules for mulmat and the Contiguous call can be simply removed.
      
      Value requires a different permutation and does require Contiguous.
      However, we can use the copy into the cache as a way to perform this
      without further overhead.
      
      To support this and avoid unexpected tensor shapes that are seen by
      models, we need tighter integration between attention, cache
      and backend. Future optimization will also likely need this structure
       - for example, flash attention has special padding requirements in
      the cache and other backends may have their own needs.
      
      This further contains the operations that go into attention so that
      these and other optimizations can be handled transparently. Models
      that have special requirements for attention can still implement
      their own version of it.
      854a9195
    • Jeffrey Morgan's avatar
  2. 01 Mar, 2025 3 commits
    • Jeffrey Morgan's avatar
      e75c6126
    • Blake Mizerany's avatar
      server/internal/internal/names: validate names (#9400) · cda6f5c6
      Blake Mizerany authored
      This commit is a step towards a goal to make names less ceremonial
      outside of the registry client. Clients of the registry package can
      treat names as opaque strings, and the registry package will handle
      parsing, validating, and normalizing names.
      
      Ideally we end up with the names package tucked away in an internal
      package for good. We'll see how things go.
      
      Also, this package name is not permanent. This another step in the
      on-going process of refactoring the server code, and at some point it
      will most likely be renamed/moved.
      cda6f5c6
    • Bruce MacDonald's avatar
      server: validate local path on safetensor create (#9379) · bebb6823
      Bruce MacDonald authored
      More validation during the safetensor creation process.
      Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
      Add comprehensive test coverage for various paths
      No functionality changes for valid inputs - existing workflows remain unaffected
      Leverages Go 1.24's new os.Root functionality for secure containment
      bebb6823
  3. 28 Feb, 2025 9 commits
  4. 27 Feb, 2025 15 commits
  5. 26 Feb, 2025 2 commits
  6. 25 Feb, 2025 5 commits
    • Jeffrey Morgan's avatar
      3ad4bc8a
    • Blake Mizerany's avatar
      .github: always run tests, and other helpful fixes (#9348) · 0d694793
      Blake Mizerany authored
      During work on our new registry client, I ran into frustrations with CI
      where a misspelling in a comment caused the linter to fail, which caused
      the tests to not run, which caused the build to not be cached, which
      caused the next run to be slow, which caused me to be sad.
      
      This commit address these issues, and pulls in some helpful changes
      we've had in CI on ollama.com for some time now.
      
      They are:
      
      * Always run tests, even if the other checks fail.
      
      Tests are the most important part of CI, and should always run. Failures
      in tests can be correlated with failures in other checks, and can help
      surface the root cause of the failure sooner. This is especially
      important when the failure is platform specific, and the tests are not
      platform independent.
      
      * Check that `go generate` is clean.
      
      This prevents 'go generate' abuse regressions. This codebase used to use
      it to generate platform specific binary build artifacts. Let's make sure
      that does not happen again and this powerful tool is used correctly, and
      the generated code is checked in.
      
      Also, while adding `go generate` the check, it was revealed that the
      generated metal code was putting dates in the comments, resulting in
      non-deterministic builds. This is a bad practice, and this commit fixes
      that. Git tells us the most important date: the commit date along with
      other associated changes.
      
      * Check that `go mod tidy` is clean.
      
      A new job to check that `go mod tidy` is clean was added, to prevent
      easily preventable merge conflicts or go.mod changes being deferred to a
      future PR that is unrelated to the change that caused the go.mod to
      change.
      
      * More robust caching.
      
      We now cache the go build cache, and the go mod download cache
      independently. This is because the download cache contains zips that can
      be unpacked in parallel faster than they can be fetched and extracted by
      tar. This speeds up the build significantly.
      
      The linter is hostile enough. It does not need to also punish us with
      longer build times due to small failures like misspellings.
      0d694793
    • Daniel Hiltgen's avatar
      Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304) · e91ae3d4
      Daniel Hiltgen authored
      * Bump cuda and rocm versions
      
      Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
      Yum has some silent failure modes, so largely switch to dnf.
      
      * Fix windows build script
      e91ae3d4
    • José Pekkarinen's avatar
      docker: upgrade rocm to 6.3.3 (#8211) · 6ecd7f64
      José Pekkarinen authored
      
      
      centos-7 images have been deprecated upstream and replaced with
      almalinux-8 images instead, requiring some small extra work.
      Signed-off-by: default avatarJosé Pekkarinen <jose.pekkarinen@foxhound.fi>
      6ecd7f64
    • Chuanhui Liu's avatar
      docs: rocm install link (#9346) · 88885567
      Chuanhui Liu authored
      88885567