Commits · af68d60a58c7ade657a1f4641bf301a29d977174 · OpenDAS / ollama

02 Mar, 2025 6 commits

readme: add AstrBot to community integrations (#9442) · af68d60a
Soulter authored Mar 02, 2025

af68d60a

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

ml: Empty tensor constructor for tensors · ee141cc8

Jesse Gross authored Feb 28, 2025

In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.

ee141cc8

ggml-backend: Store parent backend as part of tensor · 55e5776c

Jesse Gross authored Feb 27, 2025

It can be important for a tensor to know what backend it came from -
for example, to know if flash attention is enabled.

55e5776c

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

build: use correct GGML_HIP_NO_VMM compiler definition for ggml-hip (#9451) · 96a97adf
Jeffrey Morgan authored Mar 01, 2025

96a97adf

01 Mar, 2025 3 commits

build: set GGML_CUDA_NO_VMM for ggml-hip target (#9449) · e75c6126
Jeffrey Morgan authored Mar 01, 2025

e75c6126

server/internal/internal/names: validate names (#9400) · cda6f5c6

Blake Mizerany authored Mar 01, 2025

This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.

Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.

Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.

cda6f5c6

server: validate local path on safetensor create (#9379) · bebb6823

Bruce MacDonald authored Feb 28, 2025

More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment

bebb6823

28 Feb, 2025 9 commits

runner: defer context cancel · 31e472ba
Michael Yang authored Feb 28, 2025
```
defer the cancel to guarantee it runs
```
31e472ba
fix: replace deprecated functions · 657685e8
Michael Yang authored Feb 28, 2025

657685e8

build: add compute capability 12.0 to CUDA 12 preset (#9426) · a1491285

Jeffrey Morgan authored Feb 28, 2025

Focuses initial Blackwell support on compute capability 12.0
which includes the 50x series of GeForce cards. In the future
additional compute capabilities may be added

a1491285

server/.../safetensors: fix offsets and include all model parts (#9427) · eed11ded

Blake Mizerany authored Feb 28, 2025

Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".

Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.

eed11ded

cuda: enable flash attention · b42aba40
Michael Yang authored Feb 28, 2025
```
ggml added an option to disable flash attention so explicitly enable it
```
b42aba40
docs: Add 1Panel to Community Integrations (#9312) · 25885e53
王贺 authored Mar 01, 2025

25885e53
llama: add phi4 mini support (#9403) · 98d44fa3
Jeffrey Morgan authored Feb 27, 2025

98d44fa3

CONTRIBUTING: provide clarity on good commit messages, and bad (#9405) · 2099e2d2

Blake Mizerany authored Feb 27, 2025

Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.

Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.

Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.

2099e2d2

runner: default to greedy sampler for performance (#9407) · 0c1041ad

Bruce MacDonald authored Feb 27, 2025

As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.

0c1041ad

27 Feb, 2025 15 commits
- sample: remove transforms from greedy sampling (#9377) · c245b040
  Parth Sareen authored Feb 27, 2025
  
  c245b040
- kvcache: update tests · 8b194b75
  Michael Yang authored Feb 26, 2025
  
  8b194b75
- ml: update Context.Forward interface · 3e8b8a19
  Michael Yang authored Feb 21, 2025
```
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
```
  3e8b8a19
- server/internal/registry: implement CloseNotify and Flush (for now) (#9402) · 41dc2804
  Blake Mizerany authored Feb 27, 2025
```
This fixes panics introduced in 2412adf4
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
```
  41dc2804
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
- go.mod: Use full version for go 1.24.0 · e185c08a
  Jesse Gross authored Feb 27, 2025
```
Otherwise on Linux I get:
go: download go1.24 for linux/amd64: toolchain not available
```
  e185c08a
- server/internal: replace model delete API with new registry handler. (#9347) · 2412adf4
  Blake Mizerany authored Feb 27, 2025
```
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
```
  2412adf4
- docs: fix api examples link (#9360) · be2ac1ed
  Steven Hartland authored Feb 27, 2025
```
Fix the examples link in the go package documentation for the API.
```
  be2ac1ed
- server: allow vscode-file origins (#9313) · dc13813a
  Eries Trisnadi authored Feb 28, 2025
  
  dc13813a
- runner: simplify tensor split parsing · d6af13ef
  Michael Yang authored Feb 26, 2025
  
  d6af13ef
- ml/backend/ggml: fix debug logging · a59f6652
  Michael Yang authored Feb 26, 2025
  
  a59f6652
- Windows ARM build (#9120) · 688925ac
  Daniel Hiltgen authored Feb 27, 2025
```
* Windows ARM build

Skip cmake, and note it's unused in the developer docs.

* Win: only check for ninja when we need it

On windows ARM, the cim lookup fails, but we don't need ninja anyway.
```
  688925ac
- .github/workflows: swap order of go test and golangci-lint (#9389) · 76e903cf
  Blake Mizerany authored Feb 26, 2025
```
The linter is secondary to the tests, so it should run after the tests,
exposing test failures faster.
```
  76e903cf
- ml/backend/ggml: follow on fixes after updating vendored code (#9388) · a5272130
  Jeffrey Morgan authored Feb 26, 2025
```
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
```
  a5272130
- llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) · d7d7e996
  Jeffrey Morgan authored Feb 26, 2025
  
  d7d7e996
26 Feb, 2025 2 commits

readme: add Nichey to community integrations (#9370) · 2db96c18
Gordon Kamer authored Feb 26, 2025

2db96c18

Add cuda Blackwell architecture for v12 (#9350) · e12af460

Daniel Hiltgen authored Feb 26, 2025

* Add cuda Blackwell architecture for v12

* Win: Split rocm out to separate zip file

* Reduce CC matrix

The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.

e12af460

25 Feb, 2025 5 commits

llama: removed unused 'vendoring' file (#9351) · 3ad4bc8a
Jeffrey Morgan authored Feb 25, 2025

3ad4bc8a

.github: always run tests, and other helpful fixes (#9348) · 0d694793

Blake Mizerany authored Feb 25, 2025

During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.

This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.

They are:

* Always run tests, even if the other checks fail.

Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.

* Check that `go generate` is clean.

This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.

Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.

* Check that `go mod tidy` is clean.

A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.

* More robust caching.

We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.

The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.

0d694793

Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304) · e91ae3d4

Daniel Hiltgen authored Feb 25, 2025

* Bump cuda and rocm versions

Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.

* Fix windows build script

e91ae3d4

docker: upgrade rocm to 6.3.3 (#8211) · 6ecd7f64

José Pekkarinen authored Feb 25, 2025

centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.
Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>

6ecd7f64

docs: rocm install link (#9346) · 88885567
Chuanhui Liu authored Feb 25, 2025

88885567