Commits · 53d2990d9b60fae08437e98141eca5d9e393deaa · OpenDAS / ollama

27 Feb, 2025 11 commits
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
- go.mod: Use full version for go 1.24.0 · e185c08a
  Jesse Gross authored Feb 27, 2025
```
Otherwise on Linux I get:
go: download go1.24 for linux/amd64: toolchain not available
```
  e185c08a
- server/internal: replace model delete API with new registry handler. (#9347) · 2412adf4
  Blake Mizerany authored Feb 27, 2025
```
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
```
  2412adf4
- docs: fix api examples link (#9360) · be2ac1ed
  Steven Hartland authored Feb 27, 2025
```
Fix the examples link in the go package documentation for the API.
```
  be2ac1ed
- server: allow vscode-file origins (#9313) · dc13813a
  Eries Trisnadi authored Feb 28, 2025
  
  dc13813a
- runner: simplify tensor split parsing · d6af13ef
  Michael Yang authored Feb 26, 2025
  
  d6af13ef
- ml/backend/ggml: fix debug logging · a59f6652
  Michael Yang authored Feb 26, 2025
  
  a59f6652
- Windows ARM build (#9120) · 688925ac
  Daniel Hiltgen authored Feb 27, 2025
```
* Windows ARM build

Skip cmake, and note it's unused in the developer docs.

* Win: only check for ninja when we need it

On windows ARM, the cim lookup fails, but we don't need ninja anyway.
```
  688925ac
- .github/workflows: swap order of go test and golangci-lint (#9389) · 76e903cf
  Blake Mizerany authored Feb 26, 2025
```
The linter is secondary to the tests, so it should run after the tests,
exposing test failures faster.
```
  76e903cf
- ml/backend/ggml: follow on fixes after updating vendored code (#9388) · a5272130
  Jeffrey Morgan authored Feb 26, 2025
```
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
```
  a5272130
- llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) · d7d7e996
  Jeffrey Morgan authored Feb 26, 2025
  
  d7d7e996
26 Feb, 2025 2 commits

readme: add Nichey to community integrations (#9370) · 2db96c18
Gordon Kamer authored Feb 26, 2025

2db96c18

Add cuda Blackwell architecture for v12 (#9350) · e12af460

Daniel Hiltgen authored Feb 26, 2025

* Add cuda Blackwell architecture for v12

* Win: Split rocm out to separate zip file

* Reduce CC matrix

The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.

e12af460

25 Feb, 2025 10 commits

llama: removed unused 'vendoring' file (#9351) · 3ad4bc8a
Jeffrey Morgan authored Feb 25, 2025

3ad4bc8a

.github: always run tests, and other helpful fixes (#9348) · 0d694793

Blake Mizerany authored Feb 25, 2025

During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.

This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.

They are:

* Always run tests, even if the other checks fail.

Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.

* Check that `go generate` is clean.

This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.

Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.

* Check that `go mod tidy` is clean.

A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.

* More robust caching.

We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.

The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.

0d694793

Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304) · e91ae3d4

Daniel Hiltgen authored Feb 25, 2025

* Bump cuda and rocm versions

Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.

* Fix windows build script

e91ae3d4

docker: upgrade rocm to 6.3.3 (#8211) · 6ecd7f64

José Pekkarinen authored Feb 25, 2025

centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.
Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>

6ecd7f64

docs: rocm install link (#9346) · 88885567
Chuanhui Liu authored Feb 25, 2025

88885567
fix: add back bf16 support · b16367b4
Michael Yang authored Feb 25, 2025
```
this was accidentally removed when moving fs/ggml from its previous
location
```
b16367b4
build: support Compute Capability 5.0, 5.2 and 5.3 for CUDA 12.x (#8567) · a4993906
Pavol Rusnak authored Feb 25, 2025
```
CUDA 12.x still supports Compute Capability 5.0, 5.2 and 5.3,
so let's build for these architectures as well
```
a4993906
Move cgroups fix out of AMD section. (#9072) · 4df98f3e
frob authored Feb 25, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
4df98f3e

server/internal: copy bmizerany/ollama-go to internal package (#9294) · 348b3e09

Blake Mizerany authored Feb 24, 2025

This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.

New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.

348b3e09

sample: add sampling package for new engine (#8410) · 0b7e1676
Parth Sareen authored Feb 24, 2025

0b7e1676

24 Feb, 2025 3 commits
- config: allow setting context length through env var (#8938) · 314573bf
  Parth Sareen authored Feb 24, 2025
```
* envconfig: allow setting context length through env var
```
  314573bf
- go.mod: bump to go1.24 (#9242) · 4604b103
  Blake Mizerany authored Feb 24, 2025
  
  4604b103
- ml/backend/ggml: fix crash on windows paths with wide characters (#9305) · 8c13cfa4
  Jeffrey Morgan authored Feb 23, 2025
  
  8c13cfa4
22 Feb, 2025 2 commits

docs: add additional ROCm docs for building (#9066) · 7cfd4aee
Jeffrey Morgan authored Feb 22, 2025

7cfd4aee

server: group routes by category and purpose (#9270) · 68bac1e0

Blake Mizerany authored Feb 21, 2025

The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.

Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)

68bac1e0

21 Feb, 2025 3 commits

ml: Abstract attention out of model definitions · f53f4198

Jesse Gross authored Feb 14, 2025



There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

f53f4198

ml/backend/ggml: fix rms norm · 2192a28e
Michael Yang authored Feb 20, 2025

2192a28e
docs: add `RockChinQ/LangBot` to integrations list (#9272) · 5d81c1a1
Junyan Qin (Chin) authored Feb 22, 2025

5d81c1a1

20 Feb, 2025 9 commits

models: Prune unused outputs earlier in the forward pass · 5c5535c0

Jesse Gross authored Feb 18, 2025

Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.

5c5535c0

ggml-backend: Don't recreate the scheduler for each context · e5bcc51a

Jesse Gross authored Feb 18, 2025

We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.

e5bcc51a

ollamarunner: Pass runner performance parameters to backends · bd6a7d5e

Jesse Gross authored Feb 20, 2025

Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.

bd6a7d5e

api: document client stream behavior with a test (#8996) · 14b5a9a1

Bruce MacDonald authored Feb 20, 2025

Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses

14b5a9a1

ci: use clang for windows cpu builds · ba9ec3d0

Michael Yang authored Feb 20, 2025

clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance

ba9ec3d0

server: add missing function parens to debug log (#9255) · 7c168b08
frob authored Feb 20, 2025

7c168b08
docs: Add yla to community integrations · 3d4cc783
danielekp authored Feb 20, 2025

3d4cc783
openai: add 'timeout' to allowable x-stainless headers (#9237) · 351a85d9
Lucas Hahn authored Feb 20, 2025

351a85d9
reorder patches · bda4ef6c
Michael Yang authored Feb 19, 2025

bda4ef6c