Commits · 5c5535c0648fb12b174246eb2524e862ae2d2d5b · OpenDAS / ollama

20 Feb, 2025 9 commits

models: Prune unused outputs earlier in the forward pass · 5c5535c0

Jesse Gross authored Feb 18, 2025

Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.

5c5535c0

ggml-backend: Don't recreate the scheduler for each context · e5bcc51a

Jesse Gross authored Feb 18, 2025

We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.

e5bcc51a

ollamarunner: Pass runner performance parameters to backends · bd6a7d5e

Jesse Gross authored Feb 20, 2025

Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.

bd6a7d5e

api: document client stream behavior with a test (#8996) · 14b5a9a1

Bruce MacDonald authored Feb 20, 2025

Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses

14b5a9a1

ci: use clang for windows cpu builds · ba9ec3d0

Michael Yang authored Feb 20, 2025

clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance

ba9ec3d0

server: add missing function parens to debug log (#9255) · 7c168b08
frob authored Feb 20, 2025

7c168b08
docs: Add yla to community integrations · 3d4cc783
danielekp authored Feb 20, 2025

3d4cc783
openai: add 'timeout' to allowable x-stainless headers (#9237) · 351a85d9
Lucas Hahn authored Feb 20, 2025

351a85d9
reorder patches · bda4ef6c
Michael Yang authored Feb 19, 2025

bda4ef6c

19 Feb, 2025 5 commits
- Merge pull request #9203 from ollama/mxyng/sapphirerapids · 1e438b23
  Michael Yang authored Feb 19, 2025
```
build: remove backend build for sapphirerapids
```
  1e438b23
- test: add test cases for ListHandler (#9146) · d721a02e
  yuiseki authored Feb 20, 2025
  
  d721a02e
- docs: Add AntSK to Community Integrations (#9214) · 778603a8
  zyxucp authored Feb 20, 2025
  
  778603a8
- docs: Add MaxKB to Community Integrations (#9212) · 3c874df4
  maninhill authored Feb 20, 2025
  
  3c874df4
- llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) · d2eb226c
  Jeffrey Morgan authored Feb 18, 2025
  
  d2eb226c
18 Feb, 2025 9 commits
- Merge pull request #9079 from jeremyschlatter/main · e13e7c8d
  Michael Yang authored Feb 18, 2025
```
cmd: fix flickering in progress bar
```
  e13e7c8d
- address code review comments · 78f403ff
  Jeremy Schlatter authored Feb 18, 2025
  
  78f403ff
- build: remove backend build for sapphirerapids · 5f8c0318
  Michael Yang authored Feb 18, 2025
```
sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
```
  5f8c0318
- cmake: avoid building intel backends on linux · 08a299e1
  Michael Yang authored Feb 18, 2025
  
  08a299e1
- ci: set owner/group in tarball · 7b5d916a
  Michael Yang authored Feb 14, 2025
```
set owner and group when building the linux tarball so extracted files
are consistent. this is the behaviour of release tarballs in version
0.5.7 and lower
```
  7b5d916a
- Add OpenDeepResearcher-via-searxng to Community Integrations (#9138) · 33ad61b1
  benhaotang authored Feb 18, 2025
  
  33ad61b1
- test: add test cases for HumanNumber (#9108) · 716e3656
  L. Jiang authored Feb 19, 2025
  
  716e3656
- readme: add LLM Telegram Bot to community integrations (#9150) · 3b4424ff
  innightwolfsleep authored Feb 18, 2025
  
  3b4424ff
- cmd: eliminate flickering with synchronized output · f9c7ead1
  Jeremy Schlatter authored Feb 17, 2025
  
  f9c7ead1
17 Feb, 2025 2 commits

cmd: fix cursor flickering in progress bar · 5930aaeb

Jeremy Schlatter authored Feb 17, 2025

The previous commit fixed flickering in the progress bar itself. Cursor
flickering is harder to address.

Cursor flickering could be fixed by hiding the cursor altogether while
the progress bar is displayed. The downside of this is that if the
program is killed in such a way that it can't clean up its state, it
would leave the cursor invisible.

Instead, this commit introduces an output buffer. All of the escape
codes and content for a single progress update are written to a buffer,
which is then flushed to the terminal all at once. This significantly
decreases the time during which the terminal has seen the cursor-hiding
code but has not yet seen the cursor-showing code, thus minimizing (but
not 100% eliminating) cursor flickering.

For more context, see:
https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501

5930aaeb

cmd: fix progress bar flickering · faf67db0

Jeremy Schlatter authored Feb 17, 2025

Previous code cleared the display before writing new content, creating a
window where the terminal could (and in some cases did) render empty lines.

Instead, we now write new content over the old content, only clearing
the trailing end of lines for cases where the new line is shorter.

Fixes #1664

faf67db0

15 Feb, 2025 2 commits
- docs: fix incorrect shortcut key in windows.md (#9098) · 0667badd
  James-William-Kincaid-III authored Feb 15, 2025
  
  0667badd
- model: document high-level model interface (#9122) · d006e1e0
  Bruce MacDonald authored Feb 14, 2025
  
  d006e1e0
14 Feb, 2025 13 commits

Wire up system info log for new engine (#9123) · df2680b4
Daniel Hiltgen authored Feb 14, 2025

df2680b4

llamarunner: Init GGML before printing system info · 010313bb

Jesse Gross authored Feb 14, 2025

We currently print system info before the GGML backends are loaded.
This results in only getting information about the default lowest
common denominator runner. If we move up the GGML init then we can
see what we are actually running.

Before:
time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24

010313bb

llm: attempt to evaluate symlinks, but do not fail (#9089) · 5296f487

Jeffrey Morgan authored Feb 13, 2025

provides a better approach to #9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues

5296f487

llm: do not evaluate symlink for exe path lookup (#9088) · f05774b0

Jeffrey Morgan authored Feb 13, 2025

In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether

f05774b0

ml/backend/ggml: stable sort devices by score (#9081) · 6600bd7d
Jeffrey Morgan authored Feb 13, 2025

6600bd7d

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

models: Move model into their own directory · 6945617a

Jesse Gross authored Feb 05, 2025

This allows there to be a file that is a list of models that is
not mixed into the runner code.

6945617a

vocab: Use int32 for special tokens · 7916f550

Jesse Gross authored Feb 03, 2025

Special tokens are currently read as uint32 from the model metadata.
However, all other parts of the system (including the tokenizer) use
int32 to represent tokens so it is impossible to represent the high
portion of the unsigned range. For consistency and to avoid casts,
we should just use int32 everywhere.

7916f550

model: Load tensors behind an interface · d650ad39

Jesse Gross authored Jan 14, 2025

Currently, if a model uses an interface for its data structures (as mllama
does) then the tensor data in the structs implementing that interface will
not get loaded.

d650ad39

ggml-backend: Close on nil should be a no-op · d223f3b6
Jesse Gross authored Feb 10, 2025

d223f3b6

ggml-backend: Ensure data is available after async computation · 60830695

Jesse Gross authored Feb 05, 2025

We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.

60830695

ggml-backend: Let GGML allocate context memory · 01d9a468

Jesse Gross authored Jan 30, 2025

Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.

01d9a468

backend: API to support full precision matmul · d773b7d6

Jesse Gross authored Feb 13, 2025

Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.

d773b7d6