Commits · 351a85d9ea0db108ca29bba48d0a04e37c6e3607 · OpenDAS / ollama

20 Feb, 2025 2 commits
- openai: add 'timeout' to allowable x-stainless headers (#9237) · 351a85d9
  Lucas Hahn authored Feb 20, 2025
  
  351a85d9
- reorder patches · bda4ef6c
  Michael Yang authored Feb 19, 2025
  
  bda4ef6c
19 Feb, 2025 5 commits
- Merge pull request #9203 from ollama/mxyng/sapphirerapids · 1e438b23
  Michael Yang authored Feb 19, 2025
```
build: remove backend build for sapphirerapids
```
  1e438b23
- test: add test cases for ListHandler (#9146) · d721a02e
  yuiseki authored Feb 20, 2025
  
  d721a02e
- docs: Add AntSK to Community Integrations (#9214) · 778603a8
  zyxucp authored Feb 20, 2025
  
  778603a8
- docs: Add MaxKB to Community Integrations (#9212) · 3c874df4
  maninhill authored Feb 20, 2025
  
  3c874df4
- llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) · d2eb226c
  Jeffrey Morgan authored Feb 18, 2025
  
  d2eb226c
18 Feb, 2025 9 commits
- Merge pull request #9079 from jeremyschlatter/main · e13e7c8d
  Michael Yang authored Feb 18, 2025
```
cmd: fix flickering in progress bar
```
  e13e7c8d
- address code review comments · 78f403ff
  Jeremy Schlatter authored Feb 18, 2025
  
  78f403ff
- build: remove backend build for sapphirerapids · 5f8c0318
  Michael Yang authored Feb 18, 2025
```
sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
```
  5f8c0318
- cmake: avoid building intel backends on linux · 08a299e1
  Michael Yang authored Feb 18, 2025
  
  08a299e1
- ci: set owner/group in tarball · 7b5d916a
  Michael Yang authored Feb 14, 2025
```
set owner and group when building the linux tarball so extracted files
are consistent. this is the behaviour of release tarballs in version
0.5.7 and lower
```
  7b5d916a
- Add OpenDeepResearcher-via-searxng to Community Integrations (#9138) · 33ad61b1
  benhaotang authored Feb 18, 2025
  
  33ad61b1
- test: add test cases for HumanNumber (#9108) · 716e3656
  L. Jiang authored Feb 19, 2025
  
  716e3656
- readme: add LLM Telegram Bot to community integrations (#9150) · 3b4424ff
  innightwolfsleep authored Feb 18, 2025
  
  3b4424ff
- cmd: eliminate flickering with synchronized output · f9c7ead1
  Jeremy Schlatter authored Feb 17, 2025
  
  f9c7ead1
17 Feb, 2025 2 commits

cmd: fix cursor flickering in progress bar · 5930aaeb

Jeremy Schlatter authored Feb 17, 2025

The previous commit fixed flickering in the progress bar itself. Cursor
flickering is harder to address.

Cursor flickering could be fixed by hiding the cursor altogether while
the progress bar is displayed. The downside of this is that if the
program is killed in such a way that it can't clean up its state, it
would leave the cursor invisible.

Instead, this commit introduces an output buffer. All of the escape
codes and content for a single progress update are written to a buffer,
which is then flushed to the terminal all at once. This significantly
decreases the time during which the terminal has seen the cursor-hiding
code but has not yet seen the cursor-showing code, thus minimizing (but
not 100% eliminating) cursor flickering.

For more context, see:
https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501

5930aaeb

cmd: fix progress bar flickering · faf67db0

Jeremy Schlatter authored Feb 17, 2025

Previous code cleared the display before writing new content, creating a
window where the terminal could (and in some cases did) render empty lines.

Instead, we now write new content over the old content, only clearing
the trailing end of lines for cases where the new line is shorter.

Fixes #1664

faf67db0

15 Feb, 2025 2 commits
- docs: fix incorrect shortcut key in windows.md (#9098) · 0667badd
  James-William-Kincaid-III authored Feb 15, 2025
  
  0667badd
- model: document high-level model interface (#9122) · d006e1e0
  Bruce MacDonald authored Feb 14, 2025
  
  d006e1e0
14 Feb, 2025 17 commits

Wire up system info log for new engine (#9123) · df2680b4
Daniel Hiltgen authored Feb 14, 2025

df2680b4

llamarunner: Init GGML before printing system info · 010313bb

Jesse Gross authored Feb 14, 2025

We currently print system info before the GGML backends are loaded.
This results in only getting information about the default lowest
common denominator runner. If we move up the GGML init then we can
see what we are actually running.

Before:
time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24

010313bb

llm: attempt to evaluate symlinks, but do not fail (#9089) · 5296f487

Jeffrey Morgan authored Feb 13, 2025

provides a better approach to #9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues

5296f487

llm: do not evaluate symlink for exe path lookup (#9088) · f05774b0

Jeffrey Morgan authored Feb 13, 2025

In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether

f05774b0

ml/backend/ggml: stable sort devices by score (#9081) · 6600bd7d
Jeffrey Morgan authored Feb 13, 2025

6600bd7d

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

models: Move model into their own directory · 6945617a

Jesse Gross authored Feb 05, 2025

This allows there to be a file that is a list of models that is
not mixed into the runner code.

6945617a

vocab: Use int32 for special tokens · 7916f550

Jesse Gross authored Feb 03, 2025

Special tokens are currently read as uint32 from the model metadata.
However, all other parts of the system (including the tokenizer) use
int32 to represent tokens so it is impossible to represent the high
portion of the unsigned range. For consistency and to avoid casts,
we should just use int32 everywhere.

7916f550

model: Load tensors behind an interface · d650ad39

Jesse Gross authored Jan 14, 2025

Currently, if a model uses an interface for its data structures (as mllama
does) then the tensor data in the structs implementing that interface will
not get loaded.

d650ad39

ggml-backend: Close on nil should be a no-op · d223f3b6
Jesse Gross authored Feb 10, 2025

d223f3b6

ggml-backend: Ensure data is available after async computation · 60830695

Jesse Gross authored Feb 05, 2025

We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.

60830695

ggml-backend: Let GGML allocate context memory · 01d9a468

Jesse Gross authored Jan 30, 2025

Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.

01d9a468

backend: API to support full precision matmul · d773b7d6

Jesse Gross authored Feb 13, 2025

Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.

d773b7d6

backend: Support graph computation that does not return an output · 4d4463b2

Jesse Gross authored Feb 03, 2025

There are two cases where we may not have an output after computing:
 - Prompt processing where the length of the input exceeds the batch
   size
 - Internal memory management operations such as cache defrag and shift

4d4463b2

backend: Consistently use int (vs. int64) for tensor shapes · 0e38297f

Jesse Gross authored Feb 03, 2025

Currently there is a mixture of int and int64 used when dealing with
tensor dimensions and shapes, which causes unnecessary conversions -
they all should be the same type.

In general, most interfaces (such as Pytorch) use int64 for
generality but most implementations (such as CUDA) use int32 for
performance. There isn't much benefit to us to being more flexible
than the implementations we are likely to run on.

In addition, as a practical matter, a model with a tensor with a single
dimension larger than 32 bits is unlikely to run on a 32-bit machine.

0e38297f

backend: Don't return an error on Close · 7e13f568

Jesse Gross authored Feb 04, 2025

It is not common to return errors with close/free operations - most
people won't check it and even if they did there's probably not much
that can do. It's better to not give implementations false expectations.

7e13f568

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

13 Feb, 2025 3 commits
- docs: add ollamazing to the README.md (#9075) · 8cf16063
  Bùi Đức Nhật authored Feb 14, 2025
  
  8cf16063
- docs: add H200 as supported device. (#9076) · 3a4449e2
  frob authored Feb 13, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  3a4449e2
- openai: finish_reason as tool_calls for streaming with tools (#7963) · 10d59d5f
  Anuraag (Rag) Agrawal authored Feb 14, 2025
  
  10d59d5f