Commits · 479d5517668a0e8b68be8aae8e2f940efcbfbb60 · OpenDAS / ollama

"src/vscode:/vscode.git/clone" did not exist on "480510ada99a8fd7cae8de47bb202382250d6873"

11 Nov, 2024 1 commit
- docs: add mentions of Llama 3.2 (#7517) · 479d5517
  frances720 authored Nov 10, 2024
  
  479d5517
08 Nov, 2024 1 commit
- docs: update langchainpy.md with proper model name (#7527) · 771fab1d
  Edward J. Schwartz authored Nov 08, 2024
  
  771fab1d
06 Nov, 2024 2 commits
- docs: OLLAMA_NEW_RUNNERS no longer exists · 3020d2dc
  Jesse Gross authored Nov 06, 2024
  
  3020d2dc
- runner.go: Remove unused arguments · a9094176
  Jesse Gross authored Oct 30, 2024
```
Now that server.cpp is gone, we don't need to keep passing arguments
that were only ignored and only kept for compatibility.
```
  a9094176
30 Oct, 2024 4 commits

Soften windows clang requirement (#7428) · 712e99d4

Daniel Hiltgen authored Oct 30, 2024

This will no longer error if built with regular gcc on windows.  To help
triage issues that may come in related to different compilers, the runner now
reports the compier used by cgo.

712e99d4

Remove submodule and shift to Go server - 0.4.0 (#7157) · b754f5a6

Daniel Hiltgen authored Oct 30, 2024

* Remove llama.cpp submodule and shift new build to top

* CI: install msys and clang gcc on win

Needed for deepseek to work properly on windows

b754f5a6

Move windows app out of preview (#7347) · a805e594
Daniel Hiltgen authored Oct 30, 2024

a805e594

windows: Support alt install paths, fit and finish (#6967) · 91dfbb1b

Daniel Hiltgen authored Oct 30, 2024

* windows: Support alt install paths

Advanced users are leveraging innosetup's /DIR switch to target
an alternate location, but we get confused by things not existing in the LocalAppData dir.
This also hardens the server path lookup code for a future attempt to unify with a ./bin prefix

* Fit and finish improvements for windows app

Document alternate install location instructions for binaries and model.
Pop up progress UI for upgrades (automatic, with cancel button).
Expose non-default port in menu to disambiguate mutiple instances.
Set minimum Windows version to 10 22H2

91dfbb1b

29 Oct, 2024 1 commit

Switch windows to clang (#7407) · c9ca3861

Daniel Hiltgen authored Oct 29, 2024

* Switch over to clang for deepseek on windows

The patch for deepseek requires clang on windows. gcc on windows
has a buggy c++ library and can't handle the unicode characters

* Fail fast with wrong compiler on windows

Avoid users mistakenly building with GCC when we need clang

c9ca3861

26 Oct, 2024 1 commit

Better support for AMD multi-GPU on linux (#7212) · d7c94e0c

Daniel Hiltgen authored Oct 26, 2024

* Better support for AMD multi-GPU

This resolves a number of problems related to AMD multi-GPU setups on linux.

The numeric IDs used by rocm are not the same as the numeric IDs exposed in
sysfs although the ordering is consistent.  We have to count up from the first
valid gfx (major/minor/patch with non-zero values) we find starting at zero.

There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
supports UUID based identification, so we should favor that one, and try
to use UUIDs if detected to avoid potential ordering bugs with numeric IDs

* ROCR_VISIBLE_DEVICES only works on linux

Use the numeric ID only HIP_VISIBLE_DEVICES on windows

d7c94e0c

23 Oct, 2024 1 commit
- fix #7247 - invalid image input (#7249) · 0ccc7325
  Bill Wang authored Oct 24, 2024
```
---------
Co-authored-by: Bill Wang <bill.wang@bill.wang>
```
  0ccc7325
08 Oct, 2024 1 commit

Re-introduce the `llama` package (#5034) · 96efd905

Jeffrey Morgan authored Oct 08, 2024



* Re-introduce the llama package

This PR brings back the llama package, making it possible to call llama.cpp and
ggml APIs from Go directly via CGo. This has a few advantages:

- C APIs can be called directly from Go without needing to use the previous
  "server" REST API
- On macOS and for CPU builds on Linux and Windows, Ollama can be built without
  a go generate ./... step, making it easy to get up and running to hack on
  parts of Ollama that don't require fast inference
- Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
  takes <5 min on a fast CPU)
- No git submodule making it easier to clone and build from source

This is a big PR, but much of it is vendor code except for:

- llama.go CGo bindings
- example/: a simple example of running inference
- runner/: a subprocess server designed to replace the llm/ext_server package
- Makefile an as minimal as possible Makefile to build the runner package for
  different targets (cpu, avx, avx2, cuda, rocm)
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

* cache: Clear old KV cache entries when evicting a slot

When forking a cache entry, if no empty slots are available we
evict the least recently used one and copy over the KV entries
from the closest match. However, this copy does not overwrite
existing values but only adds new ones. Therefore, we need to
clear the old slot first.

This change fixes two issues:
 - The KV cache fills up and runs out of space even though we think
   we are managing it correctly
 - Performance gets worse over time as we use new cache entries that
   are not hot in the processor caches

* doc: explain golang objc linker warning (#6830)

* llama: gather transitive dependencies for rocm for dist packaging (#6848)

* Refine go server makefiles to be more DRY (#6924)

This breaks up the monolithic Makefile for the Go based runners into a
set of utility files as well as recursive Makefiles for the runners.
Files starting with the name "Makefile" are buildable, while files that
end with ".make" are utilities to include in other Makefiles.  This
reduces the amount of nearly identical targets and helps set a pattern
for future community contributions for new GPU runner architectures.

When we are ready to switch over to the Go runners, these files should
move to the top of the repo, and we should add targets for the main CLI,
as well as a helper "install" (put all the built binaries on the local
system in a runnable state) and "dist" target (generate the various
tar/zip files for distribution) for local developer use.

* llama: don't create extraneous directories (#6988)

* llama: Exercise the new build in CI (#6989)

Wire up some basic sanity testing in CI for the Go runner.  GPU runners are not covered yet.

* llama: Refine developer docs for Go server (#6842)

This enhances the documentation for development focusing on the new Go
server.  After we complete the transition further doc refinements
can remove the "transition" discussion.

* runner.go: Allocate batches for all sequences during init

We should tell the model that we could have full batches for all
sequences. We already do this when we allocate the batches but it was
missed during initialization.

* llama.go: Don't return nil from Tokenize on zero length input

Potentially receiving nil in a non-error condition is surprising to
most callers - it's better to return an empty slice.

* runner.go: Remove stop tokens from cache

If the last token is EOG then we don't return this and it isn't
present in the cache (because it was never submitted to Decode).
This works well for extending the cache entry with a new sequence.

However, for multi-token stop sequences, we won't return any of the
tokens but all but the last one will be in the cache. This means
when the conversation continues the cache will contain tokens that
don't overlap with the new prompt.

This works (we will pick up the portion where there is overlap) but
it causes unnecessary cache thrashing because we will fork the original
cache entry as it is not a perfect match.

By trimming the cache to the tokens that we actually return this
issue can be avoided.

* runner.go: Simplify flushing of pending tokens

* runner.go: Update TODOs

* runner.go: Don't panic when processing sequences

If there is an error processing a sequence, we should return a
clean HTTP error back to Ollama rather than panicing. This will
make us more resilient to transient failures.

Panics can still occur during startup as there is no way to serve
requests if that fails.
Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: More accurately capture timings

Currently prompt processing time doesn't capture the that it takes
to tokenize the input, only decoding time. We should capture the
full process to more accurately reflect reality. This is especially
true once we start processing images where the initial processing
can take significant time. This is also more consistent with the
existing C++ runner.

* runner.go: Support for vision models

In addition to bringing feature parity with the C++ runner, this also
incorporates several improvements:
 - Cache prompting works with images, avoiding the need to re-decode
   embeddings for every message in a conversation
 - Parallelism is supported, avoiding the need to restrict to one
   sequence at a time. (Though for now Ollama will not schedule
   them while we might need to fall back to the old runner.)
Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: Move Unicode checking code and add tests

* runner.go: Export external cache members

Runner and cache are in the same package so the change doesn't
affect anything but it is more internally consistent.

* runner.go: Image embedding cache

Generating embeddings from images can take significant time (on
my machine between 100ms and 8s depending on the model). Although
we already cache the result of decoding these images, the embeddings
need to be regenerated every time. This is not necessary if we get
the same image over and over again, for example, during a conversation.

This currently uses a very small cache with a very simple algorithm
but it is easy to improve as is warranted.

* llama: catch up on patches

Carry forward solar-pro and cli-unicode patches

* runner.go: Don't re-allocate memory for every batch

We can reuse memory allocated from batch to batch since batch
size is fixed. This both saves the cost of reallocation as well
keeps the cache lines hot.

This results in a roughly 1% performance improvement for token
generation with Nvidia GPUs on Linux.

* runner.go: Default to classic input cache policy

The input cache as part of the go runner implemented a cache
policy that aims to maximize hit rate in both single and multi-
user scenarios. When there is a cache hit, the response is
very fast.

However, performance is actually slower when there is an input
cache miss due to worse GPU VRAM locality. This means that
performance is generally better overall for multi-user scenarios
(better input cache hit rate, locality was relatively poor already).
But worse for single users (input cache hit rate is about the same,
locality is now worse).

This defaults the policy back to the old one to avoid a regression
but keeps the new one available through an environment variable
OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is
to improve this in the future to get the best of both worlds
without user configuration.

For inputs that result in cache misses, on Nvidia/Linux this
change improves performance by 31% for prompt processing and
13% for token generation.

* runner.go: Increase size of response channel

Generally the CPU can easily keep up with handling reponses that
are generated but there's no reason not to let generation continue
and handle things in larger batches if needed.

* llama: Add CI to verify all vendored changes have patches (#7066)

Make sure we don't accidentally merge changes in the vendored code
that aren't also reflected in the patches.

* llama: adjust clip patch for mingw utf-16 (#7065)

* llama: adjust clip patch for mingw utf-16

* llama: ensure static linking of runtime libs

Avoid runtime dependencies on non-standard libraries

* runner.go: Enable llamafile (all platforms) and BLAS (Mac OS)

These are two features that are shown on llama.cpp's system info
that are currently different between the two runners. On my test
systems the performance difference is very small to negligible
but it is probably still good to equalize the features.

* llm: Don't add BOS/EOS for tokenize requests

This is consistent with what server.cpp currently does. It affects
things like token processing counts for embedding requests.

* runner.go: Don't cache prompts for embeddings

Our integration with server.cpp implicitly disables prompt caching
because it is not part of the JSON object being parsed, this makes
the Go runner behavior similarly.

Prompt caching has been seen to affect the results of text completions
on certain hardware. The results are not wrong either way but they
are non-deterministic. However, embeddings seem to be affected even
on hardware that does not show this behavior for completions. For
now, it is best to maintain consistency with the existing behavior.

* runner.go: Adjust debug log levels

Add system info printed at startup and quiet down noisier logging.

* llama: fix compiler flag differences (#7082)

Adjust the flags for the new Go server to more closely match the
generate flow

* llama: refine developer docs (#7121)

* llama: doc and example clean up (#7122)

* llama: doc and example clean up

* llama: Move new dockerfile into llama dir

Temporary home until we fully transition to the Go server

* llama: runner doc cleanup

* llama.go: Add description for Tokenize error case

---------
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>

96efd905

25 Sep, 2024 1 commit
- update default model to llama3.2 (#6959) · 55ea963c
  Jeffrey Morgan authored Sep 25, 2024
  
  55ea963c
20 Sep, 2024 1 commit

Add Windows arm64 support to official builds (#5712) · d632e23f

Daniel Hiltgen authored Sep 20, 2024

* Unified arm/x86 windows installer

This adjusts the installer payloads to be architecture aware so we can cary
both amd64 and arm64 binaries in the installer, and install only the applicable
architecture at install time.

* Include arm64 in official windows build

* Harden schedule test for slow windows timers

This test seems to be a bit flaky on windows, so give it more time to converge

d632e23f

18 Sep, 2024 1 commit
- documentation for stopping a model (#6766) · 5804cf17
  Patrick Devine authored Sep 18, 2024
  
  5804cf17
16 Sep, 2024 1 commit
- fix typo in import docs (#6828) · d81cfd7d
  Patrick Devine authored Sep 16, 2024
  
  d81cfd7d
11 Sep, 2024 1 commit

Verify permissions for AMD GPU (#6736) · 9246e6dd

Daniel Hiltgen authored Sep 11, 2024

This adds back a check which was lost many releases back to verify /dev/kfd permissions
which when lacking, can lead to confusing failure modes of:
"rocBLAS error: Could not initialize Tensile host: No devices found"

This implementation does not hard fail the serve command but instead will fall back to CPU
with an error log. In the future we can include this in the GPU discovery UX to show
detected but unsupported devices we discovered.

9246e6dd

10 Sep, 2024 1 commit
- docs: update examples to use llama3.1 (#6718) · 83a9b527
  Jeffrey Morgan authored Sep 09, 2024
  
  83a9b527
07 Sep, 2024 1 commit
- docs: improve linux install documentation (#6683) · 108fb6c1
  Jeffrey Morgan authored Sep 06, 2024
```
Includes small improvements to document layout and code blocks
```
  108fb6c1
05 Sep, 2024 2 commits

Document uninstall on windows (#6663) · 48685c6e
Daniel Hiltgen authored Sep 05, 2024

48685c6e

Update gpu.md: Add RTX 3050 Ti and RTX 3050 Ti (#5888) · 5f944baa

Michael authored Sep 05, 2024

* Update gpu.md

    Seems strange that the laptop versions of 3050 and 3050 Ti would be supported but not the non-notebook, but this is what the page (https://developer.nvidia.com/cuda-gpus

) says.
Signed-off-by: bean5 <2052646+bean5@users.noreply.github.com>

* Update gpu.md

Remove notebook reference

---------
Signed-off-by: bean5 <2052646+bean5@users.noreply.github.com>

5f944baa

04 Sep, 2024 1 commit
- docs: add group to manual Linux isntructions and verify service is running (#6430) · 133770a5
  Tomoya Fujita authored Sep 04, 2024
  
  133770a5
02 Sep, 2024 1 commit
- docs: update faq.md for OLLAMA_MODELS env var permissions (#6587) · 741affdf
  SnoopyTlion authored Sep 03, 2024
  
  741affdf
01 Sep, 2024 1 commit
- docs: update GGUF examples and references (#6577) · 1aad8387
  rayfiyo authored Sep 01, 2024
  
  1aad8387
29 Aug, 2024 1 commit
- update the openai docs to explain how to set the context size (#6548) · 8e4e509f
  Patrick Devine authored Aug 28, 2024
  
  8e4e509f
27 Aug, 2024 5 commits
- add safetensors to the modelfile docs (#6532) · d13c3daa
  Patrick Devine authored Aug 27, 2024
  
  d13c3daa
- Fix import image width (#6528) · 1713eddc
  Patrick Devine authored Aug 27, 2024
  
  1713eddc
- Update manual instructions with discrete ROCm bundle (#6445) · 4e1c4f6e
  Daniel Hiltgen authored Aug 27, 2024
  
  4e1c4f6e
- adjust image sizes · 1c70a00f
  Patrick Devine authored Aug 27, 2024
  
  1c70a00f
- update the import docs (#6104) · ac80010d
  Patrick Devine authored Aug 26, 2024
  
  ac80010d
23 Aug, 2024 1 commit
- update faq · bb362caf
  Michael Yang authored Jul 02, 2024
  
  bb362caf
19 Aug, 2024 2 commits
- Review comments · f9e31da9
  Daniel Hiltgen authored Aug 15, 2024
  
  f9e31da9
- Adjust layout to bin+lib/ollama · 88bb9e33
  Daniel Hiltgen authored Aug 14, 2024
  
  88bb9e33
13 Aug, 2024 2 commits
- update chatml template format to latest in docs (#6344) · eda8a32a
  Bruce MacDonald authored Aug 13, 2024
  
  eda8a32a
- Update openai.md to remove extra checkbox (#6345) · 1f322761
  Pamela Fox authored Aug 13, 2024
  
  1f322761
12 Aug, 2024 1 commit
- update import.md · bd5e4326
  Michael Yang authored Aug 05, 2024
  
  bd5e4326
07 Aug, 2024 2 commits
- add metrics to docs (#6079) · 5b3a21b5
  royjhan authored Aug 07, 2024
  
  5b3a21b5
- Use llama3.1 in tools example (#5985) · ad0c19dd
  Kyle Kelley authored Aug 07, 2024
```
* Use llama3.1 in tools example

* Update api.md
```
  ad0c19dd
05 Aug, 2024 2 commits
- Disable paging for journalctl (#6154) · b73b0940
  frob authored Aug 05, 2024
```
Users using `journalctl` to get logs for issue logging sometimes don't realize that paging is causing information to be missed.
```
  b73b0940
- line feed · 6a073447
  Michael Yang authored Aug 04, 2024
  
  6a073447