Commits · d07cf41a97ea11e4f84f6df37997788d033f7e06 · OpenDAS / ollama

01 Nov, 2024 2 commits
- refactor kv estimation · d07cf41a
  Michael Yang authored Oct 31, 2024
  
  d07cf41a
- mllama cross attention · 8c238e70
  Michael Yang authored Oct 31, 2024
  
  8c238e70
30 Oct, 2024 6 commits

Refine default thread selection for NUMA systems (#7322) · 16f4eabe

Daniel Hiltgen authored Oct 30, 2024

Until we have full NUMA support, this adjusts the default thread selection
algorithm to count up the number of performance cores across all sockets.

16f4eabe

runner.go: Better abstract vision model integration · c826e574

Jesse Gross authored Oct 11, 2024



-Update mllama to take the cross attention state as embeddings in
a batch, more similar to how Llava handles it. This improves
integration with the input cache.
-Pass locations in a prompt for embeddings using tags similar to Llava.
-Abstract interface to vision models so the main runner accesses Clip
and Mllama similarly
Co-authored-by: Michael Yang <mxyng@pm.me>

c826e574

Soften windows clang requirement (#7428) · 712e99d4

Daniel Hiltgen authored Oct 30, 2024

This will no longer error if built with regular gcc on windows.  To help
triage issues that may come in related to different compilers, the runner now
reports the compier used by cgo.

712e99d4

Remove submodule and shift to Go server - 0.4.0 (#7157) · b754f5a6

Daniel Hiltgen authored Oct 30, 2024

* Remove llama.cpp submodule and shift new build to top

* CI: install msys and clang gcc on win

Needed for deepseek to work properly on windows

b754f5a6

Move windows app out of preview (#7347) · a805e594
Daniel Hiltgen authored Oct 30, 2024

a805e594

windows: Support alt install paths, fit and finish (#6967) · 91dfbb1b

Daniel Hiltgen authored Oct 30, 2024

* windows: Support alt install paths

Advanced users are leveraging innosetup's /DIR switch to target
an alternate location, but we get confused by things not existing in the LocalAppData dir.
This also hardens the server path lookup code for a future attempt to unify with a ./bin prefix

* Fit and finish improvements for windows app

Document alternate install location instructions for binaries and model.
Pop up progress UI for upgrades (automatic, with cancel button).
Expose non-default port in menu to disambiguate mutiple instances.
Set minimum Windows version to 10 22H2

91dfbb1b

29 Oct, 2024 4 commits

add more tests for getting the optimal tiled canvas (#7411) · db1842b9
Patrick Devine authored Oct 29, 2024

db1842b9

Switch windows to clang (#7407) · c9ca3861

Daniel Hiltgen authored Oct 29, 2024

* Switch over to clang for deepseek on windows

The patch for deepseek requires clang on windows. gcc on windows
has a buggy c++ library and can't handle the unicode characters

* Fail fast with wrong compiler on windows

Avoid users mistakenly building with GCC when we need clang

c9ca3861

tests: Add test for Unicode processing · 078f666f
Jesse Gross authored Oct 23, 2024

078f666f

runner.go: Better handle return NULL values from llama.cpp · de1557a0

Jesse Gross authored Oct 22, 2024

Llama.cpp sometimes returns NULL as a return value to report an
error. We should explicitly check for this and convert it to a Go
error rather than putting NULL in our data structures and waiting
for it to blow up later.

de1557a0

28 Oct, 2024 1 commit
- add mllama image processing to the generate handler (#7384) · 084929c2
  Patrick Devine authored Oct 28, 2024
  
  084929c2
27 Oct, 2024 1 commit
- Bump to latest Go 1.22 patch (#7379) · abd5dfd0
  Daniel Hiltgen authored Oct 26, 2024
  
  abd5dfd0
26 Oct, 2024 2 commits

Fix deepseek deseret regex (#7369) · 099f7077
Daniel Hiltgen authored Oct 26, 2024
```
On windows compiled with gcc the c++ regex library failed to handle
the characters
```
099f7077

Better support for AMD multi-GPU on linux (#7212) · d7c94e0c

Daniel Hiltgen authored Oct 26, 2024

* Better support for AMD multi-GPU

This resolves a number of problems related to AMD multi-GPU setups on linux.

The numeric IDs used by rocm are not the same as the numeric IDs exposed in
sysfs although the ordering is consistent.  We have to count up from the first
valid gfx (major/minor/patch with non-zero values) we find starting at zero.

There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
supports UUID based identification, so we should favor that one, and try
to use UUIDs if detected to avoid potential ordering bugs with numeric IDs

* ROCR_VISIBLE_DEVICES only works on linux

Use the numeric ID only HIP_VISIBLE_DEVICES on windows

d7c94e0c

25 Oct, 2024 2 commits
- Fix unicode output on windows with redirect to file (#7358) · 35ec7f07
  Daniel Hiltgen authored Oct 25, 2024
```
If we're not writing out to a terminal, avoid setting the console mode
on windows, which corrupts the output file.
```
  35ec7f07
- Fix incremental build file deps (#7361) · 5231ae52
  Daniel Hiltgen authored Oct 25, 2024
```
The common src/hdr defs should be in the common definitions, not gpu specific.
```
  5231ae52
24 Oct, 2024 1 commit

Improve dependency gathering logic (#7345) · 3085c47b

Daniel Hiltgen authored Oct 24, 2024

This unfies the rocm/cuda dependency logic into the makefile
and fixes a missing define which broke windows rocm

3085c47b

23 Oct, 2024 1 commit
- fix #7247 - invalid image input (#7249) · 0ccc7325
  Bill Wang authored Oct 24, 2024
```
---------
Co-authored-by: Bill Wang <bill.wang@bill.wang>
```
  0ccc7325
22 Oct, 2024 5 commits

integration: harden embedding test (#7306) · dc6fe820
Daniel Hiltgen authored Oct 22, 2024
```
Use cosine similarity to make the embeddings tests more robust
```
dc6fe820
default to "FROM ." if a Modelfile isn't present (#7250) · d78fb620
Patrick Devine authored Oct 22, 2024

d78fb620

Fix rocm windows build and clean up dependency gathering (#7305) · 5c44461c

Daniel Hiltgen authored Oct 22, 2024

On windows ensure windows version define is properly set for rocm.
Remove duplicate rocm arch flags.
Resolve wildcards in the targets so parallel builds don't race.
Use readlink to resolve rocm dependencies since wildcards omit libelf
Keep windows rocm deps aligned with unified packaging model

5c44461c

runner.go: Merge partial unicode characters before sending · 03e40efa

Jesse Gross authored Oct 21, 2024

We check for partial unicode characters and accumulate them before
sending. However, when we did send, we still sent each individual piece
separately, leading to broken output. This combines everything into
a single group, which is also more efficient.

This also switches to the built-in check for valid unicode characters,
which is stricter. After this, we should never send back an invalid
sequence.

Fixes #7290

03e40efa

readme: add Ollama for Swift to the community integrations (#7295) · 23f74650
Mattt authored Oct 21, 2024

23f74650

19 Oct, 2024 1 commit
- server: allow vscode-webview origin (#7273) · 48708ca0
  Jeffrey Morgan authored Oct 19, 2024
  
  48708ca0
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

17 Oct, 2024 4 commits

llama: Decouple patching script from submodule (#7139) · bf4018b9

Daniel Hiltgen authored Oct 17, 2024

* Refine llama.cpp vendoring workflow tools

Switch from the sync.sh over to make based tooling

* Run new make sync and patch flow

bf4018b9

llama: add compiler tags for cpu features (#7137) · f86d00cd
Daniel Hiltgen authored Oct 17, 2024
```
This adds the ability to customize the default runner with user specified flags
```
f86d00cd

IBM granite/granitemoe architecture support (#6760) · f2890a44

Gabe Goodhart authored Oct 17, 2024

* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82



Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
https://github.com/ggerganov/llama.cpp/pull/9294



The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes #6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

f2890a44

Rename gpu package discover (#7143) · 05cd82ef
Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
05cd82ef

16 Oct, 2024 1 commit

Move macos v11 support flags to build script (#7203) · 7d6eb0d4

Daniel Hiltgen authored Oct 16, 2024

Having v11 support hard-coded into the cgo settings causes warnings
for newer Xcode versions. This should help keep the build clean for users
building from source with the latest tools, while still allow us to target
the older OS via our CI processes.

7d6eb0d4

15 Oct, 2024 3 commits
- Discovery CPU details for default thread selection (#6264) · 24636dfa
  Daniel Hiltgen authored Oct 15, 2024
```
On windows, detect large multi-socket systems and reduce to the number of cores
in one socket for best performance
```
  24636dfa
- Adding 'Ollama App' as community integrations (#6465) · 1d7fa3ad
  JHubi1 authored Oct 15, 2024
  
  1d7fa3ad
- Add missing BF16 tensor type. (#7193) · 09035b71
  frob authored Oct 15, 2024
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  09035b71
14 Oct, 2024 1 commit
- Track GPU discovery failure information (#5820) · f3c8b898
  Daniel Hiltgen authored Oct 14, 2024
```
* Expose GPU discovery failure information

* Remove exposed API for now
```
  f3c8b898
13 Oct, 2024 1 commit
- Fix regression on older macos versions (#7192) · 5dd0477f
  Daniel Hiltgen authored Oct 13, 2024
```
The new cgo compilation requires a flag to target older macos versions
```
  5dd0477f
12 Oct, 2024 1 commit
- llm: Remove GGML_CUDA_NO_PEER_COPY for ROCm (#7174) · c3d321d4
  Daniel Hiltgen authored Oct 12, 2024
```
This workaround logic in llama.cpp is causing crashes for users with less system memory than VRAM.
```
  c3d321d4
10 Oct, 2024 2 commits

cli: Send all images in conversation history · 7fe39025

Jesse Gross authored Oct 09, 2024

Currently the CLI only sends images from the most recent image-
containing message. This prevents doing things like sending
one message with an image and then a follow message with a
second image and asking for comparision based on additional
information not present in any text that was output.

It's possible that some models have a problem with this but the
CLI is not the right place to do this since any adjustments are
model-specific and should affect all clients.

Both llava:34b and minicpm-v do reasonable things with multiple
images in the history.

7fe39025

runner.go: Handle truncation of tokens for stop sequences · 0077e22d

Jesse Gross authored Oct 09, 2024

When a single token contains both text to be return and a stop
sequence, this causes an out of bounds error when we update the
cache to match our text. This is because we currently assume that
the removing the stop sequence will consume at least one token.

This also inverts the logic to deal with positive numbers, rather
than a value to be subtracted, which is easier to reason about.

Fixes #7153

0077e22d