- 22 Oct, 2024 1 commit
-
-
Mattt authored
-
- 19 Oct, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 18 Oct, 2024 1 commit
-
-
Patrick Devine authored
Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Jesse Gross <jesse@ollama.com>
-
- 17 Oct, 2024 4 commits
-
-
Daniel Hiltgen authored
* Refine llama.cpp vendoring workflow tools Switch from the sync.sh over to make based tooling * Run new make sync and patch flow
-
Daniel Hiltgen authored
This adds the ability to customize the default runner with user specified flags
-
Gabe Goodhart authored
* fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82 Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR https://github.com/ggerganov/llama.cpp/pull/9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
Daniel Hiltgen authored
Cleaning up go package naming
-
- 16 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
Having v11 support hard-coded into the cgo settings causes warnings for newer Xcode versions. This should help keep the build clean for users building from source with the latest tools, while still allow us to target the older OS via our CI processes.
-
- 15 Oct, 2024 3 commits
-
-
Daniel Hiltgen authored
On windows, detect large multi-socket systems and reduce to the number of cores in one socket for best performance
-
JHubi1 authored
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
- 14 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
* Expose GPU discovery failure information * Remove exposed API for now
-
- 13 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
The new cgo compilation requires a flag to target older macos versions
-
- 12 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
This workaround logic in llama.cpp is causing crashes for users with less system memory than VRAM.
-
- 10 Oct, 2024 3 commits
-
-
Jesse Gross authored
Currently the CLI only sends images from the most recent image- containing message. This prevents doing things like sending one message with an image and then a follow message with a second image and asking for comparision based on additional information not present in any text that was output. It's possible that some models have a problem with this but the CLI is not the right place to do this since any adjustments are model-specific and should affect all clients. Both llava:34b and minicpm-v do reasonable things with multiple images in the history.
-
Jesse Gross authored
When a single token contains both text to be return and a stop sequence, this causes an out of bounds error when we update the cache to match our text. This is because we currently assume that the removing the stop sequence will consume at least one token. This also inverts the logic to deal with positive numbers, rather than a value to be subtracted, which is easier to reason about. Fixes #7153
-
Jesse Gross authored
Close can be called on an LLM server if the runner subprocess dies. However, the Ollama scheduler code may not know about this yet and still try to access it. In this case, it is important that 'cmd' is still available as it is used to check on the status of the subprocess. If this happens, Kill may be called twice on the subprocess - that is fine. In addition, model unloading may race with new accesses, so we should hold a lock around this. This may result in the model being reloaded after the first close call - this is also fine as close will be called again later.
-
- 09 Oct, 2024 2 commits
-
-
Daniel Hiltgen authored
Add missing metal files to vendoring list
-
Daniel Hiltgen authored
Expand out the file extensions for vendored code so git reports the status correctly
-
- 08 Oct, 2024 3 commits
-
-
Daniel Hiltgen authored
The recent change to applying patches leaves the submodule dirty based on "new commits" being present. This ensures we clean up so the tree no longer reports dirty after a `go generate ./...` run. The Makefile was being a bit too aggressive in cleaning things up and would result in deleting the placeholder files which someone might accidentally commit.
-
Jeffrey Morgan authored
* Re-introduce the llama package This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages: - C APIs can be called directly from Go without needing to use the previous "server" REST API - On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU) - No git submodule making it easier to clone and build from source This is a big PR, but much of it is vendor code except for: - llama.go CGo bindings - example/: a simple example of running inference - runner/: a subprocess server designed to replace the llm/ext_server package - Makefile an as minimal as possible Makefile to build the runner package for different...
-
Shifra Goldstone authored
-
- 05 Oct, 2024 1 commit
-
-
hidden1nin authored
-
- 01 Oct, 2024 1 commit
-
-
Alex Mavrogiannis authored
-
- 29 Sep, 2024 1 commit
-
-
zmldndx authored
-
- 26 Sep, 2024 1 commit
-
-
Blake Mizerany authored
This change closes the response body when an error occurs in makeRequestWithRetry. Previously, the first, non-200 response body was not closed before reattempting the request. This change ensures that the response body is closed in all cases where an error occurs, preventing leaks of file descriptors. Fixes #6974
-
- 25 Sep, 2024 2 commits
-
-
Xe Iaso authored
-
Jeffrey Morgan authored
-
- 24 Sep, 2024 3 commits
-
-
Daniel Hiltgen authored
write-host in powershell writes directly to the console and will not be picked up by a pipe. Echo, or write-output will.
-
Alex Yang authored
-
Deep Lakhani authored
-
- 22 Sep, 2024 1 commit
-
-
Mahesh Sathiamoorthy authored
This was causing an error since we depend on punkt_tab.
-
- 21 Sep, 2024 3 commits
-
-
Daniel Hiltgen authored
When running the subprocess as a background service windows may throttle, which can lead to thrashing and very poor token rate.
-
Daniel Hiltgen authored
GPUs handled the dependency path properly, but CPU runners didn't which results in missing vc redist libraries on systems where the user didn't already have it installed from some other app.
-
Daniel Hiltgen authored
The upload artifact is missing the dist prefix since all payloads are in the same directory, so restore the prefix on download.
-
- 20 Sep, 2024 3 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
* Unified arm/x86 windows installer This adjusts the installer payloads to be architecture aware so we can cary both amd64 and arm64 binaries in the installer, and install only the applicable architecture at install time. * Include arm64 in official windows build * Harden schedule test for slow windows timers This test seems to be a bit flaky on windows, so give it more time to converge
-
- 18 Sep, 2024 2 commits
-
-
Patrick Devine authored
-
Ryan Marten authored
-