Re-introduce the `llama` package (#5034)

* Re-introduce the llama package This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages: - C APIs can be called directly from Go without needing to use the previous "server" REST API - On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU) - No git submodule making it easier to clone and build from source This is a big PR, but much of it is vendor code except for: - llama.go CGo bindings - example/: a simple example of running inference - runner/: a subprocess server designed to replace the llm/ext_server package - Makefile an as minimal as possible Makefile to build the runner package for different...

Re-introduce the `llama` package (#5034)
* Re-introduce the llama package This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages: - C APIs can be called directly from Go without needing to use the previous "server" REST API - On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU) - No git submodule making it easier to clone and build from source This is a big PR, but much of it is vendor code except for: - llama.go CGo bindings - example/: a simple example of running inference - runner/: a subprocess server designed to replace the llm/ext_server package - Makefile an as minimal as possible Makefile to build the runner package for different...
96efd905 · Jeffrey Morgan · GitHub · de982616 · 96efd905 · 96efd905
Unverified Commit 96efd905 authored Oct 08, 2024 by Jeffrey Morgan Committed by GitHub Oct 08, 2024
20 changed files
--- a/llama/make/Makefile.cuda_v11
+++ b/llama/make/Makefile.cuda_v11
--- a/llama/make/Makefile.cuda_v12
+++ b/llama/make/Makefile.cuda_v12
--- a/llama/make/Makefile.default
+++ b/llama/make/Makefile.default
--- a/llama/make/Makefile.rocm
+++ b/llama/make/Makefile.rocm
--- a/llama/make/common-defs.make
+++ b/llama/make/common-defs.make
--- a/llama/make/cuda.make
+++ b/llama/make/cuda.make
--- a/llama/make/gpu.make
+++ b/llama/make/gpu.make
--- a/llama/patches/01-cuda.diff
+++ b/llama/patches/01-cuda.diff
--- a/llama/patches/02-pretokenizer.diff
+++ b/llama/patches/02-pretokenizer.diff
--- a/llama/patches/03-metal.diff
+++ b/llama/patches/03-metal.diff
--- a/llama/patches/04-ggml-metal.diff
+++ b/llama/patches/04-ggml-metal.diff
--- a/llama/patches/05-embeddings.diff
+++ b/llama/patches/05-embeddings.diff
--- a/llama/patches/07-clip-unicode.diff
+++ b/llama/patches/07-clip-unicode.diff
--- a/llama/patches/08-solar-pro.diff
+++ b/llama/patches/08-solar-pro.diff
--- a/llama/patches/10-conditional-fattn.diff
+++ b/llama/patches/10-conditional-fattn.diff
--- a/llama/patches/11-blas.diff
+++ b/llama/patches/11-blas.diff
--- a/llama/runner/README.md
+++ b/llama/runner/README.md
--- a/llama/runner/cache.go
+++ b/llama/runner/cache.go
--- a/llama/runner/cache_test.go
+++ b/llama/runner/cache_test.go
--- a/llama/runner/requirements.go
+++ b/llama/runner/requirements.go