llama/ggml-cuda/template-instances/mmq-instance-q6_k.cu · 05cd82ef94a5c4b14bc030dd93a94e18ed63e295 · orangecat / ollama

Re-introduce the `llama` package (#5034) · 96efd905

Jeffrey Morgan authored Oct 08, 2024

* Re-introduce the llama package

This PR brings back the llama package, making it possible to call llama.cpp and
ggml APIs from Go directly via CGo. This has a few advantages:

- C APIs can be called directly from Go without needing to use the previous
  "server" REST API
- On macOS and for CPU builds on Linux and Windows, Ollama can be built without
  a go generate ./... step, making it easy to get up and running to hack on
  parts of Ollama that don't require fast inference
- Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
  takes <5 min on a fast CPU)
- No git submodule making it easier to clone and build from source

This is a big PR, but much of it is vendor code except for:

- llama.go CGo bindings
- example/: a simple example of running inference
- runner/: a subprocess server designed to replace the llm/ext_server package
- Makefile an as minimal as possible Makefile to build the runner package for
  different...

96efd905

mmq-instance-q6_k.cu 1.34 KB

Replace mmq-instance-q6_k.cu