- 28 Feb, 2025 1 commit
-
-
Blake Mizerany authored
Also, require the -as flag to be set when importing a model. This prevents the confusing error message "invalid name". Also, allow short names to be used when importing a model and auto-complete the name with the default mask.
-
- 27 Feb, 2025 2 commits
-
-
Blake Mizerany authored
This fixes panics introduced in 2412adf4 when Gin ungracefully assumes that the http.ResponseWriter implements http.CloseNotifier and http.Flusher, which our new statusCodeRecorder does not. This is a temporary fix until we can pour the rest of the Gin out.
-
Blake Mizerany authored
This commit introduces a new API implementation for handling interactions with the registry and the local model cache. The new API is located in server/internal/registry. The package name is "registry" and should be considered temporary; it is hidden and not bleeding outside of the server package. As the commits roll in, we'll start consuming more of the API and then let reverse osmosis take effect, at which point it will surface closer to the root level packages as much as needed.
-
- 25 Feb, 2025 1 commit
-
-
Blake Mizerany authored
This commit copies (without history) the bmizerany/ollama-go repository with the intention of integrating it into the ollama as a replacement for the pushing, and pulling of models, and management of the cache they are pushed and pulled from. New homes for these packages will be determined as they are integrated and we have a better understanding of proper package boundaries.
-
- 22 Feb, 2025 1 commit
-
-
Blake Mizerany authored
The route assembly in Handler lacked clear organization making it difficult scan for routes and their relationships to each other. This commit aims to fix that by reordering the assembly of routes to group them by category and purpose. Also, be more specific about what "config" refers to (it is about CORS if you were wondering... I was.)
-
- 20 Feb, 2025 2 commits
-
-
frob authored
-
Lucas Hahn authored
-
- 14 Feb, 2025 3 commits
-
-
Jesse Gross authored
This provides integration with the new Ollama engine (58245413 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1
-
Jesse Gross authored
This allows there to be a file that is a list of models that is not mixed into the runner code.
-
Michael Yang authored
feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 05 Feb, 2025 3 commits
-
-
Yashwanth A authored
In some cases, downloads slow due to disk i/o or other factors, causing the download to restart a part. This causes the download to "reverse" in percent completion. By increasing the timeout to 30s, this should happen less frequently.
-
Jeffrey Morgan authored
-
William authored
-
- 29 Jan, 2025 1 commit
-
-
Michael Yang authored
* add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Daniel Hiltgen <daniel@ollama.com>
-
- 15 Jan, 2025 1 commit
-
-
Patrick Devine authored
-
- 09 Jan, 2025 1 commit
-
-
Patrick Devine authored
-
- 08 Jan, 2025 1 commit
-
-
isamu arimoto authored
-
- 01 Jan, 2025 1 commit
-
-
Patrick Devine authored
Replaces `POST /api/create` to use JSON instead of a Modelfile. This is a breaking change.
-
- 23 Dec, 2024 1 commit
-
-
湛露先生 authored
-
- 15 Dec, 2024 1 commit
-
-
Patrick Devine authored
Refactor mllama image processing code, and add pixtral and qwen2vl
-
- 11 Dec, 2024 1 commit
-
-
Blake Mizerany authored
Fixes #7944
-
- 10 Dec, 2024 3 commits
-
-
frob authored
-
Stefan Weil authored
-
Daniel Hiltgen authored
* llama: wire up builtin runner This adds a new entrypoint into the ollama CLI to run the cgo built runner. On Mac arm64, this will have GPU support, but on all other platforms it will be the lowest common denominator CPU build. After we fully transition to the new Go runners more tech-debt can be removed and we can stop building the "default" runner via make and rely on the builtin always. * build: Make target improvements Add a few new targets and help for building locally. This also adjusts the runner lookup to favor local builds, then runners relative to the executable, and finally payloads. * Support customized CPU flags for runners This implements a simplified custom CPU flags pattern for the runners. When built without overrides, the runner name contains the vector flag we check for (AVX) to ensure we don't try to run on unsupported systems and crash. If the user builds a customized set, we omit the naming scheme and don't check for compatibility. This avoids checking requirements at runtime, so that logic has been removed as well. This can be used to build GPU runners with no vector flags, or CPU/GPU runners with additional flags (e.g. AVX512) enabled. * Use relative paths If the user checks out the repo in a path that contains spaces, make gets really confused so use relative paths for everything in-repo to avoid breakage. * Remove payloads from main binary * install: clean up prior libraries This removes support for v0.3.6 and older versions (before the tar bundle) and ensures we clean up prior libraries before extracting the bundle(s). Without this change, runners and dependent libraries could leak when we update and lead to subtle runtime errors.
-
- 09 Dec, 2024 1 commit
-
-
Jesse Gross authored
New lines can be an important part of a user's prompt and trimming it can alter the results. We previously only trimmed prompts with images but refactoring brought this behavior to all prompts, where it became more noticable. The /generate endpoint adds less whitespace and therefore doesn't need to trim it out - this brings the same behavior to /chat. Thanks to @gabe-l-hart for spotting the issue! Fixes #7795
-
- 05 Dec, 2024 2 commits
-
-
Parth Sareen authored
-
Parth Sareen authored
Adds structured outputs to chat endpoint --------- Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Hieu Nguyen <hieunguyen1053@outlook.com>
-
- 30 Nov, 2024 2 commits
-
-
Jeffrey Morgan authored
-
Parth Sareen authored
-
- 27 Nov, 2024 1 commit
-
-
Parth Sareen authored
-
- 25 Nov, 2024 1 commit
-
-
Blake Mizerany authored
This changes makeRequest to update the http client Transport if and only if testMakeRequestDialContext is set. This is to avoid overriding the default Transport when testMakeRequestDialContext is nil, which broke existing behavior, included proxies, timeouts, and other behaviors. Fixes #7829 Fixes #7788
-
- 23 Nov, 2024 1 commit
-
-
oza6ut0ne authored
-
- 22 Nov, 2024 1 commit
-
-
Bruce MacDonald authored
In the past the ollama.com server would return a JWT that contained information about the user being authenticated. This was used to return different error messages to the user. This is no longer possible since the token used to authenticate does not contain information about the user anymore. Removing this code that no longer works. Follow up changes will improve the error messages returned here, but good to clean up first.
-
- 20 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
Avoid a round-trip asking users for logs to see what went wrong.
-
- 19 Nov, 2024 1 commit
-
-
Blake Mizerany authored
This change allows for mixed-case model names to be pushed, pulled, copied, and created, which was previously disallowed because the Ollama registry was backed by a Docker registry that enforced a naming convention that disallowed mixed-case names, which is no longer the case. This does not break existing, intended, behaviors. Also, make TestCase test a story of creating, updating, pulling, and copying a model with case variations, ensuring the model's manifest is updated correctly, and not duplicated across different files with different case variations.
-
- 17 Nov, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 06 Nov, 2024 1 commit
-
-
Jesse Gross authored
The Go runner does not have a problem with supporting parallel requests for most multimodal models. Now that we won't be potentially falling back to server.cpp, this restriction can be lifted. However, the new mllama model can't support parallel requests, so we will need to keep a restriction for that.
-
- 05 Nov, 2024 2 commits
-
-
Daniel Hiltgen authored
One potential failure mode is an empty file which bubbles up as an EOF error, leading to all pulls and listing operations failing. Instead, continue and warn about the corrupt manifest. This also allows re-pulling the corrupt manifest to repair the system.
-
Jesse Gross authored
Currently we assume that images take 768 tokens of context size for the purposes of clipping old messages that exceed the context window. However, our mllama implementation stores the full image embedding in a single token. As a result, there is significant waste of context space. Ideally, we would handle this more generically and have the implementation report the number of tokens. However, at the moment this would just result in a similar set of 'if' conditions in the runner plus APIs to report it back. So for now, we just keep this simple.
-
- 04 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
Avoid excessive log spew and make consistent with chat logging
-