- 02 Jul, 2024 4 commits
-
-
Daniel Hiltgen authored
Switch ARM64 container image base to rocky 8
-
Daniel Hiltgen authored
The centos 7 arm mirrors have disappeared due to the EOL 2 days ago, and the vault sed workaround which works for x86 doesn't work for arm.
-
Daniel Hiltgen authored
Centos 7 EOL broke mirrors
-
Daniel Hiltgen authored
As of July 1st 2024: Could not resolve host: mirrorlist.centos.org This is expected due to EOL dates.
-
- 01 Jul, 2024 12 commits
-
-
Josh authored
fix: trim spaces for FROM argument, don't trim inside of quotes
-
Josh authored
fix: add unsupported architecture message for linux/windows
-
Josh Yan authored
-
Josh Yan authored
-
Daniel Hiltgen authored
Fix case for NumCtx
-
Daniel Hiltgen authored
Document concurrent behavior and settings
-
Daniel Hiltgen authored
This may confuse users thinking "auto" is an acceptable string - it must be numeric
-
Daniel Hiltgen authored
-
Josh Yan authored
-
Daniel Hiltgen authored
Enable concurrency by default
-
RAPID ARCHITECT authored
* Update README.md Added Mesop example to web & desktop * Update README.md --------- Co-authored-by:Jeffrey Morgan <jmorganca@gmail.com>
-
Eduard authored
Runs fine on a NVIDIA GeForce GTX 1050 Ti
-
- 29 Jun, 2024 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2
-
- 28 Jun, 2024 4 commits
-
-
Daniel Hiltgen authored
-
royjhan authored
-
royjhan authored
* Check exists projtype * Maintain Ordering
-
royjhan authored
-
- 27 Jun, 2024 7 commits
-
-
Michael Yang authored
gemma2 graph
-
Michael Yang authored
-
Josh Yan authored
-
Josh Yan authored
-
Michael authored
* update readme for gemma 2
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 2 commits
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF. -
Blake Mizerany authored
This commit changes the 'ollama run' command to defer fetching model information until it really needs it. That is, when in interactive mode. It also removes one such case where the model information is fetch in duplicate, just before calling generateInteractive and then again, first thing, in generateInteractive. This positively impacts the performance of the command: ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
-
- 21 Jun, 2024 8 commits
-
-
Daniel Hiltgen authored
Fix use_mmap parsing for modelfiles
-
Daniel Hiltgen authored
Provide consistent ordering for the ps command - longest duration listed first
-
royjhan authored
-
Daniel Hiltgen authored
Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
Michael Yang authored
fix: quantization with template
-
Michael Yang authored
-
Daniel Hiltgen authored
Add the new tristate parsing logic for the code path for modelfiles, as well as a unit test.
-
- 20 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Refine mmap default logic on linux
-