• Daniel Hiltgen's avatar
    cuda: leverage JIT for smaller footprint (#11635) · dc5a6454
    Daniel Hiltgen authored
    Prior to this change our official binaries contained both JIT PTX code and
    the cubin binary code for our chosen compute capabilities. This change
    switches to only compile the PTX code and rely on JIT at runtime for
    generating the cubin specific to the users GPU.  The cubins are cached
    on the users system, so they should only see a small lag on the very
    first model load for a given Ollama release.  This also adds the first
    generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.
    
    This change reduces the ggml-cuda.dll from 1.2G to 460M
    dc5a6454
CMakePresets.json 2.35 KB