• Gabe Goodhart's avatar
    IBM granite/granitemoe architecture support (#6760) · f2890a44
    Gabe Goodhart authored
    * fix(ext_server): Port llama.cpp sampling refactors to ext_server
    
    This was a fairly large changeset. I closely followed the changes here:
    https://github.com/ggerganov/llama.cpp/commit/df270ef74596da8f1178f08991f4c51f18c9ee82
    
    
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat: Bump llama.cpp to the latest master with `granite` support
    
    This does not yet have granite MoE support, but that can come in a
    follow up PR
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(solar): Update solar patch for llama.cpp bump
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat(llama.cpp): Bump llama.cpp for granitemoe support
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat(llama.cpp): Bump llama.cpp for granitemoe support
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(solar): Update the solar-pro patch for latest llama.cpp bump
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat(llama.cpp): Bump to the latest master of llama.cpp
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(patches): Update all patches for latest bump
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat(llama): Always run sync.sh from the right directory
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama/patches): Update llama patches
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * feat(llama)!: Rough sync with llama.cpp submodule
    
    There are a number of changes that will need to be propagated to llama.go
    before any of this works!
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama/patches): Add a patch and update for missing ggml-impl.h include
    
    This include is where the ggml_cgraph struct is defined. It is included in
    many of the .c files to define the forward declartion in ggml.h. It seems
    that with the subset of code included here, the import was somehow lost (or
    out-of-order) when building, so adding this include to llama.cpp fixes the
    missing definition.
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Add missing log.cpp
    
    This was added as part of the logging overhaul done in llama.cpp
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Overhaul use of sampling module for llama.cpp changes
    
    The changes here reflect the changes made in the big llama.cpp sampling PR
    https://github.com/ggerganov/llama.cpp/pull/9294
    
    
    
    The sampling functionality is now broken into the base interface
    (llama_sampler) and the generation implementation (gpt_sampler). The
    changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
    STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
    access a pure-C interface.
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Fix the impl of SampleTokenGreedy for new sampling
    
    I don't think this method is currently used, so it could probably just be
    removed so that all sampling goes through the GPT interface, but in the
    interest of doing no harm, this should keep the method working as expected.
    
    Branch: IBMGraniteArchitectureSupport
    
    * fix(llama): Remove unused SampleTokenGreedy
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(sync): Remove bash-specific change to sync.sh
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * chore(gofumpt): Format on llama.go to pass linting
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llm): Fix missing <thread> include in ext_server
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Remove TODO about grammar_first
    
    This feature was not used/needed previously so should be fine without
    plumbing it through now.
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Better naming for sampling wrapper and args
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Fix patch 05 to use new wrapper api and re-sync
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * runner: Flush pending responses before returning
    
    If there are any pending reponses (such as from potential stop
    tokens) then we should send them back before ending the sequence.
    Otherwise, we can be missing tokens at the end of a response.
    
    Fixes #6707
    
    * fix(llama/sampling): Use gpt_sampler with a forward declaration
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llama): Remove unnecessary patch for gguf impl header
    
    This was caused by an earlier mistake in the embeddings patch that was
    dereferencing the pointer instead of using the wrapper API.
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    * fix(llm): Remove use of deprecated --log-disable flag
    
    Branch: IBMGraniteArchitectureSupport
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    
    ---------
    Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
    f2890a44
server.cpp 121 KB