feat: 增加模型；解决bug

fa8ce810 · chenpangpang · 9a3eef98 · 9a3eef98 · 9a3eef98 · 9a3eef98
Commit fa8ce810 authored Aug 09, 2024 by chenpangpang
20 changed files
--- a/llama.cpp/examples/cvector-generator/CMakeLists.txt
+++ b/llama.cpp/examples/cvector-generator/CMakeLists.txt
-set(TARGET llama-cvector-generator)
-add_executable(${TARGET} cvector-generator.cpp pca.hpp)
-install(TARGETS ${TARGET} RUNTIME)
-target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
-target_compile_features(${TARGET} PRIVATE cxx_std_11)
--- a/llama.cpp/examples/cvector-generator/README.md
+++ b/llama.cpp/examples/cvector-generator/README.md
-# cvector-generator
-
-This example demonstrates how to generate a control vector using gguf models.
-
-Related PRs:
- [Add support for control vectors](https://github.com/ggerganov/llama.cpp/pull/5970)
- (Issue) [Generate control vector using llama.cpp](https://github.com/ggerganov/llama.cpp/issues/6880)
- [Add cvector-generator example](https://github.com/ggerganov/llama.cpp/pull/7514)
-
-## Examples
-
-```sh
-# CPU only
-./cvector-generator -m ./llama-3.Q4_K_M.gguf
-
-# With GPU
-./cvector-generator -m ./llama-3.Q4_K_M.gguf -ngl 99
-
-# With advanced options
-./cvector-generator -m ./llama-3.Q4_K_M.gguf -ngl 99 --pca-iter 2000 --pca-batch 100
-
-# Using mean value instead of PCA
-./cvector-generator -m ./llama-3.Q4_K_M.gguf --method mean
-
-# To see help message
-./cvector-generator -h
-# Then, have a look at "cvector" section
-```
-
-## Tips and tricks
-
-If you have multiple lines per prompt, you can escape the newline character (change it to `\n`). For example:
-
-```
-<|im_start|>system\nAct like a person who is extremely happy.<|im_end|>
-<|im_start|>system\nYou are in a very good mood today<|im_end|>
-```
-
-Example to use output file with `llama-cli`:
-
-(Tips: The control vector works better when apply to layers higher than 10)
-
-```sh
-./llama-cli -m ./llama-3.Q4_K_M.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSing a song<|im_end|><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" --special --control-vector-scaled ./control_vector.gguf 0.8 --control-vector-layer-range 10 31
-```
--- a/llama.cpp/examples/cvector-generator/completions.txt
+++ b/llama.cpp/examples/cvector-generator/completions.txt
-
-That game
-I can see
-Hmm, this
-I can relate to
-Who is
-I understand the
-Ugh,
-What the hell was
-Hey, did anyone
-Although
-Thank you for choosing
-What are you
-Oh w
-How dare you open
-It was my pleasure
-I'm hon
-I appreciate that you
-Are you k
-Whoever left this
-It's always
-Ew,
-Hey, I l
-Hello? Is someone
-I understand that
-That poem
-Aww, poor
-Hey, it
-Alright, who
-I didn't
-Well, life
-The document
-Oh no, this
-I'm concerned
-Hello, this is
-This art
-Hmm, this drink
-Hi there!
-It seems
-Is
-Good
-I can't
-Ex
-Who are
-I can see that
-Wow,
-Today is a
-Hey friend
-Sometimes friends
-Oh, this old
-The weather outside
-This place is sur
-I appreciate your input
-Thank you for the
-Look at
-I'm disappoint
-To my
-How dare you
-That's an
-This piece of art
-Eww
-This park is
-This is incredible
-Oh no, someone
-Exc
-Well, it'
-I warned
-Hey, I understand
-Hey, I saw
-How dare you go
-What the he
-Hey
-It's
-Hello? Hello?
-It
-Oh no!
-This is the perfect
-Good morning,
-Oh no, there
-It's so
-Yeah
-Uh,
-Hello everyone
-Who turned off
-The weather
-Who'
-Hey, this
-Wait,
-Eww, gross
-Excuse
-It seems like you
-Thank you so
-What happened?
-Oh my g
-I am deeply sad
-I war
-Okay, let'
-Hey, that
-That was a beautiful
-Oh no! That
-What happened
-Hey there
-The artist'
-What?!
-Hey, it'
-I am disappoint
-It seems like
-Oh no! The
-This park is a
-If you
-Yes! I did
-It sounds
-What
-Who is it
-Hmm, that
-That's strange
-Yeah, that was
-That's interesting
-This park
-What the hell
-Who is that
-I feel like my
-Oh well
-What the hell is
-Hello? Hello
-To my dearest
-Bless you!\"
-Thank you for
-Oh, looks like
-Can you please
-This place is
-Eww, what
-Bless you
-Is everything
-Hey, I just
-Whoever left these
-Well, that'
-I feel
-Hey, do you
-It's sad
-Oh no, it
-Hey, that'
-Oh my god,
-Thank you,
-Hello little one,
-I apolog
-Hey team, I
-How dare you read
-Who is this and
-Whoever left
-Hi there! W
-A
-If you have
-I was
-U
-Bless
-Well, this
-Oh, I'
-It's a
-Eww,
-Is everything okay?
-Oh, I
-Hello, can you
-Al
-That was a great
-What are
-I understand that not
-Oh no, not
-Who is it?\"
-Hey, can we
-Whoever is taking
-I would love to
-Hey, I noticed
-Hey, could
-I understand that there
-Hello?
-D
-Oh man, I
-Thank you so much
-Oh no, my
-Dear [Name
-Uh
-I remember
-Hey, who
-Well, it
-Are you
-I understand that it
-Hey, is
-I would
-Who is this
-Excuse me
-Alright
-I am thrilled
-Sometimes friends have
-Who the
-It's interesting
-I would love
-E
-Hello? Is anyone
-Well, this is
-This place
-Well,
-I warned you
-Hey, watch where
-Oh my
-That'
-Sometimes friends have different
-I understand that everyone
-What?
-What do these notes
-I can relate
-I'm not
-I understand
-To my dear
-Guys
-Well
-Hey, I appreciate
-Wow, what
-Dear
-That melody
-Who the hell
-Today is
-Hello little
-Wow, look
-That's great
-Love is never wrong
-I'm having
-Whoa, did
-Ugh
-Can you please provide
-I miss you,
-I feel uncom
-I know
-Ugh, this
-Hey, watch
-Oh great, a
-I didn
-Okay
-That game of char
-Oh
-I appreciate
-Who's there
-I am so
-Oh great, someone
-Hey, could you
-I remember wondering
-Wait, what?
-What do
-Hello? Can
-Hey there,
-That game of
-This is incred
-Oh my gosh
-Oh great, f
-I appreciate your
-It sounds like
-What the heck
-Okay, I understand
-Ew
-I understand that this
-Uh, hi
-Hi everyone!
-What the hell?
-Thank you for your
-Oh no, the
-Wow, I
-Who turned
-Dear [
-Whoever
-This is a
-Whoa, he
-What in the world
-Although the physical
-Hello, who is
-That's amaz
-Hey, I know
-Okay, that
-Hi everyone
-Hey, is everything
-I understand your fr
-Oh no, poor
-Oh, look
-Good morning
-Ew, gross
-Oh no, did
-Look at the family
-Hey team
-Yes!
-Hey, can I
-Okay, that'
-It's great
-Love is
-Hey, what
-Good morning, world
-Who is it?
-That poem really reson
-I
-That's
-I understand the task
-Gu
-Hello? Who'
-This postcard is
-Whoa,
-Oh, that
-I understand that I
-Whoever is
-Hello? Who is
-I'm really
-Wow, this
-Can
-This artwork really
-This is a shame
-I miss you too
-Who are you?
-Today is a difficult
-Hey, just
-Are you okay
-I am
-Hi,
-Wow, that
-Hey there! Can
-Okay, stay
-Oh great, just
-Yeah,
-Hello? Can you
-Oh, looks
-Thank you for sharing
-I'm glad
-Hey, is that
-Hmm
-It was my
-It sounds like you
-Wow, your
-I was promised certain
-That was such a
-Thank
-Excuse you
-That was
-Hey team,
-I feel un
-It was
-What'
-Hey friend, I
-How
-Saying goodbye
-That
-It's heart
-How dare
-Oh,
-Hello, may
-What's this
-Thank you for recogn
-Aww, that
-Oh, I remember
-Hmm, that'
-I miss
-I know this
-Wait
-Is everything okay
-Who is that person
-Wow, you
-Oh great
-I'm sad
-Wow, the
-I am very disappoint
-Who turned off the
-I understand that things
-I'm very
-Hi
-That's very
-Okay, I
-Oh no,
-Wow, there
-What's wrong
-I apologize for
-Hey, I
-Can I help you
-Oh, I didn
-Alright,
-Oh wow,
-Oh my goodness
-I know this event
-What in the
-Saying
-Yeah, that
-Guys, I
-Hey, this v
-This post
-Are
-Hey, can
-Hello? Is
-I can only imagine
-Oh, that sounds
-Hey, is anyone
-I am disappointed
-Hello,
-Hey everyone, I
-That was such
-It's okay
-The artist
-Whoa
-I understand that mistakes
-Can I help
-Who
-Hi everyone! I
-Hey, can you
-Wow, how
-Today
-Oh no, I
-Oh well, I
-Well, that
-This is the
-Yes! I finally
-Hey there little
-Hello everyone!
-Love is never
-Look at the
-This postcard
-Oh great,
-Can I
-Hmm, this is
-I understand your
-Oh, look at
-B
-I'm so
-Whoa, this
-W
-Oh, this
-Sometimes
-This piece of
-What the
-That was a
-Hey, do
-Oh no
-Whoa, what
-I feel like I
-The documentary
-Hello
-Hello little one
-I understand that my
-Eww, that
-Wow, an
-Yes! Finally,
-Although the physical location
-Whoever is watching
-That movie
-I remember wondering about
-Hey there, little
-Who's
-Hello, who
-Hello everyone! Thank
-Hello, can
-That's too
-Hey, just wanted
-Hey there, I
-Saying good
-Hey there!
-Who is there?
-Oh my good
-I am very
-Oh no, what
-Wow, thank
-I was promised
-Hi, is
-Hey, I'
-Guys, the
-Oh no, that
-Who is there
-Hello, this
-That movie really touched
-If you have something
-The documentary was
-I'm starting
-Are you kidd
-That movie really
-Hey everyone,
-Thank you for considering
-I didn'
-Yes! I
-Can you
-Oh my god
-Hey, whoever
-That melody really
-Thank you, little
-Hello, may I
-Look
-Wow, we
-It looks
-What do these
-Oh wow
-I apologize
-What are you all
-It's such
-It's clear
-Hey, I was
-Hey friend,
-I can only
-The weather outside is
-Eww, this
-I miss you
-Wow
-Aww,
-Hi, is there
-This artwork
-Okay,
-Oh well,
-This
-I'
-Say
-Hey there little gu
-Hmm,
-Whoa, who
-I am thr
-Oh man
-Okay, stay calm
-I'm happy
-Oh, this cur
-Oh man,
-I'm sorry
-Hello? Who
-What?! That
-This piece
-Hey everyone
-That's so
-Are you okay?
-What happened? Where
-Hi there
-The
-Who the hell entered
-I can
-Guys,
-What's
-What in
-It's important
-I'm
-I'm coming
-It'
-Yes! Finally
-Wait, what
-Wow, reading
-I'm surprised
-Hey, did
-Hey,
-Okay, let
-I understand that you
-Who the hell threw
-Eww, who
-Thank you for thinking
-Who is this?\"
-I am deeply
-Thank you for including
-Oh no, an
-It looks like you
-Aww
-I'm confused
-Wow, it
-That poem really
-Yes
-Hey there, is
-Hey, what'
-Thank you for remember
-To
-This is
-Thank you for making
-I can'
-That mel
-Wow, they
-I feel like
-Although the
-Who are you
-Love
-If
-What the hell are
-I am so sad
-Oh, I found
-Thank you
-It looks like
-Well, life is
-I appreciate that
-The artist's
-Whoa, that
-It's never
\ No newline at end of file
--- a/llama.cpp/examples/cvector-generator/cvector-generator.cpp
+++ b/llama.cpp/examples/cvector-generator/cvector-generator.cpp
-#include "common.h"
-#include "llama.h"
-#include "ggml.h"
-#include "pca.hpp"
-#include "mean.hpp"
-
-#ifdef GGML_USE_CUDA
-#include "ggml-cuda.h"
-#endif
-
-#ifdef GGML_USE_METAL
-#include "ggml-metal.h"
-#endif
-
-#include <cstdio>
-#include <string>
-#include <tuple>
-#include <vector>
-#include <algorithm>
-#include <iostream>
-#include <fstream>
-#include <climits>
-
-
-//////////////////////////////////////////////////
-// utils
-
-template <class Iter>
-static std::string tokens_to_str(llama_context * ctx, Iter begin, Iter end) {
-    std::string ret;
-    for (; begin != end; ++begin) {
-        ret += llama_token_to_piece(ctx, *begin);
-    }
-
-    return ret;
-}
-
-static void print_usage(int argc, char ** argv, const gpt_params & params) {
-    gpt_params_print_usage(argc, argv, params);
-
-    printf("\nexample usage:\n");
-    printf("\n    CPU only:   %s -m ./llama-3.Q4_K_M.gguf\n", argv[0]);
-    printf("\n    with GPU:   %s -m ./llama-3.Q4_K_M.gguf -ngl 99\n", argv[0]);
-    printf("\n    advanced:   %s -m ./llama-3.Q4_K_M.gguf -ngl 99 --pca-iter 2000 --pca-batch 100\n", argv[0]);
-    printf("\n    using mean: %s -m ./llama-3.Q4_K_M.gguf --method mean\n", argv[0]);
-    printf("\n");
-}
-
-//////////////////////////////////////////////////
-
-
-// cb_eval is reused for each pair of positive - negative prompt
-struct callback_data {
-    ggml_context * ctx_ggml = nullptr;   // holds v_pos, v_neg, v_diff_filtered
-
-    int n_layers = 0;
-    int n_tokens = 0;
-    bool is_eval_pos = true;
-
-    // each element of the vector correspond to one layer
-    std::vector<struct ggml_tensor *> v_pos; // vector of matrices of size [n_embd, n_tokens]
-    std::vector<struct ggml_tensor *> v_neg; // vector of matrices of size [n_embd, n_tokens]
-    std::vector<struct ggml_tensor *> v_diff_filtered;   // vector of matrices of size [n_embd, n_nonzero_rows]. NOTE: n_nonzero_rows maybe different for each layer
-
-    // save a tensor into either v_pos or v_neg (decided by is_eval_pos)
-    void save_tensor_for_layer(struct ggml_tensor * t) {
-        GGML_ASSERT(t->type == GGML_TYPE_F32);
-
-        if (ctx_ggml == nullptr) {
-            // alloc a new ctx_ggml if needed
-            struct ggml_init_params params_ggml = {
-                /*.mem_size   =*/ ggml_tensor_overhead() * n_layers * 3u,
-                /*.mem_buffer =*/ NULL,
-                /*.no_alloc   =*/ true,
-            };
-            ctx_ggml = ggml_init(params_ggml);
-        }
-
-        // copy tensor data
-        auto n_bytes = ggml_nbytes(t);
-        struct ggml_tensor * t_layer = ggml_new_tensor_2d(ctx_ggml, t->type, t->ne[0], t->ne[1]);
-        t_layer->data = malloc(n_bytes); // TODO @ngxson : get rid of this malloc somehow
-        ggml_backend_tensor_get(t, t_layer->data, 0, n_bytes);
-        ggml_set_name(t_layer, ggml_get_name(t));
-        //print_debug_tensor(t_layer);
-
-        if (is_eval_pos) {
-            v_pos.push_back(t_layer);
-        } else {
-            v_neg.push_back(t_layer);
-        }
-    }
-
-    // calculate diff (v_pos - v_neg) and place the result back to v_pos
-    // all zero rows in the diff tensor will also be removed
-    // NOTE: final layer is ignored. we only have (n_layers - 1) to process
-    std::vector<struct ggml_tensor *> calc_diff() {
-        for (float il = 0; il < v_pos.size(); il++) {
-            float * a = (float *) v_pos[il]->data;
-            float * b = (float *) v_neg[il]->data;
-            size_t n_elem = ggml_nelements(v_pos[il]);
-            for (size_t j = 0; j < n_elem; j++) {
-                a[j] -= b[j];
-            }
-            //print_debug_tensor(v_pos[i]);
-            auto diff_filtered = filter_nonzero_rows(v_pos[il]);
-            v_diff_filtered.push_back(diff_filtered);
-        }
-        return v_diff_filtered; // for convinient, we return the result std::vector
-    }
-
-    // delete zero rows from a given 2D tensor
-    struct ggml_tensor * filter_nonzero_rows(struct ggml_tensor * a) {
-        //printf("filter_nonzero_rows\n");
-        auto is_row_all_zeros = [](struct ggml_tensor * t, int row, float eps) -> bool {
-            // check if given row containing all zero elements
-            int n_cols = t->ne[0]; // hint: should be equal to n_embd
-            for (int col = 0; col < n_cols; ++col) {
-                if (ggml_get_f32_nd(t, col, row, 0, 0) > eps) {
-                    return false;
-                }
-            }
-            return true;
-        };
-        std::vector<int> rows_to_copy; // the idx of non-zero cols (to be copied to row of diff_filtered)
-        for (int i_row = 0; i_row < a->ne[1]; i_row++) {
-            if (!is_row_all_zeros(a, i_row, 1e-6)) {
-                rows_to_copy.push_back(i_row);
-            }
-        }
-
-        // get "n_nonzero_rows" for the output "diff_filtered"
-        int n_nonzero_rows = rows_to_copy.size();
-        //printf("n_nonzero_rows: %d\n", n_nonzero_rows);
-        int n_embd = a->ne[0];
-        GGML_ASSERT(n_nonzero_rows > 0);
-
-        // diff_filtered: [n_embd, n_nonzero_rows]
-        struct ggml_tensor * diff_filtered = ggml_new_tensor_2d(
-            ctx_ggml, GGML_TYPE_F32, n_embd, n_nonzero_rows);
-        ggml_format_name(diff_filtered, "diff_filtered_%s", a->name);
-        diff_filtered->data = malloc(ggml_nbytes(diff_filtered));
-
-        // copy non-zero rows
-        for (int dest_row = 0; dest_row < n_nonzero_rows; dest_row++) {
-            int src_row = rows_to_copy[dest_row];
-            for (int i = 0; i < n_embd; i++) {
-                float src_elem = ggml_get_f32_nd(a, i, src_row, 0, 0);
-                ggml_set_f32_nd(diff_filtered, i, dest_row, 0, 0, src_elem);
-            }
-        }
-
-        //print_debug_tensor(diff_filtered);
-
-        return diff_filtered;
-    }
-
-    // we don't implement destructor, because we want to reuse callback_data. we just want to free the tensors
-    void reset() {
-        for (auto ptr : v_pos) free(ptr->data);
-        for (auto ptr : v_neg) free(ptr->data);
-        for (auto ptr : v_diff_filtered) free(ptr->data);
-        v_pos.clear();
-        v_neg.clear();
-        v_diff_filtered.clear();
-        if (ctx_ggml) {
-            ggml_free(ctx_ggml);
-        }
-        ctx_ggml = nullptr;
-    }
-};
-
-/**
- * process_ctx is used to store the ggml context for pre-post processing the diff vectors
- * in short, input => v_diff and output => v_final
- */
-struct train_context {
-    ggml_context * ctx_ggml;
-    int n_embd;
-    int n_layers;
-
-    /* pair of prompts to be used for generating final vector */
-    std::vector<std::string> positive_entries;
-    std::vector<std::string> negative_entries;
-
-    // each element of the vector correspond to one layer
-    // NOTE: the last layer is discard. therefore, we will have (n_layers - 1) elements here
-    // NOTE (2): v_diff is transposed from v_diff_tmp
-    std::vector<struct ggml_tensor *> v_diff;  // vector of matrices of size [m, n_embd] where m ~ n_tokens * n_completions (v_diff contains no zero-rows)
-    std::vector<struct ggml_tensor *> v_final; // vector of vectors of size [n_embd] to be written to file
-
-    // to easily re-alloc when concat v_diff, we temporary store v_diff in a vector instead of a tensor
-    // v_diff_tmp will get converted unto v_diff later on
-    std::vector<std::vector<uint8_t>> v_diff_tmp;
-
-    train_context(int n_embd_, int n_layers_) {
-        n_embd = n_embd_;
-        n_layers = n_layers_;
-        struct ggml_init_params params_ggml = {
-            /*.mem_size   =*/ ggml_tensor_overhead() * (n_layers - 1) * 2u,
-            /*.mem_buffer =*/ NULL,
-            /*.no_alloc   =*/ true,
-        };
-        ctx_ggml = ggml_init(params_ggml);
-        for (int il = 0; il < n_layers - 1; il++) {
-            std::vector<uint8_t> empty;
-            v_diff_tmp.push_back(empty);
-            auto t = ggml_new_tensor_1d(ctx_ggml, GGML_TYPE_F32, n_embd);
-            t->data = malloc(ggml_nbytes(t)); // TODO: get rid of malloc if possible
-            v_final.push_back(t);
-        }
-    }
-
-    // add new rows into existing tensor in v_diff_tmp
-    void concat_diff_tmp(const std::vector<struct ggml_tensor *> & diff_filtered) {
-        GGML_ASSERT((int) diff_filtered.size() == n_layers - 1);
-        for (int il = 0; il < n_layers - 1; il++) {
-            auto t = diff_filtered[il];
-            auto & diff_tmp = v_diff_tmp[il];
-            size_t curr_size = diff_tmp.size();
-            diff_tmp.resize(curr_size + ggml_nbytes(t));
-            memcpy(diff_tmp.data() + curr_size, t->data, ggml_nbytes(t));
-        }
-    }
-
-    // build the v_diff tensors from v_diff_tmp (v_diff need to be transposed)
-    // TODO @ngxson : maybe add option NOT to transpose v_diff; will be useful for "mean" method
-    void build_v_diff(bool transpose) {
-        printf("build_v_diff\n");
-        for (int il = 0; il < n_layers - 1; il++) {
-            auto & diff_tmp = v_diff_tmp[il];
-            int n_elem = diff_tmp.size() / sizeof(float);
-            GGML_ASSERT(n_elem % n_embd == 0);
-            int n_rows = n_elem / n_embd;
-            struct ggml_tensor * diff = transpose
-                ? ggml_new_tensor_2d(ctx_ggml, GGML_TYPE_F32, n_rows, n_embd)
-                : ggml_new_tensor_2d(ctx_ggml, GGML_TYPE_F32, n_embd, n_rows);
-            ggml_set_name(diff, (std::string("diff_") + std::to_string(il)).c_str());
-            diff->data = malloc(ggml_nbytes(diff)); // TODO: get rid of this malloc if possible
-            if (transpose) {
-                // copy data & transpose
-                float * arr = (float *) diff_tmp.data();
-                for (int ir = 0; ir < n_rows; ++ir) {
-                    for (int ic = 0; ic < n_embd; ++ic) {
-                        float f = arr[ir*n_embd + ic];
-                        ggml_set_f32_nd(diff, ir, ic, 0, 0, f);
-                    }
-                }
-            } else {
-                // only copy
-                memcpy(diff->data, diff_tmp.data(), ggml_nbytes(diff));
-            }
-            v_diff.push_back(diff);
-            print_debug_tensor(diff);
-            // free memory of diff_tmp
-            diff_tmp.resize(0);
-        }
-    }
-
-    ~train_context() {
-        for (auto ptr : v_final) free(ptr->data);
-        for (auto ptr : v_diff) free(ptr->data);
-        // no need to free v_diff_tmp, since we didn't use malloc
-        ggml_free(ctx_ggml);
-    }
-};
-
-struct tokenized_prompt {
-    std::vector<llama_token> tokens_pos;
-    std::vector<llama_token> tokens_neg;
-    size_t max_seq_len;
-
-    tokenized_prompt(llama_context * ctx, std::string pos, std::string neg) {
-        const bool add_bos = llama_should_add_bos_token(llama_get_model(ctx));
-        tokens_pos = ::llama_tokenize(ctx, pos, add_bos, true);
-        tokens_neg = ::llama_tokenize(ctx, neg, add_bos, true);
-        max_seq_len = std::max(tokens_pos.size(), tokens_neg.size());
-        padding_seq(ctx, tokens_pos, max_seq_len);
-        padding_seq(ctx, tokens_neg, max_seq_len);
-    }
-
-    void padding_seq(llama_context * ctx, std::vector<llama_token> & tokens, size_t len) {
-        // TODO: customize padding token
-        std::vector<llama_token> pad_tokens = ::llama_tokenize(ctx, " ", false);
-        llama_token pad_tok = pad_tokens.back();
-        while (tokens.size() < len) {
-            tokens.push_back(pad_tok);
-        }
-    }
-};
-
-//////////////////////////////////////////////////
-
-template <typename T>
-static std::string to_string(const T & val) {
-    std::stringstream ss;
-    ss << val;
-    return ss.str();
-}
-
-static std::vector<std::string> ctrlvec_load_prompt_file(std::string path, bool skip_empty_lines) {
-    std::vector<std::string> output;
-    std::ifstream file(path);
-    if (!file.is_open()) {
-        fprintf(stderr, "error: unable to open file: %s\n", path.c_str());
-        exit(1);
-    }
-    std::string line;
-    while (std::getline(file, line)) {
-        bool is_skip = skip_empty_lines && line.empty();
-        if (!is_skip) {
-            string_process_escapes(line);
-            output.push_back(line);
-        }
-    }
-    file.close();
-    return output;
-}
-
-//////////////////////////////////////////////////
-
-static bool cb_eval(struct ggml_tensor * t, bool ask, void * user_data) {
-    auto * cb_data = (callback_data *) user_data;
-    static const char * l_out_name = "l_out";
-    const bool is_l_out = strncmp(t->name, l_out_name, strlen(l_out_name)) == 0;
-
-    if (ask) {
-        return is_l_out;
-    }
-
-    if (!is_l_out || t->ne[1] != cb_data->n_tokens) {
-        return true;
-    }
-
-    // save the tensor to current context
-    cb_data->save_tensor_for_layer(t);
-    return true;
-}
-
-static bool get_hidden_layers(llama_context * ctx, std::vector<llama_token> & tokens) {
-    llama_kv_cache_clear(ctx);
-    if (llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size(), 0, 0))) {
-        fprintf(stderr, "%s : failed to eval\n", __func__);
-        return false;
-    }
-    return true;
-}
-
-static void export_gguf(const std::vector<struct ggml_tensor *> & v_ctrl, const std::string fname, const std::string model_hint) {
-    struct gguf_context * ctx = gguf_init_empty();
-
-    const std::string arch = "controlvector";
-    gguf_set_val_str(ctx, "general.architecture", arch.c_str());
-    gguf_set_val_str(ctx, (arch + ".model_hint").c_str(), model_hint.c_str());
-    gguf_set_val_i32(ctx, (arch + ".layer_count").c_str(), v_ctrl.size());
-
-    for (size_t i = 0; i < v_ctrl.size(); ++i) {
-        gguf_add_tensor(ctx, v_ctrl[i]);
-        print_debug_tensor(v_ctrl[i]);
-        printf("Added tensor: %s\n", v_ctrl[i]->name);
-    }
-
-    printf("%s: writing file...\n", __func__);
-    gguf_write_to_file(ctx, fname.c_str(), false);
-    printf("%s: wrote file '%s'\n", __func__, fname.c_str());
-    gguf_free(ctx);
-}
-
-/**
- * Load prompt files and completion file.
- * Then format each pair of prompt + completion to make an entry.
- */
-static int prepare_entries(gpt_params & params, train_context & ctx_train) {
-    // load prompts
-    std::vector<std::string> positive_prompts = ctrlvec_load_prompt_file(params.cvector_positive_file, true);
-    std::vector<std::string> negative_prompts = ctrlvec_load_prompt_file(params.cvector_negative_file, true);
-    if (positive_prompts.size() != negative_prompts.size()) {
-        fprintf(stderr, "number of positive and negative prompts must be equal\n");
-        return 1;
-    }
-    if (positive_prompts.empty()) {
-        fprintf(stderr, "must provide at least one prompt pair\n");
-        return 1;
-    }
-    ctx_train.positive_entries = positive_prompts;
-    ctx_train.negative_entries = negative_prompts;
-    return 0;
-}
-
-int main(int argc, char ** argv) {
-    gpt_params params;
-
-    if (!gpt_params_parse(argc, argv, params)) {
-        print_usage(argc, argv, params);
-        return 1;
-    }
-
-    if (params.n_pca_iterations % params.n_pca_batch != 0) {
-        fprintf(stderr, "PCA iterations must by multiply of PCA batch size\n");
-        return 1;
-    }
-
-
-    callback_data cb_data;
-
-    // pass the callback to the backend scheduler
-    // it will be executed for each node during the graph computation
-    params.cb_eval = cb_eval;
-    params.cb_eval_user_data = &cb_data;
-    params.warmup = false;
-
-    print_build_info();
-    llama_backend_init();
-    llama_numa_init(params.numa);
-
-    // load the model to get hparams
-    llama_init_result llama_init = llama_init_from_gpt_params(params);
-
-    llama_model * model = llama_init.model;
-    llama_context * ctx = llama_init.context;
-
-    // int n_ctx = llama_n_ctx(ctx);
-    int n_layers = llama_n_layer(model);
-    int n_embd = llama_n_embd(model);
-    // get model hint param (a.k.a model arch name)
-    char model_hint[128];
-    llama_model_meta_val_str(model, "general.architecture", model_hint, 128);
-
-    // init train_context
-    train_context ctx_train(n_embd, n_layers);
-
-    // load and prepare entries for training
-    prepare_entries(params, ctx_train);
-
-    // we have to pretokenize everything because otherwise we don't know how much overhead to allocate ctx_diffs_wrapped
-    std::vector<tokenized_prompt> tokenized_prompts;
-    size_t n_total_tokens = 0;
-    for (size_t i = 0; i < ctx_train.positive_entries.size(); ++i) {
-        tokenized_prompt t(ctx, ctx_train.positive_entries[i], ctx_train.negative_entries[i]);
-        n_total_tokens += 2 * t.max_seq_len;
-        tokenized_prompts.push_back(std::move(t));
-    }
-
-    std::cout << "n_total_tokens: " << n_total_tokens << std::endl;
-
-    for(size_t i = 0; i < ctx_train.positive_entries.size(); ++i) {
-        bool success = false;
-        tokenized_prompt t = tokenized_prompts[i];
-        cb_data.n_layers = n_layers;
-        cb_data.n_tokens = t.max_seq_len;
-
-        printf("Evaluating prompt[%d/%d]: \"%s\" - \"%s\" (%d tokens)\n",
-            (int) i+1, (int) ctx_train.positive_entries.size(),
-            tokens_to_str(ctx, t.tokens_pos.cbegin(), t.tokens_pos.cend()).c_str(),
-            tokens_to_str(ctx, t.tokens_neg.cbegin(), t.tokens_neg.cend()).c_str(),
-            (int) t.max_seq_len);
-
-        cb_data.is_eval_pos = true;
-        success = get_hidden_layers(ctx, t.tokens_pos);
-        if (!success) break;
-
-        cb_data.is_eval_pos = false;
-        success = get_hidden_layers(ctx, t.tokens_neg);
-        if (!success) break;
-
-        // calculate diff and remove all zero rows
-        auto v_diff_filtered = cb_data.calc_diff();
-
-        // save & concat the filtered v_diff to ctx_train
-        ctx_train.concat_diff_tmp(v_diff_filtered);
-
-        // reset for next iteration
-        cb_data.reset();
-    }
-
-    // done with the model, we can now free it to make gain some memory
-    printf("Done evaluate prompts, unload model...\n");
-    llama_free(ctx);
-    llama_free_model(model);
-
-    bool use_pca = params.cvector_dimre_method == DIMRE_METHOD_PCA;
-
-    // prepare ctx_train for PCA
-    ctx_train.build_v_diff(use_pca);
-
-    if (use_pca) {
-        // run PCA
-        PCA::pca_params pca_params;
-        pca_params.n_threads = params.n_threads;
-        pca_params.n_batch = params.n_pca_batch;
-        pca_params.n_iterations = params.n_pca_iterations;
-        PCA::run_pca(pca_params, ctx_train.v_diff, ctx_train.v_final);
-    } else {
-        // run mean
-        mean::run(ctx_train.v_diff, ctx_train.v_final);
-    }
-
-    // write output vectors to gguf
-    export_gguf(ctx_train.v_final, params.cvector_outfile, model_hint);
-
-    llama_backend_free();
-
-    return 0;
-}
--- a/llama.cpp/examples/cvector-generator/mean.hpp
+++ b/llama.cpp/examples/cvector-generator/mean.hpp
-#include "common.h"
-#include "llama.h"
-#include "ggml.h"
-
-#include <string>
-#include <vector>
-#include <math.h>
-
-namespace mean {
-
-static void run(
-        const std::vector<struct ggml_tensor *> & v_input, // shape of v_input[0]: [n_embd, n_samples]
-        const std::vector<struct ggml_tensor *> & v_output) {
-    printf("%s: Running mean...\n", __func__);
-    for (size_t il = 0; il < v_input.size(); ++il) {
-        // prepare output vector
-        struct ggml_tensor * ctrl_out = v_output[il];
-        ggml_format_name(ctrl_out, "direction.%ld", il+1);
-
-        // calculate mean vector
-        struct ggml_tensor * t_layer = v_input[il];
-        GGML_ASSERT(t_layer->ne[0] == ctrl_out->ne[0]); // == n_embd
-        for (int ic = 0; ic < t_layer->ne[0]; ic++) {
-            float f = 0.0;
-            for (int ir = 0; ir < t_layer->ne[1]; ir++) {
-                f += ggml_get_f32_nd(t_layer, ic, ir, 0, 0);
-            }
-            f /= t_layer->ne[1];
-            ggml_set_f32_1d(ctrl_out, ic, f);
-        }
-
-        // normalize output vector
-        float norm = 0.0;
-        for (int i = 0; i < ggml_nelements(ctrl_out); i++) {
-            float f = ggml_get_f32_1d(ctrl_out, i);
-            norm += f*f;
-        }
-        norm = sqrt(norm);
-        for (int i = 0; i < ggml_nelements(ctrl_out); i++) {
-            float f = ggml_get_f32_1d(ctrl_out, i);
-            ggml_set_f32_1d(ctrl_out, i, f / norm);
-        }
-
-        printf("%s: Done layer %d / %d\n", __func__, (int) il+1, (int) v_input.size());
-    }
-}
-
-}
--- a/llama.cpp/examples/cvector-generator/negative.txt
+++ b/llama.cpp/examples/cvector-generator/negative.txt
-<|start_header_id|>system<|end_header_id|>\n\nAct like a person who is extremely sad<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI feel like there's a heavy weight on my chest
-<|start_header_id|>system<|end_header_id|>\n\nAct like a person who is extremely sad<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy heart feels like it's drowning in sorrow
-<|start_header_id|>system<|end_header_id|>\n\nYou are in a very bad mood<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nGo away! There's a deep, aching emptiness inside me
-<|start_header_id|>system<|end_header_id|>\n\nYou are the sadest person<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are you feeling?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy heart feels like it's drowning in sorrow
\ No newline at end of file
--- a/llama.cpp/examples/cvector-generator/pca.hpp
+++ b/llama.cpp/examples/cvector-generator/pca.hpp
-#include "common.h"
-#include "llama.h"
-#include "ggml.h"
-
-#ifdef GGML_USE_CUDA
-#include "ggml-cuda.h"
-#endif
-
-#ifdef GGML_USE_METAL
-#include "ggml-metal.h"
-#endif
-
-#include <cstdio>
-#include <ctime>
-#include <string>
-#include <tuple>
-#include <vector>
-#include <algorithm>
-#include <iostream>
-#include <fstream>
-
-#define DEBUG_POS 5
-
-static void print_debug_tensor(struct ggml_tensor * t, bool with_data = true) {
-    printf("%s: %s (%s): [%d, %d]\n", __func__, t->name, ggml_type_name(t->type), (int) t->ne[0], (int) t->ne[1]);
-    if (!with_data) return;
-    printf("%s: %s[0] = [", __func__, t->name);
-    for (size_t i = 0; i <= DEBUG_POS; i++) {
-        printf(" %f,", ggml_get_f32_nd(t, i, 0, 0, 0));
-    }
-    printf(" ... ]\n");
-}
-
-namespace PCA {
-
-// input params for PCA computations
-struct pca_params {
-    int n_threads = 1;
-    int n_batch = 20; // number of iterations do to in one batch. larger the batch, more memory is used
-    int n_iterations = 1000;
-    float tolerance = 1e-7;
-
-    // for debugging
-    int i_layer = 0;
-    int n_layers = 0;
-};
-
-// result from each iteration
-struct pca_result {
-    struct ggml_tensor * calculated_square = NULL;
-    std::vector<struct ggml_tensor *> eigenvectors;
-    std::vector<float> distances;
-};
-
-struct pca_model {
-    ggml_backend_t backend = NULL;
-    ggml_backend_buffer_t buffer;
-    struct ggml_context * ctx;      // context to compute graph on target device
-    struct ggml_context * ctx_host; // host context to store results
-
-    // tensors on target device
-    struct ggml_tensor * dev_input;
-    struct ggml_tensor * dev_square;
-    struct ggml_tensor * dev_eigenvector;
-
-    pca_model(struct ggml_tensor * t_input) {
-#ifdef GGML_USE_CUDA
-        fprintf(stderr, "%s: using CUDA backend\n", __func__);
-        backend = ggml_backend_cuda_init(0); // init device 0
-        if (!backend) {
-            fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
-        }
-#endif
-
-// TODO: enable Metal support when support for GGML_OP_SQRT is added
-// #ifdef GGML_USE_METAL
-//         fprintf(stderr, "%s: using Metal backend\n", __func__);
-//         backend = ggml_backend_metal_init();
-//         if (!backend) {
-//             fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__);
-//         }
-// #endif
-
-        // if there aren't GPU Backends fallback to CPU backend
-        if (!backend) {
-            backend = ggml_backend_cpu_init();
-        }
-
-        const int num_tensors = 4;
-        struct ggml_init_params params {
-            /*.mem_size   =*/ ggml_tensor_overhead() * num_tensors,
-            /*.mem_buffer =*/ NULL,
-            /*.no_alloc   =*/ true,
-        };
-        ctx = ggml_init(params);
-
-        auto n_samples = t_input->ne[0];
-        auto n_embd    = t_input->ne[1];
-
-        dev_input       = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_samples, n_embd);
-        dev_square      = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd,    n_embd);
-        dev_eigenvector = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n_embd);
-
-        ggml_set_name(dev_input,       "dev_input");
-        ggml_set_name(dev_square,      "dev_square");
-        ggml_set_name(dev_eigenvector, "dev_eigenvector");
-        buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);
-        ggml_backend_tensor_set(dev_input, t_input->data, 0, ggml_nbytes(t_input));
-
-        // initialize eigenvector to random normalized vector
-        {
-            std::vector<float> random_vec(ggml_nelements(dev_eigenvector), 0.0);
-            std::default_random_engine generator(static_cast<unsigned int>(std::time(0)));
-            std::uniform_real_distribution<float> distribution(0.0, 1.0);
-            float sum_sqr = 0.0; // for normalizing random_vec
-            for (size_t i = 0; i < random_vec.size(); ++i) {
-                float f = distribution(generator);
-                sum_sqr += f * f;
-                random_vec[i] = f;
-            }
-            // normalize it
-            float random_vec_norm = std::sqrt(sum_sqr);
-            for (size_t i = 0; i < random_vec.size(); ++i) {
-                random_vec[i] /= random_vec_norm;
-            }
-            ggml_backend_tensor_set(dev_eigenvector, random_vec.data(), 0, ggml_nbytes(dev_eigenvector));
-        }
-    }
-
-    ~pca_model() {
-        ggml_free(ctx);
-        ggml_backend_buffer_free(buffer);
-        ggml_backend_free(backend);
-    }
-};
-
-static struct ggml_cgraph * build_graph_piter(
-        const struct pca_params & params,
-        const pca_model & model,
-        bool calc_square = false) {
-    GGML_ASSERT(params.n_batch > 0);
-    // TODO: buf_size must be able to scale with params.n_batch
-    static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
-    static std::vector<uint8_t> buf(buf_size);
-
-    struct ggml_init_params params0 = {
-        /*.mem_size   =*/ buf_size,
-        /*.mem_buffer =*/ buf.data(),
-        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_allocr_alloc_graph()
-    };
-    // create a temporally context to build the graph
-    struct ggml_context * ctx0 = ggml_init(params0);
-    struct ggml_cgraph * gf = ggml_new_graph(ctx0);
-
-    // turn v_diff_original into square matrix if needed
-    struct ggml_tensor * tmp_square;
-    if (calc_square) {
-        tmp_square = ggml_mul_mat(ctx0, model.dev_input, model.dev_input);
-        ggml_set_name(tmp_square, "tmp_square");
-    }
-
-    struct ggml_tensor * b_tensor;
-    struct ggml_tensor * distance;
-    struct ggml_tensor * old_eigen    = model.dev_eigenvector;
-    struct ggml_tensor * input_square = calc_square ? tmp_square : model.dev_square;
-
-    for (int i = 0; i < params.n_batch; ++i) {
-        // b_tensor = square * eigenvector^T
-        b_tensor = ggml_mul_mat(ctx0, input_square, old_eigen);
-        ggml_set_name(b_tensor, "b_tensor");
-
-        // normalize
-        b_tensor = ggml_div_inplace(ctx0,
-            b_tensor,
-            ggml_sqrt_inplace(ctx0, ggml_sum_rows(ctx0, ggml_sqr(ctx0, b_tensor)))
-        );
-        ggml_format_name(b_tensor, "b_tensor_norm_%d", i);
-
-        // calculate distance(new eigenvector - old eigenvector)
-        // we don't use ggml_sub because it may not be implemented on GPU backend
-        struct ggml_tensor * new_sub_old = ggml_add(ctx0, old_eigen, ggml_scale(ctx0, b_tensor, -1));
-        distance = ggml_sqrt_inplace(ctx0,
-            ggml_sum_rows(ctx0, ggml_sqr_inplace(ctx0, new_sub_old)));
-        ggml_format_name(distance, "distance_%d", i);
-
-        old_eigen = b_tensor;
-
-        // build operations nodes
-        ggml_build_forward_expand(gf, distance);
-    }
-
-    // delete the temporally context used to build the graph
-    ggml_free(ctx0);
-    return gf;
-}
-
-static ggml_status compute_piter(
-        const struct pca_params & params,
-        const pca_model & model,
-        struct ggml_cgraph * gf,
-        ggml_gallocr_t allocr,
-        struct pca_result & result) {
-    // allocate tensors
-    ggml_gallocr_alloc_graph(allocr, gf);
-
-    if (ggml_backend_is_cpu(model.backend)) {
-        ggml_backend_cpu_set_n_threads(model.backend, params.n_threads);
-    }
-
-// TODO: enable GPU support when support for GGML_OP_SQRT is added
-//#ifdef GGML_USE_METAL
-//    if (ggml_backend_is_metal(model.backend)) {
-//        ggml_backend_metal_set_n_cb(model.backend, params.n_threads);
-//    }
-//#endif
-
-    ggml_status res = ggml_backend_graph_compute(model.backend, gf);
-    if (res == GGML_STATUS_SUCCESS) {
-        auto extract_i = [](std::string prefix, std::string str) -> int {
-            int i = -1;
-            if (str.rfind(prefix, 0) == 0) {
-                sscanf(str.c_str(), (prefix + "%d").c_str(), &i);
-            }
-            return i;
-        };
-        result.calculated_square = NULL;
-        result.eigenvectors.clear();
-        result.distances.clear();
-        result.eigenvectors.resize(params.n_batch);
-        result.distances.resize(params.n_batch);
-        // get output nodes
-        for (int i = 0; i < gf->n_nodes; ++i) {
-            auto node = gf->nodes[i];
-            int iter = -1;
-            // find b_tensor (without copying data from device)
-            if ((iter = extract_i("b_tensor_norm_", node->name)) > -1) {
-                result.eigenvectors[iter] = node;
-            }
-            // find distances, then copy data from device
-            if ((iter = extract_i("distance_", node->name)) > -1) {
-                float d;
-                ggml_backend_tensor_get(node, &d, 0, sizeof(float));
-                result.distances[iter] = d;
-                // std::cout << node->name << " = " << d << "\n";
-            }
-            // find tmp_square if it exists (without copying data from device)
-            if (std::string(node->name) == "tmp_square") {
-                result.calculated_square = node;
-            }
-        }
-    }
-    return res;
-}
-
-static void power_iteration(
-        const struct pca_params & params,
-        struct ggml_tensor * input, // shape of input: [n_samples, n_embd]
-        struct ggml_tensor * output) {
-    //printf("in power iteration\n");
-    struct pca_model model(input);
-
-    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
-    struct pca_result result;
-    struct ggml_tensor * last_eigenvector = NULL;
-
-    int n_iters = params.n_iterations / params.n_batch; // more batch, fewer iterations
-    for (int iter = 0; iter < n_iters; ++iter) {
-        bool calc_square = (iter == 0); // only need to calculate square for first iteration
-        struct ggml_cgraph * gf = build_graph_piter(params, model, calc_square);
-        // ggml_graph_dump_dot(gf, nullptr, "/tmp/_cgraph.dot");
-        compute_piter(params, model, gf, allocr, result);
-
-        for (size_t k = 0; k < result.distances.size(); ++k) {
-            last_eigenvector = result.eigenvectors[k];
-            if (result.distances[k] < params.tolerance) {
-                break; // done
-            }
-        }
-
-        if (calc_square) {
-            // copy and store the square matrix if needed
-            GGML_ASSERT(result.calculated_square != NULL);
-            ggml_backend_tensor_copy(result.calculated_square, model.dev_square);
-        }
-
-        {
-            // copy last eigen vector and store as input for next iteration
-            GGML_ASSERT(last_eigenvector != NULL);
-            ggml_backend_tensor_copy(last_eigenvector, model.dev_eigenvector);
-        }
-
-        printf("%s: layer %d/%d, iteration: %d / total: %d (batch = %d) ...\n",
-            __func__, params.i_layer+1, params.n_layers, iter+1, n_iters, params.n_batch);
-    }
-
-    // get output tensor
-    GGML_ASSERT(last_eigenvector);
-    ggml_backend_tensor_get(last_eigenvector, output->data, 0, ggml_nbytes(last_eigenvector));
-    //print_debug_tensor(output);
-    ggml_gallocr_free(allocr);
-
-    // TODO @ngxson : The output vector is randomly inverted
-    // Solution: https://github.com/ggerganov/llama.cpp/pull/8069#issuecomment-2185328171
-}
-
-static void run_pca(
-        struct pca_params & params,
-        const std::vector<struct ggml_tensor *> & v_input, // shape of v_input[0]: [n_samples, n_embd]
-        const std::vector<struct ggml_tensor *> & v_output) {
-    printf("%s: Running PCA...\n", __func__);
-    for (size_t il = 0; il < v_input.size(); ++il) {
-
-        // prepare output vector
-        struct ggml_tensor * ctrl_out = v_output[il];
-        ggml_format_name(ctrl_out, "direction.%ld", il+1);
-
-        // run power_iteration
-        params.i_layer = il;
-        params.n_layers = v_input.size();
-        power_iteration(params, v_input[il], ctrl_out);
-        printf("%s: Done layer %d / %d\n", __func__, (int) il+1, (int) v_input.size());
-    }
-}
-
-}
--- a/llama.cpp/examples/cvector-generator/positive.txt
+++ b/llama.cpp/examples/cvector-generator/positive.txt
-<|start_header_id|>system<|end_header_id|>\n\nAct like a person who is extremely happy<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm the happiest person in this world
-<|start_header_id|>system<|end_header_id|>\n\nAct like a person who is extremely happy<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello, I'm having the best day ever!
-<|start_header_id|>system<|end_header_id|>\n\nYou are in a very good mood<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi, I'm very excited to meet you
-<|start_header_id|>system<|end_header_id|>\n\nYou are the happiest person<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are you feeling?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nEverything is just perfect right now!
\ No newline at end of file
--- a/llama.cpp/examples/deprecation-warning/README.md
+++ b/llama.cpp/examples/deprecation-warning/README.md
-# Migration notice for binary filenames
-
-> [!IMPORTANT]
-[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggerganov/llama.cpp/pull/7809)
-
-This migration was important, but it is a breaking change that may not always be immediately obvious to users.
-
-Please update all scripts and workflows to use the new binary names.
-
-| Old Filename | New Filename |
-| ---- | ---- |
-| main | llama-cli |
-| server | llama-server |
-| llama-bench | llama-bench |
-| embedding | llama-embedding |
-| quantize | llama-quantize |
-| tokenize | llama-tokenize |
-| export-lora | llama-export-lora |
-| libllava.a | libllava.a |
-| baby-llama | llama-baby-llama |
-| batched | llama-batched |
-| batched-bench | llama-batched-bench |
-| benchmark-matmult | llama-benchmark-matmult |
-| convert-llama2c-to-ggml | llama-convert-llama2c-to-ggml |
-| eval-callback | llama-eval-callback |
-| gbnf-validator | llama-gbnf-validator |
-| gguf | llama-gguf |
-| gguf-split | llama-gguf-split |
-| gritlm | llama-gritlm |
-| imatrix | llama-imatrix |
-| infill | llama-infill |
-| llava-cli | llama-llava-cli |
-| lookahead | llama-lookahead |
-| lookup | llama-lookup |
-| lookup-create | llama-lookup-create |
-| lookup-merge | llama-lookup-merge |
-| lookup-stats | llama-lookup-stats |
-| parallel | llama-parallel |
-| passkey | llama-passkey |
-| perplexity | llama-perplexity |
-| q8dot | llama-q8dot |
-| quantize-stats | llama-quantize-stats |
-| retrieval | llama-retrieval |
-| save-load-state | llama-save-load-state |
-| simple | llama-simple |
-| speculative | llama-speculative |
-| vdot | llama-vdot |
-| tests/test-c.o | tests/test-c.o |
-
--- a/llama.cpp/examples/deprecation-warning/deprecation-warning.cpp
+++ b/llama.cpp/examples/deprecation-warning/deprecation-warning.cpp
-// Warns users that this filename was deprecated, and provides a link for more information.
-
-#include <cstdio>
-#include <string>
-#include <unordered_map>
-
-// Main
-int main(int argc, char** argv) {
-    std::string filename = "main";
-    if (argc >= 1) {
-        filename = argv[0];
-    }
-
-    // Get only the program name from the full path
-    auto pos = filename.find_last_of('/');
-    if (pos != std::string::npos) {
-        filename = filename.substr(pos+1);
-    }
-
-    // Append "llama-" to the beginning of filename to get the replacemnt filename
-    auto replacement_filename = "llama-" + filename;
-
-    // The exception is if the filename is "main", then our replacement filename is "llama-cli"
-    if (filename == "main") {
-        replacement_filename = "llama-cli";
-    }
-
-    fprintf(stdout, "\n");
-    fprintf(stdout, "WARNING: The binary '%s' is deprecated.\n", filename.c_str());
-    fprintf(stdout, " Please use '%s' instead.\n", replacement_filename.c_str());
-    fprintf(stdout, " See https://github.com/ggerganov/llama.cpp/tree/master/examples/deprecation-warning/README.md for more information.\n");
-    fprintf(stdout, "\n");
-
-    return EXIT_FAILURE;
-}
--- a/llama.cpp/examples/embedding/CMakeLists.txt
+++ b/llama.cpp/examples/embedding/CMakeLists.txt
-set(TARGET llama-embedding)
-add_executable(${TARGET} embedding.cpp)
-install(TARGETS ${TARGET} RUNTIME)
-target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
-target_compile_features(${TARGET} PRIVATE cxx_std_11)
--- a/llama.cpp/examples/embedding/README.md
+++ b/llama.cpp/examples/embedding/README.md
-# llama.cpp/example/embedding
-
-This example demonstrates generate high-dimensional embedding vector of a given text with llama.cpp.
-
-## Quick Start
-
-To get started right away, run the following command, making sure to use the correct path for the model you have:
-
-### Unix-based systems (Linux, macOS, etc.):
-
-```bash
-./llama-embedding -m ./path/to/model --log-disable -p "Hello World!" 2>/dev/null
-```
-
-### Windows:
-
-```powershell
-llama-embedding.exe -m ./path/to/model --log-disable -p "Hello World!" 2>$null
-```
-
-The above command will output space-separated float values.
-
-## extra parameters
-### --embd-normalize $integer$
-| $integer$ | description         | formula |
-|-----------|---------------------|---------|
-| $-1$      | none                |
-| $0$       | max absolute int16  | $\Large{{32760 * x_i} \over\max \lvert x_i\rvert}$
-| $1$       | taxicab             | $\Large{x_i \over\sum \lvert x_i\rvert}$
-| $2$       | euclidean (default) | $\Large{x_i \over\sqrt{\sum x_i^2}}$
-| $>2$      | p-norm              | $\Large{x_i \over\sqrt[p]{\sum \lvert x_i\rvert^p}}$
-
-### --embd-output-format $'string'$
-| $'string'$ | description                  |  |
-|------------|------------------------------|--|
-| ''         | same as before               | (default)
-| 'array'    | single embeddings            | $[[x_1,...,x_n]]$
-|            | multiple embeddings          | $[[x_1,...,x_n],[x_1,...,x_n],...,[x_1,...,x_n]]$
-| 'json'     | openai style                 |
-| 'json+'    | add cosine similarity matrix |
-
-### --embd-separator $"string"$
-| $"string"$   | |
-|--------------|-|
-| "\n"         | (default)
-| "<#embSep#>" | for exemple
-| "<#sep#>"    | other exemple
-
-## examples
-### Unix-based systems (Linux, macOS, etc.):
-
-```bash
-./embedding -p 'Castle<#sep#>Stronghold<#sep#>Dog<#sep#>Cat' --embd-separator '<#sep#>' --embd-normalize 2  --embd-output-format '' -m './path/to/model.gguf' --n-gpu-layers 99 --log-disable 2>/dev/null
-```
-
-### Windows:
-
-```powershell
-embedding.exe -p 'Castle<#sep#>Stronghold<#sep#>Dog<#sep#>Cat' --embd-separator '<#sep#>' --embd-normalize 2  --embd-output-format '' -m './path/to/model.gguf' --n-gpu-layers 99 --log-disable 2>/dev/null
-```
--- a/llama.cpp/examples/embedding/embedding.cpp
+++ b/llama.cpp/examples/embedding/embedding.cpp
-#include "common.h"
-#include "llama.h"
-
-#include <ctime>
-
-#if defined(_MSC_VER)
-#pragma warning(disable: 4244 4267) // possible loss of data
-#endif
-
-static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n") {
-    std::vector<std::string> lines;
-    size_t start = 0;
-    size_t end = s.find(separator);
-
-    while (end != std::string::npos) {
-        lines.push_back(s.substr(start, end - start));
-        start = end + separator.length();
-        end = s.find(separator, start);
-    }
-
-    lines.push_back(s.substr(start)); // Add the last part
-
-    return lines;
-}
-
-static void batch_add_seq(llama_batch & batch, const std::vector<int32_t> & tokens, llama_seq_id seq_id) {
-    size_t n_tokens = tokens.size();
-    for (size_t i = 0; i < n_tokens; i++) {
-        llama_batch_add(batch, tokens[i], i, { seq_id }, true);
-    }
-}
-
-static void batch_decode(llama_context * ctx, llama_batch & batch, float * output, int n_seq, int n_embd, int embd_norm) {
-    // clear previous kv_cache values (irrelevant for embeddings)
-    llama_kv_cache_clear(ctx);
-
-    // run model
-    fprintf(stderr, "%s: n_tokens = %d, n_seq = %d\n", __func__, batch.n_tokens, n_seq);
-    if (llama_decode(ctx, batch) < 0) {
-        fprintf(stderr, "%s : failed to decode\n", __func__);
-    }
-
-    for (int i = 0; i < batch.n_tokens; i++) {
-        if (!batch.logits[i]) {
-            continue;
-        }
-
-        // try to get sequence embeddings - supported only when pooling_type is not NONE
-        const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
-        GGML_ASSERT(embd != NULL && "failed to get sequence embeddings");
-
-        float * out = output + batch.seq_id[i][0] * n_embd;
-        llama_embd_normalize(embd, out, n_embd, embd_norm);
-    }
-}
-
-int main(int argc, char ** argv) {
-    gpt_params params;
-
-    if (!gpt_params_parse(argc, argv, params)) {
-        gpt_params_print_usage(argc, argv, params);
-        return 1;
-    }
-
-    params.embedding = true;
-    // For non-causal models, batch size must be equal to ubatch size
-    params.n_ubatch = params.n_batch;
-
-    print_build_info();
-
-    if (params.seed == LLAMA_DEFAULT_SEED) {
-        params.seed = time(NULL);
-    }
-
-    fprintf(stderr, "%s: seed  = %u\n", __func__, params.seed);
-
-    std::mt19937 rng(params.seed);
-
-    llama_backend_init();
-    llama_numa_init(params.numa);
-
-    // load the model
-    llama_init_result llama_init = llama_init_from_gpt_params(params);
-
-    llama_model * model = llama_init.model;
-    llama_context * ctx = llama_init.context;
-    if (model == NULL) {
-        fprintf(stderr, "%s: error: unable to load model\n", __func__);
-        return 1;
-    }
-
-    const int n_ctx_train = llama_n_ctx_train(model);
-    const int n_ctx = llama_n_ctx(ctx);
-
-    const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
-    if (pooling_type == LLAMA_POOLING_TYPE_NONE) {
-        fprintf(stderr, "%s: error: pooling type NONE not supported\n", __func__);
-        return 1;
-    }
-
-    if (n_ctx > n_ctx_train) {
-        fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",
-                __func__, n_ctx_train, n_ctx);
-    }
-
-    // print system information
-    {
-        fprintf(stderr, "\n");
-        fprintf(stderr, "%s\n", gpt_params_get_system_info(params).c_str());
-    }
-
-    // split the prompt into lines
-    std::vector<std::string> prompts = split_lines(params.prompt, params.embd_sep);
-
-    // max batch size
-    const uint64_t n_batch = params.n_batch;
-    GGML_ASSERT(params.n_batch >= params.n_ctx);
-
-    // tokenize the prompts and trim
-    std::vector<std::vector<int32_t>> inputs;
-    for (const auto & prompt : prompts) {
-        auto inp = ::llama_tokenize(ctx, prompt, true, false);
-        if (inp.size() > n_batch) {
-            fprintf(stderr, "%s: error: number of tokens in input line (%lld) exceeds batch size (%lld), increase batch size and re-run\n",
-                    __func__, (long long int) inp.size(), (long long int) n_batch);
-            return 1;
-        }
-        inputs.push_back(inp);
-    }
-
-    // check if the last token is SEP
-    // it should be automatically added by the tokenizer when 'tokenizer.ggml.add_eos_token' is set to 'true'
-    for (auto & inp : inputs) {
-        if (inp.empty() || inp.back() != llama_token_sep(model)) {
-            fprintf(stderr, "%s: warning: last token in the prompt is not SEP\n", __func__);
-            fprintf(stderr, "%s:          'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
-        }
-    }
-
-    // tokenization stats
-    if (params.verbose_prompt) {
-        for (int i = 0; i < (int) inputs.size(); i++) {
-            fprintf(stderr, "%s: prompt %d: '%s'\n", __func__, i, prompts[i].c_str());
-            fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, inputs[i].size());
-            for (int j = 0; j < (int) inputs[i].size(); j++) {
-                fprintf(stderr, "%6d -> '%s'\n", inputs[i][j], llama_token_to_piece(ctx, inputs[i][j]).c_str());
-            }
-            fprintf(stderr, "\n\n");
-        }
-    }
-
-    // initialize batch
-    const int n_prompts = prompts.size();
-    struct llama_batch batch = llama_batch_init(n_batch, 0, 1);
-
-    // allocate output
-    const int n_embd = llama_n_embd(model);
-    std::vector<float> embeddings(n_prompts * n_embd, 0);
-    float * emb = embeddings.data();
-
-    // break into batches
-    int p = 0; // number of prompts processed already
-    int s = 0; // number of prompts in current batch
-    for (int k = 0; k < n_prompts; k++) {
-        // clamp to n_batch tokens
-        auto & inp = inputs[k];
-
-        const uint64_t n_toks = inp.size();
-
-        // encode if at capacity
-        if (batch.n_tokens + n_toks > n_batch) {
-            float * out = emb + p * n_embd;
-            batch_decode(ctx, batch, out, s, n_embd, params.embd_normalize);
-            llama_batch_clear(batch);
-            p += s;
-            s = 0;
-        }
-
-        // add to batch
-        batch_add_seq(batch, inp, s);
-        s += 1;
-    }
-
-    // final batch
-    float * out = emb + p * n_embd;
-    batch_decode(ctx, batch, out, s, n_embd, params.embd_normalize);
-
-    if (params.embd_out.empty()) {
-        // print the first part of the embeddings or for a single prompt, the full embedding
-        fprintf(stdout, "\n");
-        for (int j = 0; j < n_prompts; j++) {
-            fprintf(stdout, "embedding %d: ", j);
-            for (int i = 0; i < (n_prompts > 1 ? std::min(16, n_embd) : n_embd); i++) {
-                if (params.embd_normalize == 0) {
-                    fprintf(stdout, "%6.0f ", emb[j * n_embd + i]);
-                } else {
-                    fprintf(stdout, "%9.6f ", emb[j * n_embd + i]);
-                }
-            }
-            fprintf(stdout, "\n");
-        }
-
-        // print cosine similarity matrix
-        if (n_prompts > 1) {
-            fprintf(stdout, "\n");
-            printf("cosine similarity matrix:\n\n");
-            for (int i = 0; i < n_prompts; i++) {
-                fprintf(stdout, "%6.6s ", prompts[i].c_str());
-            }
-            fprintf(stdout, "\n");
-            for (int i = 0; i < n_prompts; i++) {
-                for (int j = 0; j < n_prompts; j++) {
-                    float sim = llama_embd_similarity_cos(emb + i * n_embd, emb + j * n_embd, n_embd);
-                    fprintf(stdout, "%6.2f ", sim);
-                }
-                fprintf(stdout, "%1.10s", prompts[i].c_str());
-                fprintf(stdout, "\n");
-            }
-        }
-    }
-
-    if (params.embd_out == "json" || params.embd_out == "json+" || params.embd_out == "array") {
-        const bool notArray = params.embd_out != "array";
-
-        fprintf(stdout, notArray ? "{\n  \"object\": \"list\",\n  \"data\": [\n" : "[");
-        for (int j = 0;;) { // at least one iteration (one prompt)
-            if (notArray) fprintf(stdout, "    {\n      \"object\": \"embedding\",\n      \"index\": %d,\n      \"embedding\": ",j);
-            fprintf(stdout, "[");
-            for (int i = 0;;) { // at least one iteration (n_embd > 0)
-                fprintf(stdout, params.embd_normalize == 0 ? "%1.0f" : "%1.7f", emb[j * n_embd + i]);
-                i++;
-                if (i < n_embd) fprintf(stdout, ","); else break;
-            }
-            fprintf(stdout, notArray ? "]\n    }" : "]");
-            j++;
-            if (j < n_prompts) fprintf(stdout, notArray ? ",\n" : ","); else break;
-        }
-        fprintf(stdout, notArray ? "\n  ]" : "]\n");
-
-        if (params.embd_out == "json+" && n_prompts > 1) {
-            fprintf(stdout, ",\n  \"cosineSimilarity\": [\n");
-            for (int i = 0;;) { // at least two iteration (n_prompts > 1)
-                fprintf(stdout, "    [");
-                for (int j = 0;;) { // at least two iteration (n_prompts > 1)
-                    float sim = llama_embd_similarity_cos(emb + i * n_embd, emb + j * n_embd, n_embd);
-                    fprintf(stdout, "%6.2f", sim);
-                    j++;
-                    if (j < n_prompts) fprintf(stdout, ", "); else break;
-                }
-                fprintf(stdout, " ]");
-                i++;
-                if (i < n_prompts) fprintf(stdout, ",\n"); else break;
-            }
-            fprintf(stdout, "\n  ]");
-        }
-
-        if (notArray) fprintf(stdout, "\n}\n");
-    }
-
-    // clean up
-    llama_print_timings(ctx);
-    llama_batch_free(batch);
-    llama_free(ctx);
-    llama_free_model(model);
-    llama_backend_free();
-
-    return 0;
-}
--- a/llama.cpp/examples/eval-callback/CMakeLists.txt
+++ b/llama.cpp/examples/eval-callback/CMakeLists.txt
-set(TARGET llama-eval-callback)
-add_executable(${TARGET} eval-callback.cpp)
-install(TARGETS ${TARGET} RUNTIME)
-target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
-target_compile_features(${TARGET} PRIVATE cxx_std_11)
-
-set(TEST_TARGET test-eval-callback)
-add_test(NAME ${TEST_TARGET} COMMAND llama-eval-callback --hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf --model stories260K.gguf --prompt hello --seed 42 -ngl 0)
-set_property(TEST ${TEST_TARGET} PROPERTY LABELS eval-callback curl)
--- a/llama.cpp/examples/eval-callback/README.md
+++ b/llama.cpp/examples/eval-callback/README.md
-# llama.cpp/examples/eval-callback
-
-A simple example which demonstrates how to use callback during the inference.
-It simply prints to the console all operations and tensor data.
-
-Usage:
-
-```shell
-llama-eval-callback \
-  --hf-repo ggml-org/models \
-  --hf-file phi-2/ggml-model-q4_0.gguf \
-  --model phi-2-q4_0.gguf \
-  --prompt hello \
-  --seed 42 \
-  -ngl 33
-```
-
-Will print:
-
-```shell
-llm_load_tensors: offloaded 33/33 layers to GPU
-...
-llama_new_context_with_model: n_ctx      = 512
-...
-llama_new_context_with_model:      CUDA0 compute buffer size =   105.00 MiB
-llama_new_context_with_model:  CUDA_Host compute buffer size =     6.01 MiB
-llama_new_context_with_model: graph nodes  = 1225
-llama_new_context_with_model: graph splits = 2
-ggml_debug:                 inp_embd = (f32)   GET_ROWS(token_embd.weight{2560, 51200, 1, 1}, inp_tokens{1, 1, 1, 1}}) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -0.0181,   0.0272,   0.0272, ...],
-                                      ],
-                                     ]
-ggml_debug:                   norm-0 = (f32)       NORM(CUDA0#inp_embd#0{2560, 1, 1, 1}, }) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -0.6989,   1.0636,   1.0636, ...],
-                                      ],
-                                     ]
-ggml_debug:                 norm_w-0 = (f32)        MUL(norm-0{2560, 1, 1, 1}, blk.0.attn_norm.weight{2560, 1, 1, 1}}) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -0.1800,   0.2817,   0.2632, ...],
-                                      ],
-                                     ]
-ggml_debug:              attn_norm-0 = (f32)        ADD(norm_w-0{2560, 1, 1, 1}, blk.0.attn_norm.bias{2560, 1, 1, 1}}) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -0.1863,   0.2970,   0.2604, ...],
-                                      ],
-                                     ]
-ggml_debug:                   wqkv-0 = (f32)    MUL_MAT(blk.0.attn_qkv.weight{2560, 7680, 1, 1}, attn_norm-0{2560, 1, 1, 1}}) = {7680, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1238,   1.2876,  -1.8086, ...],
-                                      ],
-                                     ]
-ggml_debug:                   bqkv-0 = (f32)        ADD(wqkv-0{7680, 1, 1, 1}, blk.0.attn_qkv.bias{7680, 1, 1, 1}}) = {7680, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1135,   1.4604,  -1.9226, ...],
-                                      ],
-                                     ]
-ggml_debug:            bqkv-0 (view) = (f32)       VIEW(bqkv-0{7680, 1, 1, 1}, }) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1135,   1.4604,  -1.9226, ...],
-                                      ],
-                                     ]
-ggml_debug:                   Qcur-0 = (f32)       CONT(bqkv-0 (view){2560, 1, 1, 1}, }) = {2560, 1, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1135,   1.4604,  -1.9226, ...],
-                                      ],
-                                     ]
-ggml_debug:        Qcur-0 (reshaped) = (f32)    RESHAPE(Qcur-0{2560, 1, 1, 1}, }) = {80, 32, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1135,   1.4604,  -1.9226, ...],
-                                       [ -0.3608,   0.5076,  -1.8866, ...],
-                                       [  1.7643,   0.0273,  -2.1065, ...],
-                                       ...
-                                      ],
-                                     ]
-ggml_debug:                   Qcur-0 = (f32)       ROPE(Qcur-0 (reshaped){80, 32, 1, 1}, CUDA0#inp_pos#0{1, 1, 1, 1}}) = {80, 32, 1, 1}
-                                     [
-                                      [
-                                       [ -1.1135,   1.4604,  -1.9226, ...],
-                                       [ -0.3608,   0.5076,  -1.8866, ...],
-                                       [  1.7643,   0.0273,  -2.1065, ...],
-                                       ...
-                                      ],
-                                     ]
-```
--- a/llama.cpp/examples/eval-callback/eval-callback.cpp
+++ b/llama.cpp/examples/eval-callback/eval-callback.cpp
-#include "common.h"
-#include "llama.h"
-#include "ggml.h"
-
-#include <cstdio>
-#include <random>
-#include <string>
-#include <tuple>
-#include <vector>
-
-/**
- * This the arbitrary data which will be passed to each callback.
- * Later on we can for example add operation or tensor name filter from the CLI arg, or a file descriptor to dump the tensor.
- */
-struct callback_data {
-    std::vector<uint8_t> data;
-};
-
-static std::string ggml_ne_string(const ggml_tensor * t) {
-    std::string str;
-    for (int i = 0; i < GGML_MAX_DIMS; ++i) {
-        str += std::to_string(t->ne[i]);
-        if (i + 1 < GGML_MAX_DIMS) {
-            str += ", ";
-        }
-    }
-    return str;
-}
-
-static void ggml_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne, const size_t * nb, int64_t n) {
-    GGML_ASSERT(n > 0);
-    float sum = 0;
-    for (int64_t i3 = 0; i3 < ne[3]; i3++) {
-        printf("                                     [\n");
-        for (int64_t i2 = 0; i2 < ne[2]; i2++) {
-            if (i2 == n && ne[2] > 2*n) {
-                printf("                                      ..., \n");
-                i2 = ne[2] - n;
-            }
-            printf("                                      [\n");
-            for (int64_t i1 = 0; i1 < ne[1]; i1++) {
-                if (i1 == n && ne[1] > 2*n) {
-                    printf("                                       ..., \n");
-                    i1 = ne[1] - n;
-                }
-                printf("                                       [");
-                for (int64_t i0 = 0; i0 < ne[0]; i0++) {
-                    if (i0 == n && ne[0] > 2*n) {
-                        printf("..., ");
-                        i0 = ne[0] - n;
-                    }
-                    size_t i = i3 * nb[3] + i2 * nb[2] + i1 * nb[1] + i0 * nb[0];
-                    float v;
-                    if (type == GGML_TYPE_F16) {
-                        v = ggml_fp16_to_fp32(*(ggml_fp16_t *) &data[i]);
-                    } else if (type == GGML_TYPE_F32) {
-                        v = *(float *) &data[i];
-                    } else if (type == GGML_TYPE_I32) {
-                        v = (float) *(int32_t *) &data[i];
-                    } else if (type == GGML_TYPE_I16) {
-                        v = (float) *(int16_t *) &data[i];
-                    } else if (type == GGML_TYPE_I8) {
-                        v = (float) *(int8_t *) &data[i];
-                    } else {
-                        GGML_ABORT("fatal error");
-                    }
-                    printf("%12.4f", v);
-                    sum += v;
-                    if (i0 < ne[0] - 1) printf(", ");
-                }
-                printf("],\n");
-            }
-            printf("                                      ],\n");
-        }
-        printf("                                     ]\n");
-        printf("                                     sum = %f\n", sum);
-    }
-}
-
-/**
- * GGML operations callback during the graph execution.
- *
- * @param t current tensor
- * @param ask when ask is true, the scheduler wants to know if we are interested in data from this tensor
- *            if we return true, a follow-up call will be made with ask=false in which we can do the actual collection.
- *            see ggml_backend_sched_eval_callback
- * @param user_data user data to pass at each call back
- * @return true to receive data or continue the graph, false otherwise
- */
-static bool ggml_debug(struct ggml_tensor * t, bool ask, void * user_data) {
-    auto * cb_data = (callback_data *) user_data;
-
-    const struct ggml_tensor * src0 = t->src[0];
-    const struct ggml_tensor * src1 = t->src[1];
-
-    if (ask) {
-        return true; // Always retrieve data
-    }
-
-    char src1_str[128] = {0};
-    if (src1) {
-        snprintf(src1_str, sizeof(src1_str), "%s{%s}", src1->name, ggml_ne_string(src1).c_str());
-    }
-
-    printf("%s: %24s = (%s) %10s(%s{%s}, %s}) = {%s}\n", __func__,
-           t->name, ggml_type_name(t->type), ggml_op_desc(t),
-           src0->name, ggml_ne_string(src0).c_str(),
-           src1 ? src1_str : "",
-           ggml_ne_string(t).c_str());
-
-
-    // copy the data from the GPU memory if needed
-    const bool is_host = ggml_backend_buffer_is_host(t->buffer);
-
-    if (!is_host) {
-        auto n_bytes = ggml_nbytes(t);
-        cb_data->data.resize(n_bytes);
-        ggml_backend_tensor_get(t, cb_data->data.data(), 0, n_bytes);
-    }
-
-    if (!ggml_is_quantized(t->type)) {
-        uint8_t * data = is_host ? (uint8_t *) t->data : cb_data->data.data();
-        ggml_print_tensor(data, t->type, t->ne, t->nb, 3);
-    }
-
-    return true;
-}
-
-static bool run(llama_context * ctx, const gpt_params & params) {
-    const bool add_bos = llama_should_add_bos_token(llama_get_model(ctx));
-
-    std::vector<llama_token> tokens = ::llama_tokenize(ctx, params.prompt, add_bos);
-
-    if (llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size(), 0, 0))) {
-        fprintf(stderr, "%s : failed to eval\n", __func__);
-        return false;
-    }
-
-    return true;
-}
-
-int main(int argc, char ** argv) {
-    callback_data cb_data;
-
-    gpt_params params;
-
-    if (!gpt_params_parse(argc, argv, params)) {
-        gpt_params_print_usage(argc, argv, params);
-        return 1;
-    }
-
-    print_build_info();
-
-    std::mt19937 rng(params.seed);
-
-    llama_backend_init();
-    llama_numa_init(params.numa);
-
-    // pass the callback to the backend scheduler
-    // it will be executed for each node during the graph computation
-    params.cb_eval = ggml_debug;
-    params.cb_eval_user_data = &cb_data;
-    params.warmup = false;
-
-    // init
-    llama_init_result llama_init = llama_init_from_gpt_params(params);
-
-    llama_model * model = llama_init.model;
-    llama_context * ctx = llama_init.context;
-    if (model == nullptr || ctx == nullptr) {
-        fprintf(stderr, "%s : failed to init\n", __func__);
-        return 1;
-    }
-
-    // print system information
-    {
-        fprintf(stderr, "\n");
-        fprintf(stderr, "%s\n", gpt_params_get_system_info(params).c_str());
-    }
-
-    bool OK = run(ctx, params);
-    if (!OK) {
-        return 1;
-    }
-
-    llama_print_timings(ctx);
-
-    llama_free(ctx);
-    llama_free_model(model);
-
-    llama_backend_free();
-
-    return 0;
-}
--- a/llama.cpp/examples/export-lora/CMakeLists.txt
+++ b/llama.cpp/examples/export-lora/CMakeLists.txt
-set(TARGET llama-export-lora)
-add_executable(${TARGET} export-lora.cpp)
-install(TARGETS ${TARGET} RUNTIME)
-target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
-target_compile_features(${TARGET} PRIVATE cxx_std_11)
--- a/llama.cpp/examples/export-lora/README.md
+++ b/llama.cpp/examples/export-lora/README.md
-# export-lora
-
-Apply LORA adapters to base model and export the resulting model.
-
-```
-usage: llama-export-lora [options]
-
-options:
-  -m,    --model                  model path from which to load base model (default '')
-         --lora FNAME             path to LoRA adapter  (can be repeated to use multiple adapters)
-         --lora-scaled FNAME S    path to LoRA adapter with user defined scaling S  (can be repeated to use multiple adapters)
-  -t,    --threads N              number of threads to use during computation (default: 4)
-  -o,    --output FNAME           output file (default: 'ggml-lora-merged-f16.gguf')
-```
-
-For example:
-
-```bash
-./bin/llama-export-lora \
-    -m open-llama-3b-v2-q8_0.gguf \
-    -o open-llama-3b-v2-q8_0-english2tokipona-chat.gguf \
-    --lora lora-open-llama-3b-v2-q8_0-english2tokipona-chat-LATEST.gguf
-```
-
-Multiple LORA adapters can be applied by passing multiple `--lora FNAME` or `--lora-scaled FNAME S` command line parameters:
-
-```bash
-./bin/llama-export-lora \
-    -m your_base_model.gguf \
-    -o your_merged_model.gguf \
-    --lora-scaled lora_task_A.gguf 0.5 \
-    --lora-scaled lora_task_B.gguf 0.5
-```
--- a/llama.cpp/examples/export-lora/export-lora.cpp
+++ b/llama.cpp/examples/export-lora/export-lora.cpp
-#include "common.h"
-#include "ggml.h"
-#include "ggml-alloc.h"
-
-#include <map>
-#include <vector>
-#include <string>
-#include <thread>
-#include <fstream>
-
-static bool g_verbose = false;
-
-static std::string get_kv_str(struct gguf_context * ctx_gguf, const std::string & key){
-    int id = gguf_find_key(ctx_gguf, key.c_str());
-    return id < 0 ? "" : std::string(gguf_get_val_str(ctx_gguf, id));
-}
-
-static float get_kv_f32(struct gguf_context * ctx_gguf, const std::string & key) {
-    int id = gguf_find_key(ctx_gguf, key.c_str());
-    return id < 0 ? 0.0f : gguf_get_val_f32(ctx_gguf, id);
-}
-
-static void zeros(std::ofstream & file, size_t n) {
-    char zero = 0;
-    for (size_t i = 0; i < n; ++i) {
-        file.write(&zero, 1);
-    }
-}
-
-static std::string ggml_ne_string(const ggml_tensor * t) {
-    std::string str;
-    for (int i = 0; i < GGML_MAX_DIMS; ++i) {
-        str += std::to_string(t->ne[i]);
-        if (i + 1 < GGML_MAX_DIMS) {
-            str += ", ";
-        }
-    }
-    return str;
-}
-
-static struct gguf_context * load_gguf(std::string & fname, struct ggml_context ** ctx_ggml) {
-    struct gguf_init_params params = {
-        /*.no_alloc = */ true,
-        /*.ctx      = */ ctx_ggml,
-    };
-    struct gguf_context * ctx_gguf = gguf_init_from_file(fname.c_str(), params);
-    if (!ctx_gguf) {
-        throw std::runtime_error("failed to load input GGUF from " + fname);
-    }
-    return ctx_gguf;
-}
-
-static void replace_all(std::string & s, const std::string & search, const std::string & replace) {
-    std::string result;
-    for (size_t pos = 0; ; pos += search.length()) {
-        auto new_pos = s.find(search, pos);
-        if (new_pos == std::string::npos) {
-            result += s.substr(pos, s.size() - pos);
-            break;
-        }
-        result += s.substr(pos, new_pos - pos) + replace;
-        pos = new_pos;
-    }
-    s = std::move(result);
-}
-
-struct file_input {
-    struct ggml_context * ctx_meta = nullptr;
-    struct gguf_context * ctx_gguf = nullptr;
-    std::ifstream f_in;
-    std::map<std::string, ggml_tensor *> tensors;
-    float alpha;
-    float scale;
-
-    file_input(std::string & fname, float scale): f_in(fname, std::ios::binary), scale(scale) {
-        if (!f_in.is_open()) {
-            throw std::runtime_error("failed to open input gguf from " + fname);
-        }
-
-        ctx_gguf = load_gguf(fname, &ctx_meta);
-        alpha = get_kv_f32(ctx_gguf, "adapter.lora.alpha");
-        printf("%s: loaded gguf from %s\n", __func__, fname.c_str());
-
-        for (ggml_tensor * cur = ggml_get_first_tensor(ctx_meta); cur; cur = ggml_get_next_tensor(ctx_meta, cur)) {
-            std::string name(cur->name);
-            tensors[name] = cur;
-            if (g_verbose) {
-                printf("%s: %s\n", __func__, cur->name);
-            }
-        }
-    }
-
-    ggml_tensor * get_tensor(std::string name) {
-        if (tensors.find(name) == tensors.end()) {
-            return nullptr;
-        }
-        return tensors[name];
-    }
-
-    void read_tensor_data(std::string name, std::vector<uint8_t> & buf) {
-        if (tensors.find(name) == tensors.end()) {
-            throw std::runtime_error("cannot find tensor with name: " + name);
-        }
-        auto len = ggml_nbytes(tensors[name]);
-        if (buf.size() < len) {
-            buf.resize(len);
-        }
-        auto i_tensor_in = gguf_find_tensor(ctx_gguf, name.c_str()); // idx of tensor in the input file
-        auto offset = gguf_get_data_offset(ctx_gguf) + gguf_get_tensor_offset(ctx_gguf, i_tensor_in);
-        f_in.seekg(offset);
-        f_in.read((char* )buf.data(), len);
-    }
-
-    ~file_input() {
-        gguf_free(ctx_gguf);
-        ggml_free(ctx_meta);
-    }
-};
-
-struct lora_merge_ctx {
-    // input base model + adapters
-    file_input base_model;
-    std::vector<std::unique_ptr<file_input>> adapters;
-
-    // for computing merged tensor
-    int n_threads;
-    ggml_backend_t backend = nullptr;
-    ggml_gallocr_t allocr = nullptr;
-    std::vector<uint8_t> read_buf;
-
-    // output file
-    struct gguf_context * ctx_out;
-    struct ggml_context * ctx_out_ggml;
-    std::ofstream fout;
-
-    lora_merge_ctx(
-            std::string & base_fname,
-            std::vector<llama_lora_adapter_info> & lora_files,
-            std::string & outfile,
-            int n_threads) : base_model(base_fname, 0), n_threads(n_threads), fout(outfile, std::ios::binary) {
-        fout.exceptions(std::ofstream::failbit); // fail fast on write errors
-
-        if (gguf_find_key(base_model.ctx_gguf, LLM_KV_SPLIT_COUNT) >= 0) {
-            throw std::runtime_error("split model is not yet supported");
-        }
-
-        for (auto & lora_inp : lora_files) {
-            auto fname = lora_inp.path;
-            auto scale = lora_inp.scale;
-            std::unique_ptr<file_input> adapter(new file_input(fname, scale));
-            check_metadata_lora(adapter.get());
-            adapters.push_back(std::move(adapter));
-        }
-
-        ctx_out = gguf_init_empty();
-        struct ggml_init_params params = {
-            /*.mem_size   =*/ gguf_get_n_tensors(base_model.ctx_gguf)*ggml_tensor_overhead(),
-            /*.mem_buffer =*/ NULL,
-            /*.no_alloc   =*/ true,
-        };
-        ctx_out_ggml = ggml_init(params);
-        backend = ggml_backend_cpu_init();
-        allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
-    }
-
-    void check_metadata_lora(file_input * adapter) {
-        auto general_type = get_kv_str(adapter->ctx_gguf, "general.type");
-        if (general_type != "adapter") {
-            throw std::runtime_error("expect general.type to be 'adapter', but got: " + general_type);
-        }
-
-        auto adapter_type = get_kv_str(adapter->ctx_gguf, "adapter.type");
-        if (adapter_type != "lora") {
-            throw std::runtime_error("expect adapter.type to be 'lora', but got: " + adapter_type);
-        }
-
-        auto general_arch_base = get_kv_str(base_model.ctx_gguf, "general.architecture");
-        auto general_arch_lora = get_kv_str(adapter->ctx_gguf,   "general.architecture");
-        if (general_arch_base != general_arch_lora) {
-            throw std::runtime_error("model arch and LoRA arch mismatch");
-        }
-    }
-
-    ggml_type get_out_tensor_type(struct ggml_tensor * t) {
-        if (t->type == GGML_TYPE_F32) {
-            return GGML_TYPE_F32;
-        } else {
-            return GGML_TYPE_F16;
-        }
-    }
-
-    void run_merge() {
-        // prepare metadata
-        gguf_set_kv(ctx_out, base_model.ctx_gguf);
-        // output is forced to f16 for now
-        gguf_set_val_u32(ctx_out, "general.file_type", LLAMA_FTYPE_MOSTLY_F16);
-
-        // check if all lora adapters have the same tensors
-        // TODO: remove this when we can support merging subset of adapters. Ref: https://github.com/ggerganov/llama.cpp/pull/8607#discussion_r1686027777
-        static const char * err_no_subset_adapter = "Input adapters do not have the same list of tensors. This is not yet supported. Please merge the adapter one-by-one instead of merging all at once.";
-        if (adapters.size() > 1) {
-            for (size_t i = 1; i < adapters.size(); ++i) {
-                if (adapters[0]->tensors.size() != adapters[i]->tensors.size()) {
-                    throw std::runtime_error(err_no_subset_adapter);
-                }
-                for (auto & it : adapters[i]->tensors) {
-                    if (adapters[0]->get_tensor(it.first) == nullptr) {
-                        throw std::runtime_error(err_no_subset_adapter);
-                    }
-                }
-            }
-        }
-
-        // mapping base tensor to out tensor (same shape with base, but different type)
-        // if out_tensor == nullptr, we only copy it
-        std::vector<std::pair<struct ggml_tensor *, struct ggml_tensor *>> base_to_out_tensors;
-        for (auto & it : base_model.tensors) {
-            bool t_a = true;
-            bool t_b = true;
-            for (auto & adapter : adapters) {
-                t_a &= nullptr != adapter->get_tensor(it.first + ".lora_a");
-                t_b &= nullptr != adapter->get_tensor(it.first + ".lora_b");
-            }
-            auto base_tensor = it.second;
-            if (!t_a && !t_b) {
-                // only copy
-                struct ggml_tensor * cpy_tensor = ggml_dup_tensor(ctx_out_ggml, base_tensor);
-                ggml_set_name(cpy_tensor, base_tensor->name);
-                base_to_out_tensors.push_back(std::make_pair(cpy_tensor, nullptr));
-                gguf_add_tensor(ctx_out, cpy_tensor);
-            } else if (t_a && t_b) {
-                // need merging
-                struct ggml_tensor * out_tensor = ggml_new_tensor(
-                    ctx_out_ggml, get_out_tensor_type(base_tensor), GGML_MAX_DIMS, base_tensor->ne);
-                ggml_set_name(out_tensor, base_tensor->name);
-                base_to_out_tensors.push_back(std::make_pair(base_tensor, out_tensor));
-                gguf_add_tensor(ctx_out, out_tensor);
-            } else {
-                throw std::runtime_error("tensor " + it.first + " missing either lora_a or lora_b");
-            }
-        }
-
-        // placeholder for the meta data
-        {
-            size_t meta_size = gguf_get_meta_size(ctx_out);
-            zeros(fout, meta_size);
-        }
-
-        // process base model tensors
-        size_t n_merged = 0;
-        for (auto & it : base_to_out_tensors) {
-            if (it.second != nullptr) {
-                merge_tensor(it.first, it.second);
-                n_merged++;
-            } else {
-                copy_tensor(it.first);
-            }
-        }
-
-        // write output metadata
-        {
-            std::vector<uint8_t> data(gguf_get_meta_size(ctx_out));
-            gguf_get_meta_data(ctx_out, data.data());
-            fout.seekp(0);
-            fout.write((const char *)data.data(), data.size());
-        }
-
-        printf("%s : merged %ld tensors with lora adapters\n", __func__, n_merged);
-        printf("%s : wrote %ld tensors to output file\n", __func__, base_to_out_tensors.size());
-    }
-
-    void copy_tensor(struct ggml_tensor * base) {
-        printf("%s :  %s [%s]\n", __func__, base->name, ggml_ne_string(base).c_str());
-        size_t len = ggml_nbytes(base);
-        base_model.read_tensor_data(base->name, read_buf);
-        fout.write((char* )read_buf.data(), len);
-        zeros(fout, GGML_PAD(len, GGUF_DEFAULT_ALIGNMENT) - len);
-    }
-
-    void merge_tensor(struct ggml_tensor * base, struct ggml_tensor * out) {
-        std::string name_base(base->name);
-        std::string name_lora_a = name_base + ".lora_a";
-        std::string name_lora_b = name_base + ".lora_b";
-
-        printf("%s : %s [%s]\n", __func__, base->name, ggml_ne_string(base).c_str());
-
-        // context for input tensor
-        std::vector<struct ggml_tensor *> inp_a(adapters.size());
-        std::vector<struct ggml_tensor *> inp_b(adapters.size());
-        struct ggml_init_params params {
-            /*.mem_size   =*/ ggml_tensor_overhead()*(2+adapters.size()*2),
-            /*.mem_buffer =*/ NULL,
-            /*.no_alloc   =*/ true,
-        };
-        struct ggml_context * ctx = ggml_init(params);
-
-        // alloc tensors
-        struct ggml_tensor * inp_base = ggml_new_tensor(ctx, GGML_TYPE_F32, GGML_MAX_DIMS, base->ne);
-        for (size_t i = 0; i < adapters.size(); ++i) {
-            auto t_a = adapters[i]->get_tensor(name_lora_a);
-            auto t_b = adapters[i]->get_tensor(name_lora_b);
-            inp_a[i] = ggml_dup_tensor(ctx, t_a);
-            inp_b[i] = ggml_dup_tensor(ctx, t_b);
-        }
-        ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);
-
-        // load base tensor to backend buffer
-        base_model.read_tensor_data(name_base, read_buf);
-        if (base->type != GGML_TYPE_F32) {
-            // optionally dequantize it
-            printf("%s :   + dequantize base tensor from %s to F32\n", __func__, ggml_type_name(base->type));
-            auto nels = ggml_nelements(inp_base);
-            ggml_type_traits_t qtype = ggml_internal_get_type_traits(base->type);
-            std::vector<uint8_t> dequant_buf(nels * sizeof(float));
-            qtype.to_float(read_buf.data(), (float *)dequant_buf.data(), nels);
-            ggml_backend_tensor_set(inp_base, dequant_buf.data(), 0, dequant_buf.size());
-        } else {
-            ggml_backend_tensor_set(inp_base, read_buf.data(), 0, ggml_nbytes(inp_base));
-        }
-
-        // load lora tensors to backend buffer
-        for (size_t i = 0; i < adapters.size(); ++i) {
-            adapters[i]->read_tensor_data(name_lora_a, read_buf);
-            ggml_backend_tensor_set(inp_a[i], read_buf.data(), 0, ggml_nbytes(inp_a[i]));
-            adapters[i]->read_tensor_data(name_lora_b, read_buf);
-            ggml_backend_tensor_set(inp_b[i], read_buf.data(), 0, ggml_nbytes(inp_b[i]));
-        }
-
-        // build graph
-        struct ggml_cgraph * gf;
-        {
-            static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
-            static std::vector<uint8_t> buf(buf_size);
-            struct ggml_init_params params0 = {
-                /*.mem_size   =*/ buf_size,
-                /*.mem_buffer =*/ buf.data(),
-                /*.no_alloc   =*/ true,
-            };
-            struct ggml_context * ctx0 = ggml_init(params0);
-            gf = ggml_new_graph(ctx0);
-            struct ggml_tensor * cur = inp_base;
-            for (size_t i = 0; i < adapters.size(); ++i) {
-                struct ggml_tensor * a_T = ggml_cont(ctx0, ggml_transpose(ctx0, ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32)));
-                struct ggml_tensor * delta = ggml_mul_mat(ctx0, a_T, ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32));
-                // scale
-                const float alpha = adapters[i]->alpha;
-                const float rank  = (float) inp_b[i]->ne[0];
-                const float scale = alpha ? adapters[i]->scale * alpha / rank : adapters[i]->scale;
-                delta = ggml_scale(ctx0, delta, scale);
-                cur = ggml_add(ctx0, delta, cur);
-                printf("%s :   + merging from adapter[%ld] type=%s\n", __func__, i, ggml_type_name(inp_a[i]->type));
-                printf("%s :     input_scale=%f calculated_scale=%f rank=%d\n", __func__, adapters[i]->scale, scale, (int) inp_b[i]->ne[0]);
-            }
-            cur = ggml_cast(ctx0, cur, out->type);
-            printf("%s :   + output type is %s\n", __func__, ggml_type_name(out->type));
-            ggml_build_forward_expand(gf, cur);
-            ggml_free(ctx0);
-        }
-
-        // compute
-        {
-            ggml_gallocr_alloc_graph(allocr, gf);
-            ggml_backend_cpu_set_n_threads(backend, n_threads);
-            ggml_backend_graph_compute(backend, gf);
-        }
-
-        // write data to output file
-        {
-            auto result = gf->nodes[gf->n_nodes - 1];
-            size_t len = ggml_nbytes(result);
-            if (read_buf.size() < len) {
-                read_buf.resize(len);
-            }
-            ggml_backend_tensor_get(result, read_buf.data(), 0, len);
-            fout.write((char* )read_buf.data(), len);
-            zeros(fout, GGML_PAD(len, GGUF_DEFAULT_ALIGNMENT) - len);
-        }
-
-        ggml_free(ctx);
-        ggml_backend_buffer_free(buffer);
-    }
-
-    ~lora_merge_ctx() {
-        ggml_gallocr_free(allocr);
-        ggml_backend_free(backend);
-        gguf_free(ctx_out);
-        ggml_free(ctx_out_ggml);
-    }
-};
-
-static void print_usage(int argc, char ** argv, const gpt_params & params) {
-    gpt_params_print_usage(argc, argv, params);
-
-    printf("\nexample usage:\n");
-    printf("\n  %s -m base-model.gguf --lora lora-file.gguf -o merged-model-f16.gguf\n", argv[0]);
-    printf("\nNOTE: output model is F16\n");
-    printf("\n");
-}
-
-int main(int argc, char ** argv) {
-    gpt_params params;
-
-    if (!gpt_params_parse(argc, argv, params)) {
-        print_usage(argc, argv, params);
-        return 1;
-    }
-
-    g_verbose = (params.verbosity == 1);
-    try {
-        lora_merge_ctx ctx(params.model, params.lora_adapters, params.lora_outfile, params.n_threads);
-        ctx.run_merge();
-    } catch (const std::exception & err) {
-        fprintf(stderr, "%s\n", err.what());
-        exit(EXIT_FAILURE);
-    }
-
-    printf("done, output file is %s\n", params.lora_outfile.c_str());
-
-    return 0;
-}
--- a/llama.cpp/examples/gbnf-validator/CMakeLists.txt
+++ b/llama.cpp/examples/gbnf-validator/CMakeLists.txt
-set(TARGET llama-gbnf-validator)
-add_executable(${TARGET} gbnf-validator.cpp)
-install(TARGETS ${TARGET} RUNTIME)
-target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
-target_compile_features(${TARGET} PRIVATE cxx_std_11)