Unverified Commit 1a19df1f authored by Michael Yang's avatar Michael Yang Committed by GitHub
Browse files

update vendored llama.cpp and ggml (#11823)

* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394



* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------
Signed-off-by: default avatarGabe Goodhart <ghart@us.ibm.com>
Co-authored-by: default avatarGabe Goodhart <ghart@us.ibm.com>
Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
parent 7ccfd97a
...@@ -16,7 +16,7 @@ Defaults to false for newly created backend buffer types. ...@@ -16,7 +16,7 @@ Defaults to false for newly created backend buffer types.
3 files changed, 21 insertions(+), 1 deletion(-) 3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
index 48839339..3903c3cb 100644 index 9424394e..b602a7c7 100644
--- a/ggml/include/ggml-backend.h --- a/ggml/include/ggml-backend.h
+++ b/ggml/include/ggml-backend.h +++ b/ggml/include/ggml-backend.h
@@ -35,6 +35,7 @@ extern "C" { @@ -35,6 +35,7 @@ extern "C" {
...@@ -48,7 +48,7 @@ index c36c12d6..81749a5a 100644 ...@@ -48,7 +48,7 @@ index c36c12d6..81749a5a 100644
GGML_API ggml_backend_buffer_t ggml_backend_buffer_init( GGML_API ggml_backend_buffer_t ggml_backend_buffer_init(
diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index be335e8c..84928bc3 100644 index eded0291..05a842ed 100644
--- a/ggml/src/ggml-backend.cpp --- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp +++ b/ggml/src/ggml-backend.cpp
@@ -35,12 +35,22 @@ const char * ggml_backend_buft_name(ggml_backend_buffer_type_t buft) { @@ -35,12 +35,22 @@ const char * ggml_backend_buft_name(ggml_backend_buffer_type_t buft) {
......
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Michael Yang <git@mxy.ng>
Date: Thu, 31 Jul 2025 12:31:58 -0700
Subject: [PATCH] cuda: disable graph compat check for OP_ADD
---
ggml/src/ggml-cuda/ggml-cuda.cu | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index bb19b06e..080e7467 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2509,20 +2509,6 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
#endif
}
- // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
- // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
- if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1 && !(node->ne[0] == 256
- && node->ne[2] == 1
- && node->ne[3] == 1
- && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name) != std::string::npos : false
- && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name : false)) {
- // Generally, changes in batch size or context size can cause changes to the grid size of some kernels.
- use_cuda_graph = false;
-#ifndef NDEBUG
- GGML_LOG_INFO("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
-#endif
- }
-
if (node->op == GGML_OP_CPY) {
// Store the pointers which are updated for each token, such that these can be sent
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
#include "llama-model.h" #include "llama-model.h"
#include "llama-model-loader.h" #include "llama-model-loader.h"
#include "llama-grammar.h" #include "llama-grammar.h"
#include "nlohmann/json.hpp"
struct common_sampler *common_sampler_cinit(const struct llama_model *model, struct common_sampler_cparams *params) { struct common_sampler *common_sampler_cinit(const struct llama_model *model, struct common_sampler_cparams *params) {
try { try {
...@@ -94,7 +95,7 @@ struct llama_grammar *grammar_init(char* grammar, uint32_t* tokens, size_t n_tok ...@@ -94,7 +95,7 @@ struct llama_grammar *grammar_init(char* grammar, uint32_t* tokens, size_t n_tok
ollama_vocab *vocab = new ollama_vocab(); ollama_vocab *vocab = new ollama_vocab();
vocab->set_eog_tokens(eog_tokens, n_eog_tokens); vocab->set_eog_tokens(eog_tokens, n_eog_tokens);
vocab->add_token_pieces(tokens, n_tokens, pieces); vocab->add_token_pieces(tokens, n_tokens, pieces);
struct llama_grammar *g = llama_grammar_init_impl(nullptr, vocab, grammar, "root", false, nullptr, 0, nullptr, 0); struct llama_grammar *g = llama_grammar_init_impl(nullptr, vocab, grammar, "root", false, nullptr, 0, nullptr, 0);
if (g == nullptr) { if (g == nullptr) {
LLAMA_LOG_ERROR("%s: failed to initialize grammar\n", __func__); LLAMA_LOG_ERROR("%s: failed to initialize grammar\n", __func__);
......
...@@ -263,6 +263,7 @@ type Tensor interface { ...@@ -263,6 +263,7 @@ type Tensor interface {
Mulmat(ctx Context, t2 Tensor) Tensor Mulmat(ctx Context, t2 Tensor) Tensor
MulmatFullPrec(ctx Context, t2 Tensor) Tensor MulmatFullPrec(ctx Context, t2 Tensor) Tensor
MulmatID(ctx Context, t2, ids Tensor) Tensor MulmatID(ctx Context, t2, ids Tensor) Tensor
AddID(ctx Context, t2, ids Tensor) Tensor
Softmax(ctx Context) Tensor Softmax(ctx Context) Tensor
LayerNorm(ctx Context, weight, bias Tensor, eps float32) Tensor LayerNorm(ctx Context, weight, bias Tensor, eps float32) Tensor
...@@ -283,6 +284,7 @@ type Tensor interface { ...@@ -283,6 +284,7 @@ type Tensor interface {
SILU(ctx Context) Tensor SILU(ctx Context) Tensor
RELU(ctx Context) Tensor RELU(ctx Context) Tensor
Sigmoid(ctx Context) Tensor Sigmoid(ctx Context) Tensor
SwiGLU(ctx Context, up Tensor, alpha, limit float32) Tensor
Reshape(ctx Context, shape ...int) Tensor Reshape(ctx Context, shape ...int) Tensor
View(ctx Context, offset int, shape ...int) Tensor View(ctx Context, offset int, shape ...int) Tensor
...@@ -332,7 +334,7 @@ type Tensor interface { ...@@ -332,7 +334,7 @@ type Tensor interface {
// kqv := value.Mulmat(ctx, kq) // kqv := value.Mulmat(ctx, kq)
// return kqv.Permute(ctx, 0, 2, 1, 3).Contiguous(ctx) // return kqv.Permute(ctx, 0, 2, 1, 3).Contiguous(ctx)
type ScaledDotProductAttention interface { type ScaledDotProductAttention interface {
ScaledDotProductAttention(ctx Context, key, value, mask Tensor, scale float64) Tensor ScaledDotProductAttention(ctx Context, key, value, mask, sinks Tensor, scale float64) Tensor
} }
type number interface { type number interface {
......
...@@ -267,7 +267,16 @@ func New(modelPath string, params ml.BackendParams) (ml.Backend, error) { ...@@ -267,7 +267,16 @@ func New(modelPath string, params ml.BackendParams) (ml.Backend, error) {
return tt return tt
} }
tt := C.ggml_new_tensor(ctxs[bt], t.source.Kind, C.int(len(t.source.Shape)), (*C.int64_t)(unsafe.Pointer(&t.source.Shape[0]))) kind := t.source.Kind
if t.source.Kind == 4 {
// transform raw mxfp4 stream to ggml mxfp4 format
kind = 39
} else if t.source.Kind == uint32(fsggml.TensorTypeBF16) && strings.HasSuffix(t.source.Name, "_exps.bias") {
// transform "_exps.bias" from bf16 to fp32; add_ids only supports fp32 tensors
kind = uint32(fsggml.TensorTypeF32)
}
tt := C.ggml_new_tensor(ctxs[bt], kind, C.int(len(t.source.Shape)), (*C.int64_t)(unsafe.Pointer(&t.source.Shape[0])))
C.ggml_set_name(tt, cname) C.ggml_set_name(tt, cname)
slog.Log(context.TODO(), logutil.LevelTrace, "created tensor", "name", name, "shape", t.source.Shape, "dtype", t.source.Kind, "buffer_type", C.GoString(C.ggml_backend_buft_name(bt))) slog.Log(context.TODO(), logutil.LevelTrace, "created tensor", "name", name, "shape", t.source.Shape, "dtype", t.source.Kind, "buffer_type", C.GoString(C.ggml_backend_buft_name(bt)))
...@@ -507,6 +516,99 @@ func (b *Backend) Load(ctx context.Context, progress func(float32)) error { ...@@ -507,6 +516,99 @@ func (b *Backend) Load(ctx context.Context, progress func(float32)) error {
} }
defer file.Close() defer file.Close()
sr := io.NewSectionReader(file, int64(b.meta.Tensors().Offset+t.Offset), int64(t.Size())) sr := io.NewSectionReader(file, int64(b.meta.Tensors().Offset+t.Offset), int64(t.Size()))
if t.Kind == 4 && tts[0]._type == 39 {
// source is mxfp4, target is ggml mxfp4
const BS = 17 // MXFP4 block size
bts := make([]byte, 8*BS*format.KibiByte) // ~128k block aligned
var s uint64
for s < t.Size() {
// Stop if either the parent context has been canceled or if any of the other tensors returned an error
if err := ctx.Err(); err != nil {
return err
}
n, err := io.ReadFull(sr, bts[:min(len(bts), int(t.Size()-s))])
if err != nil {
slog.Warn("file read error", "file", b.modelPath, "error", err)
return err
}
for j := range n / BS {
for i := 1; i < BS; i++ {
// swap nibbles
t_lo := bts[j*BS+i] & 0x0F
t_hi := bts[j*BS+i] & 0xF0
bts[j*BS+i] = (t_lo << 4) | (t_hi >> 4)
}
// transform aaaa...bbbb... to abababab...
oi := 0
tmp := [16]byte{}
for i := 1; i < 9; i++ {
blk_a0 := bts[j*BS+i] & 0xF0
blk_a1 := bts[j*BS+i] << 4
blk_b0 := bts[j*BS+i+8] >> 4
blk_b1 := bts[j*BS+i+8] & 0x0F
// swap once more
out0 := blk_a0 | blk_b0
out1 := blk_a1 | blk_b1
out_h0 := out0 & 0xF0
out_l0 := out0 & 0x0F
out_h1 := out1 & 0xF0
out_l1 := out1 & 0x0F
out0 = (out_h0 >> 4) | (out_l0 << 4)
out1 = (out_h1 >> 4) | (out_l1 << 4)
tmp[oi] = out0
oi++
tmp[oi] = out1
oi++
}
for i := range tmp {
bts[j*BS+i+1] = tmp[i]
}
}
for _, tt := range tts {
C.ggml_backend_tensor_set(tt, unsafe.Pointer(&bts[0]), C.size_t(s), C.size_t(n))
}
s += uint64(n)
if progress != nil {
done := doneBytes.Add(uint64(n))
progress(float32(done) / float32(totalBytes))
}
}
return nil
} else if strings.HasSuffix(t.Name, "_exps.bias") && t.Kind == 30 && tts[0]._type == 0 {
// source is bf16, target is ggml fp32
// data is bf16 but we need to convert to fp32
bts := make([]byte, 128*format.KibiByte)
var e uint64
for e < t.Elements() {
// Stop if either the parent context has been canceled or if any of the other tensors returned an error
if err := ctx.Err(); err != nil {
return err
}
n, err := io.ReadFull(sr, bts[:min(len(bts), int(t.Elements()-e)*2)])
if err != nil {
slog.Warn("file read error", "file", b.modelPath, "error", err)
return err
}
fp32 := ConvertToF32(bts, uint32(fsggml.TensorTypeBF16), uint64(n/2))
for _, tt := range tts {
C.ggml_backend_tensor_set(tt, unsafe.Pointer(&fp32[0]), C.size_t(e*4), C.size_t(n*2))
}
e += uint64(n / 2)
if progress != nil {
done := doneBytes.Add(uint64(n))
progress(float32(done) / float32(totalBytes))
}
}
return nil
}
bts := make([]byte, 128*format.KibiByte) bts := make([]byte, 128*format.KibiByte)
var s uint64 var s uint64
...@@ -1063,6 +1165,13 @@ func (t *Tensor) MulmatID(ctx ml.Context, t2, ids ml.Tensor) ml.Tensor { ...@@ -1063,6 +1165,13 @@ func (t *Tensor) MulmatID(ctx ml.Context, t2, ids ml.Tensor) ml.Tensor {
} }
} }
func (t *Tensor) AddID(ctx ml.Context, t2, ids ml.Tensor) ml.Tensor {
return &Tensor{
b: t.b,
t: C.ggml_add_id(ctx.(*Context).ctx, t.t, t2.(*Tensor).t, ids.(*Tensor).t),
}
}
func (t *Tensor) LayerNorm(ctx ml.Context, w, b ml.Tensor, eps float32) ml.Tensor { func (t *Tensor) LayerNorm(ctx ml.Context, w, b ml.Tensor, eps float32) ml.Tensor {
tt := C.ggml_norm(ctx.(*Context).ctx, t.t, C.float(eps)) tt := C.ggml_norm(ctx.(*Context).ctx, t.t, C.float(eps))
if w != nil { if w != nil {
...@@ -1310,6 +1419,13 @@ func (t *Tensor) RELU(ctx ml.Context) ml.Tensor { ...@@ -1310,6 +1419,13 @@ func (t *Tensor) RELU(ctx ml.Context) ml.Tensor {
} }
} }
func (t *Tensor) SwiGLU(ctx ml.Context, up ml.Tensor, alpha, limit float32) ml.Tensor {
return &Tensor{
b: t.b,
t: C.ggml_swiglu_oai(ctx.(*Context).ctx, t.t, up.(*Tensor).t, C.float(alpha), C.float(limit)),
}
}
func (t *Tensor) Conv2D(ctx ml.Context, t2 ml.Tensor, s0, s1, p0, p1, d0, d1 int) ml.Tensor { func (t *Tensor) Conv2D(ctx ml.Context, t2 ml.Tensor, s0, s1, p0, p1, d0, d1 int) ml.Tensor {
return &Tensor{ return &Tensor{
b: t.b, b: t.b,
...@@ -1338,7 +1454,7 @@ func (t *Tensor) Set(ctx ml.Context, t2 ml.Tensor, offset int, strides ...int) m ...@@ -1338,7 +1454,7 @@ func (t *Tensor) Set(ctx ml.Context, t2 ml.Tensor, offset int, strides ...int) m
return &Tensor{b: t.b, t: tt} return &Tensor{b: t.b, t: tt}
} }
func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.Tensor, scale float64) ml.Tensor { func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask, sinks ml.Tensor, scale float64) ml.Tensor {
var kqMask *C.struct_ggml_tensor var kqMask *C.struct_ggml_tensor
if mask != nil { if mask != nil {
kqMask = mask.(*Tensor).t kqMask = mask.(*Tensor).t
...@@ -1351,6 +1467,9 @@ func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.T ...@@ -1351,6 +1467,9 @@ func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.T
value = value.Permute(ctx, 0, 2, 1, 3) value = value.Permute(ctx, 0, 2, 1, 3)
kqv := C.ggml_flash_attn_ext(ctx.(*Context).ctx, query.(*Tensor).t, key.(*Tensor).t, value.(*Tensor).t, kqMask, C.float(scale), 0, 0) kqv := C.ggml_flash_attn_ext(ctx.(*Context).ctx, query.(*Tensor).t, key.(*Tensor).t, value.(*Tensor).t, kqMask, C.float(scale), 0, 0)
if sinks != nil {
C.ggml_flash_attn_ext_add_sinks(kqv, sinks.(*Tensor).t)
}
C.ggml_flash_attn_ext_set_prec(kqv, C.GGML_PREC_F32) C.ggml_flash_attn_ext_set_prec(kqv, C.GGML_PREC_F32)
return &Tensor{b: t.b, t: kqv} return &Tensor{b: t.b, t: kqv}
} else { } else {
...@@ -1359,6 +1478,9 @@ func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.T ...@@ -1359,6 +1478,9 @@ func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.T
b: t.b, b: t.b,
t: C.ggml_soft_max_ext(ctx.(*Context).ctx, kq.(*Tensor).t, kqMask, C.float(scale), 0), t: C.ggml_soft_max_ext(ctx.(*Context).ctx, kq.(*Tensor).t, kqMask, C.float(scale), 0),
} }
if sinks != nil {
C.ggml_soft_max_add_sinks(kq.(*Tensor).t, sinks.(*Tensor).t)
}
kqv := value.Mulmat(ctx, kq) kqv := value.Mulmat(ctx, kq)
return kqv.Permute(ctx, 0, 2, 1, 3).Contiguous(ctx) return kqv.Permute(ctx, 0, 2, 1, 3).Contiguous(ctx)
......
protect .rsync-filter
protect *.go protect *.go
protect *-embed.* protect *-embed.*
include include/ protect ollama-*
include src/ hide /CMakeLists.txt
include src/CMakeLists.txt hide /include/ggml-webgpu.h
include src/**/CMakeLists.txt include /cmake/
include src/ggml-blas/ include /cmake/common.cmake
include src/ggml-cpu/ include /include/
include src/ggml-cpu/amx/ include /src/
include src/ggml-cpu/llamafile/ include /src/ggml-blas/
include src/ggml-cuda/ include /src/ggml-cpu/
include src/ggml-cuda/vendors/ include /src/ggml-cpu/amx/
include src/ggml-cuda/template-instances/ include /src/ggml-cpu/arch/
include src/ggml-hip/ include /src/ggml-cpu/arch/arm/
include src/ggml-metal/ include /src/ggml-cpu/arch/x86/
include *.c include /src/ggml-cpu/llamafile/
include *.h include /src/ggml-cuda/
include /src/ggml-cuda/vendors/
include /src/ggml-cuda/template-instances/
include /src/ggml-hip/
include /src/ggml-metal/
include CMakeLists.txt
include *.[chm]
include *.cpp include *.cpp
include *.cu include *.cu
include *.cuh include *.cuh
include *.m
include *.metal include *.metal
exclude * hide *
...@@ -24,3 +24,27 @@ function(ggml_get_flags CCID CCVER) ...@@ -24,3 +24,27 @@ function(ggml_get_flags CCID CCVER)
set(GF_C_FLAGS ${C_FLAGS} PARENT_SCOPE) set(GF_C_FLAGS ${C_FLAGS} PARENT_SCOPE)
set(GF_CXX_FLAGS ${CXX_FLAGS} PARENT_SCOPE) set(GF_CXX_FLAGS ${CXX_FLAGS} PARENT_SCOPE)
endfunction() endfunction()
function(ggml_get_system_arch)
if (CMAKE_OSX_ARCHITECTURES STREQUAL "arm64" OR
CMAKE_GENERATOR_PLATFORM_LWR STREQUAL "arm64" OR
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(aarch64|arm.*|ARM64)$"))
set(GGML_SYSTEM_ARCH "ARM" PARENT_SCOPE)
elseif (CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64" OR
CMAKE_GENERATOR_PLATFORM_LWR MATCHES "^(x86_64|i686|amd64|x64|win32)$" OR
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(x86_64|i686|AMD64|amd64)$"))
set(GGML_SYSTEM_ARCH "x86" PARENT_SCOPE)
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc|power")
set(GGML_SYSTEM_ARCH "PowerPC" PARENT_SCOPE)
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "loongarch64")
set(GGML_SYSTEM_ARCH "loongarch64" PARENT_SCOPE)
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "riscv64")
set(GGML_SYSTEM_ARCH "riscv64" PARENT_SCOPE)
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "s390x")
set(GGML_SYSTEM_ARCH "s390x" PARENT_SCOPE)
else()
set(GGML_SYSTEM_ARCH "UNKNOWN" PARENT_SCOPE)
endif()
endfunction()
...@@ -347,7 +347,7 @@ extern "C" { ...@@ -347,7 +347,7 @@ extern "C" {
typedef bool (*ggml_backend_eval_callback)(int node_index, struct ggml_tensor * t1, struct ggml_tensor * t2, void * user_data); typedef bool (*ggml_backend_eval_callback)(int node_index, struct ggml_tensor * t1, struct ggml_tensor * t2, void * user_data);
// Compare the output of two backends // Compare the output of two backends
GGML_API bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t backend2, struct ggml_cgraph * graph, ggml_backend_eval_callback callback, void * user_data); GGML_API bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t backend2, struct ggml_cgraph * graph, ggml_backend_eval_callback callback, void * user_data, struct ggml_tensor * test_node);
// Tensor initialization // Tensor initialization
GGML_API enum ggml_status ggml_backend_tensor_alloc(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, void * addr); GGML_API enum ggml_status ggml_backend_tensor_alloc(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, void * addr);
......
...@@ -101,6 +101,7 @@ extern "C" { ...@@ -101,6 +101,7 @@ extern "C" {
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void); GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
GGML_BACKEND_API int ggml_cpu_has_vsx (void); GGML_BACKEND_API int ggml_cpu_has_vsx (void);
GGML_BACKEND_API int ggml_cpu_has_vxe (void); GGML_BACKEND_API int ggml_cpu_has_vxe (void);
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void); GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
GGML_BACKEND_API int ggml_cpu_has_llamafile (void); GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
...@@ -133,6 +134,7 @@ extern "C" { ...@@ -133,6 +134,7 @@ extern "C" {
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cpu_reg(void); GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cpu_reg(void);
GGML_BACKEND_API void ggml_cpu_fp32_to_fp32(const float *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_fp16(const float *, ggml_fp16_t *, int64_t); GGML_BACKEND_API void ggml_cpu_fp32_to_fp16(const float *, ggml_fp16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp16_to_fp32(const ggml_fp16_t *, float *, int64_t); GGML_BACKEND_API void ggml_cpu_fp16_to_fp32(const ggml_fp16_t *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_bf16(const float *, ggml_bf16_t *, int64_t); GGML_BACKEND_API void ggml_cpu_fp32_to_bf16(const float *, ggml_bf16_t *, int64_t);
......
#pragma once
#include "ggml.h"
#include "ggml-backend.h"
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#ifdef __cplusplus
extern "C" {
#endif
#define GGML_KOMPUTE_MAX_DEVICES 16
struct ggml_vk_device {
int index;
int type; // same as VkPhysicalDeviceType
size_t heapSize;
const char * name;
const char * vendor;
int subgroupSize;
uint64_t bufferAlignment;
uint64_t maxAlloc;
};
struct ggml_vk_device * ggml_vk_available_devices(size_t memoryRequired, size_t * count);
bool ggml_vk_get_device(struct ggml_vk_device * device, size_t memoryRequired, const char * name);
bool ggml_vk_has_vulkan(void);
bool ggml_vk_has_device(void);
struct ggml_vk_device ggml_vk_current_device(void);
//
// backend API
//
// forward declaration
typedef struct ggml_backend * ggml_backend_t;
GGML_BACKEND_API ggml_backend_t ggml_backend_kompute_init(int device);
GGML_BACKEND_API bool ggml_backend_is_kompute(ggml_backend_t backend);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_kompute_buffer_type(int device);
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_kompute_reg(void);
#ifdef __cplusplus
}
#endif
...@@ -128,6 +128,8 @@ extern "C" { ...@@ -128,6 +128,8 @@ extern "C" {
// set gradients to zero, initilize loss, and optionally reset the optimizer // set gradients to zero, initilize loss, and optionally reset the optimizer
GGML_API void ggml_opt_reset(ggml_opt_context_t opt_ctx, bool optimizer); GGML_API void ggml_opt_reset(ggml_opt_context_t opt_ctx, bool optimizer);
GGML_API bool ggml_opt_static_graphs(ggml_opt_context_t opt_ctx); // whether the graphs are allocated_statically
// get underlying tensors that store data // get underlying tensors that store data
// if not using static graphs these pointers become invalid with the next call to ggml_opt_alloc // if not using static graphs these pointers become invalid with the next call to ggml_opt_alloc
GGML_API struct ggml_tensor * ggml_opt_inputs( ggml_opt_context_t opt_ctx); // forward graph input tensor GGML_API struct ggml_tensor * ggml_opt_inputs( ggml_opt_context_t opt_ctx); // forward graph input tensor
......
...@@ -304,6 +304,16 @@ ...@@ -304,6 +304,16 @@
GGML_TENSOR_LOCALS(int64_t, ne, dst, ne) \ GGML_TENSOR_LOCALS(int64_t, ne, dst, ne) \
GGML_TENSOR_LOCALS(size_t, nb, dst, nb) GGML_TENSOR_LOCALS(size_t, nb, dst, nb)
#define GGML_TENSOR_TERNARY_OP_LOCALS \
GGML_TENSOR_LOCALS(int64_t, ne0, src0, ne) \
GGML_TENSOR_LOCALS(size_t, nb0, src0, nb) \
GGML_TENSOR_LOCALS(int64_t, ne1, src1, ne) \
GGML_TENSOR_LOCALS(size_t, nb1, src1, nb) \
GGML_TENSOR_LOCALS(int64_t, ne2, src2, ne) \
GGML_TENSOR_LOCALS(size_t, nb2, src2, nb) \
GGML_TENSOR_LOCALS(int64_t, ne, dst, ne) \
GGML_TENSOR_LOCALS(size_t, nb, dst, nb)
#define GGML_TENSOR_BINARY_OP_LOCALS01 \ #define GGML_TENSOR_BINARY_OP_LOCALS01 \
GGML_TENSOR_LOCALS(int64_t, ne0, src0, ne) \ GGML_TENSOR_LOCALS(int64_t, ne0, src0, ne) \
GGML_TENSOR_LOCALS(size_t, nb0, src0, nb) \ GGML_TENSOR_LOCALS(size_t, nb0, src0, nb) \
...@@ -314,6 +324,13 @@ ...@@ -314,6 +324,13 @@
extern "C" { extern "C" {
#endif #endif
// Function type used in fatal error callbacks
typedef void (*ggml_abort_callback_t)(const char * error_message);
// Set the abort callback (passing null will restore original abort functionality: printing a message to stdout)
// Returns the old callback for chaining
GGML_API ggml_abort_callback_t ggml_set_abort_callback(ggml_abort_callback_t callback);
GGML_NORETURN GGML_ATTRIBUTE_FORMAT(3, 4) GGML_NORETURN GGML_ATTRIBUTE_FORMAT(3, 4)
GGML_API void ggml_abort(const char * file, int line, const char * fmt, ...); GGML_API void ggml_abort(const char * file, int line, const char * fmt, ...);
...@@ -353,7 +370,7 @@ extern "C" { ...@@ -353,7 +370,7 @@ extern "C" {
GGML_TYPE_F16 = 1, GGML_TYPE_F16 = 1,
GGML_TYPE_Q4_0 = 2, GGML_TYPE_Q4_0 = 2,
GGML_TYPE_Q4_1 = 3, GGML_TYPE_Q4_1 = 3,
GGML_TYPE_MXFP4 = 4, // Formerly removed type GGML_TYPE_Q4_2 // GGML_TYPE_Q4_2 = 4, support has been removed
// GGML_TYPE_Q4_3 = 5, support has been removed // GGML_TYPE_Q4_3 = 5, support has been removed
GGML_TYPE_Q5_0 = 6, GGML_TYPE_Q5_0 = 6,
GGML_TYPE_Q5_1 = 7, GGML_TYPE_Q5_1 = 7,
...@@ -388,7 +405,8 @@ extern "C" { ...@@ -388,7 +405,8 @@ extern "C" {
// GGML_TYPE_IQ4_NL_4_4 = 36, // GGML_TYPE_IQ4_NL_4_4 = 36,
// GGML_TYPE_IQ4_NL_4_8 = 37, // GGML_TYPE_IQ4_NL_4_8 = 37,
// GGML_TYPE_IQ4_NL_8_8 = 38, // GGML_TYPE_IQ4_NL_8_8 = 38,
GGML_TYPE_COUNT = 39, GGML_TYPE_MXFP4 = 39, // MXFP4 (1 block)
GGML_TYPE_COUNT = 40,
}; };
// precision // precision
...@@ -423,6 +441,7 @@ extern "C" { ...@@ -423,6 +441,7 @@ extern "C" {
GGML_FTYPE_MOSTLY_IQ4_XS = 22, // except 1d tensors GGML_FTYPE_MOSTLY_IQ4_XS = 22, // except 1d tensors
GGML_FTYPE_MOSTLY_IQ1_M = 23, // except 1d tensors GGML_FTYPE_MOSTLY_IQ1_M = 23, // except 1d tensors
GGML_FTYPE_MOSTLY_BF16 = 24, // except 1d tensors GGML_FTYPE_MOSTLY_BF16 = 24, // except 1d tensors
GGML_FTYPE_MOSTLY_MXFP4 = 25, // except 1d tensors
}; };
// available tensor operations: // available tensor operations:
...@@ -431,6 +450,7 @@ extern "C" { ...@@ -431,6 +450,7 @@ extern "C" {
GGML_OP_DUP, GGML_OP_DUP,
GGML_OP_ADD, GGML_OP_ADD,
GGML_OP_ADD_ID,
GGML_OP_ADD1, GGML_OP_ADD1,
GGML_OP_ACC, GGML_OP_ACC,
GGML_OP_SUB, GGML_OP_SUB,
...@@ -470,6 +490,7 @@ extern "C" { ...@@ -470,6 +490,7 @@ extern "C" {
GGML_OP_TRANSPOSE, GGML_OP_TRANSPOSE,
GGML_OP_GET_ROWS, GGML_OP_GET_ROWS,
GGML_OP_GET_ROWS_BACK, GGML_OP_GET_ROWS_BACK,
GGML_OP_SET_ROWS,
GGML_OP_DIAG, GGML_OP_DIAG,
GGML_OP_DIAG_MASK_INF, GGML_OP_DIAG_MASK_INF,
GGML_OP_DIAG_MASK_ZERO, GGML_OP_DIAG_MASK_ZERO,
...@@ -481,14 +502,16 @@ extern "C" { ...@@ -481,14 +502,16 @@ extern "C" {
GGML_OP_CONV_TRANSPOSE_1D, GGML_OP_CONV_TRANSPOSE_1D,
GGML_OP_IM2COL, GGML_OP_IM2COL,
GGML_OP_IM2COL_BACK, GGML_OP_IM2COL_BACK,
GGML_OP_CONV_2D,
GGML_OP_CONV_2D_DW, GGML_OP_CONV_2D_DW,
GGML_OP_CONV_TRANSPOSE_2D, GGML_OP_CONV_TRANSPOSE_2D,
GGML_OP_POOL_1D, GGML_OP_POOL_1D,
GGML_OP_POOL_2D, GGML_OP_POOL_2D,
GGML_OP_POOL_2D_BACK, GGML_OP_POOL_2D_BACK,
GGML_OP_UPSCALE, // nearest interpolate GGML_OP_UPSCALE,
GGML_OP_PAD, GGML_OP_PAD,
GGML_OP_PAD_REFLECT_1D, GGML_OP_PAD_REFLECT_1D,
GGML_OP_ROLL,
GGML_OP_ARANGE, GGML_OP_ARANGE,
GGML_OP_TIMESTEP_EMBEDDING, GGML_OP_TIMESTEP_EMBEDDING,
GGML_OP_ARGSORT, GGML_OP_ARGSORT,
...@@ -518,6 +541,8 @@ extern "C" { ...@@ -518,6 +541,8 @@ extern "C" {
GGML_OP_CROSS_ENTROPY_LOSS_BACK, GGML_OP_CROSS_ENTROPY_LOSS_BACK,
GGML_OP_OPT_STEP_ADAMW, GGML_OP_OPT_STEP_ADAMW,
GGML_OP_GLU,
GGML_OP_COUNT, GGML_OP_COUNT,
}; };
...@@ -536,10 +561,22 @@ extern "C" { ...@@ -536,10 +561,22 @@ extern "C" {
GGML_UNARY_OP_HARDSWISH, GGML_UNARY_OP_HARDSWISH,
GGML_UNARY_OP_HARDSIGMOID, GGML_UNARY_OP_HARDSIGMOID,
GGML_UNARY_OP_EXP, GGML_UNARY_OP_EXP,
GGML_UNARY_OP_GELU_ERF,
GGML_UNARY_OP_COUNT, GGML_UNARY_OP_COUNT,
}; };
enum ggml_glu_op {
GGML_GLU_OP_REGLU,
GGML_GLU_OP_GEGLU,
GGML_GLU_OP_SWIGLU,
GGML_GLU_OP_SWIGLU_OAI,
GGML_GLU_OP_GEGLU_ERF,
GGML_GLU_OP_GEGLU_QUICK,
GGML_GLU_OP_COUNT,
};
enum ggml_object_type { enum ggml_object_type {
GGML_OBJECT_TYPE_TENSOR, GGML_OBJECT_TYPE_TENSOR,
GGML_OBJECT_TYPE_GRAPH, GGML_OBJECT_TYPE_GRAPH,
...@@ -625,6 +662,9 @@ extern "C" { ...@@ -625,6 +662,9 @@ extern "C" {
// misc // misc
GGML_API const char * ggml_version(void);
GGML_API const char * ggml_commit(void);
GGML_API void ggml_time_init(void); // call this once at the beginning of the program GGML_API void ggml_time_init(void); // call this once at the beginning of the program
GGML_API int64_t ggml_time_ms(void); GGML_API int64_t ggml_time_ms(void);
GGML_API int64_t ggml_time_us(void); GGML_API int64_t ggml_time_us(void);
...@@ -655,6 +695,7 @@ extern "C" { ...@@ -655,6 +695,7 @@ extern "C" {
GGML_API const char * ggml_op_symbol(enum ggml_op op); GGML_API const char * ggml_op_symbol(enum ggml_op op);
GGML_API const char * ggml_unary_op_name(enum ggml_unary_op op); GGML_API const char * ggml_unary_op_name(enum ggml_unary_op op);
GGML_API const char * ggml_glu_op_name(enum ggml_glu_op op);
GGML_API const char * ggml_op_desc(const struct ggml_tensor * t); // unary or op name GGML_API const char * ggml_op_desc(const struct ggml_tensor * t); // unary or op name
GGML_API size_t ggml_element_size(const struct ggml_tensor * tensor); GGML_API size_t ggml_element_size(const struct ggml_tensor * tensor);
...@@ -685,6 +726,9 @@ extern "C" { ...@@ -685,6 +726,9 @@ extern "C" {
// true for tensor that is stored in memory as CxWxHxN and has been permuted to WxHxCxN // true for tensor that is stored in memory as CxWxHxN and has been permuted to WxHxCxN
GGML_API bool ggml_is_contiguous_channels(const struct ggml_tensor * tensor); GGML_API bool ggml_is_contiguous_channels(const struct ggml_tensor * tensor);
// true if the elements in dimension 0 are contiguous, or there is just 1 block of elements
GGML_API bool ggml_is_contiguous_rows(const struct ggml_tensor * tensor);
GGML_API bool ggml_are_same_shape (const struct ggml_tensor * t0, const struct ggml_tensor * t1); GGML_API bool ggml_are_same_shape (const struct ggml_tensor * t0, const struct ggml_tensor * t1);
GGML_API bool ggml_are_same_stride(const struct ggml_tensor * t0, const struct ggml_tensor * t1); GGML_API bool ggml_are_same_stride(const struct ggml_tensor * t0, const struct ggml_tensor * t1);
...@@ -756,6 +800,7 @@ extern "C" { ...@@ -756,6 +800,7 @@ extern "C" {
GGML_API void ggml_unravel_index(const struct ggml_tensor * tensor, int64_t i, int64_t * i0, int64_t * i1, int64_t * i2, int64_t * i3); GGML_API void ggml_unravel_index(const struct ggml_tensor * tensor, int64_t i, int64_t * i0, int64_t * i1, int64_t * i2, int64_t * i3);
GGML_API enum ggml_unary_op ggml_get_unary_op(const struct ggml_tensor * tensor); GGML_API enum ggml_unary_op ggml_get_unary_op(const struct ggml_tensor * tensor);
GGML_API enum ggml_glu_op ggml_get_glu_op(const struct ggml_tensor * tensor);
GGML_API void * ggml_get_data (const struct ggml_tensor * tensor); GGML_API void * ggml_get_data (const struct ggml_tensor * tensor);
GGML_API float * ggml_get_data_f32(const struct ggml_tensor * tensor); GGML_API float * ggml_get_data_f32(const struct ggml_tensor * tensor);
...@@ -800,6 +845,13 @@ extern "C" { ...@@ -800,6 +845,13 @@ extern "C" {
struct ggml_tensor * b, struct ggml_tensor * b,
enum ggml_type type); enum ggml_type type);
// dst[i0, i1, i2] = a[i0, i1, i2] + b[i0, ids[i1, i2]]
GGML_API struct ggml_tensor * ggml_add_id(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * ids);
GGML_API struct ggml_tensor * ggml_add1( GGML_API struct ggml_tensor * ggml_add1(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a, struct ggml_tensor * a,
...@@ -934,6 +986,15 @@ extern "C" { ...@@ -934,6 +986,15 @@ extern "C" {
struct ggml_tensor * a, struct ggml_tensor * a,
struct ggml_tensor * b); struct ggml_tensor * b);
// repeat a to the specified shape
GGML_API struct ggml_tensor * ggml_repeat_4d(
struct ggml_context * ctx,
struct ggml_tensor * a,
int64_t ne0,
int64_t ne1,
int64_t ne2,
int64_t ne3);
// sums repetitions in a into shape of b // sums repetitions in a into shape of b
GGML_API struct ggml_tensor * ggml_repeat_back( GGML_API struct ggml_tensor * ggml_repeat_back(
struct ggml_context * ctx, struct ggml_context * ctx,
...@@ -1024,6 +1085,16 @@ extern "C" { ...@@ -1024,6 +1085,16 @@ extern "C" {
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a); struct ggml_tensor * a);
// GELU using erf (error function) when possible
// some backends may fallback to approximation based on Abramowitz and Stegun formula
GGML_API struct ggml_tensor * ggml_gelu_erf(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_gelu_erf_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_gelu_quick( GGML_API struct ggml_tensor * ggml_gelu_quick(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a); struct ggml_tensor * a);
...@@ -1065,6 +1136,96 @@ extern "C" { ...@@ -1065,6 +1136,96 @@ extern "C" {
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a); struct ggml_tensor * a);
// gated linear unit ops
// A: n columns, r rows,
// result is n / 2 columns, r rows,
// expects gate in second half of row, unless swapped is true
GGML_API struct ggml_tensor * ggml_glu(
struct ggml_context * ctx,
struct ggml_tensor * a,
enum ggml_glu_op op,
bool swapped);
GGML_API struct ggml_tensor * ggml_reglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_reglu_swapped(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu_swapped(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_swiglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_swiglu_swapped(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu_erf(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu_erf_swapped(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu_quick(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu_quick_swapped(
struct ggml_context * ctx,
struct ggml_tensor * a);
// A: n columns, r rows,
// B: n columns, r rows,
GGML_API struct ggml_tensor * ggml_glu_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
enum ggml_glu_op op);
GGML_API struct ggml_tensor * ggml_reglu_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_geglu_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_swiglu_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_geglu_erf_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_geglu_quick_split(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_swiglu_oai(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
float alpha,
float limit);
// normalize along rows // normalize along rows
GGML_API struct ggml_tensor * ggml_norm( GGML_API struct ggml_tensor * ggml_norm(
struct ggml_context * ctx, struct ggml_context * ctx,
...@@ -1164,6 +1325,19 @@ extern "C" { ...@@ -1164,6 +1325,19 @@ extern "C" {
struct ggml_tensor * a, struct ggml_tensor * a,
float s); float s);
// x = s * a + b
GGML_API struct ggml_tensor * ggml_scale_bias(
struct ggml_context * ctx,
struct ggml_tensor * a,
float s,
float b);
GGML_API struct ggml_tensor * ggml_scale_bias_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
float s,
float b);
// b -> view(a,offset,nb1,nb2,3), return modified a // b -> view(a,offset,nb1,nb2,3), return modified a
GGML_API struct ggml_tensor * ggml_set( GGML_API struct ggml_tensor * ggml_set(
struct ggml_context * ctx, struct ggml_context * ctx,
...@@ -1354,6 +1528,23 @@ extern "C" { ...@@ -1354,6 +1528,23 @@ extern "C" {
struct ggml_tensor * b, // row indices struct ggml_tensor * b, // row indices
struct ggml_tensor * c); // data for ggml_get_rows, only used for its shape struct ggml_tensor * c); // data for ggml_get_rows, only used for its shape
// a TD [n_embd, ne1, ne2, ne3]
// b TS [n_embd, n_rows, ne02, ne03] | ne02 == ne2, ne03 == ne3
// c I64 [n_rows, ne11, ne12, 1] | c[i] in [0, ne1)
//
// undefined behavior if destination rows overlap
//
// broadcast:
// ne2 % ne11 == 0
// ne3 % ne12 == 0
//
// return view(a)
GGML_API struct ggml_tensor * ggml_set_rows(
struct ggml_context * ctx,
struct ggml_tensor * a, // destination
struct ggml_tensor * b, // source
struct ggml_tensor * c); // row indices
GGML_API struct ggml_tensor * ggml_diag( GGML_API struct ggml_tensor * ggml_diag(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a); struct ggml_tensor * a);
...@@ -1391,8 +1582,14 @@ extern "C" { ...@@ -1391,8 +1582,14 @@ extern "C" {
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a); struct ggml_tensor * a);
// a [ne0, ne01, ne02, ne03]
// mask [ne0, ne11, ne12, ne13] | ne11 >= ne01, F16 or F32, optional
//
// broadcast:
// ne02 % ne12 == 0
// ne03 % ne13 == 0
//
// fused soft_max(a*scale + mask*(ALiBi slope)) // fused soft_max(a*scale + mask*(ALiBi slope))
// mask is optional
// max_bias = 0.0f for no ALiBi // max_bias = 0.0f for no ALiBi
GGML_API struct ggml_tensor * ggml_soft_max_ext( GGML_API struct ggml_tensor * ggml_soft_max_ext(
struct ggml_context * ctx, struct ggml_context * ctx,
...@@ -1401,6 +1598,10 @@ extern "C" { ...@@ -1401,6 +1598,10 @@ extern "C" {
float scale, float scale,
float max_bias); float max_bias);
GGML_API void ggml_soft_max_add_sinks(
struct ggml_tensor * a,
struct ggml_tensor * sinks);
GGML_API struct ggml_tensor * ggml_soft_max_ext_back( GGML_API struct ggml_tensor * ggml_soft_max_ext_back(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a, struct ggml_tensor * a,
...@@ -1702,6 +1903,17 @@ extern "C" { ...@@ -1702,6 +1903,17 @@ extern "C" {
struct ggml_tensor * b, struct ggml_tensor * b,
int stride); int stride);
GGML_API struct ggml_tensor * ggml_conv_2d_direct(
struct ggml_context * ctx,
struct ggml_tensor * a, // convolution kernel [KW, KH, IC, OC]
struct ggml_tensor * b, // input data [W, H, C, N]
int s0, // stride dimension 0
int s1, // stride dimension 1
int p0, // padding dimension 0
int p1, // padding dimension 1
int d0, // dilation dimension 0
int d1); // dilation dimension 1
enum ggml_op_pool { enum ggml_op_pool {
GGML_OP_POOL_MAX, GGML_OP_POOL_MAX,
GGML_OP_POOL_AVG, GGML_OP_POOL_AVG,
...@@ -1744,6 +1956,12 @@ extern "C" { ...@@ -1744,6 +1956,12 @@ extern "C" {
enum ggml_scale_mode { enum ggml_scale_mode {
GGML_SCALE_MODE_NEAREST = 0, GGML_SCALE_MODE_NEAREST = 0,
GGML_SCALE_MODE_BILINEAR = 1, GGML_SCALE_MODE_BILINEAR = 1,
GGML_SCALE_MODE_COUNT
};
enum ggml_scale_flag {
GGML_SCALE_FLAG_ALIGN_CORNERS = (1 << 8)
}; };
// interpolate // interpolate
...@@ -1756,14 +1974,26 @@ extern "C" { ...@@ -1756,14 +1974,26 @@ extern "C" {
// interpolate // interpolate
// interpolate scale to specified dimensions // interpolate scale to specified dimensions
GGML_API struct ggml_tensor * ggml_upscale_ext( GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_upscale_ext(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * a, struct ggml_tensor * a,
int ne0, int ne0,
int ne1, int ne1,
int ne2, int ne2,
int ne3, int ne3,
enum ggml_scale_mode mode); enum ggml_scale_mode mode),
"use ggml_interpolate instead");
// Up- or downsamples the input to the specified size.
// 2D scale modes (eg. bilinear) are applied to the first two dimensions.
GGML_API struct ggml_tensor * ggml_interpolate(
struct ggml_context * ctx,
struct ggml_tensor * a,
int64_t ne0,
int64_t ne1,
int64_t ne2,
int64_t ne3,
uint32_t mode); // ggml_scale_mode [ | ggml_scale_flag...]
// pad each dimension with zeros: [x, ..., x] -> [x, ..., x, 0, ..., 0] // pad each dimension with zeros: [x, ..., x] -> [x, ..., x, 0, ..., 0]
GGML_API struct ggml_tensor * ggml_pad( GGML_API struct ggml_tensor * ggml_pad(
...@@ -1781,6 +2011,17 @@ extern "C" { ...@@ -1781,6 +2011,17 @@ extern "C" {
int p0, int p0,
int p1); int p1);
// Move tensor elements by an offset given for each dimension. Elements that
// are shifted beyond the last position are wrapped around to the beginning.
GGML_API struct ggml_tensor * ggml_roll(
struct ggml_context * ctx,
struct ggml_tensor * a,
int shift0,
int shift1,
int shift2,
int shift3);
// Ref: https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L151 // Ref: https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L151
// timesteps: [N,] // timesteps: [N,]
// return: [N, dim] // return: [N, dim]
...@@ -1815,11 +2056,17 @@ extern "C" { ...@@ -1815,11 +2056,17 @@ extern "C" {
#define GGML_KQ_MASK_PAD 64 #define GGML_KQ_MASK_PAD 64
// q: [n_embd_k, n_batch, n_head, 1] // q: [n_embd_k, n_batch, n_head, ne3 ]
// k: [n_embd_k, n_kv, n_head_kv, 1] // k: [n_embd_k, n_kv, n_head_kv, ne3 ]
// v: [n_embd_v, n_kv, n_head_kv, 1] !! not transposed !! // v: [n_embd_v, n_kv, n_head_kv, ne3 ] !! not transposed !!
// mask: [n_kv, n_batch_pad, 1, 1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !! // mask: [n_kv, n_batch_pad, ne32, ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, 1] !! permuted !! // res: [n_embd_v, n_head, n_batch, ne3 ] !! permuted !!
//
// broadcast:
// n_head % n_head_kv == 0
// n_head % ne32 == 0
// ne3 % ne33 == 0
//
GGML_API struct ggml_tensor * ggml_flash_attn_ext( GGML_API struct ggml_tensor * ggml_flash_attn_ext(
struct ggml_context * ctx, struct ggml_context * ctx,
struct ggml_tensor * q, struct ggml_tensor * q,
...@@ -1837,6 +2084,10 @@ extern "C" { ...@@ -1837,6 +2084,10 @@ extern "C" {
GGML_API enum ggml_prec ggml_flash_attn_ext_get_prec( GGML_API enum ggml_prec ggml_flash_attn_ext_get_prec(
const struct ggml_tensor * a); const struct ggml_tensor * a);
GGML_API void ggml_flash_attn_ext_add_sinks(
struct ggml_tensor * a,
struct ggml_tensor * sinks);
// TODO: needs to be adapted to ggml_flash_attn_ext // TODO: needs to be adapted to ggml_flash_attn_ext
GGML_API struct ggml_tensor * ggml_flash_attn_back( GGML_API struct ggml_tensor * ggml_flash_attn_back(
struct ggml_context * ctx, struct ggml_context * ctx,
...@@ -1858,7 +2109,8 @@ extern "C" { ...@@ -1858,7 +2109,8 @@ extern "C" {
struct ggml_tensor * dt, struct ggml_tensor * dt,
struct ggml_tensor * A, struct ggml_tensor * A,
struct ggml_tensor * B, struct ggml_tensor * B,
struct ggml_tensor * C); struct ggml_tensor * C,
struct ggml_tensor * ids);
// partition into non-overlapping windows with padding if needed // partition into non-overlapping windows with padding if needed
// example: // example:
...@@ -2075,9 +2327,6 @@ extern "C" { ...@@ -2075,9 +2327,6 @@ extern "C" {
GGML_API struct ggml_tensor * ggml_graph_get_grad (const struct ggml_cgraph * cgraph, const struct ggml_tensor * node); GGML_API struct ggml_tensor * ggml_graph_get_grad (const struct ggml_cgraph * cgraph, const struct ggml_tensor * node);
GGML_API struct ggml_tensor * ggml_graph_get_grad_acc(const struct ggml_cgraph * cgraph, const struct ggml_tensor * node); GGML_API struct ggml_tensor * ggml_graph_get_grad_acc(const struct ggml_cgraph * cgraph, const struct ggml_tensor * node);
GGML_API void ggml_graph_export(const struct ggml_cgraph * cgraph, const char * fname);
GGML_API struct ggml_cgraph * ggml_graph_import(const char * fname, struct ggml_context ** ctx_data, struct ggml_context ** ctx_eval);
// print info and performance information for the graph // print info and performance information for the graph
GGML_API void ggml_graph_print(const struct ggml_cgraph * cgraph); GGML_API void ggml_graph_print(const struct ggml_cgraph * cgraph);
...@@ -2161,6 +2410,7 @@ extern "C" { ...@@ -2161,6 +2410,7 @@ extern "C" {
// scheduling priorities // scheduling priorities
enum ggml_sched_priority { enum ggml_sched_priority {
GGML_SCHED_PRIO_LOW = -1,
GGML_SCHED_PRIO_NORMAL, GGML_SCHED_PRIO_NORMAL,
GGML_SCHED_PRIO_MEDIUM, GGML_SCHED_PRIO_MEDIUM,
GGML_SCHED_PRIO_HIGH, GGML_SCHED_PRIO_HIGH,
......
...@@ -109,6 +109,8 @@ if (MSVC) ...@@ -109,6 +109,8 @@ if (MSVC)
else () else ()
set(CMAKE_GENERATOR_PLATFORM_LWR "") set(CMAKE_GENERATOR_PLATFORM_LWR "")
endif () endif ()
ggml_get_system_arch()
message(STATUS "GGML_SYSTEM_ARCH: ${GGML_SYSTEM_ARCH}")
if (NOT MSVC) if (NOT MSVC)
if (GGML_STATIC) if (GGML_STATIC)
...@@ -123,7 +125,6 @@ if (NOT MSVC) ...@@ -123,7 +125,6 @@ if (NOT MSVC)
endif() endif()
if (MINGW) if (MINGW)
# Target Windows 8 for PrefetchVirtualMemory
add_compile_definitions(_WIN32_WINNT=${GGML_WIN_VER}) add_compile_definitions(_WIN32_WINNT=${GGML_WIN_VER})
endif() endif()
...@@ -194,6 +195,7 @@ add_library(ggml-base ...@@ -194,6 +195,7 @@ add_library(ggml-base
../include/ggml-opt.h ../include/ggml-opt.h
../include/gguf.h ../include/gguf.h
ggml.c ggml.c
ggml.cpp
ggml-alloc.c ggml-alloc.c
ggml-backend.cpp ggml-backend.cpp
ggml-opt.cpp ggml-opt.cpp
...@@ -210,6 +212,14 @@ endif() ...@@ -210,6 +212,14 @@ endif()
add_library(ggml add_library(ggml
ggml-backend-reg.cpp) ggml-backend-reg.cpp)
add_library(ggml::ggml ALIAS ggml)
if (GGML_BACKEND_DIR)
if (NOT GGML_BACKEND_DL)
message(FATAL_ERROR "GGML_BACKEND_DIR requires GGML_BACKEND_DL")
endif()
target_compile_definitions(ggml PUBLIC GGML_BACKEND_DIR="${GGML_BACKEND_DIR}")
endif()
target_link_libraries(ggml PUBLIC ggml-base) target_link_libraries(ggml PUBLIC ggml-base)
...@@ -224,6 +234,11 @@ function(ggml_add_backend_library backend) ...@@ -224,6 +234,11 @@ function(ggml_add_backend_library backend)
set_target_properties(${backend} PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}) set_target_properties(${backend} PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})
target_compile_definitions(${backend} PRIVATE GGML_BACKEND_DL) target_compile_definitions(${backend} PRIVATE GGML_BACKEND_DL)
add_dependencies(ggml ${backend}) add_dependencies(ggml ${backend})
if (GGML_BACKEND_DIR)
install(TARGETS ${backend} LIBRARY DESTINATION ${GGML_BACKEND_DIR})
else()
install(TARGETS ${backend} LIBRARY DESTINATION ${CMAKE_INSTALL_BINDIR})
endif()
else() else()
add_library(${backend} ${ARGN}) add_library(${backend} ${ARGN})
target_link_libraries(ggml PUBLIC ${backend}) target_link_libraries(ggml PUBLIC ${backend})
...@@ -266,17 +281,27 @@ endfunction() ...@@ -266,17 +281,27 @@ endfunction()
function(ggml_add_cpu_backend_variant tag_name) function(ggml_add_cpu_backend_variant tag_name)
set(GGML_CPU_TAG_NAME ${tag_name}) set(GGML_CPU_TAG_NAME ${tag_name})
# other: OPENMP LLAMAFILE CPU_HBM # other: OPENMP LLAMAFILE CPU_HBM
foreach (feat NATIVE if (GGML_SYSTEM_ARCH STREQUAL "x86")
SSE42 foreach (feat NATIVE
AVX AVX2 BMI2 AVX_VNNI FMA F16C SSE42
AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AVX AVX2 BMI2 AVX_VNNI FMA F16C
AMX_TILE AMX_INT8 AMX_BF16) AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16
set(GGML_${feat} OFF) AMX_TILE AMX_INT8 AMX_BF16)
endforeach() set(GGML_${feat} OFF)
endforeach()
foreach (feat ${ARGN})
set(GGML_${feat} ON) foreach (feat ${ARGN})
endforeach() set(GGML_${feat} ON)
endforeach()
elseif (GGML_SYSTEM_ARCH STREQUAL "ARM")
foreach (feat ${ARGN})
set(GGML_INTERNAL_${feat} ON)
endforeach()
elseif (GGML_SYSTEM_ARCH STREQUAL "PowerPC")
foreach (feat ${ARGN})
set(GGML_INTERNAL_${feat} ON)
endforeach()
endif()
ggml_add_cpu_backend_variant_impl(${tag_name}) ggml_add_cpu_backend_variant_impl(${tag_name})
add_dependencies(ggml-cpu ggml-cpu-${tag_name}) add_dependencies(ggml-cpu ggml-cpu-${tag_name})
...@@ -287,15 +312,60 @@ ggml_add_backend(CPU) ...@@ -287,15 +312,60 @@ ggml_add_backend(CPU)
if (GGML_CPU_ALL_VARIANTS) if (GGML_CPU_ALL_VARIANTS)
if (NOT GGML_BACKEND_DL) if (NOT GGML_BACKEND_DL)
message(FATAL_ERROR "GGML_CPU_ALL_VARIANTS requires GGML_BACKEND_DL") message(FATAL_ERROR "GGML_CPU_ALL_VARIANTS requires GGML_BACKEND_DL")
elseif (GGML_CPU_ARM_ARCH)
message(FATAL_ERROR "Cannot use both GGML_CPU_ARM_ARCH and GGML_CPU_ALL_VARIANTS")
endif() endif()
add_custom_target(ggml-cpu) add_custom_target(ggml-cpu)
ggml_add_cpu_backend_variant(x64) if (GGML_SYSTEM_ARCH STREQUAL "x86")
ggml_add_cpu_backend_variant(sse42 SSE42) ggml_add_cpu_backend_variant(x64)
ggml_add_cpu_backend_variant(sandybridge SSE42 AVX) ggml_add_cpu_backend_variant(sse42 SSE42)
ggml_add_cpu_backend_variant(haswell SSE42 AVX F16C AVX2 BMI2 FMA) ggml_add_cpu_backend_variant(sandybridge SSE42 AVX)
ggml_add_cpu_backend_variant(skylakex SSE42 AVX F16C AVX2 BMI2 FMA AVX512) ggml_add_cpu_backend_variant(haswell SSE42 AVX F16C AVX2 BMI2 FMA)
ggml_add_cpu_backend_variant(icelake SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI) ggml_add_cpu_backend_variant(skylakex SSE42 AVX F16C AVX2 BMI2 FMA AVX512)
ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI) ggml_add_cpu_backend_variant(icelake SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI)
elseif(GGML_SYSTEM_ARCH STREQUAL "ARM")
if (CMAKE_SYSTEM_NAME MATCHES "Linux")
# Many of these features are optional so we build versions with popular
# combinations and name the backends based on the version they were
# first released with
ggml_add_cpu_backend_variant(armv8.0_1)
ggml_add_cpu_backend_variant(armv8.2_1 DOTPROD)
ggml_add_cpu_backend_variant(armv8.2_2 DOTPROD FP16_VECTOR_ARITHMETIC)
ggml_add_cpu_backend_variant(armv8.2_3 DOTPROD FP16_VECTOR_ARITHMETIC SVE)
ggml_add_cpu_backend_variant(armv8.6_1 DOTPROD FP16_VECTOR_ARITHMETIC SVE MATMUL_INT8)
ggml_add_cpu_backend_variant(armv8.6_2 DOTPROD FP16_VECTOR_ARITHMETIC SVE MATMUL_INT8 SVE2)
ggml_add_cpu_backend_variant(armv9.2_1 DOTPROD FP16_VECTOR_ARITHMETIC SVE MATMUL_INT8 SME)
ggml_add_cpu_backend_variant(armv9.2_2 DOTPROD FP16_VECTOR_ARITHMETIC SVE MATMUL_INT8 SVE2 SME)
elseif (CMAKE_SYSTEM_NAME MATCHES "Android")
# Android-specific backends with SoC-compatible feature sets
ggml_add_cpu_backend_variant(android_armv8.0_1)
ggml_add_cpu_backend_variant(android_armv8.2_1 DOTPROD)
ggml_add_cpu_backend_variant(android_armv8.2_2 DOTPROD FP16_VECTOR_ARITHMETIC)
ggml_add_cpu_backend_variant(android_armv8.6_1 DOTPROD FP16_VECTOR_ARITHMETIC MATMUL_INT8)
elseif (APPLE)
ggml_add_cpu_backend_variant(apple_m1 DOTPROD)
ggml_add_cpu_backend_variant(apple_m2_m3 DOTPROD MATMUL_INT8)
ggml_add_cpu_backend_variant(apple_m4 DOTPROD MATMUL_INT8 NOSVE SME)
else()
message(FATAL_ERROR "Unsupported ARM target OS: ${CMAKE_SYSTEM_NAME}")
endif()
elseif (GGML_SYSTEM_ARCH STREQUAL "PowerPC")
if (CMAKE_SYSTEM_NAME MATCHES "Linux")
ggml_add_cpu_backend_variant(power0)
ggml_add_cpu_backend_variant(power7_1 POWER7)
ggml_add_cpu_backend_variant(power7_2 POWER7 VSX)
ggml_add_cpu_backend_variant(power8_1 POWER8)
ggml_add_cpu_backend_variant(power8_2 POWER8 VSX)
ggml_add_cpu_backend_variant(power9 POWER9 VSX)
ggml_add_cpu_backend_variant(power10 POWER10 VSX)
ggml_add_cpu_backend_variant(power11 POWER11 VSX)
else()
message(FATAL_ERROR "Unsupported PowerPC target OS: ${CMAKE_SYSTEM_NAME}")
endif()
else()
message(FATAL_ERROR "GGML_CPU_ALL_VARIANTS not yet supported with ${GGML_SYSTEM_ARCH} on ${CMAKE_SYSTEM_NAME}")
endif()
elseif (GGML_CPU) elseif (GGML_CPU)
ggml_add_cpu_backend_variant_impl("") ggml_add_cpu_backend_variant_impl("")
endif() endif()
...@@ -304,12 +374,12 @@ ggml_add_backend(BLAS) ...@@ -304,12 +374,12 @@ ggml_add_backend(BLAS)
ggml_add_backend(CANN) ggml_add_backend(CANN)
ggml_add_backend(CUDA) ggml_add_backend(CUDA)
ggml_add_backend(HIP) ggml_add_backend(HIP)
ggml_add_backend(Kompute)
ggml_add_backend(METAL) ggml_add_backend(METAL)
ggml_add_backend(MUSA) ggml_add_backend(MUSA)
ggml_add_backend(RPC) ggml_add_backend(RPC)
ggml_add_backend(SYCL) ggml_add_backend(SYCL)
ggml_add_backend(Vulkan) ggml_add_backend(Vulkan)
ggml_add_backend(WebGPU)
ggml_add_backend(OpenCL) ggml_add_backend(OpenCL)
foreach (target ggml-base ggml) foreach (target ggml-base ggml)
......
...@@ -22,21 +22,6 @@ static bool ggml_is_view(const struct ggml_tensor * t) { ...@@ -22,21 +22,6 @@ static bool ggml_is_view(const struct ggml_tensor * t) {
return t->view_src != NULL; return t->view_src != NULL;
} }
static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
if (a->type != b->type) {
return false;
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (a->ne[i] != b->ne[i]) {
return false;
}
if (a->nb[i] != b->nb[i]) {
return false;
}
}
return true;
}
// ops that return true for this function must not use restrict pointers for their backend implementations // ops that return true for this function must not use restrict pointers for their backend implementations
static bool ggml_op_can_inplace(enum ggml_op op) { static bool ggml_op_can_inplace(enum ggml_op op) {
switch (op) { switch (op) {
...@@ -44,6 +29,7 @@ static bool ggml_op_can_inplace(enum ggml_op op) { ...@@ -44,6 +29,7 @@ static bool ggml_op_can_inplace(enum ggml_op op) {
case GGML_OP_DIAG_MASK_ZERO: case GGML_OP_DIAG_MASK_ZERO:
case GGML_OP_DIAG_MASK_INF: case GGML_OP_DIAG_MASK_INF:
case GGML_OP_ADD: case GGML_OP_ADD:
case GGML_OP_ADD_ID:
case GGML_OP_ADD1: case GGML_OP_ADD1:
case GGML_OP_SUB: case GGML_OP_SUB:
case GGML_OP_MUL: case GGML_OP_MUL:
......
...@@ -45,6 +45,10 @@ ...@@ -45,6 +45,10 @@
#include "ggml-vulkan.h" #include "ggml-vulkan.h"
#endif #endif
#ifdef GGML_USE_WEBGPU
#include "ggml-webgpu.h"
#endif
#ifdef GGML_USE_OPENCL #ifdef GGML_USE_OPENCL
#include "ggml-opencl.h" #include "ggml-opencl.h"
#endif #endif
...@@ -61,14 +65,13 @@ ...@@ -61,14 +65,13 @@
#include "ggml-cann.h" #include "ggml-cann.h"
#endif #endif
#ifdef GGML_USE_KOMPUTE
#include "ggml-kompute.h"
#endif
// disable C++17 deprecation warning for std::codecvt_utf8 // disable C++17 deprecation warning for std::codecvt_utf8
#if defined(__clang__) #if defined(__clang__)
# pragma clang diagnostic push # pragma clang diagnostic push
# pragma clang diagnostic ignored "-Wdeprecated-declarations" # pragma clang diagnostic ignored "-Wdeprecated-declarations"
#elif defined(__GNUC__)
# pragma GCC diagnostic push
# pragma GCC diagnostic ignored "-Wdeprecated-declarations"
#endif #endif
namespace fs = std::filesystem; namespace fs = std::filesystem;
...@@ -91,6 +94,8 @@ static std::string path_str(const fs::path & path) { ...@@ -91,6 +94,8 @@ static std::string path_str(const fs::path & path) {
#if defined(__clang__) #if defined(__clang__)
# pragma clang diagnostic pop # pragma clang diagnostic pop
#elif defined(__GNUC__)
# pragma GCC diagnostic pop
#endif #endif
#ifdef _WIN32 #ifdef _WIN32
...@@ -172,6 +177,9 @@ struct ggml_backend_registry { ...@@ -172,6 +177,9 @@ struct ggml_backend_registry {
#ifdef GGML_USE_VULKAN #ifdef GGML_USE_VULKAN
register_backend(ggml_backend_vk_reg()); register_backend(ggml_backend_vk_reg());
#endif #endif
#ifdef GGML_USE_WEBGPU
register_backend(ggml_backend_webgpu_reg());
#endif
#ifdef GGML_USE_OPENCL #ifdef GGML_USE_OPENCL
register_backend(ggml_backend_opencl_reg()); register_backend(ggml_backend_opencl_reg());
#endif #endif
...@@ -184,9 +192,6 @@ struct ggml_backend_registry { ...@@ -184,9 +192,6 @@ struct ggml_backend_registry {
#ifdef GGML_USE_RPC #ifdef GGML_USE_RPC
register_backend(ggml_backend_rpc_reg()); register_backend(ggml_backend_rpc_reg());
#endif #endif
#ifdef GGML_USE_KOMPUTE
register_backend(ggml_backend_kompute_reg());
#endif
#ifdef GGML_USE_CPU #ifdef GGML_USE_CPU
register_backend(ggml_backend_cpu_reg()); register_backend(ggml_backend_cpu_reg());
#endif #endif
...@@ -498,6 +503,9 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent, ...@@ -498,6 +503,9 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
std::vector<fs::path> search_paths; std::vector<fs::path> search_paths;
if (user_search_path == nullptr) { if (user_search_path == nullptr) {
#ifdef GGML_BACKEND_DIR
search_paths.push_back(fs::u8path(GGML_BACKEND_DIR));
#endif
// default search paths: executable directory, current directory // default search paths: executable directory, current directory
search_paths.push_back(get_executable_path()); search_paths.push_back(get_executable_path());
search_paths.push_back(fs::current_path()); search_paths.push_back(fs::current_path());
...@@ -576,14 +584,13 @@ void ggml_backend_load_all_from_path(const char * dir_path) { ...@@ -576,14 +584,13 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
// Avoid mixed hip+cuda configurations // Avoid mixed hip+cuda configurations
const char * hip_devices = std::getenv("HIP_VISIBLE_DEVICES"); const char * hip_devices = std::getenv("HIP_VISIBLE_DEVICES");
const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES"); const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES");
if (!hip_devices && !rocr_devices) { if (!hip_devices && !rocr_devices) {
ggml_backend_load_best("cuda", silent, dir_path); ggml_backend_load_best("cuda", silent, dir_path);
} else { } else {
ggml_backend_load_best("hip", silent, dir_path); ggml_backend_load_best("hip", silent, dir_path);
} }
ggml_backend_load_best("kompute", silent, dir_path);
ggml_backend_load_best("metal", silent, dir_path); ggml_backend_load_best("metal", silent, dir_path);
ggml_backend_load_best("rpc", silent, dir_path); ggml_backend_load_best("rpc", silent, dir_path);
ggml_backend_load_best("sycl", silent, dir_path); ggml_backend_load_best("sycl", silent, dir_path);
......
...@@ -368,21 +368,6 @@ ggml_backend_dev_t ggml_backend_get_device(ggml_backend_t backend) { ...@@ -368,21 +368,6 @@ ggml_backend_dev_t ggml_backend_get_device(ggml_backend_t backend) {
// backend copy // backend copy
static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
if (a->type != b->type) {
return false;
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (a->ne[i] != b->ne[i]) {
return false;
}
if (a->nb[i] != b->nb[i]) {
return false;
}
}
return true;
}
void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst) { void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst) {
GGML_ASSERT(ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts"); GGML_ASSERT(ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts");
...@@ -679,6 +664,7 @@ struct ggml_backend_sched { ...@@ -679,6 +664,7 @@ struct ggml_backend_sched {
// pipeline parallelism support // pipeline parallelism support
int n_copies; int n_copies;
int cur_copy; int cur_copy;
int next_copy;
ggml_backend_event_t events[GGML_SCHED_MAX_BACKENDS][GGML_SCHED_MAX_COPIES]; ggml_backend_event_t events[GGML_SCHED_MAX_BACKENDS][GGML_SCHED_MAX_COPIES];
struct ggml_tensor * graph_inputs[GGML_SCHED_MAX_SPLIT_INPUTS]; struct ggml_tensor * graph_inputs[GGML_SCHED_MAX_SPLIT_INPUTS];
int n_graph_inputs; int n_graph_inputs;
...@@ -834,8 +820,9 @@ static void ggml_backend_sched_print_assignments(ggml_backend_sched_t sched, str ...@@ -834,8 +820,9 @@ static void ggml_backend_sched_print_assignments(ggml_backend_sched_t sched, str
} }
if (sched->debug > 1) { if (sched->debug > 1) {
ggml_backend_t tensor_backend = ggml_backend_sched_get_tensor_backend(sched, node); ggml_backend_t tensor_backend = ggml_backend_sched_get_tensor_backend(sched, node);
GGML_LOG_DEBUG("node #%3d (%10.10s): %20.20s (%5.5s) [%5.5s %8.8s]:", i, ggml_op_name(node->op), node->name, GGML_LOG_DEBUG("node #%3d (%10.10s): %20.20s (%5.5s) [%5.5s %8.8s] use=%d:", i, ggml_op_name(node->op), node->name,
fmt_size(ggml_nbytes(node)), tensor_backend ? ggml_backend_name(tensor_backend) : "NULL", GET_CAUSE(node)); fmt_size(ggml_nbytes(node)), tensor_backend ? ggml_backend_name(tensor_backend) : "NULL", GET_CAUSE(node),
graph->use_counts[ggml_hash_find(&graph->visited_hash_set, node)]);
for (int j = 0; j < GGML_MAX_SRC; j++) { for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j]; struct ggml_tensor * src = node->src[j];
if (src == NULL) { if (src == NULL) {
...@@ -1101,6 +1088,11 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg ...@@ -1101,6 +1088,11 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
} }
} }
} }
// if the node is still unassigned, assign it to the first backend that supports it
for (int b = 0; b < sched->n_backends && *cur_backend_id == -1; b++) {
ggml_backend_sched_set_if_supported(sched, node, b, cur_backend_id);
}
GGML_ASSERT(*cur_backend_id != -1);
} }
// pass 5: split graph, find tensors that need to be copied // pass 5: split graph, find tensors that need to be copied
...@@ -1128,7 +1120,7 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg ...@@ -1128,7 +1120,7 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
const int node_backend_id = tensor_backend_id(node); const int node_backend_id = tensor_backend_id(node);
assert(node_backend_id != -1); // all nodes should be assigned by now, this can happen if there is no CPU fallback GGML_ASSERT(node_backend_id != -1); // all nodes should be assigned by now, this can happen if there is no CPU fallback
// check if we should start a new split based on the sources of the current node // check if we should start a new split based on the sources of the current node
bool need_new_split = false; bool need_new_split = false;
...@@ -1186,7 +1178,7 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg ...@@ -1186,7 +1178,7 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
size_t src_id = hash_id(src); size_t src_id = hash_id(src);
const int src_backend_id = sched->hv_tensor_backend_ids[src_id]; const int src_backend_id = sched->hv_tensor_backend_ids[src_id];
assert(src_backend_id != -1); // all inputs should be assigned by now GGML_ASSERT(src_backend_id != -1); // all inputs should be assigned by now
if (src->flags & GGML_TENSOR_FLAG_INPUT && sched->n_copies > 1) { if (src->flags & GGML_TENSOR_FLAG_INPUT && sched->n_copies > 1) {
if (tensor_id_copy(src_id, src_backend_id, 0) == NULL) { if (tensor_id_copy(src_id, src_backend_id, 0) == NULL) {
...@@ -1357,7 +1349,10 @@ static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched) { ...@@ -1357,7 +1349,10 @@ static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched) {
// allocate graph // allocate graph
if (backend_ids_changed || !ggml_gallocr_alloc_graph(sched->galloc, &sched->graph)) { if (backend_ids_changed || !ggml_gallocr_alloc_graph(sched->galloc, &sched->graph)) {
// the re-allocation may cause the split inputs to be moved to a different address // the re-allocation may cause the split inputs to be moved to a different address
ggml_backend_sched_synchronize(sched); // synchronize without ggml_backend_sched_synchronize to avoid changing cur_copy
for (int i = 0; i < sched->n_backends; i++) {
ggml_backend_synchronize(sched->backends[i]);
}
#ifndef NDEBUG #ifndef NDEBUG
GGML_LOG_DEBUG("%s: failed to allocate graph, reserving (backend_ids_changed = %d)\n", __func__, backend_ids_changed); GGML_LOG_DEBUG("%s: failed to allocate graph, reserving (backend_ids_changed = %d)\n", __func__, backend_ids_changed);
#endif #endif
...@@ -1461,8 +1456,6 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s ...@@ -1461,8 +1456,6 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
} }
} }
sched->cur_copy = (sched->cur_copy + 1) % sched->n_copies;
return GGML_STATUS_SUCCESS; return GGML_STATUS_SUCCESS;
} }
...@@ -1563,10 +1556,10 @@ void ggml_backend_sched_reset(ggml_backend_sched_t sched) { ...@@ -1563,10 +1556,10 @@ void ggml_backend_sched_reset(ggml_backend_sched_t sched) {
bool ggml_backend_sched_reserve(ggml_backend_sched_t sched, struct ggml_cgraph * measure_graph) { bool ggml_backend_sched_reserve(ggml_backend_sched_t sched, struct ggml_cgraph * measure_graph) {
GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs); GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs);
ggml_backend_sched_split_graph(sched, measure_graph);
ggml_backend_sched_synchronize(sched); ggml_backend_sched_synchronize(sched);
ggml_backend_sched_split_graph(sched, measure_graph);
if (!ggml_gallocr_reserve_n(sched->galloc, &sched->graph, sched->node_backend_ids, sched->leaf_backend_ids)) { if (!ggml_gallocr_reserve_n(sched->galloc, &sched->graph, sched->node_backend_ids, sched->leaf_backend_ids)) {
return false; return false;
} }
...@@ -1578,9 +1571,12 @@ bool ggml_backend_sched_reserve(ggml_backend_sched_t sched, struct ggml_cgraph * ...@@ -1578,9 +1571,12 @@ bool ggml_backend_sched_reserve(ggml_backend_sched_t sched, struct ggml_cgraph *
bool ggml_backend_sched_alloc_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph) { bool ggml_backend_sched_alloc_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs); GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs);
GGML_ASSERT(!sched->is_alloc);
ggml_backend_sched_split_graph(sched, graph); sched->cur_copy = sched->next_copy;
sched->next_copy = (sched->next_copy + 1) % sched->n_copies;
ggml_backend_sched_split_graph(sched, graph);
if (!ggml_backend_sched_alloc_splits(sched)) { if (!ggml_backend_sched_alloc_splits(sched)) {
return false; return false;
...@@ -1615,6 +1611,12 @@ void ggml_backend_sched_synchronize(ggml_backend_sched_t sched) { ...@@ -1615,6 +1611,12 @@ void ggml_backend_sched_synchronize(ggml_backend_sched_t sched) {
for (int i = 0; i < sched->n_backends; i++) { for (int i = 0; i < sched->n_backends; i++) {
ggml_backend_synchronize(sched->backends[i]); ggml_backend_synchronize(sched->backends[i]);
} }
if (!sched->is_alloc) {
// if the graph is not already allocated, always use copy 0 after a synchronization
// this ensures that during generation the same copy is used every time,
// which avoids changes in the graph that could cause CUDA or other graphs to be disabled
sched->next_copy = 0;
}
} }
void ggml_backend_sched_set_eval_callback(ggml_backend_sched_t sched, ggml_backend_sched_eval_callback callback, void * user_data) { void ggml_backend_sched_set_eval_callback(ggml_backend_sched_t sched, ggml_backend_sched_eval_callback callback, void * user_data) {
...@@ -1845,7 +1847,7 @@ void ggml_backend_graph_copy_free(struct ggml_backend_graph_copy copy) { ...@@ -1845,7 +1847,7 @@ void ggml_backend_graph_copy_free(struct ggml_backend_graph_copy copy) {
ggml_free(copy.ctx_unallocated); ggml_free(copy.ctx_unallocated);
} }
bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t backend2, struct ggml_cgraph * graph, ggml_backend_eval_callback callback, void * user_data) { bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t backend2, struct ggml_cgraph * graph, ggml_backend_eval_callback callback, void * user_data, struct ggml_tensor * test_node) {
struct ggml_backend_graph_copy copy = ggml_backend_graph_copy(backend2, graph); struct ggml_backend_graph_copy copy = ggml_backend_graph_copy(backend2, graph);
if (copy.buffer == NULL) { if (copy.buffer == NULL) {
return false; return false;
...@@ -1856,28 +1858,45 @@ bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t ...@@ -1856,28 +1858,45 @@ bool ggml_backend_compare_graph_backend(ggml_backend_t backend1, ggml_backend_t
assert(g1->n_nodes == g2->n_nodes); assert(g1->n_nodes == g2->n_nodes);
for (int i = 0; i < g1->n_nodes; i++) { if (test_node != nullptr) {
struct ggml_tensor * t1 = g1->nodes[i]; // Compute the whole graph and only test the output for a specific tensor
struct ggml_tensor * t2 = g2->nodes[i]; ggml_backend_graph_compute(backend1, g1);
ggml_backend_graph_compute(backend2, g2);
assert(t1->op == t2->op && ggml_are_same_layout(t1, t2)); int test_node_idx = -1;
for (int i = 0; i < g1->n_nodes; i++) {
struct ggml_tensor * t1 = g1->nodes[i];
if (t1 == test_node) {
test_node_idx = i;
break;
}
}
GGML_ASSERT(test_node_idx != -1);
struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1); callback(test_node_idx, g1->nodes[test_node_idx], g2->nodes[test_node_idx], user_data);
struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1); } else {
for (int i = 0; i < g1->n_nodes; i++) {
struct ggml_tensor * t1 = g1->nodes[i];
struct ggml_tensor * t2 = g2->nodes[i];
ggml_backend_graph_compute(backend1, &g1v); assert(t1->op == t2->op && ggml_are_same_layout(t1, t2));
ggml_backend_graph_compute(backend2, &g2v);
if (ggml_is_view_op(t1->op)) { struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
continue; struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
}
// compare results, calculate rms etc ggml_backend_graph_compute(backend1, &g1v);
if (!callback(i, t1, t2, user_data)) { ggml_backend_graph_compute(backend2, &g2v);
break;
if (ggml_is_view_op(t1->op)) {
continue;
}
// compare results, calculate rms etc
if (!callback(i, t1, t2, user_data)) {
break;
}
} }
} }
ggml_backend_graph_copy_free(copy); ggml_backend_graph_copy_free(copy);
return true; return true;
......
...@@ -81,7 +81,7 @@ if (BLAS_FOUND) ...@@ -81,7 +81,7 @@ if (BLAS_FOUND)
target_link_libraries (ggml-blas PRIVATE ${BLAS_LIBRARIES}) target_link_libraries (ggml-blas PRIVATE ${BLAS_LIBRARIES})
target_include_directories(ggml-blas PRIVATE ${BLAS_INCLUDE_DIRS}) target_include_directories(ggml-blas PRIVATE ${BLAS_INCLUDE_DIRS})
else() else()
message(ERROR "BLAS not found, please refer to " message(FATAL_ERROR "BLAS not found, please refer to "
"https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors" "https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors"
" to set correct GGML_BLAS_VENDOR") " to set correct GGML_BLAS_VENDOR")
endif() endif()
...@@ -281,10 +281,10 @@ ggml_backend_t ggml_backend_blas_init(void) { ...@@ -281,10 +281,10 @@ ggml_backend_t ggml_backend_blas_init(void) {
ggml_backend_blas_context * ctx = new ggml_backend_blas_context; ggml_backend_blas_context * ctx = new ggml_backend_blas_context;
ggml_backend_t backend = new ggml_backend { ggml_backend_t backend = new ggml_backend {
/* .guid = */ ggml_backend_blas_guid(), /* .guid = */ ggml_backend_blas_guid(),
/* .interface = */ blas_backend_i, /* .iface = */ blas_backend_i,
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_blas_reg(), 0), /* .device = */ ggml_backend_reg_dev_get(ggml_backend_blas_reg(), 0),
/* .context = */ ctx, /* .context = */ ctx,
}; };
#if defined(OPENBLAS_VERSION) && defined(GGML_USE_OPENMP) #if defined(OPENBLAS_VERSION) && defined(GGML_USE_OPENMP)
......
...@@ -99,6 +99,9 @@ typedef sycl::half2 ggml_half2; ...@@ -99,6 +99,9 @@ typedef sycl::half2 ggml_half2;
#define QI4_1 (QK4_1 / (4 * QR4_1)) #define QI4_1 (QK4_1 / (4 * QR4_1))
#define QR4_1 2 #define QR4_1 2
#define QI_MXFP4 (QK_MXFP4 / (4 * QR_MXFP4))
#define QR_MXFP4 2
#define QI5_0 (QK5_0 / (4 * QR5_0)) #define QI5_0 (QK5_0 / (4 * QR5_0))
#define QR5_0 2 #define QR5_0 2
...@@ -184,6 +187,13 @@ typedef struct { ...@@ -184,6 +187,13 @@ typedef struct {
} block_q4_1; } block_q4_1;
static_assert(sizeof(block_q4_1) == 2 * sizeof(ggml_half) + QK4_1 / 2, "wrong q4_1 block size/padding"); static_assert(sizeof(block_q4_1) == 2 * sizeof(ggml_half) + QK4_1 / 2, "wrong q4_1 block size/padding");
#define QK_MXFP4 32
typedef struct {
uint8_t e; // E8M0
uint8_t qs[QK_MXFP4/2];
} block_mxfp4;
static_assert(sizeof(block_mxfp4) == sizeof(uint8_t) + QK_MXFP4/2, "wrong mxfp4 block size/padding");
#define QK5_0 32 #define QK5_0 32
typedef struct { typedef struct {
ggml_half d; // delta ggml_half d; // delta
...@@ -417,13 +427,6 @@ typedef struct { ...@@ -417,13 +427,6 @@ typedef struct {
} block_iq4_xs; } block_iq4_xs;
static_assert(sizeof(block_iq4_xs) == sizeof(ggml_half) + sizeof(uint16_t) + QK_K/64 + QK_K/2, "wrong iq4_xs block size/padding"); static_assert(sizeof(block_iq4_xs) == sizeof(ggml_half) + sizeof(uint16_t) + QK_K/64 + QK_K/2, "wrong iq4_xs block size/padding");
#define MXFP4 32
typedef struct {
uint8_t d; // scale E8M0 float
uint8_t qs[MXFP4 / 2]; // (32) 4 bit elements E2M1 float
} block_mxfp4;
static_assert(sizeof(block_mxfp4) == sizeof(uint8_t) + MXFP4/2, "wrong mxfp4 block size/padding");
#endif // GGML_COMMON_DECL #endif // GGML_COMMON_DECL
#endif // GGML_COMMON_DECL #endif // GGML_COMMON_DECL
...@@ -1081,6 +1084,17 @@ GGML_TABLE_BEGIN(uint32_t, iq3s_grid, 512) ...@@ -1081,6 +1084,17 @@ GGML_TABLE_BEGIN(uint32_t, iq3s_grid, 512)
0x0f090307, 0x0f090501, 0x0f090b01, 0x0f0b0505, 0x0f0b0905, 0x0f0d0105, 0x0f0d0703, 0x0f0f0101, 0x0f090307, 0x0f090501, 0x0f090b01, 0x0f0b0505, 0x0f0b0905, 0x0f0d0105, 0x0f0d0703, 0x0f0f0101,
GGML_TABLE_END() GGML_TABLE_END()
// TODO: fix name to kvalues_iq4_nl
GGML_TABLE_BEGIN(int8_t, kvalues_iq4nl, 16)
-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113,
GGML_TABLE_END()
// e2m1 values (doubled)
// ref: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
GGML_TABLE_BEGIN(int8_t, kvalues_mxfp4, 16)
0, 1, 2, 3, 4, 6, 8, 12, 0, -1, -2, -3, -4, -6, -8, -12,
GGML_TABLE_END()
#define NGRID_IQ1S 2048 #define NGRID_IQ1S 2048
#define IQ1S_DELTA 0.125f #define IQ1S_DELTA 0.125f
#define IQ1M_DELTA 0.125f #define IQ1M_DELTA 0.125f
......
function(ggml_add_cpu_backend_features cpu_name arch)
# The feature detection code is compiled as a separate target so that
# it can be built without the architecture flags
# Since multiple variants of the CPU backend may be included in the same
# build, using set_source_files_properties() to set the arch flags is not possible
set(GGML_CPU_FEATS_NAME ${cpu_name}-feats)
add_library(${GGML_CPU_FEATS_NAME} OBJECT ggml-cpu/arch/${arch}/cpu-feats.cpp)
target_include_directories(${GGML_CPU_FEATS_NAME} PRIVATE . ../include)
target_compile_definitions(${GGML_CPU_FEATS_NAME} PRIVATE ${ARGN})
target_compile_definitions(${GGML_CPU_FEATS_NAME} PRIVATE GGML_BACKEND_DL GGML_BACKEND_BUILD GGML_BACKEND_SHARED)
set_target_properties(${GGML_CPU_FEATS_NAME} PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_link_libraries(${cpu_name} PRIVATE ${GGML_CPU_FEATS_NAME})
endfunction()
function(ggml_add_cpu_backend_variant_impl tag_name) function(ggml_add_cpu_backend_variant_impl tag_name)
if (tag_name) if (tag_name)
set(GGML_CPU_NAME ggml-cpu-${tag_name}) set(GGML_CPU_NAME ggml-cpu-${tag_name})
...@@ -10,14 +24,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -10,14 +24,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
list (APPEND GGML_CPU_SOURCES list (APPEND GGML_CPU_SOURCES
ggml-cpu/ggml-cpu.c ggml-cpu/ggml-cpu.c
ggml-cpu/ggml-cpu.cpp ggml-cpu/ggml-cpu.cpp
ggml-cpu/ggml-cpu-aarch64.cpp ggml-cpu/repack.cpp
ggml-cpu/ggml-cpu-aarch64.h ggml-cpu/repack.h
ggml-cpu/ggml-cpu-hbm.cpp ggml-cpu/hbm.cpp
ggml-cpu/ggml-cpu-hbm.h ggml-cpu/hbm.h
ggml-cpu/ggml-cpu-quants.c ggml-cpu/quants.c
ggml-cpu/ggml-cpu-quants.h ggml-cpu/quants.h
ggml-cpu/ggml-cpu-traits.cpp ggml-cpu/traits.cpp
ggml-cpu/ggml-cpu-traits.h ggml-cpu/traits.h
ggml-cpu/amx/amx.cpp ggml-cpu/amx/amx.cpp
ggml-cpu/amx/amx.h ggml-cpu/amx/amx.h
ggml-cpu/amx/mmq.cpp ggml-cpu/amx/mmq.cpp
...@@ -56,10 +70,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -56,10 +70,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_OPENMP) if (GGML_OPENMP)
find_package(OpenMP) find_package(OpenMP)
if (OpenMP_FOUND) if (OpenMP_FOUND)
set(GGML_OPENMP_ENABLED "ON" CACHE INTERNAL "")
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_OPENMP) target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_OPENMP)
target_link_libraries(${GGML_CPU_NAME} PRIVATE OpenMP::OpenMP_C OpenMP::OpenMP_CXX) target_link_libraries(${GGML_CPU_NAME} PRIVATE OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
else() else()
set(GGML_OPENMP_ENABLED "OFF" CACHE INTERNAL "")
message(WARNING "OpenMP not found") message(WARNING "OpenMP not found")
endif() endif()
endif() endif()
...@@ -82,12 +98,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -82,12 +98,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
target_link_libraries(${GGML_CPU_NAME} PUBLIC memkind) target_link_libraries(${GGML_CPU_NAME} PUBLIC memkind)
endif() endif()
if (CMAKE_OSX_ARCHITECTURES STREQUAL "arm64" OR if (GGML_SYSTEM_ARCH STREQUAL "ARM")
CMAKE_GENERATOR_PLATFORM_LWR STREQUAL "arm64" OR
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(aarch64|arm.*|ARM64)$"))
message(STATUS "ARM detected") message(STATUS "ARM detected")
list(APPEND GGML_CPU_SOURCES
ggml-cpu/arch/arm/quants.c
ggml-cpu/arch/arm/repack.cpp
)
if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang") if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")
message(FATAL_ERROR "MSVC is not supported for ARM, use clang") message(FATAL_ERROR "MSVC is not supported for ARM, use clang")
...@@ -143,6 +159,49 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -143,6 +159,49 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
else() else()
if (GGML_CPU_ARM_ARCH) if (GGML_CPU_ARM_ARCH)
list(APPEND ARCH_FLAGS -march=${GGML_CPU_ARM_ARCH}) list(APPEND ARCH_FLAGS -march=${GGML_CPU_ARM_ARCH})
elseif(GGML_CPU_ALL_VARIANTS)
# Begin with the lowest baseline
set(ARM_MCPU "armv8-a")
set(ARCH_TAGS "")
set(ARCH_DEFINITIONS "")
# When a feature is selected, bump the MCPU to the first
# version that supported it
if (GGML_INTERNAL_DOTPROD)
set(ARM_MCPU "armv8.2-a")
set(ARCH_TAGS "${ARCH_TAGS}+dotprod")
list(APPEND ARCH_DEFINITIONS GGML_USE_DOTPROD)
endif()
if (GGML_INTERNAL_FP16_VECTOR_ARITHMETIC)
set(ARM_MCPU "armv8.2-a")
set(ARCH_TAGS "${ARCH_TAGS}+fp16")
list(APPEND ARCH_DEFINITIONS GGML_USE_FP16_VECTOR_ARITHMETIC)
endif()
if (GGML_INTERNAL_SVE)
set(ARM_MCPU "armv8.2-a")
set(ARCH_TAGS "${ARCH_TAGS}+sve")
list(APPEND ARCH_DEFINITIONS GGML_USE_SVE)
endif()
if (GGML_INTERNAL_MATMUL_INT8)
set(ARM_MCPU "armv8.6-a")
set(ARCH_TAGS "${ARCH_TAGS}+i8mm")
list(APPEND ARCH_DEFINITIONS GGML_USE_MATMUL_INT8)
endif()
if (GGML_INTERNAL_SVE2)
set(ARM_MCPU "armv8.6-a")
set(ARCH_TAGS "${ARCH_TAGS}+sve2")
list(APPEND ARCH_DEFINITIONS GGML_USE_SVE2)
endif()
if (GGML_INTERNAL_NOSVE)
set(ARCH_TAGS "${ARCH_TAGS}+nosve")
endif()
if (GGML_INTERNAL_SME)
set(ARM_MCPU "armv9.2-a")
set(ARCH_TAGS "${ARCH_TAGS}+sme")
list(APPEND ARCH_DEFINITIONS GGML_USE_SME)
endif()
list(APPEND ARCH_FLAGS "-march=${ARM_MCPU}${ARCH_TAGS}")
ggml_add_cpu_backend_features(${GGML_CPU_NAME} arm ${ARCH_DEFINITIONS})
endif() endif()
endif() endif()
...@@ -170,11 +229,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -170,11 +229,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endforeach() endforeach()
endif() endif()
endif() endif()
elseif (CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64" OR CMAKE_GENERATOR_PLATFORM_LWR MATCHES "^(x86_64|i686|amd64|x64|win32)$" OR elseif (GGML_SYSTEM_ARCH STREQUAL "x86")
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(x86_64|i686|AMD64|amd64)$"))
message(STATUS "x86 detected") message(STATUS "x86 detected")
list(APPEND GGML_CPU_SOURCES
ggml-cpu/arch/x86/quants.c
ggml-cpu/arch/x86/repack.cpp
)
if (MSVC) if (MSVC)
# instruction set detection for MSVC only # instruction set detection for MSVC only
...@@ -299,8 +359,17 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -299,8 +359,17 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif() endif()
endif() endif()
endif() endif()
elseif ("${CMAKE_SYSTEM_PROCESSOR} " STREQUAL "ppc64le " OR "${CMAKE_SYSTEM_PROCESSOR} " STREQUAL "powerpc ")
if (GGML_BACKEND_DL)
if (GGML_NATIVE)
# the feature check relies on ARCH_DEFINITIONS, but it is not set with GGML_NATIVE
message(FATAL_ERROR "GGML_NATIVE is not compatible with GGML_BACKEND_DL, consider using GGML_CPU_ALL_VARIANTS")
endif()
ggml_add_cpu_backend_features(${GGML_CPU_NAME} x86 ${ARCH_DEFINITIONS})
endif()
elseif (GGML_SYSTEM_ARCH STREQUAL "PowerPC")
message(STATUS "PowerPC detected") message(STATUS "PowerPC detected")
list(APPEND GGML_CPU_SOURCES ggml-cpu/arch/powerpc/quants.c)
if (GGML_NATIVE) if (GGML_NATIVE)
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64") if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64")
file(READ "/proc/cpuinfo" POWER10_M) file(READ "/proc/cpuinfo" POWER10_M)
...@@ -308,7 +377,8 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -308,7 +377,8 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
execute_process(COMMAND bash -c "prtconf |grep 'Implementation' | head -n 1" OUTPUT_VARIABLE POWER10_M) execute_process(COMMAND bash -c "prtconf |grep 'Implementation' | head -n 1" OUTPUT_VARIABLE POWER10_M)
endif() endif()
string(REGEX MATCHALL "POWER *([0-9]+)" MATCHED_STRING "${POWER10_M}") string(TOUPPER "${POWER10_M}" POWER10_M_UPPER)
string(REGEX MATCHALL "POWER *([0-9]+)" MATCHED_STRING "${POWER10_M_UPPER}")
string(REGEX REPLACE "POWER *([0-9]+)" "\\1" EXTRACTED_NUMBER "${MATCHED_STRING}") string(REGEX REPLACE "POWER *([0-9]+)" "\\1" EXTRACTED_NUMBER "${MATCHED_STRING}")
if (EXTRACTED_NUMBER GREATER_EQUAL 10) if (EXTRACTED_NUMBER GREATER_EQUAL 10)
...@@ -320,13 +390,35 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -320,13 +390,35 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
else() else()
list(APPEND ARCH_FLAGS -mcpu=native -mtune=native -mpowerpc64) list(APPEND ARCH_FLAGS -mcpu=native -mtune=native -mpowerpc64)
endif() endif()
elseif(GGML_CPU_ALL_VARIANTS)
# Begin with the lowest baseline
set(ARCH_DEFINITIONS "")
# When a feature is selected, bump the MCPU to the first
# version that supported it
foreach(PVER RANGE 7 11)
if(DEFINED GGML_INTERNAL_POWER${PVER})
set(POWERPC_MCPU "power${PVER}")
list(APPEND ARCH_DEFINITIONS GGML_USE_POWER${PVER})
endif()
endforeach()
if (GGML_INTERNAL_VSX)
list(APPEND ARCH_DEFINITIONS GGML_USE_VSX)
list(APPEND ARCH_FLAGS -mvsx)
endif()
if (DEFINED POWERPC_MCPU)
list(APPEND ARCH_FLAGS -mcpu=${POWERPC_MCPU})
endif()
ggml_add_cpu_backend_features(${GGML_CPU_NAME} powerpc ${ARCH_DEFINITIONS})
else() else()
if (GGML_CPU_POWERPC_CPUTYPE) if (GGML_CPU_POWERPC_CPUTYPE)
list(APPEND ARCH_FLAGS -mcpu=${GGML_CPU_POWERPC_CPUTYPE}) list(APPEND ARCH_FLAGS -mcpu=${GGML_CPU_POWERPC_CPUTYPE})
endif() endif()
endif() endif()
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "loongarch64") elseif (GGML_SYSTEM_ARCH STREQUAL "loongarch64")
message(STATUS "loongarch64 detected") message(STATUS "loongarch64 detected")
list(APPEND GGML_CPU_SOURCES ggml-cpu/arch/loongarch/quants.c)
list(APPEND ARCH_FLAGS -march=loongarch64) list(APPEND ARCH_FLAGS -march=loongarch64)
if (GGML_LASX) if (GGML_LASX)
...@@ -335,22 +427,30 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -335,22 +427,30 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_LSX) if (GGML_LSX)
list(APPEND ARCH_FLAGS -mlsx) list(APPEND ARCH_FLAGS -mlsx)
endif() endif()
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "riscv64") elseif (GGML_SYSTEM_ARCH STREQUAL "riscv64")
message(STATUS "RISC-V detected") message(STATUS "riscv64 detected")
list(APPEND GGML_CPU_SOURCES
ggml-cpu/arch/riscv/quants.c
ggml-cpu/arch/riscv/repack.cpp
)
if (GGML_RVV) if (GGML_RVV)
if (GGML_RV_ZFH) if (GGML_XTHEADVECTOR)
list(APPEND ARCH_FLAGS -march=rv64gcv_zfhmin -DGGML_RV_ZFH -mabi=lp64d) list(APPEND ARCH_FLAGS -march=rv64gc_xtheadvector -mabi=lp64d)
elseif (GGML_RV_ZFH)
list(APPEND ARCH_FLAGS -march=rv64gcv_zfhmin -mabi=lp64d)
else() else()
list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d) list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
endif() endif()
endif() endif()
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "s390x") elseif (GGML_SYSTEM_ARCH STREQUAL "s390x")
message(STATUS "s390x detected") message(STATUS "s390x detected")
list(APPEND GGML_CPU_SOURCES ggml-cpu/arch/s390/quants.c)
file(READ "/proc/cpuinfo" CPUINFO_CONTENTS) file(READ "/proc/cpuinfo" CPUINFO_CONTENTS)
string(REGEX REPLACE "machine[ \t\r\n]*=[ \t\r\n]*([0-9]+)" "\\1" S390X_M ${CPUINFO_CONTENTS}) string(REGEX REPLACE "machine[ \t\r\n]*=[ \t\r\n]*([0-9]+)" "\\1" S390X_M ${CPUINFO_CONTENTS})
# TODO: Separation to determine activation of VX/VXE/VXE2 # TODO: Separation to determine activation of VX/VXE/VXE2
if (${S390X_M} MATCHES "8561|8562") if (${S390X_M} MATCHES "8561|8562")
set(GGML_NNPA OFF)
message(STATUS "z15 target") message(STATUS "z15 target")
list(APPEND ARCH_FLAGS -march=z15) list(APPEND ARCH_FLAGS -march=z15)
elseif (${S390X_M} MATCHES "3931") elseif (${S390X_M} MATCHES "3931")
...@@ -358,6 +458,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -358,6 +458,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
list(APPEND ARCH_FLAGS -march=z16) list(APPEND ARCH_FLAGS -march=z16)
elseif (${S390X_M} MATCHES "9175|9176") elseif (${S390X_M} MATCHES "9175|9176")
# NOTE: Only available from GCC 15.1.0 onwards. Any z17 machine with compile issues must first verify their GCC version. # NOTE: Only available from GCC 15.1.0 onwards. Any z17 machine with compile issues must first verify their GCC version.
# binutils must also be updated to the latest for the -march=z17 flag to work. Otherwise, use -march=arch15.
message(STATUS "z17 target") message(STATUS "z17 target")
list(APPEND ARCH_FLAGS -march=z17) list(APPEND ARCH_FLAGS -march=z17)
else() else()
...@@ -367,14 +468,25 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -367,14 +468,25 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif() endif()
if (GGML_VXE) if (GGML_VXE)
message(STATUS "VX/VXE/VXE2 enabled")
list(APPEND ARCH_FLAGS -mvx -mzvector) list(APPEND ARCH_FLAGS -mvx -mzvector)
list(APPEND ARCH_DEFINITIONS GGML_VXE)
endif() endif()
if (GGML_NNPA)
message(STATUS "NNPA enabled")
list(APPEND ARCH_DEFINITIONS GGML_NNPA)
endif()
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
message(STATUS "Wasm detected")
list (APPEND GGML_CPU_SOURCES ggml-cpu/arch/wasm/quants.c)
else() else()
message(STATUS "Unknown architecture") message(WARNING "Unknown CPU architecture. Falling back to generic implementations.")
list(APPEND ARCH_FLAGS -DGGML_CPU_GENERIC)
endif() endif()
if (GGML_CPU_AARCH64) if (GGML_CPU_REPACK)
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_CPU_AARCH64) target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_CPU_REPACK)
endif() endif()
if (GGML_CPU_KLEIDIAI) if (GGML_CPU_KLEIDIAI)
...@@ -385,9 +497,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -385,9 +497,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
# Fetch KleidiAI sources: # Fetch KleidiAI sources:
include(FetchContent) include(FetchContent)
set(KLEIDIAI_COMMIT_TAG "v1.5.0") set(KLEIDIAI_COMMIT_TAG "v1.11.0")
set(KLEIDIAI_DOWNLOAD_URL "https://github.com/ARM-software/kleidiai/archive/refs/tags/${KLEIDIAI_COMMIT_TAG}.tar.gz") set(KLEIDIAI_DOWNLOAD_URL "https://github.com/ARM-software/kleidiai/archive/refs/tags/${KLEIDIAI_COMMIT_TAG}.tar.gz")
set(KLEIDIAI_ARCHIVE_MD5 "ea22e1aefb800e9bc8c74d91633cc58e") set(KLEIDIAI_ARCHIVE_MD5 "3fe9e5ab964c375c53839296eb71eaa2")
if (POLICY CMP0135) if (POLICY CMP0135)
cmake_policy(SET CMP0135 NEW) cmake_policy(SET CMP0135 NEW)
...@@ -477,26 +589,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name) ...@@ -477,26 +589,12 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
target_compile_options(${GGML_CPU_NAME} PRIVATE ${ARCH_FLAGS}) target_compile_options(${GGML_CPU_NAME} PRIVATE ${ARCH_FLAGS})
target_compile_definitions(${GGML_CPU_NAME} PRIVATE ${ARCH_DEFINITIONS}) target_compile_definitions(${GGML_CPU_NAME} PRIVATE ${ARCH_DEFINITIONS})
if (GGML_BACKEND_DL)
if (GGML_NATIVE)
# the feature check relies on ARCH_DEFINITIONS, but it is not set with GGML_NATIVE
message(FATAL_ERROR "GGML_NATIVE is not compatible with GGML_BACKEND_DL, consider using GGML_CPU_ALL_VARIANTS")
endif()
# The feature detection code is compiled as a separate target so that
# it can be built without the architecture flags
# Since multiple variants of the CPU backend may be included in the same
# build, using set_source_files_properties() to set the arch flags is not possible
set(GGML_CPU_FEATS_NAME ${GGML_CPU_NAME}-feats)
add_library(${GGML_CPU_FEATS_NAME} OBJECT ggml-cpu/cpu-feats-x86.cpp)
target_include_directories(${GGML_CPU_FEATS_NAME} PRIVATE . .. ../include)
target_compile_definitions(${GGML_CPU_FEATS_NAME} PRIVATE ${ARCH_DEFINITIONS})
target_compile_definitions(${GGML_CPU_FEATS_NAME} PRIVATE GGML_BACKEND_DL GGML_BACKEND_BUILD GGML_BACKEND_SHARED)
set_target_properties(${GGML_CPU_FEATS_NAME} PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_link_libraries(${GGML_CPU_NAME} PRIVATE ${GGML_CPU_FEATS_NAME})
endif()
if (EMSCRIPTEN) if (EMSCRIPTEN)
set_target_properties(${GGML_CPU_NAME} PROPERTIES COMPILE_FLAGS "-msimd128") set_target_properties(${GGML_CPU_NAME} PROPERTIES COMPILE_FLAGS "-msimd128")
endif() endif()
if (CMAKE_CXX_COMPILER_ID STREQUAL "IntelLLVM")
# The compiler automatically enables "-ffast-math" which can cause NaNs in tests due to "-fassociative-math"
target_compile_options(${GGML_CPU_NAME} PRIVATE "-fno-associative-math")
endif()
endfunction() endfunction()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment