README.md 2.62 KB
Newer Older
xuxzh1's avatar
init  
xuxzh1 committed
1
2
## Overview

xuxzh1's avatar
update  
xuxzh1 committed
3
4
5
6
> [!IMPORTANT]
> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
> insecure. **Never run the RPC server on an open network or in a sensitive environment!**

xuxzh1's avatar
init  
xuxzh1 committed
7
8
9
10
11
12
The `rpc-server` allows  running `ggml` backend on a remote host.
The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
This can be used for distributed LLM inference with `llama.cpp` in the following way:

```mermaid
flowchart TD
xuxzh1's avatar
update  
xuxzh1 committed
13
14
15
    rpcb<-->|TCP|srva
    rpcb<-->|TCP|srvb
    rpcb<-.->|TCP|srvn
xuxzh1's avatar
init  
xuxzh1 committed
16
    subgraph hostn[Host N]
xuxzh1's avatar
update  
xuxzh1 committed
17
    srvn[rpc-server]<-.->backend3["Backend (CUDA,Metal,etc.)"]
xuxzh1's avatar
init  
xuxzh1 committed
18
19
    end
    subgraph hostb[Host B]
xuxzh1's avatar
update  
xuxzh1 committed
20
    srvb[rpc-server]<-->backend2["Backend (CUDA,Metal,etc.)"]
xuxzh1's avatar
init  
xuxzh1 committed
21
22
    end
    subgraph hosta[Host A]
xuxzh1's avatar
update  
xuxzh1 committed
23
    srva[rpc-server]<-->backend["Backend (CUDA,Metal,etc.)"]
xuxzh1's avatar
init  
xuxzh1 committed
24
25
    end
    subgraph host[Main Host]
xuxzh1's avatar
update  
xuxzh1 committed
26
27
    local["Backend (CUDA,Metal,etc.)"]<-->ggml[llama-cli]
    ggml[llama-cli]<-->rpcb[RPC backend]
xuxzh1's avatar
init  
xuxzh1 committed
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
    end
    style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
```

Each host can run a different backend, e.g. one with CUDA and another with Metal.
You can also run multiple `rpc-server` instances on the same host, each with a different backend.

## Usage

On each host, build the corresponding backend with `cmake` and add `-DGGML_RPC=ON` to the build options.
For example, to build the CUDA backend with RPC support:

```bash
mkdir build-rpc-cuda
cd build-rpc-cuda
cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build . --config Release
```

Then, start the `rpc-server` with the backend:

```bash
$ bin/rpc-server -p 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
Starting RPC server on 0.0.0.0:50052
```

When using the CUDA backend, you can specify the device with the `CUDA_VISIBLE_DEVICES` environment variable, e.g.:
```bash
$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
```
This way you can run multiple `rpc-server` instances on the same host, each with a different CUDA device.


xuxzh1's avatar
update  
xuxzh1 committed
66
67
On the main host build `llama.cpp` for the local backend and add `-DGGML_RPC=ON` to the build options.
Finally, when running `llama-cli`, use the `--rpc` option to specify the host and port of each `rpc-server`:
xuxzh1's avatar
init  
xuxzh1 committed
68
69

```bash
xuxzh1's avatar
update  
xuxzh1 committed
70
$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
xuxzh1's avatar
init  
xuxzh1 committed
71
72
```

xuxzh1's avatar
update  
xuxzh1 committed
73
This way you can offload model layers to both local and remote devices.
xuxzh1's avatar
init  
xuxzh1 committed
74