restful_api.md 5.17 KB
Newer Older
AllentDan's avatar
AllentDan committed
1
2
3
4
5
# Restful API

### Launch Service

```shell
6
python3 -m lmdeploy.serve.openai.api_server ./workspace 0.0.0.0 server_port --instance_num 32 --tp 1
AllentDan's avatar
AllentDan committed
7
8
```

9
Then, the user can open the swagger UI: `http://{server_ip}:{server_port}` for the detailed api usage.
AllentDan's avatar
AllentDan committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
We provide four restful api in total. Three of them are in OpenAI format. However, we recommend users try
our own api which provides more arguments for users to modify. The performance is comparatively better.

### python

Here is an example for our own api `generate`.

```python
import json
import requests
from typing import Iterable, List


def get_streaming_response(prompt: str,
                           api_url: str,
                           instance_id: int,
                           request_output_len: int,
                           stream: bool = True,
                           sequence_start: bool = True,
                           sequence_end: bool = True,
                           ignore_eos: bool = False) -> Iterable[List[str]]:
    headers = {'User-Agent': 'Test Client'}
    pload = {
        'prompt': prompt,
        'stream': stream,
        'instance_id': instance_id,
        'request_output_len': request_output_len,
        'sequence_start': sequence_start,
        'sequence_end': sequence_end,
        'ignore_eos': ignore_eos
    }
    response = requests.post(
        api_url, headers=headers, json=pload, stream=stream)
    for chunk in response.iter_lines(
            chunk_size=8192, decode_unicode=False, delimiter=b'\0'):
        if chunk:
            data = json.loads(chunk.decode('utf-8'))
            output = data['text']
            tokens = data['tokens']
            yield output, tokens


for output, tokens in get_streaming_response(
53
        "Hi, how are you?", "http://{server_ip}:{server_port}/generate", 0,
AllentDan's avatar
AllentDan committed
54
55
56
57
        512):
    print(output, end='')
```

58
### Java/Golang/Rust
AllentDan's avatar
AllentDan committed
59

60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
May use [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) to convert `http://{server_ip}:{server_port}/openapi.json` to java/rust/golang client.
Here is an example:

```shell
$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust

$ ls rust/*
rust/Cargo.toml  rust/git_push.sh  rust/README.md

rust/docs:
ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md

rust/src:
apis  lib.rs  models
```
AllentDan's avatar
AllentDan committed
76
77
78
79
80
81
82
83

### cURL

cURL is a tool for observing the output of the api.

List Models:

```bash
84
curl http://{server_ip}:{server_port}/v1/models
AllentDan's avatar
AllentDan committed
85
86
87
88
89
```

Generate:

```bash
90
curl http://{server_ip}:{server_port}/generate \
AllentDan's avatar
AllentDan committed
91
92
  -H "Content-Type: application/json" \
  -d '{
93
94
    "prompt": "Hello! How are you?",
    "instance_id": 1,
AllentDan's avatar
AllentDan committed
95
96
97
98
99
100
101
102
    "sequence_start": true,
    "sequence_end": true
  }'
```

Chat Completions:

```bash
103
curl http://{server_ip}:{server_port}/v1/chat/completions \
AllentDan's avatar
AllentDan committed
104
105
106
107
108
109
110
111
112
113
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm-chat-7b",
    "messages": [{"role": "user", "content": "Hello! Ho are you?"}]
  }'
```

Embeddings:

```bash
114
curl http://{server_ip}:{server_port}/v1/embeddings \
AllentDan's avatar
AllentDan committed
115
116
117
118
119
120
121
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm-chat-7b",
    "input": "Hello world!"
  }'
```

122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
### CLI client

There is a client script for restful api server.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
```

### webui

You can also test restful-api through webui.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
```

AllentDan's avatar
AllentDan committed
142
143
144
145
146
147
148
149
### FAQ

1. When user got `"finish_reason":"length"` which means the session is too long to be continued.
   Please add `"renew_session": true` into the next request.

2. When OOM appeared at the server side, please reduce the number of `instance_num` when lanching the service.

3. When the request with the same `instance_id` to `generate` got a empty return value and a negative `tokens`, please consider setting `sequence_start=false` for the second question and the same for the afterwards.
AllentDan's avatar
AllentDan committed
150
151
152
153
154
155
156

4. Requests were previously being handled sequentially rather than concurrently. To resolve this issue,

   - kindly provide unique instance_id values when calling the `generate` API or else your requests may be associated with client IP addresses
   - additionally, setting `stream=true` enables processing multiple requests simultaneously

5. Both `generate` api and `v1/chat/completions` upport engaging in multiple rounds of conversation, where input `prompt` or `messages` consists of either single strings or entire chat histories.These inputs are interpreted using multi-turn dialogue modes. However, ff you want to turn the mode of and manage the chat history in clients, please the parameter `sequence_end: true` when utilizing the `generate` function, or specify `renew_session: true` when making use of `v1/chat/completions`