Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
70359bf3
Unverified
Commit
70359bf3
authored
Jan 15, 2024
by
Lianmin Zheng
Committed by
GitHub
Jan 15, 2024
Browse files
Update benchmark scripts (#8)
parent
01ca82d7
Changes
28
Hide whitespace changes
Inline
Side-by-side
Showing
8 changed files
with
18 additions
and
8 deletions
+18
-8
benchmark/tree_of_thought_v0/README.md
benchmark/tree_of_thought_v0/README.md
+0
-0
benchmark/tree_of_thought_v0/bench_other.py
benchmark/tree_of_thought_v0/bench_other.py
+0
-0
benchmark/tree_of_thought_v0/bench_sglang.py
benchmark/tree_of_thought_v0/bench_sglang.py
+0
-0
docs/flashinfer.md
docs/flashinfer.md
+7
-4
python/sglang/srt/models/mixtral.py
python/sglang/srt/models/mixtral.py
+1
-1
python/sglang/srt/server_args.py
python/sglang/srt/server_args.py
+2
-1
python/sglang/test/test_utils.py
python/sglang/test/test_utils.py
+2
-2
scripts/launch_tgi.sh
scripts/launch_tgi.sh
+6
-0
No files found.
benchmark/tree_of_thought/README.md
→
benchmark/tree_of_thought
_v0
/README.md
View file @
70359bf3
File moved
benchmark/tree_of_thought/bench_other.py
→
benchmark/tree_of_thought
_v0
/bench_other.py
View file @
70359bf3
File moved
benchmark/tree_of_thought/bench_sglang.py
→
benchmark/tree_of_thought
_v0
/bench_sglang.py
View file @
70359bf3
File moved
docs/flashinfer.md
View file @
70359bf3
## Flashinfer Mode
[
`flashinfer`
](
https://github.com/flashinfer-ai/flashinfer
)
is a kernel library for LLM serving; we use it here to support our attention computation.
[
flashinfer
](
https://github.com/flashinfer-ai/flashinfer
)
is a kernel library for LLM serving.
It can be used in SGLang runtime to accelerate attention computation.
### Install flashinfer
Note: The compilation can take a very long time.
```
bash
git submodule update
--init
--recursive
pip
install
3rdparty/flashinfer/python
```
### Run Sever With Flashinfer Mode
### Run
a
Se
r
ver With Flashinfer Mode
Add
through
`--model
_
mode
`
argument from the command line
.
Add
`--model
-
mode
flashinfer`
argument to enable flashinfer when launching a server
.
Example:
```
bash
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-2-7b-chat-hf
--port
30000
--model-mode
flashinfer
```
\ No newline at end of file
```
python/sglang/srt/models/mixtral.py
View file @
70359bf3
...
...
@@ -351,7 +351,7 @@ class MixtralForCausalLM(nn.Module):
params_dict
=
dict
(
self
.
named_parameters
())
for
name
,
loaded_weight
in
hf_model_weights_iterator
(
model_name_or_path
,
cache_dir
,
load_format
,
revision
,
fall_back_to_pt
=
False
model_name_or_path
,
cache_dir
,
load_format
,
revision
):
if
"rotary_emb.inv_freq"
in
name
:
continue
...
...
python/sglang/srt/server_args.py
View file @
70359bf3
...
...
@@ -93,7 +93,8 @@ class ServerArgs:
type
=
str
,
default
=
[],
nargs
=
"+"
,
help
=
"Model mode: [flashinfer, no-cache, aggressive-new-fill]"
,
choices
=
[
"flashinfer"
,
"no-cache"
],
help
=
"Model mode: [flashinfer, no-cache]"
,
)
parser
.
add_argument
(
"--schedule-heuristic"
,
...
...
python/sglang/test/test_utils.py
View file @
70359bf3
...
...
@@ -99,7 +99,7 @@ def call_select_vllm(context, choices, url):
}
res
=
requests
.
post
(
url
,
json
=
data
)
assert
res
.
status_code
==
200
scores
.
append
(
res
.
json
()
[
"prompt_score"
]
)
scores
.
append
(
res
.
json
()
.
get
(
"prompt_score"
,
0
)
)
return
np
.
argmax
(
scores
)
"""
...
...
@@ -112,7 +112,7 @@ def call_select_vllm(context, choices, url):
def
add_common_other_args_and_parse
(
parser
):
parser
.
add_argument
(
"--parallel"
,
type
=
int
,
default
=
9
6
)
parser
.
add_argument
(
"--parallel"
,
type
=
int
,
default
=
6
4
)
parser
.
add_argument
(
"--host"
,
type
=
str
,
default
=
"http://127.0.0.1"
)
parser
.
add_argument
(
"--port"
,
type
=
int
,
default
=
None
)
parser
.
add_argument
(
...
...
scripts/launch_tgi.sh
0 → 100644
View file @
70359bf3
docker run
--name
tgi
--rm
-ti
--gpus
all
--network
host
\
-v
/home/ubuntu/model_weights/Llama-2-7b-chat-hf:/Llama-2-7b-chat-hf
\
ghcr.io/huggingface/text-generation-inference:1.3.0
\
--model-id
/Llama-2-7b-chat-hf
--num-shard
1
--trust-remote-code
\
--max-input-length
2048
--max-total-tokens
4096
\
--port
24000
Prev
1
2
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment