Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
2c7d0a5b
Unverified
Commit
2c7d0a5b
authored
Oct 02, 2024
by
Theresa Barton
Committed by
GitHub
Oct 02, 2024
Browse files
[Fix] Fix all the Huggingface paths (#1553)
parent
8cdc76f6
Changes
11
Hide whitespace changes
Inline
Side-by-side
Showing
11 changed files
with
24 additions
and
24 deletions
+24
-24
README.md
README.md
+2
-2
benchmark/benchmark_vllm_060/README.md
benchmark/benchmark_vllm_060/README.md
+8
-8
benchmark/blog_v0_2/README.md
benchmark/blog_v0_2/README.md
+2
-2
docker/compose.yaml
docker/compose.yaml
+1
-1
docker/k8s-sglang-service.yaml
docker/k8s-sglang-service.yaml
+1
-1
docs/en/benchmark_and_profiling.md
docs/en/benchmark_and_profiling.md
+1
-1
docs/en/install.md
docs/en/install.md
+2
-2
examples/runtime/openai_chat_with_response_prefill.py
examples/runtime/openai_chat_with_response_prefill.py
+2
-2
python/sglang/test/test_utils.py
python/sglang/test/test_utils.py
+3
-3
test/srt/models/test_generation_models.py
test/srt/models/test_generation_models.py
+1
-1
test/srt/test_openai_server.py
test/srt/test_openai_server.py
+1
-1
No files found.
README.md
View file @
2c7d0a5b
...
...
@@ -81,7 +81,7 @@ docker run --gpus all \
--env
"HF_TOKEN=<secret>"
\
--ipc
=
host
\
lmsysorg/sglang:latest
\
python3
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--host
0.0.0.0
--port
30000
python3
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--host
0.0.0.0
--port
30000
```
### Method 4: Using docker compose
...
...
@@ -121,7 +121,7 @@ resources:
run
:
|
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/
Meta-
Llama-3.1-8B-Instruct \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
...
...
benchmark/benchmark_vllm_060/README.md
View file @
2c7d0a5b
...
...
@@ -58,12 +58,12 @@ We referred to the reproduction method in https://github.com/vllm-project/vllm/i
```
bash
# Llama 3.1 8B Instruct on 1 x A100
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--max_model_len
4096
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/Llama-3.1-8B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--max_model_len
4096
# Llama 3.1 70B Instruct on 4 x H100
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
4
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/
Meta-
Llama-3.1-70B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--tensor
4
--max_model_len
4096
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
4
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/Llama-3.1-70B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--tensor
4
--max_model_len
4096
# bench serving
python3
-m
sglang.bench_serving
--backend
sglang
--dataset-name
sharegpt
--num-prompts
1200
--request-rate
4
...
...
@@ -76,12 +76,12 @@ python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-pro
```
bash
# Llama 3.1 8B Instruct on 1 x A100
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--max_model_len
4096
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/Llama-3.1-8B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--max_model_len
4096
# Llama 3.1 70B Instruct on 4 x H100
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
4
--mem-frac
0.88
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/
Meta-
Llama-3.1-70B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--tensor
4
--max_model_len
4096
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
4
--mem-frac
0.88
python
-m
vllm.entrypoints.openai.api_server
--model
meta-llama/Llama-3.1-70B-Instruct
--disable-log-requests
--num-scheduler-steps
10
--tensor
4
--max_model_len
4096
# bench serving
python3
-m
sglang.bench_serving
--backend
sglang
--dataset-name
sharegpt
--num-prompts
5000
...
...
benchmark/blog_v0_2/README.md
View file @
2c7d0a5b
...
...
@@ -27,10 +27,10 @@ export HF_TOKEN=hf_token
```
bash
# Meta-Llama-3.1-8B-Instruct
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--enable-torch-compile
--disable-radix-cache
# Meta-Llama-3.1-70B-Instruct
python
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
8
python
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-70B-Instruct
--disable-radix-cache
--tp
8
# Meta-Llama-3-70B-Instruct-FP8
python
-m
sglang.launch_server
--model-path
neuralmagic/Meta-Llama-3-70B-Instruct-FP8
--disable-radix-cache
--tp
8
...
...
docker/compose.yaml
View file @
2c7d0a5b
...
...
@@ -17,7 +17,7 @@ services:
# - SGLANG_USE_MODELSCOPE: true
entrypoint
:
python3 -m sglang.launch_server
command
:
--model-path meta-llama/
Meta-
Llama-3.1-8B-Instruct
--model-path meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port
30000
ulimits
:
...
...
docker/k8s-sglang-service.yaml
View file @
2c7d0a5b
...
...
@@ -32,7 +32,7 @@ spec:
ports
:
-
containerPort
:
30000
command
:
[
"
python3"
,
"
-m"
,
"
sglang.launch_server"
]
args
:
[
"
--model-path"
,
"
meta-llama/
Meta-
Llama-3.1-8B-Instruct"
,
"
--host"
,
"
0.0.0.0"
,
"
--port"
,
"
30000"
]
args
:
[
"
--model-path"
,
"
meta-llama/Llama-3.1-8B-Instruct"
,
"
--host"
,
"
0.0.0.0"
,
"
--port"
,
"
30000"
]
env
:
-
name
:
HF_TOKEN
value
:
<secret>
...
...
docs/en/benchmark_and_profiling.md
View file @
2c7d0a5b
...
...
@@ -30,7 +30,7 @@ apt install nsight-systems-cli
```
bash
# server
# set the delay and duration times according to needs
nsys profile
--trace-fork-before-exec
=
true
--cuda-graph-trace
=
node
-o
sglang.out
--delay
60
--duration
70 python3
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--disable-radix-cache
nsys profile
--trace-fork-before-exec
=
true
--cuda-graph-trace
=
node
-o
sglang.out
--delay
60
--duration
70 python3
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--disable-radix-cache
# client
python3
-m
sglang.bench_serving
--backend
sglang
--num-prompts
6000
--dataset-name
random
--random-input
4096
--random-output
2048
...
...
docs/en/install.md
View file @
2c7d0a5b
...
...
@@ -35,7 +35,7 @@ docker run --gpus all \
--env
"HF_TOKEN=<secret>"
\
--ipc
=
host
\
lmsysorg/sglang:latest
\
python3
-m
sglang.launch_server
--model-path
meta-llama/
Meta-
Llama-3.1-8B-Instruct
--host
0.0.0.0
--port
30000
python3
-m
sglang.launch_server
--model-path
meta-llama/Llama-3.1-8B-Instruct
--host
0.0.0.0
--port
30000
```
### Method 4: Using docker compose
...
...
@@ -75,7 +75,7 @@ resources:
run
:
|
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/
Meta-
Llama-3.1-8B-Instruct \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
...
...
examples/runtime/openai_chat_with_response_prefill.py
View file @
2c7d0a5b
"""
Usage:
python -m sglang.launch_server --model-path meta-llama/
Meta-
Llama-3.1-8B-Instruct --port 30000
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000
python openai_chat.py
"""
...
...
@@ -10,7 +10,7 @@ from openai import OpenAI
client
=
openai
.
Client
(
base_url
=
"http://127.0.0.1:30000/v1"
,
api_key
=
"EMPTY"
)
response
=
client
.
chat
.
completions
.
create
(
model
=
"meta-llama/
Meta-
Llama-3.1-8B-Instruct"
,
model
=
"meta-llama/Llama-3.1-8B-Instruct"
,
messages
=
[
{
"role"
:
"system"
,
"content"
:
"You are a helpful AI assistant"
},
{
...
...
python/sglang/test/test_utils.py
View file @
2c7d0a5b
...
...
@@ -23,13 +23,13 @@ from sglang.srt.utils import kill_child_process
from
sglang.utils
import
get_exception_traceback
DEFAULT_FP8_MODEL_NAME_FOR_TEST
=
"neuralmagic/Meta-Llama-3.1-8B-FP8"
DEFAULT_MODEL_NAME_FOR_TEST
=
"meta-llama/
Meta-
Llama-3.1-8B-Instruct"
DEFAULT_MODEL_NAME_FOR_TEST
=
"meta-llama/Llama-3.1-8B-Instruct"
DEFAULT_MOE_MODEL_NAME_FOR_TEST
=
"mistralai/Mixtral-8x7B-Instruct-v0.1"
DEFAULT_MLA_MODEL_NAME_FOR_TEST
=
"deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
DEFAULT_MLA_FP8_MODEL_NAME_FOR_TEST
=
"neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8"
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH
=
600
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_TP1
=
"meta-llama/
Meta-
Llama-3.1-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.3,deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,google/gemma-2-27b-it"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_TP2
=
"meta-llama/
Meta-
Llama-3.1-70B-Instruct,mistralai/Mixtral-8x7B-Instruct-v0.1,Qwen/Qwen2-57B-A14B-Instruct,deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_TP1
=
"meta-llama/Llama-3.1-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.3,deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,google/gemma-2-27b-it"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_TP2
=
"meta-llama/Llama-3.1-70B-Instruct,mistralai/Mixtral-8x7B-Instruct-v0.1,Qwen/Qwen2-57B-A14B-Instruct,deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_FP8_TP1
=
"neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8,neuralmagic/Mistral-7B-Instruct-v0.3-FP8,neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8,neuralmagic/gemma-2-2b-it-FP8"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_FP8_TP2
=
"neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8,neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8,neuralmagic/Qwen2-72B-Instruct-FP8,neuralmagic/Qwen2-57B-A14B-Instruct-FP8,neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8"
DEFAULT_MODEL_NAME_FOR_NIGHTLY_EVAL_QUANT_TP1
=
"hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
...
...
test/srt/models/test_generation_models.py
View file @
2c7d0a5b
...
...
@@ -44,7 +44,7 @@ class ModelCase:
# Popular models that run on CI
CI_MODELS
=
[
ModelCase
(
"meta-llama/
Meta-
Llama-3.1-8B-Instruct"
),
ModelCase
(
"meta-llama/Llama-3.1-8B-Instruct"
),
ModelCase
(
"google/gemma-2-2b"
),
]
...
...
test/srt/test_openai_server.py
View file @
2c7d0a5b
...
...
@@ -499,7 +499,7 @@ class TestOpenAIServer(unittest.TestCase):
client
=
openai
.
Client
(
api_key
=
self
.
api_key
,
base_url
=
self
.
base_url
)
response
=
client
.
chat
.
completions
.
create
(
model
=
"meta-llama/
Meta-
Llama-3.1-8B-Instruct"
,
model
=
"meta-llama/Llama-3.1-8B-Instruct"
,
messages
=
[
{
"role"
:
"system"
,
"content"
:
"You are a helpful AI assistant"
},
{
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment