Unverified Commit a8ae6403 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Improve docs and warnings (#1164)

parent d8476818
<!-- Thank you for your contribution, we really appreciate it. The following instructions will help improve your pull request and make it easier to receive feedback. If there are any items you don't understand, don't worry. Just submit the pull request and ask the maintainers for help. --> <!-- Thank you for your contribution! We appreciate it. The following guidelines will help improve your pull request and facilitate feedback. If anything is unclear, don't hesitate to submit your pull request and ask the maintainers for assistance. -->
## Motivation ## Motivation
<!-- Please explain the motivation behind this PR and the goal you aim to achieve with it. --> <!-- Explain the purpose of this PR and the goals it aims to achieve. -->
## Modification ## Modifications
<!-- Briefly describe the changes made in this PR. --> <!-- Describe the changes made in this PR. -->
## Checklist ## Checklist
- [ ] Before submitting a PR for review, make sure it has passed verification in your local development environment **at least**. - [ ] Format your code according to the [Contributor Guide](https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md).
- [ ] Ensure pre-commit `pre-commit run --all-files` or other linting tools are used to fix potential lint issues. - [ ] Add unit tests as outlined in the [Contributor Guide](https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md).
- [ ] Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness. - [ ] Update documentation as needed, including docstrings or example tutorials.
- [ ] Modify documentation as needed, such as docstrings or example tutorials. \ No newline at end of file
...@@ -81,14 +81,17 @@ docker run --gpus all \ ...@@ -81,14 +81,17 @@ docker run --gpus all \
### Method 4: Using docker compose ### Method 4: Using docker compose
<details>
> This method is recommended if you plan to serve it as a service. > This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml). > A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](./docker/compose.yaml) to your local machine 1. Copy the [compose.yml](./docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal. 2. Execute the command `docker compose up -d` in your terminal.
</details>
### Method 5: Run on Kubernetes or Clouds with SkyPilot ### Method 5: Run on Kubernetes or Clouds with SkyPilot
<details>
To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot). To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). 1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
...@@ -114,8 +117,6 @@ run: | ...@@ -114,8 +117,6 @@ run: |
--port 30000 --port 30000
``` ```
</details>
```bash ```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider. # Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
...@@ -124,7 +125,7 @@ HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml ...@@ -124,7 +125,7 @@ HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
sky status --endpoint 30000 sglang sky status --endpoint 30000 sglang
``` ```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve). 3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>
### Common Notes ### Common Notes
......
...@@ -147,13 +147,12 @@ def get_tokenizer( ...@@ -147,13 +147,12 @@ def get_tokenizer(
and kwargs.get("use_fast", True) and kwargs.get("use_fast", True)
and tokenizer_name != _FAST_LLAMA_TOKENIZER and tokenizer_name != _FAST_LLAMA_TOKENIZER
): ):
pass warnings.warn(
# warnings.warn( "For some LLaMA V1 models, initializing the fast tokenizer may "
# "For some LLaMA V1 models, initializing the fast tokenizer may " "take a long time. To reduce the initialization time, consider "
# "take a long time. To reduce the initialization time, consider " f"using '{_FAST_LLAMA_TOKENIZER}' instead of the original "
# f"using '{_FAST_LLAMA_TOKENIZER}' instead of the original " "tokenizer."
# "tokenizer." )
# )
try: try:
tokenizer = AutoTokenizer.from_pretrained( tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name, tokenizer_name,
......
...@@ -270,7 +270,7 @@ class Req: ...@@ -270,7 +270,7 @@ class Req:
if all_ids[prompt_tokens - 1] != self.origin_input_ids_unpadded[-1]: if all_ids[prompt_tokens - 1] != self.origin_input_ids_unpadded[-1]:
# TODO(lsyin): fix token fusion # TODO(lsyin): fix token fusion
warnings.warn( logging.warning(
"Token fusion between input and output, try to avoid this by removing the space at the end of the input." "Token fusion between input and output, try to avoid this by removing the space at the end of the input."
) )
return False return False
...@@ -791,7 +791,7 @@ class ScheduleBatch: ...@@ -791,7 +791,7 @@ class ScheduleBatch:
) )
if not torch.all(success): if not torch.all(success):
warnings.warn("Sampling failed, fallback to top_k=1 strategy") logging.warning("Sampling failed, fallback to top_k=1 strategy")
probs = probs.masked_fill(torch.isnan(probs), 0.0) probs = probs.masked_fill(torch.isnan(probs), 0.0)
argmax_ids = torch.argmax(probs, dim=-1) argmax_ids = torch.argmax(probs, dim=-1)
batch_next_token_ids = torch.where( batch_next_token_ids = torch.where(
......
...@@ -774,7 +774,7 @@ class ModelTpServer: ...@@ -774,7 +774,7 @@ class ModelTpServer:
torch.cuda.empty_cache() torch.cuda.empty_cache()
logger.info("Cache flushed successfully!") logger.info("Cache flushed successfully!")
else: else:
warnings.warn( logging.warning(
f"Cache not flushed because there are pending requests. " f"Cache not flushed because there are pending requests. "
f"#queue-req: {len(self.waiting_queue)}, " f"#queue-req: {len(self.waiting_queue)}, "
f"#running-req: {0 if self.running_batch is None else len(self.running_batch.reqs)}" f"#running-req: {0 if self.running_batch is None else len(self.running_batch.reqs)}"
......
...@@ -237,7 +237,7 @@ class ModelRunner: ...@@ -237,7 +237,7 @@ class ModelRunner:
self.max_total_num_tokens = self.profile_max_num_token(total_gpu_memory) self.max_total_num_tokens = self.profile_max_num_token(total_gpu_memory)
if max_total_tokens is not None: if max_total_tokens is not None:
if max_total_tokens > self.max_total_num_tokens: if max_total_tokens > self.max_total_num_tokens:
warnings.warn( logging.warning(
f"max_total_tokens={max_total_tokens} is larger than the profiled value " f"max_total_tokens={max_total_tokens} is larger than the profiled value "
f"{self.max_total_num_tokens}. " f"{self.max_total_num_tokens}. "
f"Use the profiled value instead." f"Use the profiled value instead."
......
...@@ -17,10 +17,10 @@ limitations under the License. ...@@ -17,10 +17,10 @@ limitations under the License.
import asyncio import asyncio
import json import json
import logging
import os import os
import time import time
import uuid import uuid
import warnings
from http import HTTPStatus from http import HTTPStatus
from typing import Dict, List, Optional from typing import Dict, List, Optional
...@@ -65,6 +65,8 @@ from sglang.srt.openai_api.protocol import ( ...@@ -65,6 +65,8 @@ from sglang.srt.openai_api.protocol import (
UsageInfo, UsageInfo,
) )
logger = logging.getLogger(__name__)
chat_template_name = None chat_template_name = None
...@@ -408,7 +410,7 @@ def v1_generate_request(all_requests: List[CompletionRequest]): ...@@ -408,7 +410,7 @@ def v1_generate_request(all_requests: List[CompletionRequest]):
"Parallel sampling is not supported for completions from files" "Parallel sampling is not supported for completions from files"
) )
if request.echo and request.logprobs: if request.echo and request.logprobs:
warnings.warn( logger.warning(
"Echo is not compatible with logprobs. " "Echo is not compatible with logprobs. "
"To compute logprobs of input prompt, please use SGLang /request API." "To compute logprobs of input prompt, please use SGLang /request API."
) )
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment