Revert "fix some typos" (#6244)

e8e18dcd · Lianmin Zheng · GitHub · bad7c26f · e8e18dcd · e8e18dcd
Unverified Commit e8e18dcd authored May 12, 2025 by Lianmin Zheng Committed by GitHub May 12, 2025
20 changed files
--- a/docs/references/benchmark_and_profiling.md
+++ b/docs/references/benchmark_and_profiling.md
@@ -69,7 +69,7 @@

  This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.

-  Additionally, if you want to locate the SGLang Python source code through the CUDA kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
+  Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.

 ## Profile with Nsight


--- a/docs/references/contribution_guide.md
+++ b/docs/references/contribution_guide.md
@@ -35,7 +35,7 @@ SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unitt

 ## Writing Documentation & Running Docs CI

-We recommend new contributors start from writing documentation, which helps you quickly understand the SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
+We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).

 ## Tips for Newcomers


--- a/docs/references/deepseek.md
+++ b/docs/references/deepseek.md
@@ -69,13 +69,13 @@ If you encounter errors when starting the server, ensure the weights have finish
 The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.

 ### Launch with one node of 8 x H200
-Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8**. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
+Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.

 ### Running examples on Multi-node

 - [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes).

- [Serving with two H200*8 nodes and Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
+- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).

 - [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes).

@@ -89,13 +89,13 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be

 - **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.

- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented the Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
+- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.

- **CUDA Graph & torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
+- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.

- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for the FlashAttention3 backend.
+- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for FlashAttention3 backend.

-Overall, with these optimizations, we have achieved up to a **7x** acceleration in output throughput compared to the previous version.
+Overall, with these optimizations, we have achieved up to **7x** acceleration in output throughput compared to the previous version.

 <p align="center">
  <img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
@@ -113,7 +113,7 @@ Overall, with these optimizations, we have achieved up to a **7x** acceleration
  <img src="https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg" alt="Data Parallelism Attention for DeepSeek Series Models">
 </p>

-With data parallelism attention enabled, we have achieved up to a **1.9x** decoding throughput improvement compared to the previous version.
+With data parallelism attention enabled, we have achieved up to **1.9x** decoding throughput improvement compared to the previous version.

 <p align="center">
  <img src="https://lmsys.org/images/blog/sglang_v0_4/deepseek_coder_v2.svg" alt="Data Parallelism Attention Performance Comparison">
@@ -150,7 +150,7 @@ python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --tru
 The precompilation process typically takes around 10 minutes to complete.

 ### Multi-token Prediction
-**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively with H200 TP8 setting.
+**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.

 **Usage**:
 Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
@@ -161,7 +161,7 @@ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --spec
 - FlashAttention3 and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the FlashMLA backend and CutlassMLA backend is still under development.
 - To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
  - Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
-  - Set `--cuda-graph-bs`. It is a list of batch sizes for CUDA graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
+  - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.


 ### Reasoning Content for DeepSeek R1
@@ -209,7 +209,7 @@ data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]}
 data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"}
 data: [DONE]
 ```
-The client needs to concatenate all fragments of the tool call arguments to reconstruct the complete tool call:
+The client needs to concatenate all arguments fragments to reconstruct the complete tool call:
 ```
 {"city": "Qingdao"}
 ```
@@ -223,4 +223,4 @@ Important Notes:

 1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?

-    **Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for a 1-hour timeout.
+    **Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
--- a/docs/references/deploy_on_k8s.md
+++ b/docs/references/deploy_on_k8s.md
@@ -330,7 +330,7 @@ This should resolve most NCCL-related issues.
 ## Remaining issues

 * In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
-* We utilize privileged mode, which isn't secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
+* We utilize privileged mode, which  isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.

 ## TODO


--- a/docs/references/modelscope.md
+++ b/docs/references/modelscope.md
@@ -25,4 +25,4 @@ docker run --gpus all \
    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
 ```

-Note that ModelScope uses a different cache directory than HuggingFace. You may need to set it manually to avoid running out of disk space.
+Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.
--- a/docs/start/install.md
+++ b/docs/start/install.md
@@ -23,7 +23,7 @@ uv pip install "sglang[all]>=0.4.6.post3"
  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.

- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, run `pip install transformers==4.51.1`.
+- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.51.1`.

 ## Method 2: From source

@@ -54,10 +54,10 @@ cd ..
 pip install -e "python[all_hip]"
 ```

-## Method 3: Using Docker
+## Method 3: Using docker

 The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
-Replace `<secret>` below with your HuggingFace hub [token](https://huggingface.co/docs/hub/en/security-tokens).
+Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).

 ```bash
 docker run --gpus all \
@@ -89,7 +89,7 @@ drun -p 30000:30000 \
 drun v0.4.6.post3-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
 ```

-## Method 4: Using Docker Compose
+## Method 4: Using docker compose

 <details>
 <summary>More</summary>
@@ -164,4 +164,4 @@ sky status --endpoint 30000 sglang
 - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
 - If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
 - The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
- To reinstall FlashInfer locally, use the following command: `pip install "flashinfer-python==0.2.5" -i https://flashinfer.ai/whl/cu124/torch2.6 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
+- To reinstall flashinfer locally, use the following command: `pip install "flashinfer-python==0.2.5" -i https://flashinfer.ai/whl/cu124/torch2.6 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
--- a/examples/runtime/README.md
+++ b/examples/runtime/README.md
@@ -28,7 +28,7 @@ The `engine` folder contains that examples that show how to use [Offline Engine

 ## Hidden States

-The `hidden_states` folder contains examples on how to extract hidden states using SGLang. Please note that this might degrade throughput due to CUDA graph rebuilding.
+The `hidden_states` folder contains examples on how to extract hidden states using SGLang. Please note that this might degrade throughput due to cuda graph rebuilding.

 * `hidden_states_engine.py`: An example how to extract hidden states using the Engine API.
 * `hidden_states_server.py`: An example how to extract hidden states using the Server API.

--- a/examples/runtime/hidden_states/hidden_states_engine.py
+++ b/examples/runtime/hidden_states/hidden_states_engine.py
@@ -3,7 +3,7 @@ Usage:
 python hidden_states.py

 Note that each time you change the `return_hidden_states` parameter,
-the CUDA graph will be recaptured, which might lead to a performance hit.
+the cuda graph will be recaptured, which might lead to a performance hit.
 So avoid getting hidden states and completions alternately.
 """


--- a/examples/runtime/hidden_states/hidden_states_server.py
+++ b/examples/runtime/hidden_states/hidden_states_server.py
@@ -4,7 +4,7 @@ Usage:
 python hidden_states_server.py

 Note that each time you change the `return_hidden_states` parameter,
-the CUDA graph will be recaptured, which might lead to a performance hit.
+the cuda graph will be recaptured, which might lead to a performance hit.
 So avoid getting hidden states and completions alternately.
 """


--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -68,7 +68,7 @@ blackwell = [
 ]

 # HIP (Heterogeneous-computing Interface for Portability) for AMD
-# => base docker rocm/vllm-dev:20250114, not from public vLLM whl
+# => base docker rocm/vllm-dev:20250114, not from public vllm whl
 srt_hip = [
    "sglang[runtime_common]",
    "torch",
@@ -76,7 +76,7 @@ srt_hip = [
    "outlines==0.1.11"
 ]

-# xpu is not enabled in public vLLM and torch whl,
+# xpu is not enabled in public vllm and torch whl,
 # need to follow https://docs.vllm.ai/en/latest/getting_started/xpu-installation.htmlinstall vllm
 srt_xpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]

@@ -84,8 +84,8 @@ srt_xpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]
 # https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
 srt_hpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]

-# CPU: currently, there are no pre-built vLLM wheels for CPU.
-# To install vLLM for CPU, please follow the instruction here:
+# CPU: currently, there are no pre-built vllm wheels for CPU.
+# To install vllm for CPU, please follow the instruction here:
 # https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html
 srt_cpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11", "torch"]
 # https://vllm-ascend.readthedocs.io/en/latest/installation.html

--- a/python/sglang/compile_deep_gemm.py
+++ b/python/sglang/compile_deep_gemm.py
@@ -129,7 +129,7 @@ def launch_server_process_and_send_one_request(


 def refine_server_args(server_args: ServerArgs, compile_args: CompileArgs):
-    # Disable CUDA graph and torch compile to save time
+    # Disable cuda graph and torch compile to save time
    server_args.disable_cuda_graph = True
    server_args.enable_torch_compile = False
    print(f"Disable CUDA Graph and Torch Compile to save time...")

--- a/python/sglang/srt/_custom_ops.py
+++ b/python/sglang/srt/_custom_ops.py
@@ -12,7 +12,7 @@ use_vllm_custom_allreduce = get_bool_env_var(
 )

 if not is_hpu():
-    # ROCm does not use vLLM custom allreduce
+    # ROCm does not use vllm custom allreduce
    if use_vllm_custom_allreduce and not is_hip():
        try:
            import vllm._C

--- a/python/sglang/srt/configs/chatglm.py
+++ b/python/sglang/srt/configs/chatglm.py
@@ -53,7 +53,7 @@ class ChatGLMConfig(PretrainedConfig):
        self.kv_channels = kv_channels
        self.num_attention_heads = num_attention_heads
        self.seq_length = seq_length
-        # It is to be compatible with long LoRA.
+        # It is to be compatible with long lora.
        self.max_position_embeddings = seq_length
        self.hidden_dropout = hidden_dropout
        self.attention_dropout = attention_dropout

--- a/python/sglang/srt/configs/load_config.py
+++ b/python/sglang/srt/configs/load_config.py
@@ -29,7 +29,7 @@ class LoadFormat(str, enum.Enum):
 class LoadConfig:
    """
    download_dir: Directory to download and load the weights, default to the
-        default cache directory of HuggingFace.
+        default cache directory of huggingface.
    load_format: The format of the model weights to load:
        "auto" will try to load the weights in the safetensors format and
            fall back to the pytorch bin format if safetensors format is

--- a/python/sglang/srt/distributed/device_communicators/custom_all_reduce.py
+++ b/python/sglang/srt/distributed/device_communicators/custom_all_reduce.py
@@ -172,7 +172,7 @@ class CustomAllreduce:

        if not custom_ar:
            # disable because of missing custom allreduce library
-            # e.g. in a non-CUDA environment
+            # e.g. in a non-cuda environment
            return

        self.group = group
@@ -389,11 +389,11 @@ class CustomAllreduce:
        if _is_hip:
            handle, offset = ops.get_graph_buffer_ipc_meta(self._ptr)
            handles, offsets = self._gather_ipc_meta((bytes(handle), offset))
-            logger.info("Registering %d CUDA graph addresses", len(offset))
+            logger.info("Registering %d cuda graph addresses", len(offset))
            ops.register_graph_buffers(self._ptr, handles, offsets)
        else:
            handle, offset = ops.get_graph_buffer_ipc_meta(self._ptr)
-            logger.info("Registering %d CUDA graph addresses", len(offset))
+            logger.info("Registering %d cuda graph addresses", len(offset))
            # We cannot directly use `dist.all_gather_object` here
            # because it is incompatible with `gloo` backend under inference mode.
            # see https://github.com/pytorch/pytorch/issues/126032 for details.
@@ -435,7 +435,7 @@ class CustomAllreduce:
        return False

    # all reduce, assuming inp tensor is IPC registered with register_buffer,
-    # or, in the context of CUDA graphs, register_graph_buffers
+    # or, in the context of cuda graphs, register_graph_buffers
    def all_reduce_reg(self, inp: torch.Tensor, out: torch.Tensor = None):
        if out is None:
            out = torch.empty_like(inp)
@@ -473,7 +473,7 @@ class CustomAllreduce:
        return out

    def custom_all_reduce(self, input: torch.Tensor) -> Optional[torch.Tensor]:
-        """The main allreduce API that provides support for CUDA graph."""
+        """The main allreduce API that provides support for cuda graph."""
        # When custom allreduce is disabled, this will be None.
        if self.disabled or not self.should_custom_ar(input):
            return None
@@ -489,7 +489,7 @@ class CustomAllreduce:
                return torch.empty_like(input)
        else:
            if _is_hip:
-                # note: outside of CUDA graph context,
+                # note: outside of cuda graph context,
                # custom allreduce incurs a cost of cudaMemcpy, which should
                # be small(<=1% of overall latency) compared to the performance
                # gains of using custom kernels

--- a/python/sglang/srt/distributed/device_communicators/custom_all_reduce_utils.py
+++ b/python/sglang/srt/distributed/device_communicators/custom_all_reduce_utils.py
@@ -121,14 +121,14 @@ def can_actually_p2p(
    Therefore, we have to perform a real P2P access to check if it is actually
    possible.

-    Note on p2p and CUDA IPC:
+    Note on p2p and cuda IPC:
    Usually, one process uses one GPU:
-    GPU src --> CUDA context src --> tensor src --> process src
+    GPU src --> cuda context src --> tensor src --> process src

-    We need to combine p2p and CUDA IPC, so that:
-    GPU src --> CUDA context src --> tensor src --> process src
+    We need to combine p2p and cuda IPC, so that:
+    GPU src --> cuda context src --> tensor src --> process src
                                      |shared|
-    GPU tgt --> CUDA context tgt --> tensor tgt --> process tgt
+    GPU tgt --> cuda context tgt --> tensor tgt --> process tgt
    That is to say, process src creates a tensor in GPU src, passes IPC handle to
    process tgt, and process tgt accesses the tensor in GPU tgt. Any operation on the
    tensor in process tgt will be reflected in the tensor in process src, because
@@ -201,9 +201,9 @@ def can_actually_p2p(
 # then all the processes can read the cache file to check the p2p access status.
 # Note that the cache file is suffixed by the CUDA_VISIBLE_DEVICES, so that we
 #  can have different cache files for different CUDA_VISIBLE_DEVICES settings,
-#  e.g. used by different vLLM engines. The device id in the cache file is a
+#  e.g. used by different vllm engines. The device id in the cache file is a
 #  **local** device id, i.e. from 0 to num_dev-1, where num_dev is the number
-#  of visible devices in the vLLM engine.
+#  of visible devices in the vllm engine.
 _gpu_p2p_access_cache: Optional[Dict[str, bool]] = None



--- a/python/sglang/srt/distributed/device_communicators/pynccl.py
+++ b/python/sglang/srt/distributed/device_communicators/pynccl.py
@@ -104,7 +104,7 @@ class PyNcclCommunicator:
        self.device = device
        # nccl communicator and stream will use this device
        # `torch.cuda.device` is a context manager that changes the
-        # current CUDA device to the specified one
+        # current cuda device to the specified one
        with torch.cuda.device(device):
            self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                self.world_size, self.unique_id, self.rank

--- a/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py
+++ b/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py
@@ -6,7 +6,7 @@
 # 1. We tried to use `cupy`, it calls NCCL correctly, but `cupy` itself
 #  often gets stuck when initializing the NCCL communicator.
 # 2. We tried to use `torch.distributed`, but `torch.distributed.all_reduce`
-#  contains many other potential CUDA APIs, that are not allowed during
+#  contains many other potential cuda APIs, that are not allowed during
 #  capturing the CUDA graph. For further details, please check
 # https://discuss.pytorch.org/t/pytorch-cudagraph-with-nccl-operation-failed/ .
 #

--- a/python/sglang/srt/distributed/parallel_state.py
+++ b/python/sglang/srt/distributed/parallel_state.py
@@ -170,7 +170,7 @@ class GroupCoordinator:
    GroupCoordinator takes charge of all the communication operations among
        the processes in the group. It can route the communication to
        a specific implementation (e.g. switch allreduce implementation
-        based on the tensor size and CUDA graph mode).
+        based on the tensor size and cuda graph mode).
    """

    # available attributes:

--- a/python/sglang/srt/hf_transformers_utils.py
+++ b/python/sglang/srt/hf_transformers_utils.py
@@ -127,7 +127,7 @@ CONTEXT_LENGTH_KEYS = [


 def get_context_length(config):
-    """Get the context length of a model from a HuggingFace model configs."""
+    """Get the context length of a model from a huggingface model configs."""
    text_config = config
    rope_scaling = getattr(text_config, "rope_scaling", None)
    if rope_scaling: