"src/libtorchaudio/sox/io.cpp" did not exist on "f234e51ff8e5c8b6a7971b608b72f0e4eb602db2"
Unverified Commit dda34c2f authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix mem fraction static for nightly tests (#11076)

parent 4eeaff74
...@@ -23,7 +23,7 @@ The case of a server being too conservative can happen when users send many requ ...@@ -23,7 +23,7 @@ The case of a server being too conservative can happen when users send many requ
On the other hand, if you see `token usage` very high and you frequently see warnings like On the other hand, if you see `token usage` very high and you frequently see warnings like
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. `KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay. If you see `KV cache pool is full. Retract requests.` occasionally but not frequently (~1 time per minute), it is okay.
### Tune `--mem-fraction-static` to increase KV cache pool capacity ### Tune `--mem-fraction-static` to increase KV cache pool capacity
SGLang allocates memory as follows: SGLang allocates memory as follows:
......
...@@ -9,7 +9,7 @@ If you encounter out-of-memory (OOM) errors, you can adjust the following parame ...@@ -9,7 +9,7 @@ If you encounter out-of-memory (OOM) errors, you can adjust the following parame
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts. - If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`. - If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput. - You can also decrease `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`. - Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
### CUDA Error: Illegal Memory Access Encountered ### CUDA Error: Illegal Memory Access Encountered
...@@ -17,6 +17,12 @@ This error may result from kernel errors or out-of-memory issues: ...@@ -17,6 +17,12 @@ This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub. - If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues. - If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
### The server hangs
- If the server hangs during initialization or running, it can be memory issues (out of memory), network issues (nccl errors), or other bugs in sglang.
- If it is out of memory, you might see that `avail mem` is very low during the initialization or right after initialization. In this case,
you can try to decrease `--mem-fraction-static`, decrease `--cuda-graph-max-bs`, or decrease `--chunked-prefill-size`.
- Other bugs, please raise a Github issue to us.
## Frequently Asked Questions ## Frequently Asked Questions
...@@ -28,8 +34,6 @@ From our initial investigation, this indeterminism arises from two factors: dyna ...@@ -28,8 +34,6 @@ From our initial investigation, this indeterminism arises from two factors: dyna
To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting. To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting.
We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower. **Note**:
Recently, we also introduced a deterministic mode, you can enable it with `--enable-deterministic-inference`. It might not work for all cases.
We have two issues to track our progress: Please find more details in this blog post: https://lmsys.org/blog/2025-09-22-sglang-deterministic/
- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729).
- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335).
...@@ -67,7 +67,7 @@ from sglang.srt.mem_cache.swa_radix_cache import SWARadixCache ...@@ -67,7 +67,7 @@ from sglang.srt.mem_cache.swa_radix_cache import SWARadixCache
from sglang.srt.metrics.collector import SchedulerMetricsCollector, TimeStats from sglang.srt.metrics.collector import SchedulerMetricsCollector, TimeStats
from sglang.srt.model_executor.forward_batch_info import CaptureHiddenMode, ForwardMode from sglang.srt.model_executor.forward_batch_info import CaptureHiddenMode, ForwardMode
from sglang.srt.sampling.sampling_batch_info import SamplingBatchInfo from sglang.srt.sampling.sampling_batch_info import SamplingBatchInfo
from sglang.srt.sampling.sampling_params import DEFAULT_SAMPLING_SEED, SamplingParams from sglang.srt.sampling.sampling_params import SamplingParams
from sglang.srt.server_args import ServerArgs from sglang.srt.server_args import ServerArgs
from sglang.srt.utils import flatten_nested_list, support_triton from sglang.srt.utils import flatten_nested_list, support_triton
......
...@@ -1482,7 +1482,8 @@ class ModelRunner: ...@@ -1482,7 +1482,8 @@ class ModelRunner:
if self.max_total_num_tokens <= 0: if self.max_total_num_tokens <= 0:
raise RuntimeError( raise RuntimeError(
"Not enough memory. Please try to increase --mem-fraction-static." f"Not enough memory. Please try to increase --mem-fraction-static. "
f"Current value: {self.server_args.mem_fraction_static=}"
) )
# Initialize req_to_token_pool # Initialize req_to_token_pool
......
...@@ -19,7 +19,6 @@ from sglang.srt.utils import get_bool_env_var ...@@ -19,7 +19,6 @@ from sglang.srt.utils import get_bool_env_var
_SAMPLING_EPS = 1e-6 _SAMPLING_EPS = 1e-6
TOP_K_ALL = 1 << 30 TOP_K_ALL = 1 << 30
DEFAULT_SAMPLING_SEED = 42
class SamplingParams: class SamplingParams:
...@@ -56,7 +55,7 @@ class SamplingParams: ...@@ -56,7 +55,7 @@ class SamplingParams:
custom_params: Optional[Dict[str, Any]] = None, custom_params: Optional[Dict[str, Any]] = None,
stream_interval: Optional[int] = None, stream_interval: Optional[int] = None,
logit_bias: Optional[Dict[str, float]] = None, logit_bias: Optional[Dict[str, float]] = None,
sampling_seed: Optional[int] = None, sampling_seed: int = 42,
) -> None: ) -> None:
self.max_new_tokens = max_new_tokens self.max_new_tokens = max_new_tokens
self.stop_strs = stop self.stop_strs = stop
...@@ -84,13 +83,6 @@ class SamplingParams: ...@@ -84,13 +83,6 @@ class SamplingParams:
self.custom_params = custom_params self.custom_params = custom_params
self.stream_interval = stream_interval self.stream_interval = stream_interval
self.logit_bias = logit_bias self.logit_bias = logit_bias
# Used for deterministic sampling
if (
get_bool_env_var("SGLANG_ENABLE_DETERMINISTIC_INFERENCE")
and sampling_seed is None
):
# If deterministic inference is enabled and sampling_seed is not set, use the default seed
sampling_seed = DEFAULT_SAMPLING_SEED
self.sampling_seed = sampling_seed self.sampling_seed = sampling_seed
# Process some special cases # Process some special cases
......
...@@ -618,7 +618,7 @@ class ServerArgs: ...@@ -618,7 +618,7 @@ class ServerArgs:
if self.mem_fraction_static is None: if self.mem_fraction_static is None:
# Constant meta data (e.g., from attention backend) # Constant meta data (e.g., from attention backend)
reserved_mem = 1024 reserved_mem = 512
# For activation during large prefill # For activation during large prefill
if self.chunked_prefill_size > 0: if self.chunked_prefill_size > 0:
reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5 reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5
...@@ -627,7 +627,7 @@ class ServerArgs: ...@@ -627,7 +627,7 @@ class ServerArgs:
# For cuda graphs # For cuda graphs
reserved_mem += self.cuda_graph_max_bs * 2 reserved_mem += self.cuda_graph_max_bs * 2
# Some adjustments for large parallel size # Some adjustments for large parallel size
reserved_mem += self.tp_size * self.pp_size / 4 * 1024 reserved_mem += self.tp_size * self.pp_size / 8 * 1024
if self.enable_dp_attention: if self.enable_dp_attention:
# DP attention needs more padding for some operations # DP attention needs more padding for some operations
......
...@@ -216,7 +216,7 @@ def _run_sglang_subprocess( ...@@ -216,7 +216,7 @@ def _run_sglang_subprocess(
del hf_model del hf_model
hf_model = None hf_model = None
torch.cuda.empty_cache() torch.cuda.empty_cache()
time.sleep(5) time.sleep(3)
torch.cuda.empty_cache() torch.cuda.empty_cache()
_curr_usage = get_gpu_memory_gb(rank) _curr_usage = get_gpu_memory_gb(rank)
assert ( assert (
......
...@@ -63,10 +63,15 @@ class TestNightlyGsm8KEval(unittest.TestCase): ...@@ -63,10 +63,15 @@ class TestNightlyGsm8KEval(unittest.TestCase):
for model in model_group: for model in model_group:
model_count += 1 model_count += 1
with self.subTest(model=model): with self.subTest(model=model):
other_args = ["--tp", "2"] if is_tp2 else []
if model == "meta-llama/Llama-3.1-70B-Instruct":
other_args.extend(["--mem-fraction-static", "0.9"])
process = popen_launch_server( process = popen_launch_server(
model=model, model=model,
other_args=other_args,
base_url=self.base_url, base_url=self.base_url,
other_args=["--tp", "2"] if is_tp2 else [],
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
) )
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment