Deterministic inference ensures consistent LLM outputs across runs, which is critical for:
-**Reinforcement Learning**: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable.
-**Production**: Improves reliability and user experience
Even with `temperature=0`, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.
## The Root Cause of Non-Determinism
The main source is **varying batch sizes**. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity (`(a + b) + c ≠ a + (b + c)`), this produces different results even for identical inputs.
## SGLang's Solution
Building on [Thinking Machines Lab's batch-invariant operators](https://github.com/thinking-machines-lab/batch_invariant_ops), SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this [issue](https://github.com/sgl-project/sglang/issues/10278).
### Supported Backends
Deterministic inference is only supported with the following three attention backends: **FlashInfer**, **FlashAttention 3 (FA3)**, and **Triton**.
The following table shows feature compatibility for deterministic inference across different attention backends:
### Deterministic Inference with Non-Greedy Sampling (Temperature > 0)
SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.
#### Default Behavior
By default, SGLang uses a sampling seed of `42` for reproducible sampling:
```python
importrequests
response=requests.post(
"http://localhost:30000/generate",
json={
"text":"Tell me a joke",
"sampling_params":{
"temperature":0.8,# Non-greedy sampling
"max_new_tokens":128,
},
},
)
print(response.json())
# This will always produce the same response across runs
```
#### Generating Multiple Reproducible Responses
To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:
```python
importrequests
# Prepare a list of sampling seeds for different responses
sampling_seeds=[42,43,44,45,46]
responses=[]
forseedinsampling_seeds:
response=requests.post(
"http://localhost:30000/generate",
json={
"text":"Tell me a joke",
"sampling_params":{
"temperature":0.8,
"max_new_tokens":128,
"sampling_seed":seed,# Specify sampling seed
},
},
)
responses.append(response.json())
# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response
```
This approach ensures that:
- Different seeds produce diverse responses
- The same seed always produces the same response across different runs
- Results are reproducible for debugging and evaluation
## Verification
Run deterministic tests to verify consistent outputs:
```bash
# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50
# Prefix test: prompts with different prefix lengths