## Penalties. See [Performance Implications on Penalties] section below for more informations.
# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty:float=0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty:float=0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty:float=1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens:int=0,
```
## Examples
...
...
@@ -142,3 +166,252 @@ print(response.json())
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Streaming is supported in a similar manner as [above](#streaming).
## Performance Implications on Penalties
While you can apply penalties by supplying relevant `sampling_params`, this comes with some drawbacks.
These drawbacks will be applied to every single requests in the same batch, as penalizers also applies in batch.
### Latency
While we try to compute penalty algorithms through CUDA, it is still additional computation on top of the basic sampling logic. For detailed overhead, we recommend you to run your own benchmarks, but you can find samples below to get a glimpse.
### Memory
Since we compute penalty algorithms through CUDA, the logic stores relevant parameters on GPU. This is usually in a scale of `vocab_size` multiplied by `running_requests`.
You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help. See [here](hyperparameter_tuning.md#minor-tune---max-prefill-tokens---mem-fraction-static---max-running-requests) for more information.
### Benchmarks
All the benchmarks below were ran on NVIDIA H100 SXM5.
<details>
#### Baseline
Measured at [dc9d06d886151707f97d0b78095df9de262fd3c9](https://github.com/sgl-project/sglang/commit/dc9d06d886151707f97d0b78095df9de262fd3c9).
logits (torch.Tensor): The logits to apply the penalizers to.
Returns:
torch.Tensor: The logits after applying the penalizers.
"""
forpenalizerinself.penalizers.values():
logits=penalizer.apply(logits)
returnlogits
deffilter(
self,
indices_to_keep:typing.List[int],
indices_tensor_to_keep:torch.Tensor=None,
):
"""
Filter the penalizers based on the indices to keep in the batch.
Args:
indices_to_keep (typing.List[int]): List of indices to keep in the batch.
indices_tensor_to_keep (torch.Tensor = None): Tensor of indices to keep in the batch. If not None, it will be used instead of converting indices_to_keep to a tensor.
# first step must be input, which will be converted to Req
steps:typing.List[Step]
eos_token_id:int=-1
def__post_init__(self):
ifself.steps[0].type!=StepType.INPUT:
raiseValueError("First step must be input")
# each steps should have the same expected_tensors.keys()
foriinrange(1,len(self.steps)):
ifself.tensor_keys(i)!=self.tensor_keys():
raiseValueError(
f"Expected tensors keys must be the same for all steps. Got {self.steps[i].expected_tensors.keys()} for key={i} and {self.steps[0].expected_tensors.keys()}"
# each test_subjects.steps should have the same expected_tensors.keys()
foriinrange(1,len(self.test_subjects)):
ifself.tensor_keys(i)!=self.tensor_keys():
raiseValueError(
f"Expected tensors keys must be the same for all test_subjects. Got {self.test_subjects[i].tensor_keys()} for key={i} and {self.test_subjects[0].tensor_keys()}"
)
deftensor_keys(self,i:int=0)->typing.List[str]:
returnset(self.test_subjects[i].tensor_keys())
classBaseBatchedPenalizerTest(unittest.TestCase):
Penalizer:typing.Type[_BatchedPenalizer]
device="cuda"
vocab_size=5
enabled:Subject=None
disabled:Subject=None
defsetUp(self):
ifself.__class__==BaseBatchedPenalizerTest:
self.skipTest("Base class for penalizer tests")
self.create_test_subjects()
self.create_test_cases()
deftensor(self,data,**kwargs)->torch.Tensor:
"""
Shortcut to create a tensor with device=self.device.