[Docs] Add "Generation quality changed" section to troubleshooting (#15701)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Docs] Add "Generation quality changed" section to troubleshooting (#15701)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
0b416752 · Harry Mellor · GitHub · fd5fd269 · 0b416752
Unverified Commit 0b416752 authored Mar 28, 2025 by Harry Mellor Committed by GitHub Mar 28, 2025
Show whitespace changes
Inline Side-by-side

Showing with 8 additions and 0 deletions

docs/source/getting_started/troubleshooting.md docs/source/getting_started/troubleshooting.md +8 -0

No files found.
--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/source/getting_started/troubleshooting.md
@@ -26,6 +26,14 @@ To isolate the model downloading and loading issue, you can use the `--load-form
 If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider [using tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+## Generation quality changed
+In v0.8.0, the source of default sampling parameters was changed in <gh-pr:12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
+In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
+You can check if this is happening by trying the old defaults with `--generation-config vllm` for online and `generation_config="vllm"` for offline. If, after trying this, your generation quality improves we would recommend continuing to use the vLLM defaults and petition the model creator on <https://huggingface.co> to update their default `generation_config.json` so that it produces better quality generations.
 ## Enable more logging
 If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue: