"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](https://docs.sglang.ai/supported_models/embedding_models.html)\n"
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
"This tutorial covers the vision APIs for vision language models.\n",
"\n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/supported_models/multimodal_language_models): \n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/supported_models/multimodal_language_models).\n",
"\n",
"As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)."
SGLang provides many optimizations specifically designed for the DeepSeek models, making it the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended) from Day 0.
This document outlines current optimizations for DeepSeek.
Additionally, the SGLang team is actively developing enhancements following this[Roadmap](https://github.com/sgl-project/sglang/issues/2591).
For an overview of the implemented features see the completed[Roadmap](https://github.com/sgl-project/sglang/issues/2591).
## Launch DeepSeek V3 with SGLang
...
...
@@ -221,6 +221,6 @@ Important Notes:
## FAQ
1.**Question**: What should I do if model loading takes too long and NCCL timeout occurs?
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
A: If you're experiencing extended model loading times and an NCCL timeout, you can try increasing the timeout duration. Add the argument`--dist-timeout 3600` when launching your model. This will set the timeout to one hour, which often resolves the issue.
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.51.1`.