Add document for LoRA serving (#5521)

072b4d03 · Baizhou Zhang · GitHub · 9c434777 · 072b4d03 · 072b4d03
Unverified Commit 072b4d03 authored Apr 20, 2025 by Baizhou Zhang Committed by GitHub Apr 20, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 205 additions and 0 deletions

docs/backend/lora.ipynb docs/backend/lora.ipynb +204 -0

docs/index.rst docs/index.rst +1 -0

No files found.
--- a/docs/backend/lora.ipynb
+++ b/docs/backend/lora.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LoRA Serving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Arguments for LoRA Serving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following server arguments are relevant for multi-LoRA serving:\n",
+    "\n",
+    "* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.\n",
+    "\n",
+    "* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n",
+    "\n",
+    "* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
+    "\n",
+    "* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n",
+    "\n",
+    "From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage\n",
+    "\n",
+    "### Serving Single Adaptor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sglang.test.test_utils import is_in_ci\n",
+    "\n",
+    "if is_in_ci():\n",
+    "    from patch import launch_server_cmd\n",
+    "else:\n",
+    "    from sglang.utils import launch_server_cmd\n",
+    "\n",
+    "from sglang.utils import wait_for_server, terminate_process\n",
+    "\n",
+    "import json\n",
+    "import requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "server_process, port = launch_server_cmd(\n",
+    "    \"\"\"\n",
+    "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
+    "    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
+    "    --max-loras-per-batch 1 --lora-backend triton \\\n",
+    "    --disable-cuda-graph --disable-radix-cache\n",
+    "\"\"\"\n",
+    ")\n",
+    "\n",
+    "wait_for_server(f\"http://localhost:{port}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = f\"http://127.0.0.1:{port}\"\n",
+    "json_data = {\n",
+    "    \"text\": [\n",
+    "        \"List 3 countries and their capitals.\",\n",
+    "        \"AI is a field of computer science focused on\",\n",
+    "    ],\n",
+    "    \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
+    "    # The first input uses lora0, and the second input uses the base model\n",
+    "    \"lora_path\": [\"lora0\", None],\n",
+    "}\n",
+    "response = requests.post(\n",
+    "    url + \"/generate\",\n",
+    "    json=json_data,\n",
+    ")\n",
+    "print(f\"Output 0: {response.json()[0]['text']}\")\n",
+    "print(f\"Output 1: {response.json()[1]['text']}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Serving Multiple Adaptors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "server_process, port = launch_server_cmd(\n",
+    "    \"\"\"\n",
+    "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
+    "    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
+    "    lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
+    "    --max-loras-per-batch 2 --lora-backend triton \\\n",
+    "    --disable-cuda-graph --disable-radix-cache\n",
+    "\"\"\"\n",
+    ")\n",
+    "\n",
+    "wait_for_server(f\"http://localhost:{port}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = f\"http://127.0.0.1:{port}\"\n",
+    "json_data = {\n",
+    "    \"text\": [\n",
+    "        \"List 3 countries and their capitals.\",\n",
+    "        \"AI is a field of computer science focused on\",\n",
+    "    ],\n",
+    "    \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
+    "    # The first input uses lora0, and the second input uses lora1\n",
+    "    \"lora_path\": [\"lora0\", \"lora1\"],\n",
+    "}\n",
+    "response = requests.post(\n",
+    "    url + \"/generate\",\n",
+    "    json=json_data,\n",
+    ")\n",
+    "print(f\"Output 0: {response.json()[0]['text']}\")\n",
+    "print(f\"Output 1: {response.json()[1]['text']}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Future Works\n",
+    "\n",
+    "The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently Cuda graph and radix attention are not incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -54,6 +54,7 @@ The core features include:
   backend/structured_outputs_for_reasoning_models.ipynb
   backend/custom_chat_template.md
   backend/quantization.md
+   backend/lora.ipynb
 .. toctree::
   :maxdepth: 1