Commit 118f1fc7 authored by maxiao1's avatar maxiao1
Browse files

sglangv0.5.2 & support Qwen3-Next-80B-A3B-Instruct

parents
<div align="center" id="sglangtop">
<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
![PyPI - Downloads](https://static.pepy.tech/badge/sglang?period=month)
[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/sgl-project/sglang)
</div>
--------------------------------------------------------------------------------
| [**Blog**](https://lmsys.org/blog/2025-05-05-large-scale-ep/)
| [**Documentation**](https://docs.sglang.ai/)
| [**Join Slack**](https://slack.sglang.ai/)
| [**Join Bi-Weekly Development Meeting**](https://meeting.sglang.ai/)
| [**Roadmap**](https://github.com/sgl-project/sglang/issues/7736)
| [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
## News
- [2025/08] 🔔 SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking ([Roadmap](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_roadmap.pdf), [Large-scale EP](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_ep.pdf), [Highlights](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_highlights.pdf), [AITER/MoRI](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_aiter_mori.pdf), [Wave](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_wave.pdf)).
- [2025/08] 🔥 SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833))
- [2025/06] 🔥 SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https://a16z.com/advancing-open-source-ai-through-benchmarks-and-bold-experimentation/)).
- [2025/06] 🔥 Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https://lmsys.org/blog/2025-06-16-gb200-part-1/)).
- [2025/05] 🔥 Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
- [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
- [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
- [2024/12] v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
<details>
<summary>More</summary>
- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
- [2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
- [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
</details>
## About
SGLang is a fast serving framework for large language models and vision language models.
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
## Getting Started
- [Install SGLang](https://docs.sglang.ai/get_started/install.html)
- [Quick Start](https://docs.sglang.ai/basic_usage/send_request.html)
- [Backend Tutorial](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
- [Frontend Tutorial](https://docs.sglang.ai/references/frontend/frontend_tutorial.html)
- [Contribution Guide](https://docs.sglang.ai/developer_guide/contribution_guide.html)
## Benchmark and Performance
Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), [Large-scale expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/).
## Roadmap
[Development Roadmap (2025 H2)](https://github.com/sgl-project/sglang/issues/7736)
## Adoption and Sponsorship
SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 1,000,000 GPUs worldwide.
<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/refs/heads/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
## Contact Us
For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at contact@sglang.ai.
## Acknowledgment
We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
<svg width="2392" height="729" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" overflow="hidden"><defs><filter id="fx0" x="-10%" y="-10%" width="120%" height="120%" filterUnits="userSpaceOnUse" primitiveUnits="userSpaceOnUse"><feComponentTransfer color-interpolation-filters="sRGB"><feFuncR type="discrete" tableValues="0.835294 0.835294"/><feFuncG type="discrete" tableValues="0.345098 0.345098"/><feFuncB type="discrete" tableValues="0.086275 0.086275"/><feFuncA type="linear" slope="0.400000" intercept="0.000000"/></feComponentTransfer><feGaussianBlur stdDeviation="7.638889 7.638889"/></filter><filter id="fx1" x="-10%" y="-10%" width="120%" height="120%" filterUnits="userSpaceOnUse" primitiveUnits="userSpaceOnUse"><feComponentTransfer color-interpolation-filters="sRGB"><feFuncR type="discrete" tableValues="0.835294 0.835294"/><feFuncG type="discrete" tableValues="0.345098 0.345098"/><feFuncB type="discrete" tableValues="0.086275 0.086275"/><feFuncA type="linear" slope="0.400000" intercept="0.000000"/></feComponentTransfer><feGaussianBlur stdDeviation="6.111111 6.111111"/></filter><filter id="fx2" x="-10%" y="-10%" width="120%" height="120%" filterUnits="userSpaceOnUse" primitiveUnits="userSpaceOnUse"><feComponentTransfer color-interpolation-filters="sRGB"><feFuncR type="discrete" tableValues="0.835294 0.835294"/><feFuncG type="discrete" tableValues="0.345098 0.345098"/><feFuncB type="discrete" tableValues="0.086275 0.086275"/><feFuncA type="linear" slope="0.400000" intercept="0.000000"/></feComponentTransfer><feGaussianBlur stdDeviation="7.638889 7.638889"/></filter><clipPath id="clip3"><path d="M1756.97 902.5C1708.88 902.5 1667.36 908.097 1632.43 919.291 1597.49 930.485 1568.73 945.491 1546.14 964.309 1523.56 983.128 1506.82 1004.62 1495.94 1028.8 1485.05 1052.97 1479.61 1078.03 1479.61 1103.99 1479.61 1125.73 1483.26 1144.63 1490.57 1160.69 1497.89 1176.75 1507.63 1190.86 1519.82 1203.03 1532.01 1215.2 1545.74 1225.74 1561.01 1234.66 1576.29 1243.59 1591.97 1251.86 1608.05 1259.48 1624.14 1267.11 1639.9 1274.49 1655.34 1281.63 1670.77 1288.77 1684.58 1296.39 1696.77 1304.5 1708.96 1312.61 1718.71 1321.62 1726.02 1331.51 1733.33 1341.41 1736.99 1353.17 1736.99 1366.8 1736.99 1378.48 1734.23 1389.59 1728.7 1400.14 1723.18 1410.68 1714.48 1420.01 1702.62 1428.12 1690.76 1436.23 1675.65 1442.64 1657.29 1447.35 1638.93 1452.05 1616.91 1454.4 1591.24 1454.4 1576.61 1454.4 1561.5 1453.59 1545.9 1451.97 1530.3 1450.35 1514.7 1448.08 1499.1 1445.16 1483.51 1442.23 1468.07 1438.67 1452.79 1434.45 1437.52 1430.23 1423.06 1425.69 1409.41 1420.82L1386.5 1533.25C1415.1 1542.33 1444.83 1549.14 1475.71 1553.69 1506.58 1558.23 1540.05 1560.5 1576.12 1560.5 1621.3 1560.5 1662.24 1555.47 1698.96 1545.41 1735.69 1535.35 1766.97 1520.92 1792.8 1502.1 1818.64 1483.28 1838.54 1460.57 1852.52 1433.96 1866.49 1407.36 1873.48 1377.67 1873.48 1344.9 1873.48 1323.48 1869.74 1304.58 1862.27 1288.2 1854.79 1271.81 1844.96 1257.38 1832.77 1244.88 1820.59 1232.39 1806.78 1221.44 1791.34 1212.03 1775.9 1202.62 1760.06 1193.94 1743.81 1185.99 1727.56 1178.05 1711.72 1170.5 1696.28 1163.36 1680.85 1156.23 1667.04 1148.76 1654.85 1140.98 1642.66 1133.19 1632.83 1124.67 1625.36 1115.42 1617.88 1106.18 1614.15 1095.39 1614.15 1083.06 1614.15 1072.35 1616.66 1062.21 1621.7 1052.64 1626.74 1043.07 1634.62 1034.8 1645.34 1027.82 1656.07 1020.85 1669.72 1015.33 1686.29 1011.27 1702.87 1007.22 1722.69 1005.19 1745.76 1005.19 1757.46 1005.19 1769.81 1005.76 1782.81 1006.89 1795.81 1008.03 1808.73 1009.65 1821.56 1011.76 1834.4 1013.87 1846.75 1016.3 1858.61 1019.06 1870.47 1021.82 1881.11 1024.82 1890.54 1028.06L1911.5 922.455C1901.75 919.859 1890.86 917.344 1878.84 914.91 1866.82 912.477 1854.06 910.368 1840.57 908.584 1827.09 906.799 1813.28 905.339 1799.14 904.203 1785 903.068 1770.95 902.5 1756.97 902.5ZM756 866 3148 866 3148 1595 756 1595Z" fill-rule="evenodd" clip-rule="evenodd"/></clipPath><clipPath id="clip4"><rect x="1.24353" y="1.08319" width="595.223" height="724.834"/></clipPath><clipPath id="clip5"><rect x="1.41663" y="1.66669" width="592.667" height="675.667"/></clipPath><clipPath id="clip6"><rect x="-2078.55" y="-2770.64" width="374073" height="374073"/></clipPath><image width="512" height="512" xlink:href="" preserveAspectRatio="none" id="img7"></image><clipPath id="clip8"><rect x="0" y="0" width="371301" height="371301"/></clipPath><clipPath id="clip9"><rect x="-0.363636" y="-2770.64" width="371302" height="374073"/></clipPath><image width="512" height="512" xlink:href="" preserveAspectRatio="none" id="img10"></image><clipPath id="clip11"><rect x="0" y="0" width="371301" height="371301"/></clipPath><clipPath id="clip12"><path d="M2369.95 902.5C2319.29 902.5 2274.72 909.476 2236.23 923.428 2197.75 937.38 2164.46 955.873 2136.37 978.909 2108.28 1001.95 2084.82 1028.23 2065.98 1057.75 2047.14 1087.28 2032.04 1117.78 2020.68 1149.25 2009.31 1180.72 2001.27 1211.79 1996.56 1242.45 1991.85 1273.11 1989.5 1300.93 1989.5 1325.92 1989.5 1363.55 1994.86 1396.89 2005.58 1425.93 2016.29 1454.97 2031.8 1479.47 2052.1 1499.42 2072.39 1519.38 2097.08 1534.54 2126.14 1544.93 2155.21 1555.31 2188.25 1560.5 2225.27 1560.5 2262.62 1560.5 2296.8 1557.17 2327.81 1550.52 2358.83 1543.87 2390.09 1533.89 2421.59 1520.59L2490.27 1181.37 2263.76 1181.37 2244.27 1280.17 2351.93 1280.17 2318.8 1442.72C2308.41 1446.29 2298.1 1448.97 2287.87 1450.75 2277.64 1452.54 2265.05 1453.43 2250.12 1453.43 2226.08 1453.43 2205.54 1450.59 2188.49 1444.91 2171.44 1439.23 2157.48 1430.8 2146.6 1419.6 2135.72 1408.41 2127.76 1394.78 2122.73 1378.72 2117.7 1362.66 2115.18 1344.09 2115.18 1323 2115.18 1302.56 2117.05 1280.57 2120.78 1257.05 2124.52 1233.53 2130.44 1210.17 2138.56 1186.97 2146.68 1163.77 2157.07 1141.54 2169.74 1120.29 2182.41 1099.04 2197.75 1080.22 2215.77 1063.84 2233.8 1047.45 2254.58 1034.31 2278.13 1024.41 2301.67 1014.52 2328.22 1009.57 2357.77 1009.57 2387.33 1009.57 2415.66 1012.9 2442.78 1019.55 2469.89 1026.2 2494.17 1033.58 2515.6 1041.69L2538.5 931.214C2512.84 922.778 2485.56 915.883 2456.66 910.53 2427.76 905.177 2398.85 902.5 2369.95 902.5ZM756 866 3148 866 3148 1595 756 1595Z" fill-rule="evenodd" clip-rule="evenodd"/></clipPath><clipPath id="clip13"><rect x="0.916748" y="1.08319" width="617.517" height="724.834"/></clipPath></defs><g transform="translate(-756 -866)"><g><g clip-path="url(#clip3)"><g clip-path="url(#clip4)" filter="url(#fx0)" transform="translate(1396 868)"><g><g><path d="M406.807 34.4998C420.781 34.4998 434.837 35.068 448.972 36.2035 463.109 37.3389 476.921 38.799 490.407 40.5838 503.894 42.3677 516.649 44.4772 528.673 46.9104 540.697 49.3436 551.584 51.859 561.333 54.4545L540.372 160.065C530.948 156.821 520.305 153.819 508.444 151.061 496.581 148.304 484.232 145.87 471.396 143.761 458.559 141.652 445.642 140.03 432.642 138.894 419.644 137.759 407.295 137.191 395.595 137.191 372.522 137.191 352.699 139.219 336.125 143.275 319.551 147.33 305.902 152.846 295.178 159.822 284.453 166.797 276.572 175.071 271.535 184.642 266.497 194.214 263.979 204.354 263.979 215.061 263.979 227.39 267.717 238.178 275.191 247.425 282.666 256.673 292.496 265.189 304.683 272.977 316.87 280.763 330.681 288.226 346.117 295.364 361.554 302.502 377.396 310.046 393.646 317.995 409.894 325.944 425.737 334.624 441.173 344.033 456.61 353.442 470.422 364.392 482.608 376.884 494.795 389.376 504.625 403.814 512.099 420.199 519.574 436.584 523.311 455.483 523.311 476.898 523.311 509.669 516.324 539.356 502.35 565.962 488.376 592.567 468.471 615.279 442.636 634.098 416.8 652.916 385.521 667.354 348.798 677.413 312.076 687.471 271.129 692.5 225.957 692.5 189.884 692.5 156.412 690.229 125.539 685.687 94.6662 681.144 64.931 674.33 36.3331 665.246L59.2438 552.821C72.8928 557.688 87.3538 562.23 102.628 566.448 117.902 570.666 133.339 574.235 148.937 577.155 164.536 580.075 180.136 582.347 195.734 583.969 211.333 585.591 226.445 586.403 241.068 586.403 266.741 586.403 288.759 584.05 307.12 579.346 325.481 574.641 340.592 568.233 352.455 560.122 364.316 552.01 373.009 542.682 378.534 532.137 384.059 521.592 386.821 510.479 386.821 498.799 386.821 485.172 383.164 473.41 375.853 463.514 368.541 453.618 358.791 444.614 346.605 436.503 334.418 428.391 320.607 420.767 305.17 413.629 289.733 406.491 273.972 399.11 257.886 391.484 241.8 383.86 226.119 375.586 210.845 366.663 195.571 357.741 181.841 347.197 169.654 335.028 157.468 322.861 147.718 308.748 140.407 292.687 133.095 276.626 129.439 257.727 129.439 235.988 129.439 210.032 134.882 184.967 145.769 160.795 156.656 136.624 173.391 115.128 195.978 96.3092 218.563 77.4909 247.324 62.4847 282.259 51.2908 317.194 40.0968 358.71 34.4998 406.807 34.4998Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#D55816" fill-rule="evenodd" fill-opacity="1"/></g></g></g></g><path d="M1756.97 902.5C1770.95 902.5 1785 903.068 1799.14 904.203 1813.28 905.339 1827.09 906.799 1840.57 908.584 1854.06 910.368 1866.82 912.477 1878.84 914.91 1890.86 917.344 1901.75 919.859 1911.5 922.455L1890.54 1028.06C1881.11 1024.82 1870.47 1021.82 1858.61 1019.06 1846.75 1016.3 1834.4 1013.87 1821.56 1011.76 1808.73 1009.65 1795.81 1008.03 1782.81 1006.89 1769.81 1005.76 1757.46 1005.19 1745.76 1005.19 1722.69 1005.19 1702.87 1007.22 1686.29 1011.27 1669.72 1015.33 1656.07 1020.85 1645.34 1027.82 1634.62 1034.8 1626.74 1043.07 1621.7 1052.64 1616.66 1062.21 1614.15 1072.35 1614.15 1083.06 1614.15 1095.39 1617.88 1106.18 1625.36 1115.42 1632.83 1124.67 1642.66 1133.19 1654.85 1140.98 1667.04 1148.76 1680.85 1156.23 1696.28 1163.36 1711.72 1170.5 1727.56 1178.05 1743.81 1185.99 1760.06 1193.94 1775.9 1202.62 1791.34 1212.03 1806.78 1221.44 1820.59 1232.39 1832.77 1244.88 1844.96 1257.38 1854.79 1271.81 1862.27 1288.2 1869.74 1304.58 1873.48 1323.48 1873.48 1344.9 1873.48 1377.67 1866.49 1407.36 1852.52 1433.96 1838.54 1460.57 1818.64 1483.28 1792.8 1502.1 1766.97 1520.92 1735.69 1535.35 1698.96 1545.41 1662.24 1555.47 1621.3 1560.5 1576.12 1560.5 1540.05 1560.5 1506.58 1558.23 1475.71 1553.69 1444.83 1549.14 1415.1 1542.33 1386.5 1533.25L1409.41 1420.82C1423.06 1425.69 1437.52 1430.23 1452.79 1434.45 1468.07 1438.67 1483.51 1442.23 1499.1 1445.16 1514.7 1448.08 1530.3 1450.35 1545.9 1451.97 1561.5 1453.59 1576.61 1454.4 1591.24 1454.4 1616.91 1454.4 1638.93 1452.05 1657.29 1447.35 1675.65 1442.64 1690.76 1436.23 1702.62 1428.12 1714.48 1420.01 1723.18 1410.68 1728.7 1400.14 1734.23 1389.59 1736.99 1378.48 1736.99 1366.8 1736.99 1353.17 1733.33 1341.41 1726.02 1331.51 1718.71 1321.62 1708.96 1312.61 1696.77 1304.5 1684.58 1296.39 1670.77 1288.77 1655.34 1281.63 1639.9 1274.49 1624.14 1267.11 1608.05 1259.48 1591.97 1251.86 1576.29 1243.59 1561.01 1234.66 1545.74 1225.74 1532.01 1215.2 1519.82 1203.03 1507.63 1190.86 1497.89 1176.75 1490.57 1160.69 1483.26 1144.63 1479.61 1125.73 1479.61 1103.99 1479.61 1078.03 1485.05 1052.97 1495.94 1028.8 1506.82 1004.62 1523.56 983.128 1546.14 964.309 1568.73 945.491 1597.49 930.485 1632.43 919.291 1667.36 908.097 1708.88 902.5 1756.97 902.5Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#D55816" fill-rule="evenodd" fill-opacity="1"/><g clip-path="url(#clip5)" filter="url(#fx1)" transform="translate(757 888)"><g><g><path d="M482.943 195.5C482.943 283.043 379.597 370.586 276.25 370.586" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M195.25 316.5C262.575 316.5 329.901 417.048 329.901 517.595" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M30.2499 316C30.2499 270.437 67.1864 233.5 112.75 233.5 158.313 233.5 195.25 270.437 195.25 316 195.25 361.564 158.313 398.5 112.75 398.5 67.1864 398.5 30.2499 361.564 30.2499 316Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M400.25 113C400.25 67.4365 437.187 30.5 482.75 30.5 528.314 30.5 565.25 67.4365 565.25 113 565.25 158.563 528.314 195.5 482.75 195.5 437.187 195.5 400.25 158.563 400.25 113Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M237.25 539.334C237.25 527.275 247.025 517.5 259.084 517.5L400.417 517.5C412.475 517.5 422.25 527.275 422.25 539.334L422.25 626.666C422.25 638.725 412.475 648.5 400.417 648.5L259.084 648.5C247.025 648.5 237.25 638.725 237.25 626.666Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><g clip-path="url(#clip6)" transform="matrix(0.000360892 0 0 0.000360892 261.75 517)"><g clip-path="url(#clip8)" transform="matrix(1 0 0 1 0.0663341 0.216198)"><use width="100%" height="100%" xlink:href="#img7" opacity="1" transform="scale(725.197 725.197)"></use></g></g></g></g></g><path d="M1226.19 1083.5C1226.19 1171.04 1122.85 1258.59 1019.5 1258.59" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M938.5 1204.5C1005.83 1204.5 1073.15 1305.05 1073.15 1405.6" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M773.5 1204C773.5 1158.44 810.436 1121.5 856 1121.5 901.563 1121.5 938.5 1158.44 938.5 1204 938.5 1249.56 901.563 1286.5 856 1286.5 810.436 1286.5 773.5 1249.56 773.5 1204Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M1143.5 1001C1143.5 955.437 1180.44 918.5 1226 918.5 1271.56 918.5 1308.5 955.437 1308.5 1001 1308.5 1046.56 1271.56 1083.5 1226 1083.5 1180.44 1083.5 1143.5 1046.56 1143.5 1001Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M980.5 1427.33C980.5 1415.28 990.275 1405.5 1002.33 1405.5L1143.67 1405.5C1155.72 1405.5 1165.5 1415.28 1165.5 1427.33L1165.5 1514.67C1165.5 1526.72 1155.72 1536.5 1143.67 1536.5L1002.33 1536.5C990.275 1536.5 980.5 1526.72 980.5 1514.67Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><g clip-path="url(#clip9)" transform="matrix(0.000360892 0 0 0.000360892 1005 1405)"><g clip-path="url(#clip11)" transform="matrix(1 0 0 1 0.0684703 0.216198)"><use width="100%" height="100%" xlink:href="#img10" opacity="1" transform="scale(725.197 725.197)"></use></g></g><g clip-path="url(#clip12)"><g clip-path="url(#clip13)" filter="url(#fx2)" transform="translate(2001 868)"><g><g><path d="M414.785 34.4998C443.687 34.4998 472.591 37.1766 501.494 42.53 530.398 47.8835 557.677 54.7783 583.333 63.2143L560.437 173.692C539.004 165.581 514.728 158.2 487.611 151.548 460.494 144.896 432.159 141.571 402.606 141.571 373.053 141.571 346.504 146.519 322.96 156.415 299.415 166.311 278.631 179.451 260.607 195.836 242.582 212.221 227.238 231.04 214.573 252.292 201.907 273.544 191.515 295.769 183.396 318.968 175.278 342.167 169.35 365.528 165.616 389.051 161.881 412.575 160.014 434.557 160.014 454.997 160.014 476.087 162.531 494.662 167.564 510.723 172.598 526.783 180.555 540.41 191.434 551.604 202.313 562.798 216.278 571.234 233.327 576.913 250.377 582.59 270.918 585.429 294.95 585.429 309.888 585.429 322.473 584.537 332.703 582.752 342.932 580.968 353.244 578.291 363.636 574.722L396.761 412.169 289.104 412.169 308.589 313.371 535.107 313.371 466.421 652.592C434.92 665.894 403.662 675.872 372.647 682.523 341.633 689.174 307.453 692.5 270.106 692.5 233.083 692.5 200.039 687.309 170.974 676.926 141.908 666.544 117.227 651.375 96.9299 631.421 76.6329 611.467 61.1254 586.97 50.4087 557.931 39.6919 528.893 34.3335 495.554 34.3335 457.918 34.3335 432.934 36.6877 405.112 41.3971 374.451 46.1055 343.789 54.1431 312.723 65.5098 281.25 76.8756 249.778 91.9768 219.279 110.813 189.753 129.649 160.227 153.113 133.946 181.204 110.909 209.296 87.8731 242.582 69.3795 281.066 55.4276 319.55 41.4758 364.123 34.4998 414.785 34.4998Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#D55816" fill-rule="evenodd" fill-opacity="1"/></g></g></g></g><path d="M2369.95 902.5C2398.85 902.5 2427.76 905.177 2456.66 910.53 2485.56 915.883 2512.84 922.778 2538.5 931.214L2515.6 1041.69C2494.17 1033.58 2469.89 1026.2 2442.78 1019.55 2415.66 1012.9 2387.33 1009.57 2357.77 1009.57 2328.22 1009.57 2301.67 1014.52 2278.13 1024.41 2254.58 1034.31 2233.8 1047.45 2215.77 1063.84 2197.75 1080.22 2182.41 1099.04 2169.74 1120.29 2157.07 1141.54 2146.68 1163.77 2138.56 1186.97 2130.44 1210.17 2124.52 1233.53 2120.78 1257.05 2117.05 1280.57 2115.18 1302.56 2115.18 1323 2115.18 1344.09 2117.7 1362.66 2122.73 1378.72 2127.76 1394.78 2135.72 1408.41 2146.6 1419.6 2157.48 1430.8 2171.44 1439.23 2188.49 1444.91 2205.54 1450.59 2226.08 1453.43 2250.12 1453.43 2265.05 1453.43 2277.64 1452.54 2287.87 1450.75 2298.1 1448.97 2308.41 1446.29 2318.8 1442.72L2351.93 1280.17 2244.27 1280.17 2263.76 1181.37 2490.27 1181.37 2421.59 1520.59C2390.09 1533.89 2358.83 1543.87 2327.81 1550.52 2296.8 1557.17 2262.62 1560.5 2225.27 1560.5 2188.25 1560.5 2155.21 1555.31 2126.14 1544.93 2097.08 1534.54 2072.39 1519.38 2052.1 1499.42 2031.8 1479.47 2016.29 1454.97 2005.58 1425.93 1994.86 1396.89 1989.5 1363.55 1989.5 1325.92 1989.5 1300.93 1991.85 1273.11 1996.56 1242.45 2001.27 1211.79 2009.31 1180.72 2020.68 1149.25 2032.04 1117.78 2047.14 1087.28 2065.98 1057.75 2084.82 1028.23 2108.28 1001.95 2136.37 978.909 2164.46 955.873 2197.75 937.38 2236.23 923.428 2274.72 909.476 2319.29 902.5 2369.95 902.5Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#D55816" fill-rule="evenodd" fill-opacity="1"/><path d="M2837.7 900.5 2964.33 900.5 2853.41 1455.63 3132.5 1455.63 3111.73 1562.5 2705.5 1562.5 2837.7 900.5Z" stroke="#E4C0B8" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FFFFFF" fill-rule="evenodd" fill-opacity="1"/><path d="M2807.7 900.5 2934.33 900.5 2823.41 1455.63 3102.5 1455.63 3081.73 1562.5 2675.5 1562.5 2807.7 900.5Z" stroke="#D29886" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FFFFFF" fill-rule="evenodd" fill-opacity="1"/><path d="M2778.7 900.5 2905.33 900.5 2794.41 1455.63 3073.5 1455.63 3052.73 1562.5 2646.5 1562.5 2778.7 900.5Z" stroke="#BC644B" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FFFFFF" fill-rule="evenodd" fill-opacity="1"/><path d="M2748.7 900.5 2875.33 900.5 2764.41 1455.63 3043.5 1455.63 3022.73 1562.5 2616.5 1562.5 2748.7 900.5Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/></g></g></svg>
<svg width="596" height="683" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" overflow="hidden"><defs><filter id="fx0" x="-10%" y="-10%" width="120%" height="120%" filterUnits="userSpaceOnUse" primitiveUnits="userSpaceOnUse"><feComponentTransfer color-interpolation-filters="sRGB"><feFuncR type="discrete" tableValues="0.835294 0.835294"/><feFuncG type="discrete" tableValues="0.345098 0.345098"/><feFuncB type="discrete" tableValues="0.086275 0.086275"/><feFuncA type="linear" slope="0.400000" intercept="0.000000"/></feComponentTransfer><feGaussianBlur stdDeviation="6.111111 6.111111"/></filter><clipPath id="clip1"><rect x="1.41663" y="1.66669" width="592.667" height="675.667"/></clipPath><clipPath id="clip2"><rect x="-2078.55" y="-2770.64" width="374073" height="374073"/></clipPath><image width="512" height="512" xlink:href="" preserveAspectRatio="none" id="img3"></image><clipPath id="clip4"><rect x="0" y="0" width="371301" height="371301"/></clipPath><clipPath id="clip5"><rect x="-0.363636" y="-2770.64" width="371302" height="374073"/></clipPath><image width="512" height="512" xlink:href="" preserveAspectRatio="none" id="img6"></image><clipPath id="clip7"><rect x="0" y="0" width="371301" height="371301"/></clipPath></defs><g transform="translate(-756 -884)"><g><g clip-path="url(#clip1)" filter="url(#fx0)" transform="translate(757 888)"><g><g><path d="M482.943 195.5C482.943 283.043 379.597 370.586 276.25 370.586" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M195.25 316.5C262.575 316.5 329.901 417.048 329.901 517.595" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M30.2499 316C30.2499 270.437 67.1864 233.5 112.75 233.5 158.313 233.5 195.25 270.437 195.25 316 195.25 361.564 158.313 398.5 112.75 398.5 67.1864 398.5 30.2499 361.564 30.2499 316Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M400.25 113C400.25 67.4365 437.187 30.5 482.75 30.5 528.314 30.5 565.25 67.4365 565.25 113 565.25 158.563 528.314 195.5 482.75 195.5 437.187 195.5 400.25 158.563 400.25 113Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M237.25 539.334C237.25 527.275 247.025 517.5 259.084 517.5L400.417 517.5C412.475 517.5 422.25 527.275 422.25 539.334L422.25 626.666C422.25 638.725 412.475 648.5 400.417 648.5L259.084 648.5C247.025 648.5 237.25 638.725 237.25 626.666Z" stroke="#A5300F" stroke-width="21" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><g clip-path="url(#clip2)" transform="matrix(0.000360892 0 0 0.000360892 261.75 517)"><g clip-path="url(#clip4)" transform="matrix(1 0 0 1 0.0663341 0.216198)"><use width="100%" height="100%" xlink:href="#img3" opacity="1" transform="scale(725.197 725.197)"></use></g></g></g></g></g><path d="M1226.19 1083.5C1226.19 1171.04 1122.85 1258.59 1019.5 1258.59" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M938.5 1204.5C1005.83 1204.5 1073.15 1305.05 1073.15 1405.6" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="none" fill-rule="evenodd"/><path d="M773.5 1204C773.5 1158.44 810.436 1121.5 856 1121.5 901.563 1121.5 938.5 1158.44 938.5 1204 938.5 1249.56 901.563 1286.5 856 1286.5 810.436 1286.5 773.5 1249.56 773.5 1204Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M1143.5 1001C1143.5 955.437 1180.44 918.5 1226 918.5 1271.56 918.5 1308.5 955.437 1308.5 1001 1308.5 1046.56 1271.56 1083.5 1226 1083.5 1180.44 1083.5 1143.5 1046.56 1143.5 1001Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><path d="M980.5 1427.33C980.5 1415.28 990.275 1405.5 1002.33 1405.5L1143.67 1405.5C1155.72 1405.5 1165.5 1415.28 1165.5 1427.33L1165.5 1514.67C1165.5 1526.72 1155.72 1536.5 1143.67 1536.5L1002.33 1536.5C990.275 1536.5 980.5 1526.72 980.5 1514.67Z" stroke="#A5300F" stroke-width="20.625" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="8" stroke-opacity="1" fill="#FADDCD" fill-rule="evenodd" fill-opacity="1"/><g clip-path="url(#clip5)" transform="matrix(0.000360892 0 0 0.000360892 1005 1405)"><g clip-path="url(#clip7)" transform="matrix(1 0 0 1 0.0684703 0.216198)"><use width="100%" height="100%" xlink:href="#img6" opacity="1" transform="scale(725.197 725.197)"></use></g></g></g></g></svg>
import argparse
import torch
import triton
from sglang.srt.layers.attention.triton_ops.decode_attention import (
decode_attention_fwd_grouped,
)
from sglang.srt.layers.attention.triton_ops.extend_attention import extend_attention_fwd
# gpt oss
head_num = 64
head_dim = 64
head_kv_num = 8
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["S"], # sequence length on x-axis
x_vals=[128, 256, 512, 1024, 2048, 4096],
x_log=True,
line_arg="B", # batch size as different lines
line_vals=[1, 8, 32, 128],
line_names=["B=1", "B=8", "B=32", "B=128"],
styles=[
("blue", "-"),
("green", "-"),
("red", "-"),
("cyan", "-"),
],
ylabel="TFLOPS",
plot_name="attention-sink-triton-decode",
args={},
)
)
def benchmark_decode(B, S, H_Q, H_KV, D):
D_V = D
dtype = torch.bfloat16
seq_len = S
total_tokens = B * seq_len
device = torch.device("cuda")
sm_scale = 1.0 / (D**0.5)
max_kv_splits = 8
num_kv_splits = torch.full((B,), 4, dtype=torch.int32, device="cuda")
# q represents the new token being generated, one per batch
q = torch.randn(B, H_Q, D, dtype=dtype, device="cuda")
# k_buffer and v_buffer represent all previous tokens
k_buffer = torch.randn(total_tokens, H_KV, D, dtype=dtype, device="cuda")
v_buffer = torch.randn(total_tokens, H_KV, D, dtype=dtype, device="cuda")
o = torch.zeros(B, H_Q, D_V, dtype=dtype, device="cuda")
b_seq_len = torch.full((B,), seq_len, device="cuda")
kv_indptr = torch.zeros((B + 1,), dtype=torch.int32, device="cuda")
kv_indptr[1 : B + 1] = torch.cumsum(b_seq_len, dim=0)
kv_indices = torch.arange(total_tokens, device="cuda")
attn_logits1 = torch.empty(
(B, H_Q, max_kv_splits, D_V),
dtype=torch.float32,
device="cuda",
)
attn_lse1 = torch.empty(
(B, H_Q, max_kv_splits, D_V),
dtype=torch.float32,
device="cuda",
)
sink = torch.randn(H_Q, device=device, dtype=torch.float32)
# warmup
for _ in range(5):
decode_attention_fwd_grouped(
q,
k_buffer,
v_buffer,
o,
kv_indptr,
kv_indices,
attn_logits1,
attn_lse1,
num_kv_splits,
max_kv_splits,
sm_scale,
logit_cap=0.0,
sinks=sink,
)
# benchmark
run_step = 500
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(run_step):
decode_attention_fwd_grouped(
q,
k_buffer,
v_buffer,
o,
kv_indptr,
kv_indices,
attn_logits1,
attn_lse1,
num_kv_splits,
max_kv_splits,
sm_scale,
logit_cap=0.0,
sinks=sink,
)
end_event.record()
end_event.synchronize()
torch.cuda.synchronize()
ms = start_event.elapsed_time(end_event) / run_step
tflops = lambda ms: (2 * B * S * H_Q * D) * 1e-9 / ms # must be causal
return tflops(ms)
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["S"], # sequence length on x-axis
x_vals=[128, 256, 512, 1024, 2048, 4096],
x_log=True,
line_arg="B", # batch size as different lines
line_vals=[1, 8, 32, 128],
line_names=["B=1", "B=8", "B=32", "B=128"],
styles=[
("blue", "-"),
("green", "-"),
("red", "-"),
("cyan", "-"),
],
ylabel="TFLOPS",
plot_name="attention-sink-triton-extend",
args={},
)
)
def benchmark_extend(B, S, H_Q, H_KV, D):
# S here represents N_CTX from the test
dtype = torch.bfloat16
device = "cuda"
# Split S into prefix and extend lengths
prefill_len = S // 2 # Similar to test's N_CTX // 2
extend_len = S // 4 # Make extend length smaller than prefix
# Calculate total tokens and extend tokens
total_extend_tokens = B * extend_len
total_prefix_tokens = B * prefill_len
# Create query, key, value tensors for extension
q_extend = torch.randn(total_extend_tokens, H_Q, D, dtype=dtype, device=device)
k_extend = torch.randn(total_extend_tokens, H_KV, D, dtype=dtype, device=device)
v_extend = torch.randn(total_extend_tokens, H_KV, D, dtype=dtype, device=device)
o_extend = torch.empty_like(q_extend)
# Create key-value buffers for prefix
k_buffer = torch.randn(total_prefix_tokens, H_KV, D, dtype=dtype, device=device)
v_buffer = torch.randn(total_prefix_tokens, H_KV, D, dtype=dtype, device=device)
# Create index pointers
qo_indptr = torch.arange(0, (B + 1) * extend_len, extend_len, device=device).to(
torch.int32
)
kv_indptr = torch.arange(0, (B + 1) * prefill_len, prefill_len, device=device).to(
torch.int32
)
kv_indices = torch.arange(0, total_prefix_tokens, device=device).to(torch.int32)
sm_scale = 1.0 / (D**0.5)
# sliding_window = 128 # From GPT-OSS config, skip for now
sliding_window = -1
sink = torch.randn(H_Q, device=device, dtype=torch.float32)
# warmup
for _ in range(5):
extend_attention_fwd(
q_extend,
k_extend,
v_extend,
o_extend,
k_buffer,
v_buffer,
qo_indptr,
kv_indptr,
kv_indices,
custom_mask=None,
is_causal=True,
mask_indptr=None,
max_len_extend=extend_len,
sm_scale=sm_scale,
sliding_window_size=sliding_window,
sinks=sink,
)
# benchmark
run_step = 500
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(run_step):
extend_attention_fwd(
q_extend,
k_extend,
v_extend,
o_extend,
k_buffer,
v_buffer,
qo_indptr,
kv_indptr,
kv_indices,
custom_mask=None,
is_causal=True,
mask_indptr=None,
max_len_extend=extend_len,
sm_scale=sm_scale,
sliding_window_size=sliding_window,
sinks=sink,
)
end_event.record()
end_event.synchronize()
torch.cuda.synchronize()
ms = start_event.elapsed_time(end_event) / run_step
# FLOPS calculation: each attention operation requires 2 multiplications per element
total_flops = 2 * total_extend_tokens * H_Q * (prefill_len + extend_len / 2) * D
tflops = lambda ms: total_flops * 1e-12 / (ms * 1e-3) # convert to TFLOPS
return tflops(ms)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--bench", type=str, default="all", help="all, extend, decode")
args = parser.parse_args()
kwargs = {
"H_Q": head_num,
"H_KV": head_kv_num,
"D": head_dim,
}
if args.bench in ["all", "decode"]:
benchmark_decode.run(print_data=True, show_plots=False, **kwargs)
if args.bench in ["all", "extend"]:
benchmark_extend.run(print_data=True, show_plots=False, **kwargs)
print("Benchmark finished!")
# Benchmark with lots of common prefixes. Used to benchmark prefix caching performance.
#
# Launch a server:
# python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --log-level-http warning
import random
import string
import time
from tqdm import tqdm
from transformers import AutoTokenizer
import sglang as sgl
from sglang import set_default_backend
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
def generate_random_string(token_length: int) -> str:
random_string = "".join(
random.choices(string.ascii_letters + string.digits, k=token_length * 100)
)
tokenized_output = tokenizer.encode(random_string, add_special_tokens=False)[
:token_length
]
if len(tokenized_output) < token_length:
tokenized_output = tokenized_output + [tokenizer.pad_token_id] * (
token_length - len(tokenized_output)
)
decoded_string = tokenizer.decode(tokenized_output, skip_special_tokens=False)
return decoded_string
def generate_unique_prefix(base_text, index):
return str(index) + base_text[len(str(index)) :]
@sgl.function
def text_qa(s, question, gen_len):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", stop="\n", temperature=0, max_tokens=gen_len)
def prepare_prompts(num_prefix, num_samples_per_prefix, prefix_length, suffix_length):
base_prefix = generate_random_string(prefix_length)
tot_input_len = 0
all_prompts = []
for i in tqdm(range(num_prefix), desc="prepare prompts"):
unique_prefix = generate_unique_prefix(base_prefix, i)
prompt_list = []
for j in range(num_samples_per_prefix):
suffix = generate_random_string(suffix_length)
prompt = unique_prefix + suffix
prompt_list.append(prompt)
tot_input_len += len(tokenizer.encode(prompt))
all_prompts.append(prompt_list)
return all_prompts, tot_input_len
def test_batch_by_batch(all_prompts, gen_len):
backend.flush_cache()
tot_time = 0
for i in range(len(all_prompts)):
tic = time.perf_counter()
text_qa.run_batch(
list(zip(all_prompts[i], [gen_len] * len(all_prompts[i]))),
)
tot_time += time.perf_counter() - tic
return tot_time
def test_batch_by_batch_with_hint(all_prompts, gen_len):
backend.flush_cache()
tot_time = 0
for i in range(len(all_prompts)):
tic = time.perf_counter()
# Send a hint to cache the prefix
text_qa.run_batch(list(zip(all_prompts[i][:1], [gen_len])))
# Send the batch
text_qa.run_batch(list(zip(all_prompts[i], [gen_len] * len(all_prompts[i]))))
tot_time += time.perf_counter() - tic
return tot_time
def test_send_all(all_prompts, gen_len):
backend.flush_cache()
all_prompts = [x for prompt_list in all_prompts for x in prompt_list]
tic = time.perf_counter()
text_qa.run_batch(
list(zip(all_prompts, [gen_len] * len(all_prompts))),
)
tot_time = time.perf_counter() - tic
return tot_time
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
backend = RuntimeEndpoint("http://127.0.0.1:30000")
set_default_backend(backend)
random.seed(0)
num_prefix = 10
num_samples_per_prefix = 32
prefix_length = 1024
suffix_length = 128
gen_len = 1
all_prompts, tot_input_len = prepare_prompts(
num_prefix, num_samples_per_prefix, prefix_length, suffix_length
)
print(f"Total input token length: {tot_input_len}\n")
cost = test_batch_by_batch(all_prompts, gen_len)
print(f"Latency of test_batch_by_batch : {cost:.4f} s\n")
cost = test_batch_by_batch_with_hint(all_prompts, gen_len)
print(f"Latency of test_batch_by_batch_with_hint: {cost:.4f} s\n")
cost = test_send_all(all_prompts, gen_len)
print(f"Latency of test_send_all : {cost:.4f} s\n")
import concurrent.futures
import os
import random
import time
from concurrent.futures import ProcessPoolExecutor
from statistics import mean
import requests
from tqdm import tqdm
from transformers import AutoTokenizer
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
###############################################################################
# CONFIG
###############################################################################
ENDPOINT_URL = "http://127.0.0.1:30000"
TOKENIZER_DIR = "/models/meta-llama/Llama-3.2-3B"
# Benchmark configurations
NUM_REQUESTS = 10 # Total number of requests (each with BATCH_SIZE prompts)
NUM_TOKENS = 32000 # Tokens per prompt
BATCH_SIZE = 8 # Number of prompts per request
GEN_TOKENS = 0 # Tokens to generate per prompt
###############################################################################
# REQUEST GENERATION (in parallel)
###############################################################################
def generate_random_prompt(index, tokenizer_dir, num_tokens):
"""Generate a single random prompt with specified token count."""
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
vocab_size = tokenizer.vocab_size
def generate_random_text(num_toks):
random_token_ids = [random.randint(0, vocab_size - 1) for _ in range(num_toks)]
return tokenizer.decode(random_token_ids, clean_up_tokenization_spaces=True)
random_text = generate_random_text(num_tokens)
return f"Prompt {index}: {random_text}"
def prepare_all_prompts(num_requests, batch_size, num_tokens, tokenizer_dir):
"""Generate prompts for all requests in parallel."""
total_prompts = num_requests * batch_size
all_prompts = [None] * total_prompts
max_workers = min(os.cpu_count() or 1, total_prompts)
with ProcessPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(generate_random_prompt, i, tokenizer_dir, num_tokens)
for i in range(total_prompts)
]
for future in tqdm(
concurrent.futures.as_completed(futures),
total=total_prompts,
desc="Generating prompts",
):
index = futures.index(future)
all_prompts[index] = future.result()
batched_prompts = [
all_prompts[i * batch_size : (i + 1) * batch_size] for i in range(num_requests)
]
print(
f"Generated {total_prompts} prompts with {num_tokens} tokens each, grouped into {num_requests} requests of {batch_size} prompts.\n"
)
return batched_prompts
###############################################################################
# HTTP CALLS
###############################################################################
def send_batch_request(endpoint, prompts, gen_tokens, request_id):
"""Send a batch of prompts to the /generate endpoint synchronously."""
sampling_params = {
"max_new_tokens": gen_tokens,
"temperature": 0.7,
"stop": "\n",
}
data = {"text": prompts, "sampling_params": sampling_params}
start_time = time.perf_counter()
try:
response = requests.post(
endpoint.base_url + "/generate", json=data, timeout=3600
)
if response.status_code != 200:
error = response.json()
raise RuntimeError(f"Request {request_id} failed: {error}")
result = response.json()
elapsed_time = (time.perf_counter() - start_time) * 1000 # Convert to ms
avg_per_prompt = elapsed_time / len(prompts) if prompts else 0
return request_id, elapsed_time, avg_per_prompt, True, len(prompts)
except Exception as e:
print(f"[Request] Error for request {request_id}: {e}")
return request_id, 0, 0, False, len(prompts)
def run_benchmark(endpoint, batched_prompts, batch_size, gen_tokens):
"""Run the benchmark sequentially."""
results = []
num_requests = len(batched_prompts)
# Record start time for total latency
benchmark_start_time = time.perf_counter()
for i, batch_prompts in enumerate(batched_prompts):
request_id = i + 1
assert (
len(batch_prompts) == batch_size
), f"Request {request_id} should have {batch_size} prompts, got {len(batch_prompts)}"
print(
f"[Request] Sending request {request_id}/{num_requests} with {len(batch_prompts)} prompts at {int(time.time()*1000)}"
)
result = send_batch_request(endpoint, batch_prompts, gen_tokens, request_id)
results.append(result)
# Calculate total latency
total_latency = (time.perf_counter() - benchmark_start_time) * 1000 # Convert to ms
return results, total_latency
###############################################################################
# RESULTS
###############################################################################
def process_results(results, total_latency, num_requests):
"""Process and display benchmark results."""
total_time = 0
successful_requests = 0
failed_requests = 0
request_latencies = []
per_prompt_latencies = []
total_prompts = 0
for request_id, elapsed_time, avg_per_prompt, success, batch_size in results:
if success:
successful_requests += 1
total_prompts += batch_size
request_latencies.append(elapsed_time)
per_prompt_latencies.append(avg_per_prompt)
total_time += elapsed_time / 1000 # Convert to seconds
else:
failed_requests += 1
avg_request_latency = mean(request_latencies) if request_latencies else 0
avg_per_prompt_latency = mean(per_prompt_latencies) if per_prompt_latencies else 0
throughput = total_prompts / total_time if total_time > 0 else 0
print("\nBenchmark Summary:")
print(f" Total requests sent: {len(results)}")
print(f" Total prompts sent: {total_prompts}")
print(f" Successful requests: {successful_requests}")
print(f" Failed requests: {failed_requests}")
print(f" Total latency (all requests): {total_latency:.2f} ms")
print(f" Avg per request latency: {avg_request_latency:.2f} ms")
print(f" Avg per prompt latency: {avg_per_prompt_latency:.2f} ms")
print(f" Throughput: {throughput:.2f} prompts/second\n")
###############################################################################
# MAIN
###############################################################################
def main():
# Initialize endpoint
endpoint = RuntimeEndpoint(ENDPOINT_URL)
# Generate prompts
batched_prompts = prepare_all_prompts(
NUM_REQUESTS, BATCH_SIZE, NUM_TOKENS, TOKENIZER_DIR
)
# Flush cache before benchmark
# endpoint.flush_cache()
# Run benchmark
print(
f"Starting benchmark: NUM_TOKENS={NUM_TOKENS}, BATCH_SIZE={BATCH_SIZE}, NUM_REQUESTS={NUM_REQUESTS}\n"
)
results, total_latency = run_benchmark(
endpoint, batched_prompts, BATCH_SIZE, GEN_TOKENS
)
# Process and display results
process_results(results, total_latency, NUM_REQUESTS)
if __name__ == "__main__":
random.seed(0)
main()
import random
import time
from statistics import mean
from transformers import AutoTokenizer
# CONFIG
TOKENIZER_DIR = (
"/shared/public/sharing/fait360brew/training/models/meta-llama/Llama-3.2-3B"
)
NUM_TOKENS = 20000 # Each prompt should contain this many tokens
BATCH_SIZES = [1, 2, 4, 8] # Test different batch sizes
NUM_RUNS = 5 # Number of runs for each batch size to get reliable measurements
def generate_random_prompts(num_prompts, num_tokens, tokenizer):
"""Generate random prompts with specified token count."""
vocab_size = tokenizer.vocab_size
all_prompts = []
print(f"Generating {num_prompts} random prompts with {num_tokens} tokens each...")
for i in range(num_prompts):
# Generate random token IDs - this directly gives us the exact token count
random_token_ids = [
random.randint(0, vocab_size - 1) for _ in range(num_tokens)
]
random_text = tokenizer.decode(
random_token_ids, clean_up_tokenization_spaces=True
)
prompt = f"Prompt {i}: {random_text}"
tokens = tokenizer.encode(prompt)
print(f" Prompt {i}: {len(tokens)} tokens")
all_prompts.append(prompt)
return all_prompts
def benchmark_sequential_vs_batch(prompts, batch_size, tokenizer):
"""Compare sequential vs batch tokenization for a given batch size."""
# Sequential tokenization using encode()
sequential_times = []
for run in range(NUM_RUNS):
batch_prompts = prompts[:batch_size] # Use same prompts for fair comparison
start_time = time.perf_counter()
for prompt in batch_prompts:
tokens = tokenizer.encode(prompt)
sequential_time = (time.perf_counter() - start_time) * 1000
sequential_times.append(sequential_time)
# Batch tokenization using tokenizer()
batch_times = []
for run in range(NUM_RUNS):
batch_prompts = prompts[:batch_size] # Use same prompts for fair comparison
start_time = time.perf_counter()
tokens = tokenizer(batch_prompts)
batch_time = (time.perf_counter() - start_time) * 1000
batch_times.append(batch_time)
return {
"batch_size": batch_size,
"avg_sequential_ms": mean(sequential_times),
"avg_batch_ms": mean(batch_times),
"speedup_factor": (
mean(sequential_times) / mean(batch_times) if mean(batch_times) > 0 else 0
),
"sequential_runs": sequential_times,
"batch_runs": batch_times,
}
def main():
print("Tokenizer Benchmark: Sequential vs Batch Processing")
print("-" * 60)
print(f"Tokenizer: {TOKENIZER_DIR}")
print(f"Tokens per prompt: {NUM_TOKENS}")
print(f"Number of runs per batch size: {NUM_RUNS}")
print("-" * 60)
# Load tokenizer once for all operations
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)
# The largest batch size determines how many prompts we need
max_batch_size = max(BATCH_SIZES)
all_prompts = generate_random_prompts(max_batch_size, NUM_TOKENS, tokenizer)
results = []
print("\nRunning benchmark...")
for batch_size in BATCH_SIZES:
print(f"\nBenchmarking batch size: {batch_size}")
result = benchmark_sequential_vs_batch(all_prompts, batch_size, tokenizer)
results.append(result)
print(f" Sequential tokenization (encode):")
for i, run_time in enumerate(result["sequential_runs"]):
print(f" Run {i+1}: {run_time:.2f} ms")
print(f" Average: {result['avg_sequential_ms']:.2f} ms")
print(f" Batch tokenization (tokenizer):")
for i, run_time in enumerate(result["batch_runs"]):
print(f" Run {i+1}: {run_time:.2f} ms")
print(f" Average: {result['avg_batch_ms']:.2f} ms")
print(f" Speedup factor: {result['speedup_factor']:.2f}x")
print("\n" + "=" * 60)
print("SUMMARY OF RESULTS")
print("=" * 60)
print(
f"{'Batch Size':<10} {'Sequential (ms)':<18} {'Batch (ms)':<18} {'Speedup':<10}"
)
print("-" * 60)
for result in results:
print(
f"{result['batch_size']:<10} {result['avg_sequential_ms']:.2f} ms{' ' * 8} {result['avg_batch_ms']:.2f} ms{' ' * 8} {result['speedup_factor']:.2f}x"
)
if __name__ == "__main__":
random.seed(0)
main()
## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0
In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
## Online benchmark results
### Llama 3.1 8B Instruct 1 x A100 80G
| RPS | Num prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 1564.17 | **31.98** | 13.17 | **11.93** |
| 4 | 1200 | vLLM | 1691.97 | **100.48** | 14.14 | **129.32** |
| 8 | 2400 | SGLang | 2175.02 | **35.68** | 17.85 | **14.41** |
| 8 | 2400 | vLLM | 2137.16 | **120.39** | 17.09 | **158.63** |
### Llama 3.1 70B Insruct 4 x H100 80G
| RPS | Num Prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 3005.24 | **53.94** | 25.03 | **21.67** |
| 4 | 1200 | vLLM | 2915.60 | **179.15** | 23.58 | **231.23** |
| 8 | 2400 | SGLang | 4064.98 | **58.11** | 33.07 | **24.45** |
| 8 | 2400 | vLLM | 3752.38 | **207.12** | 29.15 | **275.32** |
## Offline benchmark results
### Llama 3.1 8B Instruct 1 x A100 80G
| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 22.03 | **4281.51** |
| inf | 5000 | vLLM | 21.27 | **4132.37** |
### Llama 3.1 70B Insruct 4 x H100 80G
| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 19.84 | **3856.01** |
| inf | 5000 | vLLM | 19.04 | **3700.64** |
## Installation
```bash
# install sglang v0.3.0
pip install --upgrade pip
pip install "sglang[all]"==0.3.0
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
# install vllm v0.6.0
pip install vllm==0.6.0
```
## Notes
We referred to the reproduction method in https://github.com/vllm-project/vllm/issues/8176, and added the `--num-scheduler-steps 10` parameter when starting the vLLM server. The `gpu_memory_utilization` of vLLM is by default 0.9 at both TP 1 and TP 4, while SGLang's `mem_frac` is 0.88 at TP 1 and 0.85 at TP 4, so we manually set it to 0.88 at TP 4.
## Online benchmarks
```bash
# Llama 3.1 8B Instruct on 1 x A100
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096
# Llama 3.1 70B Instruct on 4 x H100
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct --disable-radix-cache --tp 4
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096
# bench serving
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1200 --request-rate 4
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 2400 --request-rate 8
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 1200 --request-rate 4
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 2400 --request-rate 8
```
## Offline benchmarks
```bash
# Llama 3.1 8B Instruct on 1 x A100
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096
# Llama 3.1 70B Instruct on 4 x H100
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct --disable-radix-cache --tp 4 --mem-frac 0.88
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096
# bench serving
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000
```
# Create dummy weights:
# 1. Create a folder `~/llama-3.1-405b-fp8-dummy` and create `config.json` and tokenizer under this folder.
# 2. Get `config.json`` from ./config.md
# 3. Download the tokenizer
# wget https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8/resolve/main/tokenizer.json
# wget https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8/resolve/main/tokenizer_config.json
# Launch sglang
# python -m sglang.launch_server --model-path ~/llama-3.1-405b-fp8-dummy/ --load-format dummy --tp 8 --quantization fp8 --disable-radix --mem-frac 0.87
# offline
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3000 --random-input 1024 --random-output 1024 > sglang_log11
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 4000 --random-input 1024 --random-output 512 > sglang_log12
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 800 --random-input 4096 --random-output 2048 > sglang_log13
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1500 --random-input 4096 --random-output 1024 > sglang_log14
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 6000 --random-input 256 --random-output 512 > sglang_log15
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompt 2000 > sglang_log21
# online
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 > sglang_log31
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 > sglang_log32
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 > sglang_log33
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 > sglang_log34
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3200 --request-rate 16 --random-input 1024 --random-output 1024 > sglang_log35
# Launch trtllm
# https://github.com/sgl-project/tensorrt-demo
# offline
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 3000 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log11
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 4000 --random-input 1024 --random-output 512 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log12
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 800 --random-input 4096 --random-output 2048 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log13
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 1500 --random-input 4096 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log14
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 6000 --random-input 256 --random-output 512 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log15
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name sharegpt --num-prompt 2000 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log21
# online
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log31
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log32
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log33
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log34
python3 ../../python/sglang/bench_serving.py --backend trt --dataset-name random --num-prompt 3200 --request-rate 16 --random-input 1024 --random-output 1024 --model /root/Meta-Llama-3-8B-Instruct > trtllm_log35
# Create dummy weights:
# 1. Create a folder `~/llama-3.1-405b-fp8-dummy` and create `config.json` and tokenizer under this folder.
# 2. Get `config.json`` from ./config.md
# 3. Download the tokenizer
# wget https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8/resolve/main/tokenizer.json
# wget https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8/resolve/main/tokenizer_config.json
# Launch vllm
# python3 -m vllm.entrypoints.openai.api_server --model ~/llama-3.1-405b-fp8-dummy/ --load-format dummy --disable-log-requests --tensor-parallel-size 8 --max-model-len 10000
# offline
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 3000 --random-input 1024 --random-output 1024 > vllm_log11
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 4000 --random-input 1024 --random-output 512 > vllm_log12
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 800 --random-input 4096 --random-output 2048 > vllm_log13
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 1500 --random-input 4096 --random-output 1024 > vllm_log14
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 6000 --random-input 256 --random-output 512 > vllm_log15
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name sharegpt --num-prompt 2000 > vllm_log21
# online
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 > vllm_log31
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 > vllm_log32
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 > vllm_log33
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 > vllm_log34
python3 ../../python/sglang/bench_serving.py --backend vllm --dataset-name random --num-prompt 3200 --request-rate 16 --random-input 1024 --random-output 1024 > vllm_log35
# How to reproduce the benchmark results of SGLang
## Prerequisite
### Install the latest SGLang
```bash
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.2.7
pip install --upgrade pip
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
```
### Set up ulimit and HF_TOKEN
```bash
ulimit -n 65535
# Change the token to a real and usable one, with access permissions for the Llama 3 models.
export HF_TOKEN=hf_token
```
### Launch the server
```bash
# Meta-Llama-3.1-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
# Meta-Llama-3.1-70B-Instruct
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct --disable-radix-cache --tp 8
# Meta-Llama-3-70B-Instruct-FP8
python -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8
```
## Benchmark
### Hardware Requirements
- 8B models: Single NVIDIA A100 80GB GPU
- 70B models: 8 x NVIDIA A100 80GB GPUs with Tensor Parallelism (TP) 8
- 70B FP8 models: 8 x NVIDIA H100 GPUs with Tensor Parallelism (TP) 8
Please ensure you have the appropriate hardware before running the benchmarks.
#### Offline benchmark
```bash
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 5000 --random-input 1024 --random-output 512 --output-file offline.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 4096 --random-output 2048 --output-file offline.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2000 --random-input 4096 --random-output 1024 --output-file offline.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 3000 --output-file offline.jsonl
cat offline.jsonl | cut -d':' -f12 | cut -d',' -f1
```
#### Online benchmark
```bash
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 300 --request-rate 1 --output-file online.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 600 --request-rate 2 --output-file online.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 1200 --request-rate 4 --output-file online.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 2400 --request-rate 8 --output-file online.jsonl
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 3200 --request-rate 16 --output-file online.jsonl
cat online.jsonl | cut -d':' -f9 | cut -d',' -f1
```
## Other
We tried using vLLM 0.5.3.post1, but it often crashes under high loads, and it seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking, so we are using the older version, vLLM 0.5.2.
Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16.
```bash
# vLLM
pip install vllm==0.5.2
pip install jsonschema==4.21.1
# Meta-Llama-3-8B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --disable-log-requests
# meta-llama/Meta-Llama-3-70B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --disable-log-requests --tensor 8
# neuralmagic/Meta-Llama-3-70B-Instruct-FP8
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-log-requests --tensor 8
```
```bash
wget https://raw.githubusercontent.com/sgl-project/sglang/main/python/sglang/bench_serving.py
```
```bash
# vLLM Offline
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 5000 --random-input 1024 --random-output 512 --output-file offline_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 1000 --random-input 4096 --random-output 2048 --output-file offline_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 2000 --random-input 4096 --random-output 1024 --output-file offline_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name sharegpt --num-prompts 3000 --output-file offline_vllm.jsonl
cat offline_vllm.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# vLLM Online
python3 bench_serving.py --backend vllm --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 300 --request-rate 1 --output-file online_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 600 --request-rate 2 --output-file online_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 1200 --request-rate 4 --output-file online_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 2400 --request-rate 8 --output-file online_vllm.jsonl
python3 bench_serving.py --backend vllm --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 3200 --request-rate 16 --output-file online_vllm.jsonl
cat online_vllm.jsonl | cut -d':' -f9 | cut -d',' -f1
```
```bash
# TensorRT LLM Offline 8B
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --num-prompts 5000 --random-input 1024 --random-output 512 --output-file offline_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --num-prompts 1000 --random-input 4096 --random-output 2048 --output-file offline_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --num-prompts 2000 --random-input 4096 --random-output 1024 --output-file offline_trt_8b.jsonl
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline_trt_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name sharegpt --num-prompts 3000 --output-file offline_trt_8b.jsonl
cat offline_trt_8b.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# TensorRT LLM Online 8B
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 300 --request-rate 1 --output-file online_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 600 --request-rate 2 --output-file online_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 1200 --request-rate 4 --output-file online_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 2400 --request-rate 8 --output-file online_trt_8b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 3200 --request-rate 16 --output-file online_trt_8b.jsonl
cat online_trt_8b.jsonl | cut -d':' -f9 | cut -d',' -f1
```
```bash
# TensorRT LLM Offline 70B
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --num-prompts 5000 --random-input 1024 --random-output 512 --output-file offline_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --num-prompts 1000 --random-input 4096 --random-output 2048 --output-file offline_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --num-prompts 2000 --random-input 4096 --random-output 1024 --output-file offline_trt_70b.jsonl
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline_trt_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name sharegpt --num-prompts 3000 --output-file offline_trt_70b.jsonl
cat offline_trt_70b.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# TensorRT LLM Online 70B
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 300 --request-rate 1 --output-file online_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 600 --request-rate 2 --output-file online_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 1200 --request-rate 4 --output-file online_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 2400 --request-rate 8 --output-file online_trt_70b.jsonl
python3 bench_serving.py --backend trt --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-name random --random-input 1024 --random-output 1024 --num-prompts 3200 --request-rate 16 --output-file online_trt_70b.jsonl
cat online_trt_70b.jsonl | cut -d':' -f9 | cut -d',' -f1
```
### used for TensorRT LLM
```
{
"architecture": "LlamaForCausalLM",
"dtype": "float16",
"logits_dtype": "float32",
"vocab_size": 128256,
"max_position_embeddings": 8192,
"hidden_size": 16384,
"num_hidden_layers": 126,
"num_attention_heads": 128,
"num_key_value_heads": 16,
"head_size": 128,
"qk_layernorm": false,
"hidden_act": "silu",
"intermediate_size": 53248,
"norm_epsilon": 1e-05,
"position_embedding_type": "rope_gpt_neox",
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"share_embedding_table": false,
"mapping": {
"world_size": 8,
"tp_size": 8,
"pp_size": 1,
"gpus_per_node": 8
},
"quantization": {
"quant_algo": "FP8",
"kv_cache_quant_algo": null,
"group_size": 128,
"smoothquant_val": null,
"has_zero_point": false,
"pre_quant_scale": false,
"exclude_modules": [
"lm_head"
]
},
"kv_dtype": "float16",
"rotary_scaling": null,
"residual_mlp": false,
"moe_normalization_mode": null,
"rotary_base": 500000.0,
"moe_num_experts": 0,
"moe_top_k": 0,
"moe_tp_mode": 2,
"attn_bias": false,
"disable_weight_only_quant_plugin": false,
"mlp_bias": false
}
```
### used for vLLM and SGLang
```
{
"_name_or_path": "dummy_fp8",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128009,
"hidden_act": "silu",
"hidden_size": 16384,
"initializer_range": 0.02,
"intermediate_size": 53248,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 128,
"num_hidden_layers": 126,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"quantization_config": {
"activation_scheme": "static",
"ignored_layers": [
"lm_head"
],
"quant_method": "fp8"
},
"rope_scaling": {
"factor": 8.0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"max_position_embeddings": 131072,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.1",
"use_cache": true,
"vocab_size": 128256
}
```
## Download data
```
git clone https://hf-mirror.com/datasets/google/boolq
```
## Convert parquet to json
```
bash parquet_to_json.sh
```
## Run benchmark
### Benchmark sglang
```
python -m sglang.launch_server --model-path ramblingpolymath/Qwen3-32B-W8A8 --port 30000
```
```
python3 bench_sglang.py
```
import argparse
import json
import time
import numpy as np
from sglang.api import set_default_backend
from sglang.test.test_utils import (
add_common_sglang_args_and_parse,
select_sglang_backend,
)
from sglang.utils import read_jsonl
def get_example(lines, i, answer):
prompt = "Question: " + lines[i]["question"] + lines[i]["passage"] + "\nAnswer:"
if answer:
prompt += str(lines[i]["answer"])
return prompt
def few_shot_examples(lines, k):
prompts = ""
for i in range(k):
prompts += get_example(lines, i, True) + "\n\n"
return prompts
def main(args):
# Select backend
set_default_backend(select_sglang_backend(args))
# Read data
train_data_path = args.train_data_path
test_data_path = args.test_data_path
lines_train = list(read_jsonl(train_data_path))
lines_test = list(read_jsonl(test_data_path))
# Construct prompts
num_questions = args.num_questions
num_shots = args.num_shots
few_shots = few_shot_examples(lines_train, num_shots)
questions = []
answer = []
for i in range(len(lines_test[:num_questions])):
questions.append(get_example(lines_test, i, False))
answer.append(str(lines_test[i]["answer"]))
arguments = [{"question": q} for q in questions]
#####################################
######### SGL Program Begin #########
#####################################
import sglang as sgl
@sgl.function
def few_shot_boolq(s, question):
s += few_shots + question
s += sgl.gen("answer", max_tokens=5, stop=["\n"])
#####################################
########## SGL Program End ##########
#####################################
# Run requests
tic = time.perf_counter()
states = few_shot_boolq.run_batch(
arguments,
temperature=0,
num_threads=args.parallel,
progress_bar=True,
)
latency = time.perf_counter() - tic
preds = []
for i in range(len(states)):
preds.append(states[i]["answer"])
# Compute accuracy
acc = np.mean(np.array(preds) == np.array(answer))
# Compute speed
num_output_tokens = sum(
s.get_meta_info("answer")["completion_tokens"] for s in states
)
output_throughput = num_output_tokens / latency
# Print results
print(f"Accuracy: {acc:.3f}")
print(f"Latency: {latency:.3f} s")
print(f"Output throughput: {output_throughput:.3f} token/s")
# Results
with open(args.result_file, "a") as fout:
value = {
"task": "boolq",
"backend": args.backend,
"num_gpus": 1,
"latency": round(latency, 3),
"accuracy": round(acc, 3),
"num_requests": args.num_questions,
"other": {
"num_questions": args.num_questions,
"parallel": args.parallel,
},
}
fout.write(json.dumps(value) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--num-shots", type=int, default=5)
parser.add_argument(
"--train-data-path", type=str, default="./boolq/data/train-00000-of-00001.json"
)
parser.add_argument(
"--test-data-path",
type=str,
default="./boolq/data/validation-00000-of-00001.json",
)
parser.add_argument("--num-questions", type=int, default=200)
args = add_common_sglang_args_and_parse(parser)
main(args)
import sys
import pyarrow.parquet as pq
def convert_parquet_to_json(input_file, output_file):
# read parquet file
table = pq.read_table(input_file)
# turn parquet data to dataframe
df = table.to_pandas()
# turn dataframe to json form
json_data = df.to_json(orient="records", lines=True)
# write json to file
with open(output_file, "w") as f:
f.write(json_data)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage:python convert_parquet_to_json.py <input_file> <output_file>")
input_file = sys.argv[1]
output_file = sys.argv[2]
convert_parquet_to_json(input_file, output_file)
#!/bin/bash
#define input and output direction
input_dir="./boolq/data"
output_dir="./boolq/data"
#define files needed to be handled
files=(
"train-00000-of-00001.parquet"
"validation-00000-of-00001.parquet"
)
#foe files above, use python script to convert the form
for file in "${files[@]}"; do
input_file="${input_dir}/${file}"
output_file="${output_dir}/${file%.parquet}.json"
echo "Converting ${input_file} to ${output_file} ..."
python3 convert_parquet_to_json.py "${input_file}" "${output_file}"
if [ $? -eq 0 ]; then
echo "Conversion successful: ${output_file}"
else
echo "Conversion failed: ${input_file}"
fi
done
## Download data
```
git lfs clone https://huggingface.co/datasets/ceval/ceval-exam
```
## Run benchmark
### Benchmark sglang
```
python -m sglang.launch_server --model-path ramblingpolymath/Qwen3-32B-W8A8 --port 30000
```
```
python3 bench_sglang.py
```
import argparse
import json
import os
import random
import re
import time
import numpy as np
from datasets import load_dataset
from sglang.api import set_default_backend
from sglang.test.test_utils import (
add_common_sglang_args_and_parse,
select_sglang_backend,
)
choices = ["A", "B", "C", "D"]
def get_one_example(line, include_answer):
res = line["question"]
res += f"\nA. {line['A']}"
res += f"\nB. {line['B']}"
res += f"\nC. {line['C']}"
res += f"\nD. {line['D']}"
if include_answer:
res += f"\nAnswer: {line['answer']} \n\n"
return res
def get_few_shot_examples(lines):
res = ""
for line in lines:
res += get_one_example(line, True) + "\n\n"
return res
def get_answer_value(response):
pattern = r"(Answer:|answer:|答案是|答案是:|正确答案是:|答案:|Assistant:)\s*([A-D])(?![\w])"
match = re.search(pattern, response)
if match:
return match.group(2)
return random.choice(choices)
def main(args):
# Read data && Construct prompts
arguments = []
labels = []
examples = "examples:\n"
data_path = args.data_path
for subject in os.listdir(data_path):
subject_path = os.path.join(data_path, subject)
if os.path.isdir(subject_path) and subject != ".git":
dataset = load_dataset(data_path, name=subject)
dev_lines_temp = dataset["dev"]
val_lines_temp = dataset["val"]
few_shot_examples = get_few_shot_examples(dev_lines_temp, subject)
examples += f"{few_shot_examples}"
for val_line in val_lines_temp:
arguments.append(
{
"examples": few_shot_examples,
"question": get_one_example(val_line, False),
}
)
labels.append(val_line["answer"])
#####################################
######### SGL Program Begin #########
#####################################
import sglang as sgl
@sgl.function
def few_shot_ceval(s, examples, question):
s += examples + question + sgl.gen("Answer")
#####################################
########## SGL Program End ##########
#####################################
num_questions = args.num_questions if args.num_questions else len(arguments)
# Select backend
set_default_backend(select_sglang_backend(args))
# Run requests
tic = time.perf_counter()
states = few_shot_ceval.run_batch(
arguments[:num_questions],
temperature=0,
num_threads=args.parallel,
progress_bar=True,
)
latency = time.perf_counter() - tic
preds = [get_answer_value(states[i]["Answer"]) for i in range(num_questions)]
# Compute accuracy
acc = np.mean(np.array(preds) == np.array(labels[:num_questions]))
# Compute speed
num_output_tokens = sum(
s.get_meta_info("Answer")["completion_tokens"] for s in states
)
output_throughput = num_output_tokens / latency
# Print results
print(f"Accuracy: {acc:.3f}")
print(f"Latency: {latency:.3f} s")
print(f"Output throughput: {output_throughput:.3f} token/s")
# Write results
with open(args.result_file, "a") as fout:
value = {
"task": "ceval",
"backend": args.backend,
"num_gpus": 1,
"latency": round(latency, 3),
"accuracy": round(acc, 3),
"num_requests": args.num_questions,
"other": {
"parallel": args.parallel,
},
}
fout.write(json.dumps(value) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--data-path", type=str, default="ceval-exam")
parser.add_argument("--num-questions", type=int, default=None)
args = add_common_sglang_args_and_parse(parser)
main(args)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment