diff --git a/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh index c8db951381b0bd8b4c36ffe0b97c2155aea5c52b..0745da8dc418d478d84df9c45978f5da19152f6c 100755 --- a/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh +++ b/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh @@ -2,7 +2,7 @@ # We can use this script to compute baseline accuracy on chartqa for vllm. # # Make sure you have lm-eval-harness installed: -# pip install lm-eval==0.4.9 +# pip install "lm-eval[api]>=0.4.9.2" usage() { echo`` diff --git a/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh index 897f84d1e360de11ceb10d77baf0ff9f8453cdfd..5c17a06245bcf6277decc55bb3236fd2e618eb34 100755 --- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh +++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh @@ -2,7 +2,7 @@ # We can use this script to compute baseline accuracy on GSM for transformers. # # Make sure you have lm-eval-harness installed: -# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api] +# pip install "lm-eval[api]>=0.4.9.2" usage() { echo`` diff --git a/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh index 792f355c47a5178801b2624f1a9e06c69707f0ce..1b617ff17c41c3f7e2b4e13aed8ad9b0938fa2e8 100644 --- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh +++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh @@ -3,7 +3,7 @@ # We use this for fp8, which HF does not support. # # Make sure you have lm-eval-harness installed: -# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api] +# pip install "lm-eval[api]>=0.4.9.2" usage() { echo`` diff --git a/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh index d85a1721db9a59d46ab9a7fdaf52b68c8dc13186..12336d7f85bc918cd5776d82fffeca518f474180 100644 --- a/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh +++ b/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh @@ -3,7 +3,7 @@ # We use this for fp8, which HF does not support. # # Make sure you have lm-eval-harness installed: -# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api] +# pip install "lm-eval[api]>=0.4.9.2" usage() { echo`` diff --git a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py index f94d681197d2d5e0fce5d1de23cf47840309c78b..a22abe73e39f72abdab84e51a38324b696ef7cf0 100644 --- a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py +++ b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py @@ -60,6 +60,7 @@ def launch_lm_eval(eval_config, tp_size): f"add_bos_token=true," f"trust_remote_code={trust_remote_code}," f"max_model_len={max_model_len}," + "allow_deprecated_quantization=True," ) env_vars = eval_config.get("env_vars", None) diff --git a/.buildkite/performance-benchmarks/README.md b/.buildkite/performance-benchmarks/README.md index 015f48c2520d60dee4782b0a1bde445cd0ed76fa..289877e504bbda8cd4fba7b2fb4b32ae50c19977 100644 --- a/.buildkite/performance-benchmarks/README.md +++ b/.buildkite/performance-benchmarks/README.md @@ -7,7 +7,7 @@ vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](http ## Performance benchmark quick overview -**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models. +**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors, Intel® Gaudi® 3 Accelerators and Arm® Neoverse™ with different models. **Benchmarking Duration**: about 1hr. @@ -23,7 +23,7 @@ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh Runtime environment variables: -- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0. +- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file). @@ -34,8 +34,9 @@ Runtime environment variables: See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead. -For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead. -> +> For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead. +> For Arm® Neoverse™, use `tests/latency-tests-arm64-cpu.json`, `tests/throughput-tests-arm64-cpu.json`, `tests/serving-tests-arm64-cpu.json` instead. + ### Latency test Here is an example of one test inside `latency-tests.json`: @@ -175,19 +176,6 @@ If you do not see the table, please wait till the benchmark finish running. The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file. The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking. -The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`. -When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`. -`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT. -If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead. - -Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps. -`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` - -| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio | -|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------| -| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 | -| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 | +#### Performance Results Comparison -A comparison diagram will be generated below the table. -Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3 -image +Follow the instructions in [performance results comparison](https://docs.vllm.ai/en/latest/benchmarking/dashboard/#performance-results-comparison) to analyze performance results and the sizing guide. diff --git a/.buildkite/performance-benchmarks/scripts/compare-json-results.py b/.buildkite/performance-benchmarks/scripts/compare-json-results.py index c8bf7b0453662d71dff5a6be0f48d2ceb63785e3..b3d0a2d3bbce0b6804a4fdd0ac177628bd860ebb 100644 --- a/.buildkite/performance-benchmarks/scripts/compare-json-results.py +++ b/.buildkite/performance-benchmarks/scripts/compare-json-results.py @@ -1,8 +1,13 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +from __future__ import annotations + import argparse +import html as _html import json import os +from dataclasses import dataclass from importlib import util import pandas as pd @@ -10,27 +15,49 @@ import pandas as pd pd.options.display.float_format = "{:.2f}".format plotly_found = util.find_spec("plotly.express") is not None - +DEFAULT_INFO_COLS = [ + "Model", + "Dataset Name", + "Input Len", + "Output Len", + # "TP Size", + # "PP Size", + "# of max concurrency.", + "qps", +] + +# Safety net: if any DataFrame leaks into to_html(), keep precision at 2. +pd.set_option("display.precision", 2) +pd.set_option("display.float_format", lambda x: f"{x:.2f}") + + +# ----------------------------- +# Core data compare +# ----------------------------- def compare_data_columns( - files, name_column, data_column, info_cols, drop_column, debug=False + files: list[str], + name_column: str, + data_column: str, + info_cols: list[str], + drop_column: str, + debug: bool = False, ): """ Align concatenation by keys derived from info_cols instead of row order. - Pick one canonical key list: subset of info_cols present in ALL files. - For each file: set index to those keys, aggregate duplicates - - (mean for metric, first for names). + (mean for metric, first for names). - Concat along axis=1 (indexes align), then reset_index so callers can - - group by columns. + group by columns. - If --debug, add a _name column per file. """ print("\ncompare_data_column:", data_column) frames = [] - raw_data_cols = [] + raw_data_cols: list[str] = [] compare_frames = [] - # 1) choose a canonical key list from info_cols that exists in ALL files - cols_per_file = [] + cols_per_file: list[set] = [] for f in files: try: df_tmp = pd.read_json(f, orient="records") @@ -40,24 +67,20 @@ def compare_data_columns( key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)] if not key_cols: - # soft fallback: use any info_cols present in the first file key_cols = [c for c in info_cols if c in list(cols_per_file[0])] if not key_cols: raise ValueError( "No common key columns found from info_cols across the input files." ) - # 2) build a single "meta" block (keys as columns) once, aligned by the key index meta_added = False for file in files: df = pd.read_json(file, orient="records") - # Keep rows that actually have the compared metric (same as original behavior) if drop_column in df.columns: df = df.dropna(subset=[drop_column], ignore_index=True) - # Stabilize numeric key columns (harmless if missing) for c in ( "Input Len", "Output Len", @@ -69,32 +92,26 @@ def compare_data_columns( if c in df.columns: df[c] = pd.to_numeric(df[c], errors="coerce") - # Ensure all key columns exist for c in key_cols: if c not in df.columns: df[c] = pd.NA - # Set index = key_cols and aggregate duplicates → unique MultiIndex df_idx = df.set_index(key_cols, drop=False) - # meta (key columns), unique per key meta = df_idx[key_cols] if not meta.index.is_unique: meta = meta.groupby(level=key_cols, dropna=False).first() - # metric series for this file, aggregated to one row per key file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file) s = df_idx[data_column] if not s.index.is_unique: s = s.groupby(level=key_cols, dropna=False).mean() - s.name = file_label # column label like original + s.name = file_label - # add meta once (from first file) so keys are the leftmost columns if not meta_added: frames.append(meta) meta_added = True - # (NEW) debug: aligned test-name column per file if debug and name_column in df_idx.columns: name_s = df_idx[name_column] if not name_s.index.is_unique: @@ -106,26 +123,19 @@ def compare_data_columns( raw_data_cols.append(file_label) compare_frames.append(s) - # Generalize ratio: for any file N>=2, add ratio (fileN / file1) if len(compare_frames) >= 2: base = compare_frames[0] current = compare_frames[-1] if "P99" in data_column or "Median" in data_column: - ratio = base / current # for latency + ratio = base / current else: ratio = current / base - ratio = ratio.mask(base == 0) # avoid inf when baseline is 0 + ratio = ratio.mask(base == 0) ratio.name = f"Ratio 1 vs {len(compare_frames)}" frames.append(ratio) - # 4) concat on columns with aligned MultiIndex; - # then reset_index to return keys as columns - concat_df = pd.concat(frames, axis=1) - concat_df = concat_df.reset_index(drop=True).reset_index() - if "index" in concat_df.columns: - concat_df = concat_df.drop(columns=["index"]) + concat_df = pd.concat(frames, axis=1).reset_index(drop=True) - # Ensure key/info columns appear first (in your info_cols order) front = [c for c in info_cols if c in concat_df.columns] rest = [c for c in concat_df.columns if c not in front] concat_df = concat_df[front + rest] @@ -134,20 +144,15 @@ def compare_data_columns( return concat_df, raw_data_cols +# ----------------------------- +# Split helper +# ----------------------------- def split_json_by_tp_pp( input_file: str = "benchmark_results.json", output_root: str = "." ) -> list[str]: - """ - Split a benchmark JSON into separate folders by (TP Size, PP Size). - - Creates: /tp{TP}_pp{PP}/benchmark_results.json - Returns: list of file paths written. - """ - # Load JSON data into DataFrame with open(input_file, encoding="utf-8") as f: data = json.load(f) - # If the JSON is a dict with a list under common keys, use that list if isinstance(data, dict): for key in ("results", "serving_results", "benchmarks", "data"): if isinstance(data.get(key), list): @@ -156,7 +161,6 @@ def split_json_by_tp_pp( df = pd.DataFrame(data) - # Keep only "serving" tests name_col = next( (c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None ) @@ -165,7 +169,6 @@ def split_json_by_tp_pp( df[name_col].astype(str).str.contains(r"serving", case=False, na=False) ].copy() - # Handle alias column names rename_map = { "tp_size": "TP Size", "tensor_parallel_size": "TP Size", @@ -176,21 +179,14 @@ def split_json_by_tp_pp( columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True ) - # Ensure TP/PP columns exist (default to 1 if missing) if "TP Size" not in df.columns: df["TP Size"] = 1 if "PP Size" not in df.columns: df["PP Size"] = 1 - # make sure TP/PP are numeric ints with no NaN - df["TP Size"] = ( - pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int) - ) - df["PP Size"] = ( - pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int) - ) + df["TP Size"] = pd.to_numeric(df["TP Size"], errors="coerce").fillna(1).astype(int) + df["PP Size"] = pd.to_numeric(df["PP Size"], errors="coerce").fillna(1).astype(int) - # Split into separate folders saved_paths: list[str] = [] for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False): folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}") @@ -203,32 +199,9 @@ def split_json_by_tp_pp( return saved_paths -def _add_limit_line(fig, y_value, label): - # Visible dashed line + annotation - fig.add_hline( - y=y_value, - line_dash="dash", - line_color="red" if "ttft" in label.lower() else "blue", - annotation_text=f"{label}: {y_value} ms", - annotation_position="top left", - ) - # Optional: add a legend item (as a transparent helper trace) - if plot and plotly_found: - import plotly.graph_objects as go - - fig.add_trace( - go.Scatter( - x=[None], - y=[None], - mode="lines", - line=dict( - dash="dash", color="red" if "ttft" in label.lower() else "blue" - ), - name=f"{label}", - ) - ) - - +# ----------------------------- +# Styling helpers +# ----------------------------- def _find_concurrency_col(df: pd.DataFrame) -> str: for c in [ "# of max concurrency.", @@ -239,7 +212,6 @@ def _find_concurrency_col(df: pd.DataFrame) -> str: ]: if c in df.columns: return c - # Fallback: guess an integer-like column (harmless if unused) for c in df.columns: if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1: return c @@ -248,8 +220,7 @@ def _find_concurrency_col(df: pd.DataFrame) -> str: def _highlight_threshold( df: pd.DataFrame, threshold: float -) -> "pd.io.formats.style.Styler": - """Highlight numeric per-configuration columns with value <= threshold.""" +) -> pd.io.formats.style.Styler: conc_col = _find_concurrency_col(df) key_cols = [ c @@ -260,6 +231,7 @@ def _highlight_threshold( c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio") ] conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])] + return df.style.map( lambda v: "background-color:#e6ffe6;font-weight:bold;" if pd.notna(v) and v <= threshold @@ -268,7 +240,264 @@ def _highlight_threshold( ) -if __name__ == "__main__": +def highlight_ratio_columns(styler: pd.io.formats.style.Styler): + ratio_cols = [c for c in styler.data.columns if "ratio" in str(c).lower()] + if not ratio_cols: + return styler + + styler = styler.apply( + lambda _: ["background-color: #fff3b0"] * len(styler.data), + subset=ratio_cols, + axis=0, + ) + + styler = styler.set_table_styles( + [ + { + "selector": f"th.col_heading.level0.col{i}", + "props": [("background-color", "#fff3b0")], + } + for i, col in enumerate(styler.data.columns) + if col in ratio_cols + ], + overwrite=False, + ) + return styler + + +def _apply_two_decimals( + styler: pd.io.formats.style.Styler, +) -> pd.io.formats.style.Styler: + df = styler.data + num_cols = df.select_dtypes("number").columns + if len(num_cols) == 0: + return styler + return styler.format({c: "{:.2f}" for c in num_cols}, na_rep="") + + +# ----------------------------- +# Valid max concurrency summary helpers +# ----------------------------- +def _config_value_columns(df: pd.DataFrame, conc_col: str) -> list[str]: + key_cols = [ + c + for c in ["Model", "Dataset Name", "Input Len", "Output Len"] + if c in df.columns + ] + exclude = set(key_cols + [conc_col, "qps", "QPS"]) + + cols: list[str] = [] + for c in df.columns: + if c in exclude: + continue + lc = str(c).lower() + if lc.startswith("ratio"): + continue + if lc.endswith("_name") or lc == "test name" or lc == "test_name": + continue + if pd.api.types.is_numeric_dtype(df[c]): + cols.append(c) + return cols + + +def _max_concurrency_ok( + df: pd.DataFrame, conc_col: str, cfg_col: str, threshold: float +): + if df is None or conc_col not in df.columns or cfg_col not in df.columns: + return pd.NA + + d = df[[conc_col, cfg_col]].copy() + d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce") + d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce") + d = d.dropna(subset=[conc_col, cfg_col]) + + if d.empty: + return pd.NA + + ok = d[d[cfg_col] <= threshold] + if ok.empty: + return pd.NA + + return ok[conc_col].max() + + +def _value_at_concurrency(df: pd.DataFrame, conc_col: str, cfg_col: str, conc_value): + if ( + df is None + or conc_col not in df.columns + or cfg_col not in df.columns + or pd.isna(conc_value) + ): + return pd.NA + + d = df[[conc_col, cfg_col]].copy() + d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce") + d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce") + + conc_value = pd.to_numeric(conc_value, errors="coerce") + if pd.isna(conc_value): + return pd.NA + + hit = d[d[conc_col] == conc_value] + if hit.empty: + return pd.NA + return hit[cfg_col].iloc[0] + + +def build_valid_max_concurrency_summary_html( + tput_group_df: pd.DataFrame | None, + ttft_group_df: pd.DataFrame | None, + tpot_group_df: pd.DataFrame | None, + conc_col: str, + args, +) -> str: + if ttft_group_df is None and tpot_group_df is None: + return "" + + ttft_cols = ( + _config_value_columns(ttft_group_df, conc_col) + if ttft_group_df is not None + else [] + ) + tpot_cols = ( + _config_value_columns(tpot_group_df, conc_col) + if tpot_group_df is not None + else [] + ) + tput_cols = ( + _config_value_columns(tput_group_df, conc_col) + if tput_group_df is not None + else [] + ) + + if ttft_group_df is not None and tpot_group_df is not None: + cfg_cols = [c for c in ttft_cols if c in tpot_cols] + if tput_group_df is not None: + cfg_cols = [c for c in cfg_cols if c in tput_cols] or cfg_cols + else: + cfg_cols = ttft_cols or tpot_cols + + if not cfg_cols: + cfg_cols = sorted(set(ttft_cols) | set(tpot_cols) | set(tput_cols), key=str) + + rows = [] + for cfg in cfg_cols: + ttft_max = ( + _max_concurrency_ok(ttft_group_df, conc_col, cfg, args.ttft_max_ms) + if ttft_group_df is not None + else pd.NA + ) + tpot_max = ( + _max_concurrency_ok(tpot_group_df, conc_col, cfg, args.tpot_max_ms) + if tpot_group_df is not None + else pd.NA + ) + both = ( + pd.NA + if (pd.isna(ttft_max) or pd.isna(tpot_max)) + else min(ttft_max, tpot_max) + ) + + tput_at_both = ( + _value_at_concurrency(tput_group_df, conc_col, cfg, both) + if tput_group_df is not None + else pd.NA + ) + ttft_at_both = ( + _value_at_concurrency(ttft_group_df, conc_col, cfg, both) + if ttft_group_df is not None + else pd.NA + ) + tpot_at_both = ( + _value_at_concurrency(tpot_group_df, conc_col, cfg, both) + if tpot_group_df is not None + else pd.NA + ) + + rows.append( + { + "Configuration": cfg, + f"Max {conc_col} (TTFT ≤ {args.ttft_max_ms:g} ms)": ttft_max, + f"Max {conc_col} (TPOT ≤ {args.tpot_max_ms:g} ms)": tpot_max, + f"Max {conc_col} (Both)": both, + "Output Tput @ Both (tok/s)": tput_at_both, + "TTFT @ Both (ms)": ttft_at_both, + "TPOT @ Both (ms)": tpot_at_both, + } + ) + + summary_df = pd.DataFrame(rows) + + # --- Coerce numeric columns so Styler doesn't miss them due to object dtype --- + for c in summary_df.columns: + if c == "Configuration": + continue + summary_df[c] = pd.to_numeric(summary_df[c], errors="coerce") + + both_col = f"Max {conc_col} (Both)" + + # --- Strict 2-decimal formatting for ALL non-Configuration columns --- + formatters = {} + for c in summary_df.columns: + if c == "Configuration": + continue + # default argument binds per-column formatter correctly + formatters[c] = lambda v: "" if pd.isna(v) else f"{float(v):.2f}" + + styler = summary_df.style.format(formatters) + + def _green(v): + return "background-color:#e6ffe6;font-weight:bold;" if pd.notna(v) else "" + + if both_col in summary_df.columns: + styler = styler.map(_green, subset=[both_col]) + + title = ( + '
' + "Valid Max Concurrency Summary" + "
\n" + ) + return title + styler.to_html(table_attributes='border="1" class="dataframe"') + + +# ----------------------------- +# Plot helper +# ----------------------------- +def _add_limit_line(fig, y_value: float, label: str): + fig.add_hline( + y=y_value, + line_dash="dash", + line_color="red" if "ttft" in label.lower() else "blue", + annotation_text=f"{label}: {y_value} ms", + annotation_position="top left", + ) + if plotly_found: + import plotly.graph_objects as go + + fig.add_trace( + go.Scatter( + x=[None], + y=[None], + mode="lines", + line=dict( + dash="dash", + color="red" if "ttft" in label.lower() else "blue", + ), + name=label, + ) + ) + + +# ----------------------------- +# Refactored main + group-first report +# ----------------------------- +@dataclass(frozen=True) +class MetricPlan: + data_cols: list[str] + drop_column: str + + +def build_parser() -> argparse.ArgumentParser: parser = argparse.ArgumentParser() parser.add_argument( "-f", "--file", action="append", type=str, help="input file name" @@ -308,149 +537,289 @@ if __name__ == "__main__": default=100.0, help="Reference limit for TPOT plots (ms)", ) + return parser - args = parser.parse_args() +def choose_metrics(latency: str) -> MetricPlan: + latency = (latency or "").lower() drop_column = "P99" - name_column = "Test name" - info_cols = [ - "Model", - "Dataset Name", - "Input Len", - "Output Len", - "TP Size", - "PP Size", - "# of max concurrency.", - "qps", - ] - if "median" in args.latency: - data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"] - html_msgs_for_data_cols = [ - "Compare Output Tokens /n", - "Median TTFT /n", - "Median TPOT /n", - ] - drop_column = "P99" - elif "p99" in args.latency: - data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"] - html_msgs_for_data_cols = [ - "Compare Output Tokens /n", - "P99 TTFT /n", - "P99 TPOT /n", - ] + if "median" in latency: + return MetricPlan( + data_cols=["Output Tput (tok/s)", "Median TTFT (ms)", "Median"], + drop_column=drop_column, + ) + + return MetricPlan( + data_cols=["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"], + drop_column=drop_column, + ) + + +def prepare_input_files(args, info_cols: list[str]) -> tuple[list[str], list[str]]: + if not args.file: + raise ValueError("No input files provided. Use -f/--file.") if len(args.file) == 1: files = split_json_by_tp_pp(args.file[0], output_root="splits") info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")] else: files = args.file + + return files, info_cols + + +def get_y_axis_col(info_cols: list[str], xaxis: str) -> str: + y_axis_index = info_cols.index(xaxis) if xaxis in info_cols else 6 + return info_cols[y_axis_index] + + +def get_group_cols(output_df: pd.DataFrame, info_cols: list[str]) -> list[str]: + filtered_info_cols = info_cols[:4] + group_cols = [c for c in filtered_info_cols if c in output_df.columns] + if not group_cols: + raise ValueError( + f"No valid group-by columns. Expected subset: {filtered_info_cols}, " + f"but DataFrame has: {list(output_df.columns)}" + ) + return group_cols + + +def normalize_group_key(name): + return name if isinstance(name, tuple) else (name,) + + +def group_filename(name, prefix: str = "perf_comparison_") -> str: + name_vals = normalize_group_key(name) + safe = ",".join(map(str, name_vals)).replace(",", "_").replace("/", "-") + return f"{prefix}{safe}.html" + + +def build_group_suffix(group_cols: list[str], name) -> str: + name_vals = normalize_group_key(name) + return " , ".join(f"{col} : [ {val} ] " for col, val in zip(group_cols, name_vals)) + + +def render_metric_table_html( + display_group: pd.DataFrame, + metric_label: str, + group_suffix: str, + args, +) -> str: + title = ( + f'
' + f"{_html.escape(metric_label)}" + f" — {_html.escape(group_suffix)}" + f"
\n" + ) + + metric_name = metric_label.lower() + if "ttft" in metric_name: + styler = _highlight_threshold(display_group, args.ttft_max_ms) + elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name): + styler = _highlight_threshold(display_group, args.tpot_max_ms) + else: + styler = display_group.style + + styler = _apply_two_decimals(styler) + styler = highlight_ratio_columns(styler) + + return title + styler.to_html(table_attributes='border="1" class="dataframe"') + + +def maybe_write_plot( + main_fh, + sub_fh, + group_df: pd.DataFrame, + raw_data_cols: list[str], + metric_label: str, + y_axis_col: str, + args, +): + if not (args.plot and plotly_found): + return + + import plotly.express as px + + df = group_df[raw_data_cols].sort_values(by=y_axis_col) + df_melted = df.melt( + id_vars=y_axis_col, + var_name="Configuration", + value_name=metric_label, + ) + + fig = px.line( + df_melted, + x=y_axis_col, + y=metric_label, + color="Configuration", + title=f"{metric_label} vs {y_axis_col}", + markers=True, + ) + + # Ensure plot hover + y tick labels are also 2 decimals. + fig.update_traces(hovertemplate="%{y:.2f}") + fig.update_yaxes(tickformat=".2f") + + metric_name = metric_label.lower() + if "ttft" in metric_name: + _add_limit_line(fig, args.ttft_max_ms, "TTFT limit") + elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name): + _add_limit_line(fig, args.tpot_max_ms, "TPOT limit") + + html = fig.to_html(full_html=True, include_plotlyjs="cdn") + main_fh.write(html) + sub_fh.write(html) + + +def build_group_keys( + df: pd.DataFrame, group_cols: list[str], sort_cols: list[str] | None = None +): + if sort_cols: + df = df.sort_values(by=sort_cols) + gb = df.groupby(group_cols, dropna=False) + return [k for k, _ in gb] + + +def write_report_group_first( + files: list[str], info_cols: list[str], plan: MetricPlan, args +): + name_column = "Test name" + y_axis_col = get_y_axis_col(info_cols, args.xaxis) + print("comparing : " + ", ".join(files)) - debug = args.debug - plot = args.plot - # For Plot feature, assign y axis from one of info_cols - y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6 - with open("perf_comparison.html", "w") as text_file: - for i in range(len(data_cols_to_compare)): - output_df, raw_data_cols = compare_data_columns( - files, - name_column, - data_cols_to_compare[i], - info_cols, - drop_column, - debug=debug, + + metric_cache: dict[str, tuple[pd.DataFrame, list[str]]] = {} + group_cols_canonical: list[str] | None = None + + for metric_label in plan.data_cols: + output_df, raw_data_cols = compare_data_columns( + files, + name_column, + metric_label, + info_cols, + plan.drop_column, + debug=args.debug, + ) + + raw_data_cols = list(raw_data_cols) + raw_data_cols.insert(0, y_axis_col) + + group_cols = get_group_cols(output_df, info_cols) + if group_cols_canonical is None: + group_cols_canonical = group_cols + else: + group_cols_canonical = [c for c in group_cols_canonical if c in group_cols] + + metric_cache[metric_label] = ( + output_df.sort_values(by=args.xaxis), + raw_data_cols, + ) + + if not group_cols_canonical: + raise ValueError("No canonical group columns found across metrics.") + + first_metric = plan.data_cols[0] + first_df_sorted, _ = metric_cache[first_metric] + group_keys = build_group_keys( + first_df_sorted, group_cols_canonical, sort_cols=[args.xaxis] + ) + + metric_groupbys = { + metric_label: df.groupby(group_cols_canonical, dropna=False) + for metric_label, (df, _) in metric_cache.items() + } + + with open("perf_comparison.html", "w", encoding="utf-8") as main_fh: + main_fh.write('\n') + for gkey in group_keys: + gkey_tuple = normalize_group_key(gkey) + suffix = build_group_suffix(group_cols_canonical, gkey_tuple) + sub_path = group_filename(gkey_tuple) + group_header = ( + '
' + f"{_html.escape(suffix)}" + "
\n" ) - # For Plot feature, insert y axis from one of info_cols - raw_data_cols.insert(0, info_cols[y_axis_index]) - - filtered_info_cols = info_cols[:-2] - existing_group_cols = [ - c for c in filtered_info_cols if c in output_df.columns - ] - if not existing_group_cols: - raise ValueError( - f"No valid group-by columns " - f"Expected subset: {filtered_info_cols}, " - f"but DataFrame has: {list(output_df.columns)}" - ) - # output_df_sorted = output_df.sort_values(by=existing_group_cols) - output_df_sorted = output_df.sort_values(by=args.xaxis) - output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False) - for name, group in output_groups: - group_name = ( - ",".join(map(str, name)).replace(",", "_").replace("/", "-") - ) - group_html_name = "perf_comparison_" + group_name + ".html" - - metric_name = str(data_cols_to_compare[i]).lower() - if "tok/s" in metric_name: - html = group.to_html() - elif "ttft" in metric_name: - styler = _highlight_threshold(group, args.ttft_max_ms).format( - {c: "{:.2f}" for c in group.select_dtypes("number").columns}, - na_rep="—", - ) - html = styler.to_html( - table_attributes='border="1" class="dataframe"' + main_fh.write(group_header) + with open(sub_path, "w", encoding="utf-8") as sub_fh: + sub_fh.write('\n') + sub_fh.write(group_header) + tput_group_df = None + ttft_group_df = None + tpot_group_df = None + conc_col = args.xaxis + + for metric_label in plan.data_cols: + gb = metric_groupbys[metric_label] + df_sorted, raw_data_cols = metric_cache[metric_label] + + try: + group_df = gb.get_group(gkey) + except KeyError: + missing = ( + '
' + f"{_html.escape(metric_label)} — missing for this group" + "
\n" + ) + + main_fh.write(missing) + sub_fh.write(missing) + continue + + if conc_col not in group_df.columns: + conc_col = _find_concurrency_col(group_df) + + mn = metric_label.lower().strip() + if "tok/s" in mn: + tput_group_df = group_df + elif "ttft" in mn: + ttft_group_df = group_df + elif mn in ("p99", "median") or "tpot" in mn: + tpot_group_df = group_df + + display_group = group_df.drop( + columns=group_cols_canonical, errors="ignore" ) - elif ( - "tpot" in metric_name - or "median" in metric_name - or "p99" in metric_name - ): - styler = _highlight_threshold(group, args.tpot_max_ms).format( - {c: "{:.2f}" for c in group.select_dtypes("number").columns}, - na_rep="—", + + html = render_metric_table_html( + display_group, metric_label, suffix, args ) - html = styler.to_html( - table_attributes='border="1" class="dataframe"' + main_fh.write(html) + sub_fh.write(html) + + maybe_write_plot( + main_fh, + sub_fh, + group_df=group_df, + raw_data_cols=raw_data_cols, + metric_label=metric_label, + y_axis_col=y_axis_col, + args=args, ) - text_file.write(html_msgs_for_data_cols[i]) - text_file.write(html) - with open(group_html_name, "a+") as sub_text_file: - sub_text_file.write(html_msgs_for_data_cols[i]) - sub_text_file.write(html) - - if plot and plotly_found: - import plotly.express as px - - df = group[raw_data_cols] - df_sorted = df.sort_values(by=info_cols[y_axis_index]) - # Melt DataFrame for plotting - df_melted = df_sorted.melt( - id_vars=info_cols[y_axis_index], - var_name="Configuration", - value_name=data_cols_to_compare[i], - ) - title = ( - data_cols_to_compare[i] + " vs " + info_cols[y_axis_index] - ) - # Create Plotly line chart - fig = px.line( - df_melted, - x=info_cols[y_axis_index], - y=data_cols_to_compare[i], - color="Configuration", - title=title, - markers=True, - ) + summary_html = build_valid_max_concurrency_summary_html( + tput_group_df=tput_group_df, + ttft_group_df=ttft_group_df, + tpot_group_df=tpot_group_df, + conc_col=conc_col, + args=args, + ) + if summary_html: + main_fh.write(summary_html) + sub_fh.write(summary_html) - # ---- Add threshold lines based on metric name ---- - if "ttft" in metric_name: - _add_limit_line(fig, args.ttft_max_ms, "TTFT limit") - elif ( - "tpot" in metric_name - or "median" in metric_name - or "p99" in metric_name - ): - _add_limit_line(fig, args.tpot_max_ms, "TPOT limit") - - # Export to HTML - text_file.write( - fig.to_html(full_html=True, include_plotlyjs="cdn") - ) - sub_text_file.write( - fig.to_html(full_html=True, include_plotlyjs="cdn") - ) + +def main(): + args = build_parser().parse_args() + info_cols = list(DEFAULT_INFO_COLS) + plan = choose_metrics(args.latency) + files, info_cols = prepare_input_files(args, info_cols) + write_report_group_first(files, info_cols, plan, args) + + +if __name__ == "__main__": + main() diff --git a/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh b/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh old mode 100644 new mode 100755 index 34ceefe0996f29c4d9b8957257645bfc9d849b65..6b6a7e472b9c8658f3e79135030f2c5a604ee0b4 --- a/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh +++ b/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh @@ -49,7 +49,11 @@ check_cpus() { echo "Need at least 1 NUMA to run benchmarking." exit 1 fi - declare -g gpu_type="cpu" + if [[ "$(uname -m)" == "aarch64" ]] || [[ "$(uname -m)" == "arm64" ]]; then + declare -g gpu_type="arm64-cpu" + else + declare -g gpu_type="cpu" + fi echo "GPU type is $gpu_type" } @@ -207,8 +211,8 @@ run_latency_tests() { # check if there is enough GPU to run the test tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size') - if [ "$ON_CPU" == "1" ]; then - pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size') + if [[ "$ON_CPU" == "1" ]]; then + pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size // 1') world_size=$(($tp*$pp)) if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name." @@ -276,8 +280,8 @@ run_throughput_tests() { # check if there is enough GPU to run the test tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size') - if [ "$ON_CPU" == "1" ]; then - pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size') + if [[ "$ON_CPU" == "1" ]]; then + pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size // 1') world_size=$(($tp*$pp)) if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name." @@ -393,8 +397,8 @@ run_serving_tests() { # check if there is enough resources to run the test tp=$(echo "$server_params" | jq -r '.tensor_parallel_size') - if [ "$ON_CPU" == "1" ]; then - pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size') + if [[ "$ON_CPU" == "1" ]]; then + pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size // 1') world_size=$(($tp*$pp)) if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name." @@ -496,9 +500,9 @@ run_serving_tests() { main() { local ARCH ARCH='' - if [ "$ON_CPU" == "1" ];then - check_cpus - ARCH='-cpu' + if [[ "$ON_CPU" == "1" ]]; then + check_cpus + ARCH="-$gpu_type" else check_gpus ARCH="$arch_suffix" diff --git a/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json b/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json new file mode 100644 index 0000000000000000000000000000000000000000..fba695041e3eef5a40503f8a9e9abf1bcb150595 --- /dev/null +++ b/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json @@ -0,0 +1,26 @@ +[ + { + "test_name": "latency_llama8B_tp1", + "environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "parameters": { + "model": "meta-llama/Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "load_format": "dummy", + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "num_iters_warmup": 5, + "num_iters": 15 + } + } +] diff --git a/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json b/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json new file mode 100644 index 0000000000000000000000000000000000000000..63f1f8ab887b34e3a5cb8752fc4da4af120c4389 --- /dev/null +++ b/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json @@ -0,0 +1,130 @@ +{ + "defaults": { + "qps_list": [ + "inf" + ], + "max_concurrency_list": [ + 12, + 16, + 24, + 32, + 64, + 128, + 200 + ], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Llama-3.1-8B-Instruct", + "backend": "vllm", + "ignore-eos": "", + "num_prompts": 200 + } + }, + "tests": [ + { + "test_name": "serving_llama8B_tp1_sharegpt", + "server_parameters": { + "tensor_parallel_size": 1 + }, + "client_parameters": { + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json" + } + }, + { + "test_name": "serving_llama8B_tp2_sharegpt", + "server_parameters": { + "tensor_parallel_size": 2 + }, + "client_parameters": { + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json" + } + }, + { + "test_name": "serving_llama8B_tp1_random_128_128", + "server_parameters": { + "tensor_parallel_size": 1 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128 + } + }, + { + "test_name": "serving_llama8B_tp2_random_128_128", + "server_parameters": { + "tensor_parallel_size": 2 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128 + } + }, + { + "test_name": "serving_llama8B_tp1_random_128_2048", + "server_parameters": { + "tensor_parallel_size": 1 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 2048 + } + }, + { + "test_name": "serving_llama8B_tp2_random_128_2048", + "server_parameters": { + "tensor_parallel_size": 2 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 2048 + } + }, + { + "test_name": "serving_llama8B_tp1_random_2048_128", + "server_parameters": { + "tensor_parallel_size": 1 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 2048, + "random-output-len": 128 + } + }, + { + "test_name": "serving_llama8B_tp2_random_2048_128", + "server_parameters": { + "tensor_parallel_size": 2 + }, + "client_parameters": { + "dataset_name": "random", + "random-input-len": 2048, + "random-output-len": 128 + } + } + ] +} \ No newline at end of file diff --git a/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json b/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json index 8f7200862d20cb1d0d2a8a3e793143479668d44b..25ed7415ec0e48b65e19123493aff4a9977a2296 100644 --- a/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json +++ b/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json @@ -19,10 +19,8 @@ "block_size": 128, "trust_remote_code": "", "disable_log_stats": "", - "enforce_eager": "", "max_num_batched_tokens": 2048, - "max_num_seqs": 256, - "load_format": "dummy" + "max_num_seqs": 256 }, "client_parameters": { "model": "meta-llama/Llama-3.1-8B-Instruct", @@ -151,6 +149,45 @@ "random-output-len": 128 } }, + { + "test_name": "serving_llama8B_int4_tp1_random_128_128", + "server_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "tensor_parallel_size": 1 + }, + "client_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128 + } + }, + { + "test_name": "serving_llama8B_int4_tp2_random_128_128", + "server_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "tensor_parallel_size": 2 + }, + "client_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128 + } + }, + { + "test_name": "serving_llama8B_int4_tp4_random_128_128", + "server_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "tensor_parallel_size": 4 + }, + "client_parameters": { + "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128 + } + }, { "test_name": "serving_llama3B_tp1_random_128_128", "server_parameters": { diff --git a/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json b/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json new file mode 100644 index 0000000000000000000000000000000000000000..da84dd4d0c67aa9887f2105a30eed1e097fcb0ca --- /dev/null +++ b/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json @@ -0,0 +1,27 @@ +[ + { + "test_name": "throughput_llama8B_tp1", + "environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "parameters": { + "model": "meta-llama/Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "load_format": "dummy", + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "num_prompts": 200, + "backend": "vllm" + } + } +] diff --git a/.buildkite/release-pipeline.yaml b/.buildkite/release-pipeline.yaml index a9d51557bd9bb67951f5ab97e1748f21abd37ec2..092755ea085c8538f7b552eb7819bea787186b8b 100644 --- a/.buildkite/release-pipeline.yaml +++ b/.buildkite/release-pipeline.yaml @@ -1,6 +1,6 @@ steps: # aarch64 + CUDA builds - - label: "Build arm64 wheel - CUDA 12.9" + - label: "Build wheel - aarch64 - CUDA 12.9" depends_on: ~ id: build-wheel-arm64-cuda-12-9 agents: @@ -11,11 +11,11 @@ steps: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh" + - "bash .buildkite/scripts/upload-nightly-wheels.sh" env: DOCKER_BUILDKIT: "1" - - label: "Build arm64 wheel - CUDA 13.0" + - label: "Build wheel - aarch64 - CUDA 13.0" depends_on: ~ id: build-wheel-arm64-cuda-13-0 agents: @@ -26,12 +26,12 @@ steps: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35" + - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35" env: DOCKER_BUILDKIT: "1" # aarch64 build - - label: "Build arm64 CPU wheel" + - label: "Build wheel - aarch64 - CPU" depends_on: ~ id: build-wheel-arm64-cpu agents: @@ -40,39 +40,39 @@ steps: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35" + - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35" env: DOCKER_BUILDKIT: "1" # x86 + CUDA builds - - label: "Build wheel - CUDA 12.9" + - label: "Build wheel - x86_64 - CUDA 12.9" depends_on: ~ - id: build-wheel-cuda-12-9 + id: build-wheel-x86-cuda-12-9 agents: queue: cpu_queue_postmerge commands: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_31" + - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_31" env: DOCKER_BUILDKIT: "1" - - label: "Build wheel - CUDA 13.0" + - label: "Build wheel - x86_64 - CUDA 13.0" depends_on: ~ - id: build-wheel-cuda-13-0 + id: build-wheel-x86-cuda-13-0 agents: queue: cpu_queue_postmerge commands: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35" + - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35" env: DOCKER_BUILDKIT: "1" # x86 CPU wheel build - - label: "Build x86 CPU wheel" + - label: "Build wheel - x86_64 - CPU" depends_on: ~ id: build-wheel-x86-cpu agents: @@ -81,12 +81,12 @@ steps: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ." - "mkdir artifacts" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35" + - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35" env: DOCKER_BUILDKIT: "1" - # Build release images (12.9) - - label: "Build release image (x86)" + # Build release images (CUDA 12.9) + - label: "Build release image - x86_64 - CUDA 12.9" depends_on: ~ id: build-release-image-x86 agents: @@ -99,7 +99,7 @@ steps: - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT" - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT" - - label: "Build release image (arm64)" + - label: "Build release image - aarch64 - CUDA 12.9" depends_on: ~ id: build-release-image-arm64 agents: @@ -109,34 +109,92 @@ steps: - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ." - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)" - # Add job to create multi-arch manifest - - label: "Create multi-arch manifest" + - label: "Create multi-arch manifest - CUDA 12.9" depends_on: - build-release-image-x86 - build-release-image-arm64 id: create-multi-arch-manifest agents: - queue: cpu_queue_postmerge + queue: small_cpu_queue_postmerge commands: - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 --amend" - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT" - - label: "Annotate release workflow" + - label: "Annotate release workflow - CUDA 12.9" depends_on: - create-multi-arch-manifest id: annotate-release-workflow agents: - queue: cpu_queue_postmerge + queue: small_cpu_queue_postmerge commands: - "bash .buildkite/scripts/annotate-release.sh" + - block: "Build CUDA 13.0 release images" + key: block-release-image-build-cuda-13-0 + depends_on: ~ + + - label: "Build release image - x86_64 - CUDA 13.0" + depends_on: block-release-image-build-cuda-13-0 + id: build-release-image-x86-cuda-13-0 + agents: + queue: cpu_queue_postmerge + commands: + - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" + - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ." + - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130" + # re-tag to default image tag and push, just in case arm64 build fails + - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130" + - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130" + + - label: "Build release image - aarch64 - CUDA 13.0" + depends_on: block-release-image-build-cuda-13-0 + id: build-release-image-arm64-cuda-13-0 + agents: + queue: arm64_cpu_queue_postmerge + commands: + - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" + - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ." + - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130" + + - label: "Create multi-arch manifest - CUDA 13.0" + depends_on: + - build-release-image-x86-cuda-13-0 + - build-release-image-arm64-cuda-13-0 + id: create-multi-arch-manifest-cuda-13-0 + agents: + queue: small_cpu_queue_postmerge + commands: + - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" + - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64-cu130 --amend" + - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130" + - input: "Provide Release version here" id: input-release-version fields: - text: "What is the release version?" key: release-version + - block: "Confirm update release wheels to PyPI (experimental, use with caution)?" + key: block-upload-release-wheels + depends_on: + - input-release-version + - build-wheel-x86-cuda-12-9 + - build-wheel-x86-cuda-13-0 + - build-wheel-x86-cpu + - build-wheel-arm64-cuda-12-9 + - build-wheel-arm64-cuda-13-0 + - build-wheel-arm64-cpu + + - label: "Upload release wheels to PyPI and GitHub" + depends_on: + - block-upload-release-wheels + id: upload-release-wheels + agents: + queue: small_cpu_queue_postmerge + commands: + - "bash .buildkite/scripts/upload-release-wheels.sh" + - block: "Build CPU release image" key: block-cpu-release-image-build depends_on: ~ @@ -169,12 +227,30 @@ steps: env: DOCKER_BUILDKIT: "1" + - block: "Build ROCm release image" + key: block-rocm-release-image-build + depends_on: ~ + + - label: "Build release image (ROCm)" + depends_on: block-rocm-release-image-build + id: build-release-image-rocm + agents: + queue: cpu_queue_postmerge + commands: + - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" + # Build base image first + - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --tag rocm/vllm-dev:base-$BUILDKITE_COMMIT --target final --progress plain -f docker/Dockerfile.rocm_base ." + # Build vLLM ROCm image using the base + - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg BASE_IMAGE=rocm/vllm-dev:base-$BUILDKITE_COMMIT --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm --target vllm-openai --progress plain -f docker/Dockerfile.rocm ." + - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm" + + - label: "Build and publish nightly multi-arch image to DockerHub" depends_on: - create-multi-arch-manifest if: build.env("NIGHTLY") == "1" agents: - queue: cpu_queue_postmerge + queue: small_cpu_queue_postmerge commands: - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64" @@ -196,3 +272,365 @@ steps: env: DOCKER_BUILDKIT: "1" DOCKERHUB_USERNAME: "vllmbot" + + # ============================================================================= + # ROCm Release Pipeline (x86_64 only) + # ============================================================================= + # + # vLLM version is determined by the Buildkite checkout (like CUDA pipeline). + # To build a specific version, trigger the build from that branch/tag. + # + # Environment variables for ROCm builds (set via Buildkite UI or schedule): + # ROCM_PYTHON_VERSION: Python version (default: 3.12) + # PYTORCH_ROCM_ARCH: GPU architectures (default: gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151) + # ROCM_UPLOAD_WHEELS: Upload to S3 (default: false for nightly, true for releases) + # ROCM_FORCE_REBUILD: Force rebuild base wheels, ignore S3 cache (default: false) + # + # Note: ROCm version is determined by BASE_IMAGE in docker/Dockerfile.rocm_base + # (currently rocm/dev-ubuntu-22.04:7.1-complete) + # + # ============================================================================= + + # ROCm Input Step - Collect build configuration (manual trigger only) + - input: "ROCm Wheel Release Build Configuration" + key: input-rocm-config + depends_on: ~ + if: build.source == "ui" + fields: + - text: "Python Version" + key: "rocm-python-version" + default: "3.12" + hint: "Python version (e.g., 3.12)" + - text: "GPU Architectures" + key: "rocm-pytorch-rocm-arch" + default: "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151" + hint: "Semicolon-separated GPU architectures" + - select: "Upload Wheels to S3" + key: "rocm-upload-wheels" + default: "true" + options: + - label: "No - Build only (nightly/dev)" + value: "false" + - label: "Yes - Upload to S3 (release)" + value: "true" + - select: "Force Rebuild Base Wheels" + key: "rocm-force-rebuild" + default: "false" + hint: "Ignore S3 cache and rebuild base wheels from scratch" + options: + - label: "No - Use cached wheels if available" + value: "false" + - label: "Yes - Rebuild even if cache exists" + value: "true" + + # ROCm Job 1: Build ROCm Base Wheels (with S3 caching) + - label: ":rocm: Build ROCm Base Wheels" + id: build-rocm-base-wheels + depends_on: + - step: input-rocm-config + allow_failure: true # Allow failure so non-UI builds can proceed (input step is skipped) + agents: + queue: cpu_queue_postmerge + commands: + # Set configuration and check cache + - | + set -euo pipefail + + # Get values from meta-data (set by input step) or use defaults + PYTHON_VERSION="$$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo '')" + export PYTHON_VERSION="$${PYTHON_VERSION:-3.12}" + + PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')" + export PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}" + + # Check for force rebuild flag + ROCM_FORCE_REBUILD="$${ROCM_FORCE_REBUILD:-}" + if [ -z "$${ROCM_FORCE_REBUILD}" ]; then + ROCM_FORCE_REBUILD="$$(buildkite-agent meta-data get rocm-force-rebuild 2>/dev/null || echo '')" + fi + + echo "========================================" + echo "ROCm Base Wheels Build Configuration" + echo "========================================" + echo " PYTHON_VERSION: $${PYTHON_VERSION}" + echo " PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}" + echo " ROCM_FORCE_REBUILD: $${ROCM_FORCE_REBUILD:-false}" + echo "========================================" + + # Save resolved config for later jobs + buildkite-agent meta-data set "rocm-python-version" "$${PYTHON_VERSION}" + buildkite-agent meta-data set "rocm-pytorch-rocm-arch" "$${PYTORCH_ROCM_ARCH}" + + # Check S3 cache for pre-built wheels + CACHE_KEY=$$(.buildkite/scripts/cache-rocm-base-wheels.sh key) + CACHE_PATH=$$(.buildkite/scripts/cache-rocm-base-wheels.sh path) + echo "" + echo "Cache key: $${CACHE_KEY}" + echo "Cache path: $${CACHE_PATH}" + + # Save cache key for downstream jobs + buildkite-agent meta-data set "rocm-cache-key" "$${CACHE_KEY}" + + CACHE_STATUS="miss" + if [ "$${ROCM_FORCE_REBUILD}" != "true" ]; then + CACHE_STATUS=$$(.buildkite/scripts/cache-rocm-base-wheels.sh check) + else + echo "Force rebuild requested, skipping cache check" + fi + + if [ "$${CACHE_STATUS}" = "hit" ]; then + echo "" + echo "CACHE HIT! Downloading pre-built wheels..." + echo "" + .buildkite/scripts/cache-rocm-base-wheels.sh download + + # Set the S3 path for the cached Docker image (for Job 2 to download) + S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}" + buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz" + + # Mark that we used cache (for Docker image handling) + buildkite-agent meta-data set "rocm-used-cache" "true" + + echo "" + echo "Cache download complete. Skipping Docker build." + echo "Docker image will be downloaded from: $${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz" + else + echo "" + echo "CACHE MISS. Building from scratch..." + echo "" + + # Build full base image (for later vLLM build) + DOCKER_BUILDKIT=1 docker buildx build \ + --file docker/Dockerfile.rocm_base \ + --tag rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} \ + --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \ + --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \ + --build-arg USE_SCCACHE=1 \ + --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \ + --build-arg SCCACHE_REGION_NAME=us-west-2 \ + --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \ + --load \ + . + + # Build debs_wheel_release stage for wheel extraction + DOCKER_BUILDKIT=1 docker buildx build \ + --file docker/Dockerfile.rocm_base \ + --tag rocm-base-debs:$${BUILDKITE_BUILD_NUMBER} \ + --target debs_wheel_release \ + --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \ + --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \ + --build-arg USE_SCCACHE=1 \ + --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \ + --build-arg SCCACHE_REGION_NAME=us-west-2 \ + --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \ + --load \ + . + + # Extract wheels from Docker image + mkdir -p artifacts/rocm-base-wheels + container_id=$$(docker create rocm-base-debs:$${BUILDKITE_BUILD_NUMBER}) + docker cp $${container_id}:/app/debs/. artifacts/rocm-base-wheels/ + docker rm $${container_id} + echo "Extracted base wheels:" + ls -lh artifacts/rocm-base-wheels/ + + # Upload wheels to S3 cache for future builds + echo "" + echo "Uploading wheels to S3 cache..." + .buildkite/scripts/cache-rocm-base-wheels.sh upload + + # Export base Docker image for reuse in vLLM build + mkdir -p artifacts/rocm-docker-image + docker save rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} | gzip > artifacts/rocm-docker-image/rocm-base-image.tar.gz + echo "Docker image size:" + ls -lh artifacts/rocm-docker-image/ + + # Upload large Docker image to S3 (also cached by cache key) + S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}" + echo "Uploading Docker image to $${S3_ARTIFACT_PATH}/" + aws s3 cp artifacts/rocm-docker-image/rocm-base-image.tar.gz "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz" + + # Save the S3 path for downstream jobs + buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz" + + # Mark that we did NOT use cache + buildkite-agent meta-data set "rocm-used-cache" "false" + + echo "" + echo "Build complete. Wheels cached for future builds." + fi + artifact_paths: + - "artifacts/rocm-base-wheels/*.whl" + env: + DOCKER_BUILDKIT: "1" + S3_BUCKET: "vllm-wheels" + + # ROCm Job 2: Build vLLM ROCm Wheel + - label: ":python: Build vLLM ROCm Wheel" + id: build-rocm-vllm-wheel + depends_on: + - step: build-rocm-base-wheels + allow_failure: false + agents: + queue: cpu_queue_postmerge + timeout_in_minutes: 180 + commands: + # Download artifacts and prepare Docker image + - | + set -euo pipefail + + # Ensure git tags are up-to-date (Buildkite's default fetch doesn't update tags) + # This fixes version detection when tags are moved/force-pushed + echo "Fetching latest tags from origin..." + git fetch --tags --force origin + + # Log tag information for debugging version detection + echo "========================================" + echo "Git Tag Verification" + echo "========================================" + echo "Current HEAD: $(git rev-parse HEAD)" + echo "git describe --tags: $(git describe --tags 2>/dev/null || echo 'No tags found')" + echo "" + echo "Recent tags (pointing to commits near HEAD):" + git tag -l --sort=-creatordate | head -5 + echo "setuptools_scm version detection:" + pip install -q setuptools_scm 2>/dev/null || true + python3 -c "import setuptools_scm; print(' Detected version:', setuptools_scm.get_version())" 2>/dev/null || echo " (setuptools_scm not available in this environment)" + echo "========================================" + + # Download wheel artifacts from current build + echo "Downloading wheel artifacts from current build" + buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" . + + # Download Docker image from S3 (too large for Buildkite artifacts) + DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')" + if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then + echo "ERROR: rocm-docker-image-s3-path metadata not found" + echo "This should have been set by the build-rocm-base-wheels job" + exit 1 + fi + echo "Downloading Docker image from $${DOCKER_IMAGE_S3_PATH}" + mkdir -p artifacts/rocm-docker-image + aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz + + # Load base Docker image and capture the tag + echo "Loading base Docker image..." + LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load) + echo "$${LOAD_OUTPUT}" + # Extract the actual loaded image tag from "Loaded image: " output + # This avoids picking up stale images (like rocm/vllm-dev:nightly) already on the agent + BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //') + if [ -z "$${BASE_IMAGE_TAG}" ]; then + echo "ERROR: Failed to extract image tag from docker load output" + echo "Load output was: $${LOAD_OUTPUT}" + exit 1 + fi + echo "Loaded base image: $${BASE_IMAGE_TAG}" + + # Prepare base wheels for Docker build context + mkdir -p docker/context/base-wheels + touch docker/context/base-wheels/.keep + cp artifacts/rocm-base-wheels/*.whl docker/context/base-wheels/ + echo "Base wheels for vLLM build:" + ls -lh docker/context/base-wheels/ + + # Get GPU architectures from meta-data + PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')" + PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}" + + echo "========================================" + echo "Building vLLM wheel with:" + echo " BUILDKITE_COMMIT: $${BUILDKITE_COMMIT}" + echo " BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}" + echo " PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}" + echo " BASE_IMAGE: $${BASE_IMAGE_TAG}" + echo "========================================" + + # Build vLLM wheel using local checkout (REMOTE_VLLM=0) + DOCKER_BUILDKIT=1 docker build \ + --file docker/Dockerfile.rocm \ + --target export_vllm_wheel_release \ + --output type=local,dest=rocm-dist \ + --build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \ + --build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \ + --build-arg REMOTE_VLLM=0 \ + --build-arg GIT_REPO_CHECK=1 \ + --build-arg USE_SCCACHE=1 \ + --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \ + --build-arg SCCACHE_REGION_NAME=us-west-2 \ + --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \ + . + + echo "Built vLLM wheel:" + ls -lh rocm-dist/*.whl + + # Copy wheel to artifacts directory + mkdir -p artifacts/rocm-vllm-wheel + cp rocm-dist/*.whl artifacts/rocm-vllm-wheel/ + echo "Final vLLM wheel:" + ls -lh artifacts/rocm-vllm-wheel/ + artifact_paths: + - "artifacts/rocm-vllm-wheel/*.whl" + env: + DOCKER_BUILDKIT: "1" + S3_BUCKET: "vllm-wheels" + + # ROCm Job 3: Upload Wheels to S3 + - label: ":s3: Upload ROCm Wheels to S3" + id: upload-rocm-wheels + depends_on: + - step: build-rocm-vllm-wheel + allow_failure: false + agents: + queue: cpu_queue_postmerge + timeout_in_minutes: 60 + commands: + # Download all wheel artifacts and run upload + - | + set -euo pipefail + + # Check if upload is enabled (from env var, meta-data, or release branch) + ROCM_UPLOAD_WHEELS="$${ROCM_UPLOAD_WHEELS:-}" + if [ -z "$${ROCM_UPLOAD_WHEELS}" ]; then + # Try to get from meta-data (input form) + ROCM_UPLOAD_WHEELS="$$(buildkite-agent meta-data get rocm-upload-wheels 2>/dev/null || echo '')" + fi + + echo "========================================" + echo "Upload check:" + echo " ROCM_UPLOAD_WHEELS: $${ROCM_UPLOAD_WHEELS}" + echo " BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}" + echo "========================================" + + # Skip upload if not enabled + if [ "$${ROCM_UPLOAD_WHEELS}" != "true" ]; then + echo "Skipping S3 upload (ROCM_UPLOAD_WHEELS != true, NIGHTLY != 1, not a release branch)" + echo "To enable upload, set 'Upload Wheels to S3' to 'Yes' in the build configuration" + exit 0 + fi + + echo "Upload enabled, proceeding..." + + # Download artifacts from current build + echo "Downloading artifacts from current build" + buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" . + buildkite-agent artifact download "artifacts/rocm-vllm-wheel/*.whl" . + + # Run upload script + bash .buildkite/scripts/upload-rocm-wheels.sh + env: + DOCKER_BUILDKIT: "1" + S3_BUCKET: "vllm-wheels" + + # ROCm Job 4: Annotate ROCm Wheel Release + - label: ":memo: Annotate ROCm wheel release" + id: annotate-rocm-release + depends_on: + - step: upload-rocm-wheels + allow_failure: true + agents: + queue: cpu_queue_postmerge + commands: + - "bash .buildkite/scripts/annotate-rocm-release.sh" + env: + S3_BUCKET: "vllm-wheels" diff --git a/.buildkite/scripts/annotate-release.sh b/.buildkite/scripts/annotate-release.sh index df805e085080626ec9bc7daf9fbe8dfee231df7a..d178fb88841e26f4d7c6f770a3686f0bbf8de33b 100755 --- a/.buildkite/scripts/annotate-release.sh +++ b/.buildkite/scripts/annotate-release.sh @@ -32,6 +32,7 @@ To download and upload the image: \`\`\` docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 +docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64 docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64 @@ -45,6 +46,12 @@ docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 docker push vllm/vllm-openai:latest-aarch64 docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 +docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai:rocm +docker tag vllm/vllm-openai:rocm vllm/vllm-openai:latest-rocm +docker tag vllm/vllm-openai:rocm vllm/vllm-openai:v${RELEASE_VERSION}-rocm +docker push vllm/vllm-openai:latest-rocm +docker push vllm/vllm-openai:v${RELEASE_VERSION}-rocm + docker manifest rm vllm/vllm-openai:latest docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64 docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 diff --git a/.buildkite/scripts/annotate-rocm-release.sh b/.buildkite/scripts/annotate-rocm-release.sh new file mode 100755 index 0000000000000000000000000000000000000000..fcc7c290ec043aa1daf4445ba116bb75e88d8bb5 --- /dev/null +++ b/.buildkite/scripts/annotate-rocm-release.sh @@ -0,0 +1,74 @@ +#!/bin/bash +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# +# Generate Buildkite annotation for ROCm wheel release + +set -ex + +# Get build configuration from meta-data +# Extract ROCm version dynamically from Dockerfile.rocm_base +# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.1-complete -> extracts "7.1" +ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown") +PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12") +PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151") + +# S3 URLs +S3_BUCKET="${S3_BUCKET:-vllm-wheels}" +S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}" +S3_URL="https://${S3_BUCKET}.s3.${S3_REGION}.amazonaws.com" +ROCM_PATH="rocm/${BUILDKITE_COMMIT}" + +buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF +## :rocm: ROCm Wheel Release + +### Build Configuration +| Setting | Value | +|---------|-------| +| **ROCm Version** | ${ROCM_VERSION} | +| **Python Version** | ${PYTHON_VERSION} | +| **GPU Architectures** | ${PYTORCH_ROCM_ARCH} | +| **Branch** | \`${BUILDKITE_BRANCH}\` | +| **Commit** | \`${BUILDKITE_COMMIT}\` | + +### :package: Installation + +**Install from this build (by commit):** +\`\`\`bash +uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/{rocm_variant}/ + +# Example: +uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/rocm700/ +\`\`\` + +**Install from nightly (if published):** +\`\`\`bash +uv pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/ +\`\`\` + +### :floppy_disk: Download Wheels Directly + +\`\`\`bash +# List all ROCm wheels +aws s3 ls s3://${S3_BUCKET}/${ROCM_PATH}/ + +# Download specific wheels +aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/vllm-*.whl . +aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torch-*.whl . +aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/triton_rocm-*.whl . +aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torchvision-*.whl . +aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/amdsmi-*.whl . +\`\`\` + +### :gear: Included Packages +- **vllm**: vLLM with ROCm support +- **torch**: PyTorch built for ROCm ${ROCM_VERSION} +- **triton_rocm**: Triton built for ROCm +- **torchvision**: TorchVision for ROCm PyTorch +- **amdsmi**: AMD SMI Python bindings + +### :warning: Notes +- These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs +- Supported GPU architectures: ${PYTORCH_ROCM_ARCH} +- Platform: Linux x86_64 only +EOF diff --git a/.buildkite/scripts/cache-rocm-base-wheels.sh b/.buildkite/scripts/cache-rocm-base-wheels.sh new file mode 100755 index 0000000000000000000000000000000000000000..be244725023da4640d41ffb9c11b3a6588e7a8a2 --- /dev/null +++ b/.buildkite/scripts/cache-rocm-base-wheels.sh @@ -0,0 +1,140 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# +# Cache helper for ROCm base wheels +# +# This script manages caching of pre-built ROCm base wheels (torch, triton, etc.) +# to avoid rebuilding them when Dockerfile.rocm_base hasn't changed. +# +# Usage: +# cache-rocm-base-wheels.sh check - Check if cache exists, outputs "hit" or "miss" +# cache-rocm-base-wheels.sh upload - Upload wheels to cache +# cache-rocm-base-wheels.sh download - Download wheels from cache +# cache-rocm-base-wheels.sh key - Output the cache key +# +# Environment variables: +# S3_BUCKET - S3 bucket name (default: vllm-wheels) +# PYTHON_VERSION - Python version (affects cache key) +# PYTORCH_ROCM_ARCH - GPU architectures (affects cache key) +# +# Note: ROCm version is determined by BASE_IMAGE in Dockerfile.rocm_base, +# so changes to ROCm version are captured by the Dockerfile hash. + +set -euo pipefail + +BUCKET="${S3_BUCKET:-vllm-wheels}" +DOCKERFILE="docker/Dockerfile.rocm_base" +CACHE_PREFIX="rocm/cache" + +# Generate hash from Dockerfile content + build args +generate_cache_key() { + # Include Dockerfile content + if [[ ! -f "$DOCKERFILE" ]]; then + echo "ERROR: Dockerfile not found: $DOCKERFILE" >&2 + exit 1 + fi + local dockerfile_hash=$(sha256sum "$DOCKERFILE" | cut -c1-16) + + # Include key build args that affect the output + # These should match the ARGs in Dockerfile.rocm_base that change the build output + # Note: ROCm version is determined by BASE_IMAGE in the Dockerfile, so it's captured by dockerfile_hash + local args_string="${PYTHON_VERSION:-}|${PYTORCH_ROCM_ARCH:-}" + local args_hash=$(echo "$args_string" | sha256sum | cut -c1-8) + + echo "${dockerfile_hash}-${args_hash}" +} + +CACHE_KEY=$(generate_cache_key) +CACHE_PATH="s3://${BUCKET}/${CACHE_PREFIX}/${CACHE_KEY}/" + +case "${1:-}" in + check) + echo "Checking cache for key: ${CACHE_KEY}" >&2 + echo "Cache path: ${CACHE_PATH}" >&2 + echo "Variables used in cache key:" >&2 + echo " PYTHON_VERSION: ${PYTHON_VERSION:-}" >&2 + echo " PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH:-}" >&2 + + # Check if cache exists by listing objects + # We look for at least one .whl file + echo "Running: aws s3 ls ${CACHE_PATH}" >&2 + S3_OUTPUT=$(aws s3 ls "${CACHE_PATH}" 2>&1) || true + echo "S3 ls output:" >&2 + echo "$S3_OUTPUT" | head -5 >&2 + + if echo "$S3_OUTPUT" | grep -q "\.whl"; then + echo "hit" + else + echo "miss" + fi + ;; + + upload) + echo "========================================" + echo "Uploading wheels to cache" + echo "========================================" + echo "Cache key: ${CACHE_KEY}" + echo "Cache path: ${CACHE_PATH}" + echo "" + + if [[ ! -d "artifacts/rocm-base-wheels" ]]; then + echo "ERROR: artifacts/rocm-base-wheels directory not found" >&2 + exit 1 + fi + + WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l) + if [[ "$WHEEL_COUNT" -eq 0 ]]; then + echo "ERROR: No wheels found in artifacts/rocm-base-wheels/" >&2 + exit 1 + fi + + echo "Uploading $WHEEL_COUNT wheels..." + aws s3 cp --recursive artifacts/rocm-base-wheels/ "${CACHE_PATH}" + + echo "" + echo "Cache upload complete!" + echo "========================================" + ;; + + download) + echo "========================================" + echo "Downloading wheels from cache" + echo "========================================" + echo "Cache key: ${CACHE_KEY}" + echo "Cache path: ${CACHE_PATH}" + echo "" + + mkdir -p artifacts/rocm-base-wheels + aws s3 cp --recursive "${CACHE_PATH}" artifacts/rocm-base-wheels/ + + echo "" + echo "Downloaded wheels:" + ls -lh artifacts/rocm-base-wheels/ + + WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l) + echo "" + echo "Total: $WHEEL_COUNT wheels" + echo "========================================" + ;; + + key) + echo "${CACHE_KEY}" + ;; + + path) + echo "${CACHE_PATH}" + ;; + + *) + echo "Usage: $0 {check|upload|download|key|path}" >&2 + echo "" >&2 + echo "Commands:" >&2 + echo " check - Check if cache exists, outputs 'hit' or 'miss'" >&2 + echo " upload - Upload wheels from artifacts/rocm-base-wheels/ to cache" >&2 + echo " download - Download wheels from cache to artifacts/rocm-base-wheels/" >&2 + echo " key - Output the cache key" >&2 + echo " path - Output the full S3 cache path" >&2 + exit 1 + ;; +esac diff --git a/.buildkite/scripts/generate-nightly-index.py b/.buildkite/scripts/generate-nightly-index.py index d0965fbd56405ac50b470bbe2ed2ed922b1597d6..2eb4211402cc1e5cfb96cf1363181e3fa8a6adfe 100644 --- a/.buildkite/scripts/generate-nightly-index.py +++ b/.buildkite/scripts/generate-nightly-index.py @@ -16,6 +16,18 @@ from urllib.parse import quote import regex as re + +def normalize_package_name(name: str) -> str: + """ + Normalize package name according to PEP 503. + https://peps.python.org/pep-0503/#normalized-names + + Replace runs of underscores, hyphens, and periods with a single hyphen, + and lowercase the result. + """ + return re.sub(r"[-_.]+", "-", name).lower() + + if not sys.version_info >= (3, 12): raise RuntimeError("This script requires Python 3.12 or higher.") @@ -78,7 +90,13 @@ def parse_from_filename(file: str) -> WheelFileInfo: version = version.removesuffix("." + variant) else: if "+" in version: - version, variant = version.split("+") + version_part, suffix = version.split("+", 1) + # Only treat known patterns as variants (rocmXXX, cuXXX, cpu) + # Git hashes and other suffixes are NOT variants + if suffix.startswith(("rocm", "cu", "cpu")): + variant = suffix + version = version_part + # Otherwise keep the full version string (variant stays None) return WheelFileInfo( package_name=package_name, @@ -206,6 +224,26 @@ def generate_index_and_metadata( print("No wheel files found, skipping index generation.") return + # For ROCm builds: inherit variant from vllm wheel + # All ROCm wheels should share the same variant as vllm + rocm_variant = None + for file in parsed_files: + if ( + file.package_name == "vllm" + and file.variant + and file.variant.startswith("rocm") + ): + rocm_variant = file.variant + print(f"Detected ROCm variant from vllm: {rocm_variant}") + break + + # Apply ROCm variant to all wheels without a variant + if rocm_variant: + for file in parsed_files: + if file.variant is None: + file.variant = rocm_variant + print(f"Inherited variant '{rocm_variant}' for {file.filename}") + # Group by variant variant_to_files: dict[str, list[WheelFileInfo]] = {} for file in parsed_files: @@ -256,8 +294,8 @@ def generate_index_and_metadata( variant_dir.mkdir(parents=True, exist_ok=True) - # gather all package names in this variant - packages = set(f.package_name for f in files) + # gather all package names in this variant (normalized per PEP 503) + packages = set(normalize_package_name(f.package_name) for f in files) if variant == "default": # these packages should also appear in the "project list" # generate after all variants are processed @@ -269,8 +307,10 @@ def generate_index_and_metadata( f.write(project_list_str) for package in packages: - # filter files belonging to this package only - package_files = [f for f in files if f.package_name == package] + # filter files belonging to this package only (compare normalized names) + package_files = [ + f for f in files if normalize_package_name(f.package_name) == package + ] package_dir = variant_dir / package package_dir.mkdir(parents=True, exist_ok=True) index_str, metadata_str = generate_package_index_and_metadata( @@ -291,6 +331,7 @@ if __name__ == "__main__": """ Arguments: --version : version string for the current build (e.g., commit hash) + --wheel-dir : directory containing wheel files (default to be same as `version`) --current-objects : path to JSON file containing current S3 objects listing in this version directory --output-dir : directory to store generated index files --alias-to-default : (optional) alias variant name for the default variant @@ -318,6 +359,12 @@ if __name__ == "__main__": required=True, help="Directory to store generated index files", ) + parser.add_argument( + "--wheel-dir", + type=str, + default=None, + help="Directory containing wheel files (default to be same as `version`)", + ) parser.add_argument( "--alias-to-default", type=str, @@ -334,8 +381,13 @@ if __name__ == "__main__": args = parser.parse_args() version = args.version - if "/" in version or "\\" in version: - raise ValueError("Version string must not contain slashes.") + # Allow rocm/ prefix, reject other slashes and all backslashes + if "\\" in version: + raise ValueError("Version string must not contain backslashes.") + if "/" in version and not version.startswith("rocm/"): + raise ValueError( + "Version string must not contain slashes (except for 'rocm/' prefix)." + ) current_objects_path = Path(args.current_objects) output_dir = Path(args.output_dir) if not output_dir.exists(): @@ -372,7 +424,7 @@ if __name__ == "__main__": print(f"Found {len(wheel_files)} wheel files for version {version}: {wheel_files}") - # keep only "official" files for a non-nightly version (specifed by cli args) + # keep only "official" files for a non-nightly version (specified by cli args) PY_VERSION_RE = re.compile(r"^\d+\.\d+\.\d+([a-zA-Z0-9.+-]*)?$") if PY_VERSION_RE.match(version): # upload-wheels.sh ensures no "dev" is in args.version @@ -384,9 +436,25 @@ if __name__ == "__main__": print("Nightly version detected, keeping all wheel files.") # Generate index and metadata, assuming wheels and indices are stored as: - # s3://vllm-wheels/{version}/ + # s3://vllm-wheels/{wheel_dir}/ # s3://vllm-wheels// - wheel_base_dir = Path(output_dir).parent / version + # + # For ROCm builds, version is "rocm/{commit}" and indices are uploaded to: + # - rocm/{commit}/ (same as wheels) + # - rocm/nightly/ + # - rocm/{version}/ + # All these are under the "rocm/" prefix, so relative paths should be + # relative to "rocm/", not the bucket root. + if args.wheel_dir: + # Explicit wheel-dir provided (e.g., for version-specific indices pointing to commit dir) + wheel_dir = args.wheel_dir.strip().rstrip("/") + elif version.startswith("rocm/"): + # For rocm/commit, wheel_base_dir should be just the commit part + # so relative path from rocm/0.12.0/rocm710/vllm/ -> ../../../{commit}/ + wheel_dir = version.split("/", 1)[1] + else: + wheel_dir = version + wheel_base_dir = Path(output_dir).parent / wheel_dir index_base_dir = Path(output_dir) generate_index_and_metadata( diff --git a/.buildkite/scripts/hardware_ci/run-amd-test.sh b/.buildkite/scripts/hardware_ci/run-amd-test.sh index 864eb470bb0a7e33cd9067ec912a30a3e2722ee3..484167f4619b37e0680dc8d6da10eb564e3ecc25 100755 --- a/.buildkite/scripts/hardware_ci/run-amd-test.sh +++ b/.buildkite/scripts/hardware_ci/run-amd-test.sh @@ -141,7 +141,6 @@ if [[ $commands == *" entrypoints/openai "* ]]; then --ignore=entrypoints/openai/test_audio.py \ --ignore=entrypoints/openai/test_shutdown.py \ --ignore=entrypoints/openai/test_completion.py \ - --ignore=entrypoints/openai/test_sleep.py \ --ignore=entrypoints/openai/test_models.py \ --ignore=entrypoints/openai/test_lora_adapters.py \ --ignore=entrypoints/openai/test_return_tokens_as_ids.py \ @@ -210,12 +209,21 @@ if [[ $commands == *"--shard-id="* ]]; then wait "${pid}" STATUS+=($?) done + at_least_one_shard_with_tests=0 for st in "${STATUS[@]}"; do - if [[ ${st} -ne 0 ]]; then + if [[ ${st} -ne 0 ]] && [[ ${st} -ne 5 ]]; then echo "One of the processes failed with $st" exit "${st}" + elif [[ ${st} -eq 5 ]]; then + echo "Shard exited with status 5 (no tests collected) - treating as success" + else # This means st is 0 + at_least_one_shard_with_tests=1 fi done + if [[ ${#STATUS[@]} -gt 0 && ${at_least_one_shard_with_tests} -eq 0 ]]; then + echo "All shards reported no tests collected. Failing the build." + exit 1 + fi else echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES" docker run \ diff --git a/.buildkite/scripts/hardware_ci/run-cpu-test.sh b/.buildkite/scripts/hardware_ci/run-cpu-test.sh index 438fe522c8702ec2e552d4a958d816d3808a2891..ee6510bf88e3e108e0c6f8bddc5de81ff62cb919 100644 --- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh @@ -50,6 +50,7 @@ function cpu_tests() { docker exec cpu-test-"$NUMA_NODE" bash -c " set -e pytest -x -v -s tests/kernels/attention/test_cpu_attn.py + pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py pytest -x -v -s tests/kernels/test_onednn.py" # Run basic model test @@ -83,7 +84,7 @@ function cpu_tests() { docker exec cpu-test-"$NUMA_NODE" bash -c " set -e pytest -x -s -v \ - tests/lora/test_qwen2vl.py" + tests/lora/test_qwenvl.py" # online serving: tp+pp docker exec cpu-test-"$NUMA_NODE" bash -c ' diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh index cbb2527a4ff0aa30664f340c61f75833db469e2a..6959f81eab3732043741dab067fa4c6710fbf088 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh @@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR" echo "--- Installing Python dependencies ---" python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \ && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \ - && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \ + && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \ && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0 echo "--- Python dependencies installed ---" diff --git a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh index f022fa3672eeba2774d51e34b928ea25c7d4d6e6..eafc82b98439be027a28b4be8b9fc4899badbf5e 100755 --- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh +++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh @@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR" echo "--- Installing Python dependencies ---" python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \ && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \ - && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \ + && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \ && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0 echo "--- Python dependencies installed ---" diff --git a/.buildkite/scripts/hardware_ci/run-xpu-test.sh b/.buildkite/scripts/hardware_ci/run-xpu-test.sh index dfc9db512d1e9d01166f3bee1b8b3f9e2d66847f..85b554e5e86460ad21b4c2072d0e5aa73883d6c2 100644 --- a/.buildkite/scripts/hardware_ci/run-xpu-test.sh +++ b/.buildkite/scripts/hardware_ci/run-xpu-test.sh @@ -39,7 +39,7 @@ docker run \ python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp python3 examples/offline_inference/basic/generate.py --model Intel/Qwen2.5-0.5B-W4A16-G128-AutoRound-LLMC-TEST-ONLY --enforce-eager - VLLM_ATTENTION_BACKEND=TRITON_ATTN python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager + python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --attention-backend=TRITON_ATTN cd tests pytest -v -s v1/core pytest -v -s v1/engine diff --git a/.buildkite/scripts/run-multi-node-test.sh b/.buildkite/scripts/run-multi-node-test.sh index 49aebce786b9250887a6160f0f98d5581fd34344..c0911f17b660d21ca95e7e606654f0c82f4d8f2d 100755 --- a/.buildkite/scripts/run-multi-node-test.sh +++ b/.buildkite/scripts/run-multi-node-test.sh @@ -2,6 +2,17 @@ set -euox pipefail +# To detect ROCm +# Check multiple indicators: +if [ -e /dev/kfd ] || \ + [ -d /opt/rocm ] || \ + command -v rocm-smi &> /dev/null || \ + [ -n "${ROCM_HOME:-}" ]; then + IS_ROCM=1 +else + IS_ROCM=0 +fi + if [[ $# -lt 4 ]]; then echo "Usage: .buildkite/scripts/run-multi-node-test.sh WORKING_DIR NUM_NODES NUM_GPUS DOCKER_IMAGE COMMAND1 COMMAND2 ... COMMANDN" exit 1 @@ -26,13 +37,18 @@ for command in "${COMMANDS[@]}"; do echo "$command" done + start_network() { docker network create --subnet=192.168.10.0/24 docker-net } start_nodes() { for node in $(seq 0 $(($NUM_NODES-1))); do - GPU_DEVICES='"device=' + if [ "$IS_ROCM" -eq 1 ]; then + GPU_DEVICES='--device /dev/kfd --device /dev/dri -e HIP_VISIBLE_DEVICES=' + else + GPU_DEVICES='--gpus "device=' + fi for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu)) GPU_DEVICES+=$(($DEVICE_NUM)) @@ -40,7 +56,9 @@ start_nodes() { GPU_DEVICES+=',' fi done - GPU_DEVICES+='"' + if [ "$IS_ROCM" -eq 0 ]; then + GPU_DEVICES+='"' + fi # start the container in detached mode # things to note: @@ -49,7 +67,7 @@ start_nodes() { # 3. map the huggingface cache directory to the container # 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes: # starting from 192.168.10.11) - docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \ + docker run -d $GPU_DEVICES --shm-size=10.24gb -e HF_TOKEN \ -v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \ --network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \ /bin/bash -c "tail -f /dev/null" diff --git a/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh b/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh index 6a1bef275d04705e26675add8078786ac362b2d8..d0921c5699d5d202bd0fed73e3ac0bb14860d4f0 100644 --- a/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh +++ b/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh @@ -44,10 +44,10 @@ trap cleanup EXIT for BACK in "${BACKENDS[@]}"; do VLLM_DEEP_GEMM_WARMUP=skip \ - VLLM_ALL2ALL_BACKEND=$BACK \ vllm serve "$MODEL" \ --enforce-eager \ --enable-eplb \ + --all2all-backend $BACK \ --eplb-config '{"window_size":10, "step_interval":100, "num_redundant_experts":0, "log_balancedness":true}' \ --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \ --data-parallel-size ${DATA_PARALLEL_SIZE} \ diff --git a/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh b/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh index 937a43d1a32214ed97a6437710f28e4fc19c49ce..b3b65128e606244e2a444e0e514dbc73e99ca53f 100644 --- a/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh +++ b/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh @@ -43,12 +43,12 @@ trap cleanup EXIT for BACK in "${BACKENDS[@]}"; do VLLM_DEEP_GEMM_WARMUP=skip \ - VLLM_ALL2ALL_BACKEND=$BACK \ vllm serve "$MODEL" \ --enforce-eager \ --tensor-parallel-size 4 \ --enable-expert-parallel \ --enable-eplb \ + --all2all-backend $BACK \ --eplb-config '{"window_size":200,"step_interval":600,"use_async":true}' \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \ --trust-remote-code \ diff --git a/.buildkite/scripts/upload-wheels.sh b/.buildkite/scripts/upload-nightly-wheels.sh similarity index 94% rename from .buildkite/scripts/upload-wheels.sh rename to .buildkite/scripts/upload-nightly-wheels.sh index 3a218a4bb2e6daae133b060038859f7a6ddfe2ee..1af7f476ae74b725aa3c969256a49d5ebca0b411 100644 --- a/.buildkite/scripts/upload-wheels.sh +++ b/.buildkite/scripts/upload-nightly-wheels.sh @@ -102,6 +102,7 @@ if [[ "$version" != *"dev"* ]]; then echo "Re-generating indices for /$pure_version/" rm -rf "$INDICES_OUTPUT_DIR/*" mkdir -p "$INDICES_OUTPUT_DIR" - $PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg + # wheel-dir is overridden to be the commit directory, so that the indices point to the correct wheel path + $PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --wheel-dir "$SUBPATH" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/$pure_version/" fi diff --git a/.buildkite/scripts/upload-release-wheels.sh b/.buildkite/scripts/upload-release-wheels.sh new file mode 100644 index 0000000000000000000000000000000000000000..a4b246bf1b8531ba3ad13d83c3f0e7646b668594 --- /dev/null +++ b/.buildkite/scripts/upload-release-wheels.sh @@ -0,0 +1,103 @@ +#!/usr/bin/env bash + +set -e + +BUCKET="vllm-wheels" +SUBPATH=$BUILDKITE_COMMIT +S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/" + +RELEASE_VERSION=$(buildkite-agent meta-data get release-version) +echo "Release version from Buildkite: $RELEASE_VERSION" +GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null) +if [ -z "$GIT_VERSION" ]; then + echo "[FATAL] Not on a git tag, cannot create release." + exit 1 +else + echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION" +fi +# sanity check for version mismatch +if [ "v$RELEASE_VERSION" != "$GIT_VERSION" ]; then + if [ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]; then + echo "[WARNING] Force release and ignore version mismatch" + else + echo "[FATAL] Release version from Buildkite does not match Git version." + exit 1 + fi +fi + +# check pypi token +if [ -z "$PYPI_TOKEN" ]; then + echo "[FATAL] PYPI_TOKEN is not set." + exit 1 +else + export TWINE_USERNAME="__token__" + export TWINE_PASSWORD="$PYPI_TOKEN" +fi + +# check github token +if [ -z "$GITHUB_TOKEN" ]; then + echo "[FATAL] GITHUB_TOKEN is not set." + exit 1 +else + export GH_TOKEN="$GITHUB_TOKEN" +fi + +set -x # avoid printing secrets above + +# download gh CLI from github +# Get latest gh CLI version from GitHub API +GH_VERSION=$(curl -s https://api.github.com/repos/cli/cli/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/' | sed 's/^v//') +if [ -z "$GH_VERSION" ]; then + echo "[FATAL] Failed to get latest gh CLI version from GitHub" + exit 1 +fi +echo "Downloading gh CLI version: $GH_VERSION" +GH_TARBALL="gh_${GH_VERSION}_linux_amd64.tar.gz" +GH_URL="https://github.com/cli/cli/releases/download/v${GH_VERSION}/${GH_TARBALL}" +GH_INSTALL_DIR="/tmp/gh-install" +mkdir -p "$GH_INSTALL_DIR" +pushd "$GH_INSTALL_DIR" +curl -L -o "$GH_TARBALL" "$GH_URL" +tar -xzf "$GH_TARBALL" +GH_BIN=$(realpath $(find . -name "gh" -type f -executable | head -n 1)) +if [ -z "$GH_BIN" ]; then + echo "[FATAL] Failed to find gh CLI executable" + exit 1 +fi +echo "gh CLI downloaded successfully, version: $($GH_BIN --version)" +echo "Last 5 releases on GitHub:" # as a sanity check of gh and GH_TOKEN +command "$GH_BIN" release list --limit 5 +popd + +# install twine from pypi +python3 -m venv /tmp/vllm-release-env +source /tmp/vllm-release-env/bin/activate +pip install twine +python3 -m twine --version + +# copy release wheels to local directory +DIST_DIR=/tmp/vllm-release-dist +echo "Existing wheels on S3:" +aws s3 ls "$S3_COMMIT_PREFIX" +echo "Copying wheels to local directory" +mkdir -p $DIST_DIR +# include only wheels for the release version, ignore all files with "dev" or "rc" in the name +aws s3 cp --recursive --exclude "*" --include "vllm-${RELEASE_VERSION}*.whl" --exclude "*dev*" --exclude "*rc*" "$S3_COMMIT_PREFIX" $DIST_DIR +echo "Wheels copied to local directory" +# generate source tarball +git archive --format=tar.gz --output="$DIST_DIR/vllm-${RELEASE_VERSION}.tar.gz" $BUILDKITE_COMMIT +ls -la $DIST_DIR + + +# upload wheels to PyPI (only default variant, i.e. files without '+' in the name) +PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${RELEASE_VERSION}*.whl" -not -name "*+*") +if [ -z "$PYPI_WHEEL_FILES" ]; then + echo "No default variant wheels found, quitting..." + exit 1 +fi +python3 -m twine check $PYPI_WHEEL_FILES +python3 -m twine --non-interactive --verbose upload $PYPI_WHEEL_FILES +echo "Wheels uploaded to PyPI" + +# create release on GitHub with the release version and all wheels +command "$GH_BIN" release create $GIT_VERSION -d --latest --notes-from-tag --verify-tag $DIST_DIR/*.whl diff --git a/.buildkite/scripts/upload-rocm-wheels.sh b/.buildkite/scripts/upload-rocm-wheels.sh new file mode 100755 index 0000000000000000000000000000000000000000..bb555bc842925c13bd05f7aa44ebb8f4dabbf194 --- /dev/null +++ b/.buildkite/scripts/upload-rocm-wheels.sh @@ -0,0 +1,151 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +# +# Upload ROCm wheels to S3 with proper index generation +# +# Required environment variables: +# AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY (or IAM role) +# S3_BUCKET (default: vllm-wheels) +# +# S3 path structure: +# s3://vllm-wheels/rocm/{commit}/ - All wheels for this commit +# s3://vllm-wheels/rocm/nightly/ - Index pointing to latest nightly +# s3://vllm-wheels/rocm/{version}/ - Index for release versions + +set -ex + +# ======== Configuration ======== +BUCKET="${S3_BUCKET:-vllm-wheels}" +ROCM_SUBPATH="rocm/${BUILDKITE_COMMIT}" +S3_COMMIT_PREFIX="s3://$BUCKET/$ROCM_SUBPATH/" +INDICES_OUTPUT_DIR="rocm-indices" +PYTHON="${PYTHON_PROG:-python3}" + +# ROCm uses manylinux_2_35 (Ubuntu 22.04 based) +MANYLINUX_VERSION="manylinux_2_35" + +echo "========================================" +echo "ROCm Wheel Upload Configuration" +echo "========================================" +echo "S3 Bucket: $BUCKET" +echo "S3 Path: $ROCM_SUBPATH" +echo "Commit: $BUILDKITE_COMMIT" +echo "Branch: $BUILDKITE_BRANCH" +echo "========================================" + +# ======== Part 0: Setup Python ======== + +# Detect if python3.12+ is available +has_new_python=$($PYTHON -c "print(1 if __import__('sys').version_info >= (3,12) else 0)" 2>/dev/null || echo 0) +if [[ "$has_new_python" -eq 0 ]]; then + # Use new python from docker + # Use --user to ensure files are created with correct ownership (not root) + docker pull python:3-slim + PYTHON="docker run --rm --user $(id -u):$(id -g) -v $(pwd):/app -w /app python:3-slim python3" +fi + +echo "Using python interpreter: $PYTHON" +echo "Python version: $($PYTHON --version)" + +# ======== Part 1: Collect and prepare wheels ======== + +# Collect all wheels +mkdir -p all-rocm-wheels +cp artifacts/rocm-base-wheels/*.whl all-rocm-wheels/ 2>/dev/null || true +cp artifacts/rocm-vllm-wheel/*.whl all-rocm-wheels/ 2>/dev/null || true + +WHEEL_COUNT=$(ls all-rocm-wheels/*.whl 2>/dev/null | wc -l) +echo "Total wheels to upload: $WHEEL_COUNT" + +if [ "$WHEEL_COUNT" -eq 0 ]; then + echo "ERROR: No wheels found to upload!" + exit 1 +fi + +# Rename linux to manylinux in wheel filenames +for wheel in all-rocm-wheels/*.whl; do + if [[ "$wheel" == *"linux"* ]] && [[ "$wheel" != *"manylinux"* ]]; then + new_wheel="${wheel/linux/$MANYLINUX_VERSION}" + mv -- "$wheel" "$new_wheel" + echo "Renamed: $(basename "$wheel") -> $(basename "$new_wheel")" + fi +done + +echo "" +echo "Wheels to upload:" +ls -lh all-rocm-wheels/ + +# ======== Part 2: Upload wheels to S3 ======== + +echo "" +echo "Uploading wheels to $S3_COMMIT_PREFIX" +for wheel in all-rocm-wheels/*.whl; do + aws s3 cp "$wheel" "$S3_COMMIT_PREFIX" +done + +# ======== Part 3: Generate and upload indices ======== + +# List existing wheels in commit directory +echo "" +echo "Generating indices..." +obj_json="rocm-objects.json" +aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$ROCM_SUBPATH/" --delimiter / --output json > "$obj_json" + +mkdir -p "$INDICES_OUTPUT_DIR" + +# Use the existing generate-nightly-index.py +# HACK: Replace regex module with stdlib re (same as CUDA script) +sed -i 's/import regex as re/import re/g' .buildkite/scripts/generate-nightly-index.py + +$PYTHON .buildkite/scripts/generate-nightly-index.py \ + --version "$ROCM_SUBPATH" \ + --current-objects "$obj_json" \ + --output-dir "$INDICES_OUTPUT_DIR" \ + --comment "ROCm commit $BUILDKITE_COMMIT" + +# Upload indices to commit directory +echo "Uploading indices to $S3_COMMIT_PREFIX" +aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "$S3_COMMIT_PREFIX" + +# Update rocm/nightly/ if on main branch and not a PR +if [[ "$BUILDKITE_BRANCH" == "main" && "$BUILDKITE_PULL_REQUEST" == "false" ]] || [[ "$NIGHTLY" == "1" ]]; then + echo "Updating rocm/nightly/ index..." + aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/nightly/" +fi + +# Extract version from vLLM wheel and update version-specific index +VLLM_WHEEL=$(ls all-rocm-wheels/vllm*.whl 2>/dev/null | head -1) +if [ -n "$VLLM_WHEEL" ]; then + VERSION=$(unzip -p "$VLLM_WHEEL" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2) + echo "Version in wheel: $VERSION" + PURE_VERSION="${VERSION%%+*}" + PURE_VERSION="${PURE_VERSION%%.rocm}" + echo "Pure version: $PURE_VERSION" + + if [[ "$VERSION" != *"dev"* ]]; then + echo "Updating rocm/$PURE_VERSION/ index..." + aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/$PURE_VERSION/" + fi +fi + +# ======== Part 4: Summary ======== + +echo "" +echo "========================================" +echo "ROCm Wheel Upload Complete!" +echo "========================================" +echo "" +echo "Wheels available at:" +echo " s3://$BUCKET/$ROCM_SUBPATH/" +echo "" +echo "Install command (by commit):" +echo " pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/$ROCM_SUBPATH/" +echo "" +if [[ "$BUILDKITE_BRANCH" == "main" ]] || [[ "$NIGHTLY" == "1" ]]; then + echo "Install command (nightly):" + echo " pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/rocm/nightly/" +fi +echo "" +echo "Wheel count: $WHEEL_COUNT" +echo "========================================" diff --git a/.buildkite/test-amd.yaml b/.buildkite/test-amd.yaml index 3c9b8cbedcf065c1af6a686c72082ecb19138d95..044a82c9773f0e3bb7dfea936f1861e313af5d4f 100644 --- a/.buildkite/test-amd.yaml +++ b/.buildkite/test-amd.yaml @@ -128,7 +128,7 @@ steps: - tests/entrypoints/ commands: - pytest -v -s entrypoints/openai/tool_parsers - - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling + - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/openai --ignore=entrypoints/rpc --ignore=entrypoints/sleep --ignore=entrypoints/instrumentator --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling - label: Entrypoints Integration Test (LLM) # 30min timeout_in_minutes: 40 @@ -148,7 +148,7 @@ steps: - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests -- label: Entrypoints Integration Test (API Server) # 100min +- label: Entrypoints Integration Test (API Server 1) # 100min timeout_in_minutes: 130 mirror_hardwares: [amdexperimental] agent_pool: mi325_1 @@ -162,10 +162,28 @@ steps: - tests/entrypoints/test_chat_utils commands: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/openai/test_collective_rpc.py # PYTHONPATH is needed to import custom Worker extension - - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/ + - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses - pytest -v -s entrypoints/test_chat_utils.py +- label: Entrypoints Integration Test (API Server 2) + timeout_in_minutes: 50 + mirror_hardwares: [amdexperimental] + agent_pool: mi325_1 + # grade: Blocking + working_dir: "/vllm-workspace/tests" + fast_check: true + torch_nightly: true + source_file_dependencies: + - vllm/ + - tests/entrypoints/sleep + - tests/entrypoints/rpc + - tests/tool_use + commands: + - export VLLM_WORKER_MULTIPROC_METHOD=spawn + - pytest -v -s entrypoints/sleep + - pytest -v -s tool_use + - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/rpc + - label: Entrypoints Integration Test (Pooling) timeout_in_minutes: 50 mirror_hardwares: [amdexperimental] @@ -181,6 +199,21 @@ steps: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - pytest -v -s entrypoints/pooling +- label: Entrypoints Integration Test (Responses API) + timeout_in_minutes: 50 + mirror_hardwares: [amdexperimental] + agent_pool: mi325_1 + # grade: Blocking + working_dir: "/vllm-workspace/tests" + fast_check: true + torch_nightly: true + source_file_dependencies: + - vllm/ + - tests/entrypoints/openai/responses + commands: + - export VLLM_WORKER_MULTIPROC_METHOD=spawn + - pytest -v -s entrypoints/openai/responses + - label: Distributed Tests (4 GPUs) # 35min timeout_in_minutes: 50 mirror_hardwares: [amdexperimental] @@ -201,6 +234,9 @@ steps: - tests/v1/engine/test_engine_core_client.py - tests/distributed/test_symm_mem_allreduce.py commands: + # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876 + # TODO: Remove when the bug is fixed in a future ROCm release + - export TORCH_NCCL_BLOCKING_WAIT=1 # test with torchrun tp=2 and external_dp=2 - torchrun --nproc-per-node=4 distributed/test_torchrun_example.py # test with torchrun tp=2 and pp=2 @@ -249,9 +285,10 @@ steps: - vllm/v1/executor/uniproc_executor.py - vllm/v1/worker/gpu_worker.py commands: - # https://github.com/NVIDIA/nccl/issues/1838 - #- export NCCL_CUMEM_HOST_ENABLE=0 # test with torchrun tp=2 and dp=4 with ep + # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876 + # TODO: Remove when the bug is fixed in a future ROCm release + - export TORCH_NCCL_BLOCKING_WAIT=1 - torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep - label: EPLB Algorithm Test # 5min @@ -331,7 +368,9 @@ steps: - label: V1 Test e2e + engine # 65min timeout_in_minutes: 90 mirror_hardwares: [amdexperimental] - agent_pool: mi325_4 + # The test uses 4 GPUs, but we schedule it on 8-GPU machines for stability. + # See discussion here: https://github.com/vllm-project/vllm/pull/31040 + agent_pool: mi325_8 # grade: Blocking source_file_dependencies: - vllm/ @@ -492,8 +531,7 @@ steps: - tests/samplers - tests/conftest.py commands: - - pytest -v -s samplers - - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers + - pytest -v -s -m 'not skip_v1' samplers - label: LoRA Test %N # 20min each timeout_in_minutes: 30 @@ -707,7 +745,7 @@ steps: - label: Quantization Test # 70min timeout_in_minutes: 90 - mirror_hardwares: [amdexperimental] + mirror_hardwares: [amdexperimental, amdproduction] agent_pool: mi325_1 # grade: Blocking source_file_dependencies: @@ -722,7 +760,7 @@ steps: # https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now # we can only upgrade after this is resolved # TODO(jerryzh168): resolve the above comment - - uv pip install --system torchao==0.13.0 + - uv pip install --system torchao==0.14.1 - uv pip install --system conch-triton-kernels - VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py @@ -736,7 +774,7 @@ steps: - vllm/model_executor/layers/quantization autorun_on_main: true commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt - label: OpenAI API correctness # 10min timeout_in_minutes: 15 @@ -747,21 +785,11 @@ steps: - csrc/ - vllm/entrypoints/openai/ - vllm/model_executor/models/whisper.py + - tools/ commands: # LMEval+Transcription WER check - # Transcription WER check is skipped because encoder-decoder models are not supported on ROCm, see https://github.com/vllm-project/vllm/issues/27442 + - bash ../tools/install_torchcodec_rocm.sh || exit 1 - pytest -s entrypoints/openai/correctness/ -- label: OpenAI-Compatible Tool Use # 23 min - timeout_in_minutes: 35 - mirror_hardwares: [amdexperimental, amdproduction] - agent_pool: mi325_1 - # grade: Blocking - fast_check: false - source_file_dependencies: - - vllm/ - - tests/tool_use - commands: - - pytest -v -s tool_use ##### models test ##### @@ -854,6 +882,7 @@ steps: # Shard slow subset of standard language models tests. Only run when model # source is modified, or when specified test files are modified - pip freeze | grep -E 'torch' + - export TORCH_NCCL_BLOCKING_WAIT=1 - pytest -v -s models/language -m 'core_model and slow_test' \ --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \ --shard-id=$$BUILDKITE_PARALLEL_JOB @@ -871,7 +900,7 @@ steps: commands: # Install fast path packages for testing against transformers # Note: also needed to run plamo2 model in vLLM - - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5' + - uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr' - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2' # Shard hybrid language model tests - pytest -v -s models/language/generation \ @@ -892,7 +921,7 @@ steps: commands: # Install fast path packages for testing against transformers # Note: also needed to run plamo2 model in vLLM - - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5' + - uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr' - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2' - pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)' @@ -957,7 +986,7 @@ steps: - pytest -v -s models/multimodal/processing - label: Multi-Modal Models Test (Standard) # 60min - timeout_in_minutes: 80 + timeout_in_minutes: 100 mirror_hardwares: [amdexperimental] agent_pool: mi325_1 # grade: Blocking @@ -966,13 +995,16 @@ steps: - vllm/ - tests/models/multimodal commands: + - export MIOPEN_DEBUG_CONV_DIRECT=0 + - export MIOPEN_DEBUG_CONV_GEMM=0 - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git - pip freeze | grep -E 'torch' - - pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing + - pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing --ignore models/multimodal/pooling/test_prithvi_mae.py + - pytest -v -s models/multimodal/pooling/test_prithvi_mae.py -m core_model - cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work -- label: Multi-Modal Accuracy Eval (Small Models) # 150min - 180min - timeout_in_minutes: 180 +- label: Multi-Modal Accuracy Eval (Small Models) # 5min + timeout_in_minutes: 10 mirror_hardwares: [amdexperimental, amdproduction] agent_pool: mi325_1 # grade: Blocking @@ -982,7 +1014,9 @@ steps: - vllm/inputs/ - vllm/v1/core/ commands: - - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1 + - export MIOPEN_DEBUG_CONV_DIRECT=0 + - export MIOPEN_DEBUG_CONV_GEMM=0 + - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt - label: Multi-Modal Models Test (Extended) 1 # 60min timeout_in_minutes: 120 @@ -994,10 +1028,13 @@ steps: - vllm/ - tests/models/multimodal commands: + - export MIOPEN_DEBUG_CONV_DIRECT=0 + - export MIOPEN_DEBUG_CONV_GEMM=0 - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git - pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing -- label: Multi-Modal Models Test (Extended) 2 +- label: Multi-Modal Models Test (Extended) 2 #60min + timeout_in_minutes: 120 mirror_hardwares: [amdexperimental] agent_pool: mi325_1 # grade: Blocking @@ -1006,6 +1043,8 @@ steps: - vllm/ - tests/models/multimodal commands: + - export MIOPEN_DEBUG_CONV_DIRECT=0 + - export MIOPEN_DEBUG_CONV_GEMM=0 - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model' @@ -1019,6 +1058,8 @@ steps: - vllm/ - tests/models/multimodal commands: + - export MIOPEN_DEBUG_CONV_DIRECT=0 + - export MIOPEN_DEBUG_CONV_GEMM=0 - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model' @@ -1078,8 +1119,8 @@ steps: - vllm/v1/attention/backends/flashinfer.py - vllm/v1/attention/backends/mla/cutlass_mla.py - vllm/v1/attention/backends/mla/flashinfer_mla.py + - vllm/v1/attention/selector.py - vllm/platforms/cuda.py - - vllm/attention/selector.py commands: - nvidia-smi - python3 examples/offline_inference/basic/chat.py @@ -1196,7 +1237,7 @@ steps: - csrc/ - vllm/model_executor/layers/quantization commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt ##### 1 GPU test ##### ##### multi gpus test ##### @@ -1236,13 +1277,13 @@ steps: - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up) - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' - - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code + - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up) - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' - - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code + - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code - label: Distributed Tests (2 GPUs) # 68min timeout_in_minutes: 90 @@ -1268,6 +1309,9 @@ steps: - tests/v1/shutdown - tests/v1/worker/test_worker_memory_snapshot.py commands: + # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876 + # TODO: Remove when the bug is fixed in a future ROCm release + - export TORCH_NCCL_BLOCKING_WAIT=1 - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py @@ -1417,8 +1461,22 @@ steps: - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py - tests/v1/kv_connector/nixl_integration/ commands: - - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt - - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh + - uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt + - VLLM_ATTENTION_BACKEND=ROCM_ATTN bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh + +- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min + mirror_hardwares: [amdexperimental] + agent_pool: mi325_4 + # grade: Blocking + timeout_in_minutes: 15 + working_dir: "/vllm-workspace/tests" + num_gpus: 4 + source_file_dependencies: + - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py + - tests/v1/kv_connector/nixl_integration/ + commands: + - uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt + - VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh ##### multi gpus test ##### ##### A100 test ##### @@ -1490,7 +1548,7 @@ steps: - "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'" - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py - pytest -v -s tests/distributed/test_context_parallel.py - - HIP_VISIBLE_DEVICES=0,1 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 + - HIP_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization - pytest -v -s tests/v1/distributed/test_dbo.py ##### B200 test ##### @@ -1514,7 +1572,7 @@ steps: - csrc/ - vllm/model_executor/layers/quantization commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt - label: LM Eval Large Models (4 Card) mirror_hardwares: [amdexperimental, amdproduction] @@ -1569,6 +1627,8 @@ steps: - .buildkite/scripts/run-prime-rl-test.sh commands: - bash .buildkite/scripts/run-prime-rl-test.sh + +##### EPLB Accuracy Tests ##### - label: DeepSeek V2-Lite Accuracy mirror_hardwares: [amdexperimental, amdproduction] agent_pool: mi325_4 diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 9d0b3fdd3a02c4b5366aed2521ab84d436f7e3b0..1c7a5ca368867560f53a2cee6193a85c6016638c 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -114,7 +114,7 @@ steps: - tests/entrypoints/ commands: - pytest -v -s entrypoints/openai/tool_parsers - - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling + - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/rpc --ignore=entrypoints/sleep --ignore=entrypoints/instrumentator --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling - label: Entrypoints Integration Test (LLM) # 30min timeout_in_minutes: 40 @@ -132,7 +132,7 @@ steps: - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests -- label: Entrypoints Integration Test (API Server) # 100min +- label: Entrypoints Integration Test (API Server 1) # 100min timeout_in_minutes: 130 mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/tests" @@ -144,10 +144,26 @@ steps: - tests/entrypoints/test_chat_utils commands: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/openai/test_collective_rpc.py # PYTHONPATH is needed to import custom Worker extension - - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/ + - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses - pytest -v -s entrypoints/test_chat_utils.py +- label: Entrypoints Integration Test (API Server 2) + timeout_in_minutes: 50 + mirror_hardwares: [amdexperimental] + working_dir: "/vllm-workspace/tests" + fast_check: true + torch_nightly: true + source_file_dependencies: + - vllm/ + - tests/entrypoints/sleep + - tests/entrypoints/rpc + - tests/tool_use + commands: + - export VLLM_WORKER_MULTIPROC_METHOD=spawn + - pytest -v -s entrypoints/sleep + - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/rpc + - pytest -v -s tool_use + - label: Entrypoints Integration Test (Pooling) timeout_in_minutes: 50 mirror_hardwares: [amdexperimental] @@ -161,6 +177,18 @@ steps: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - pytest -v -s entrypoints/pooling +- label: Entrypoints Integration Test (Responses API) + timeout_in_minutes: 50 + mirror_hardwares: [amdexperimental] + working_dir: "/vllm-workspace/tests" + fast_check: true + torch_nightly: true + source_file_dependencies: + - vllm/ + - tests/entrypoints/openai/responses + commands: + - pytest -v -s entrypoints/openai/responses + - label: Distributed Tests (4 GPUs) # 35min timeout_in_minutes: 50 mirror_hardwares: [amdexperimental] @@ -303,7 +331,10 @@ steps: # TODO: accuracy does not match, whether setting # VLLM_USE_FLASHINFER_SAMPLER or not on H100. - pytest -v -s v1/e2e - - pytest -v -s v1/engine + # Run this test standalone for now; + # need to untangle use (implicit) use of spawn/fork across the tests. + - pytest -v -s v1/engine/test_preprocess_error_handling.py + - pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py - label: V1 Test entrypoints # 35min timeout_in_minutes: 50 @@ -642,7 +673,7 @@ steps: # https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now # we can only upgrade after this is resolved # TODO(jerryzh168): resolve the above comment - - uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129 + - uv pip install --system torchao==0.14.1 --index-url https://download.pytorch.org/whl/cu129 - uv pip install --system conch-triton-kernels - VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py @@ -654,7 +685,7 @@ steps: - vllm/model_executor/layers/quantization autorun_on_main: true commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt - label: OpenAI API correctness # 22min timeout_in_minutes: 30 @@ -666,16 +697,6 @@ steps: commands: # LMEval+Transcription WER check - pytest -s entrypoints/openai/correctness/ -- label: OpenAI-Compatible Tool Use # 23 min - timeout_in_minutes: 35 - mirror_hardwares: [amdexperimental] - fast_check: false - source_file_dependencies: - - vllm/ - - tests/tool_use - commands: - - pytest -v -s tool_use - ##### models test ##### - label: Basic Models Tests (Initialization) @@ -934,7 +955,6 @@ steps: timeout_in_minutes: 30 working_dir: "/vllm-workspace/" gpu: b200 - # optional: true source_file_dependencies: - csrc/quantization/fp4/ - csrc/attention/mla/ @@ -946,8 +966,8 @@ steps: - vllm/v1/attention/backends/flashinfer.py - vllm/v1/attention/backends/mla/cutlass_mla.py - vllm/v1/attention/backends/mla/flashinfer_mla.py + - vllm/v1/attention/selector.py - vllm/platforms/cuda.py - - vllm/attention/selector.py commands: - nvidia-smi - python3 examples/offline_inference/basic/chat.py @@ -1064,7 +1084,7 @@ steps: - csrc/ - vllm/model_executor/layers/quantization commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt ##### 1 GPU test ##### ##### multi gpus test ##### @@ -1096,17 +1116,18 @@ steps: - vllm/model_executor/models/ - tests/distributed/ - tests/examples/offline_inference/data_parallel.py + - .buildkite/scripts/run-multi-node-test.sh commands: - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up) - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' - - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code + - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up) - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' - - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code + - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code - label: Distributed Tests (2 GPUs) # 68min timeout_in_minutes: 90 @@ -1258,8 +1279,19 @@ steps: commands: - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt -- label: NixlConnector PD accuracy tests (Distributed) # 30min - timeout_in_minutes: 30 +- label: NixlConnector PD accuracy tests (Distributed) # 40min + timeout_in_minutes: 40 + working_dir: "/vllm-workspace/tests" + num_gpus: 4 + source_file_dependencies: + - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py + - tests/v1/kv_connector/nixl_integration/ + commands: + - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt + - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh + +- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min + timeout_in_minutes: 15 working_dir: "/vllm-workspace/tests" num_gpus: 4 source_file_dependencies: @@ -1267,7 +1299,7 @@ steps: - tests/v1/kv_connector/nixl_integration/ commands: - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt - - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh + - DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh ##### multi gpus test ##### @@ -1325,9 +1357,17 @@ steps: - "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'" - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py - pytest -v -s tests/distributed/test_context_parallel.py - - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 + - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput - pytest -v -s tests/v1/distributed/test_dbo.py +- label: LM Eval Large Models (H200) # optional + timeout_in_minutes: 60 + gpu: h200 + optional: true + num_gpus: 8 + commands: + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt + ##### B200 test ##### - label: Distributed Tests (B200) # optional gpu: b200 @@ -1350,6 +1390,7 @@ steps: - vllm/ - .buildkite/scripts/run-prime-rl-test.sh commands: + - nvidia-smi - bash .buildkite/scripts/run-prime-rl-test.sh - label: DeepSeek V2-Lite Accuracy @@ -1378,3 +1419,26 @@ steps: working_dir: "/vllm-workspace" commands: - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1 + +##### MoE Refactor (Temporary) Tests ##### + +- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional + gpu: h100 + optional: true + num_gpus: 2 + commands: + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt + +- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional + gpu: b200 + optional: true + num_gpus: 2 + commands: + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt + +- label: MoE Refactor Integration Test (B200 DP - TEMPORARY) # optional + gpu: b200 + optional: true + num_gpus: 2 + commands: + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt diff --git a/.buildkite/test_areas/distributed.yaml b/.buildkite/test_areas/distributed.yaml index 2cc90698d916ae0a7a231d5af4fa558491f42b4a..c88076bb528e8e1de0cd0f296561d9b15cd8b122 100644 --- a/.buildkite/test_areas/distributed.yaml +++ b/.buildkite/test_areas/distributed.yaml @@ -145,7 +145,7 @@ steps: - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4' - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py - pytest -v -s tests/distributed/test_context_parallel.py - - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 + - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput - pytest -v -s tests/v1/distributed/test_dbo.py - label: Distributed Tests (2 GPUs)(B200) @@ -171,7 +171,7 @@ steps: - tests/distributed/ - tests/examples/offline_inference/data_parallel.py commands: - - ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code" + - ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code" - label: Distributed NixlConnector PD accuracy (4 GPUs) timeout_in_minutes: 30 @@ -182,7 +182,7 @@ steps: - tests/v1/kv_connector/nixl_integration/ commands: - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt - - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh + - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh - label: Pipeline + Context Parallelism (4 GPUs)) timeout_in_minutes: 60 diff --git a/.buildkite/test_areas/e2e_integration.yaml b/.buildkite/test_areas/e2e_integration.yaml index 93d389815edacc6f3005ee594028c99b838eeeb5..2e0857986c3fa0b22369d2f9de49592a68f3fe6a 100644 --- a/.buildkite/test_areas/e2e_integration.yaml +++ b/.buildkite/test_areas/e2e_integration.yaml @@ -32,6 +32,7 @@ steps: - label: Prime-RL Integration (2 GPUs) timeout_in_minutes: 30 optional: true + soft_fail: true num_gpus: 2 working_dir: "/vllm-workspace" source_file_dependencies: @@ -39,21 +40,3 @@ steps: - .buildkite/scripts/run-prime-rl-test.sh commands: - bash .buildkite/scripts/run-prime-rl-test.sh - -- label: DeepSeek V2-Lite Async EPLB Accuracy - timeout_in_minutes: 60 - gpu: h100 - optional: true - num_gpus: 4 - working_dir: "/vllm-workspace" - commands: - - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030 - -- label: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy - timeout_in_minutes: 60 - gpu: h100 - optional: true - num_gpus: 4 - working_dir: "/vllm-workspace" - commands: - - bash .buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh 0.8 1319 8040 diff --git a/.buildkite/test_areas/entrypoints.yaml b/.buildkite/test_areas/entrypoints.yaml index 0a789be943f374a82f323a450b0417f3544df352..8e02d9f60b4e9cbee9481b697a6c017a0f1e32d9 100644 --- a/.buildkite/test_areas/entrypoints.yaml +++ b/.buildkite/test_areas/entrypoints.yaml @@ -10,7 +10,7 @@ steps: - tests/entrypoints/ commands: - pytest -v -s entrypoints/openai/tool_parsers - - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling + - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/rpc --ignore=entrypoints/sleep --ignore=entrypoints/instrumentator --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py --ignore=entrypoints/pooling - label: Entrypoints Integration (LLM) timeout_in_minutes: 40 @@ -25,7 +25,7 @@ steps: - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests -- label: Entrypoints Integration (API Server) +- label: Entrypoints Integration (API Server 1) timeout_in_minutes: 130 working_dir: "/vllm-workspace/tests" source_file_dependencies: @@ -34,10 +34,24 @@ steps: - tests/entrypoints/test_chat_utils commands: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/openai/test_collective_rpc.py # PYTHONPATH is needed to import custom Worker extension - - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/ + - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses - pytest -v -s entrypoints/test_chat_utils.py +- label: Entrypoints Integration (API Server 2) + timeout_in_minutes: 130 + working_dir: "/vllm-workspace/tests" + source_file_dependencies: + - vllm/ + - tests/tool_use + - tests/entrypoints/sleep + - tests/entrypoints/instrumentator + - tests/entrypoints/rpc + commands: + - export VLLM_WORKER_MULTIPROC_METHOD=spawn + - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/rpc + - pytest -v -s entrypoints/instrumentator + - pytest -v -s entrypoints/sleep + - pytest -v -s tool_use - label: Entrypoints Integration (Pooling) timeout_in_minutes: 50 @@ -49,6 +63,14 @@ steps: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - pytest -v -s entrypoints/pooling +- label: Entrypoints Integration (Responses API) + timeout_in_minutes: 50 + working_dir: "/vllm-workspace/tests" + source_file_dependencies: + - vllm/ + - tests/entrypoints/openai/responses + commands: + - pytest -v -s entrypoints/openai/responses - label: Entrypoints V1 timeout_in_minutes: 50 diff --git a/.buildkite/test_areas/kernels.yaml b/.buildkite/test_areas/kernels.yaml index 7ca099516d64159fe1925407902443a2215769f3..cf4b646f349595b4759d0ac49d15d7d03c4a332f 100644 --- a/.buildkite/test_areas/kernels.yaml +++ b/.buildkite/test_areas/kernels.yaml @@ -90,8 +90,8 @@ steps: - vllm/v1/attention/backends/flashinfer.py - vllm/v1/attention/backends/mla/cutlass_mla.py - vllm/v1/attention/backends/mla/flashinfer_mla.py + - vllm/v1/attention/selector.py - vllm/platforms/cuda.py - - vllm/attention/selector.py commands: - nvidia-smi - python3 examples/offline_inference/basic/chat.py diff --git a/.buildkite/test_areas/lm_eval.yaml b/.buildkite/test_areas/lm_eval.yaml index 9af43e0c375a8d77b9123b8409f471b4c702e7bb..e2498512bdef7cb59eaa493c5bca7bcf533c1822 100644 --- a/.buildkite/test_areas/lm_eval.yaml +++ b/.buildkite/test_areas/lm_eval.yaml @@ -9,7 +9,7 @@ steps: - vllm/model_executor/layers/quantization autorun_on_main: true commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt - label: LM Eval Large Models (4 GPUs)(A100) gpu: a100 @@ -43,4 +43,4 @@ steps: - csrc/ - vllm/model_executor/layers/quantization commands: - - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt --tp-size=1 + - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt diff --git a/.buildkite/test_areas/lora.yaml b/.buildkite/test_areas/lora.yaml index 809b4138f44babebb6be1f76e17ed66ad230e261..59ade40cc8f520efee826709982840b34903f276 100644 --- a/.buildkite/test_areas/lora.yaml +++ b/.buildkite/test_areas/lora.yaml @@ -22,6 +22,8 @@ steps: # FIXIT: find out which code initialize cuda before running the test # before the fix, we need to use spawn to test it - export VLLM_WORKER_MULTIPROC_METHOD=spawn + # Alot of these tests are on the edge of OOMing + - export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # There is some Tensor Parallelism related processing logic in LoRA that # requires multi-GPU testing for validation. - pytest -v -s -x lora/test_chatglm3_tp.py diff --git a/.buildkite/test_areas/models_basic.yaml b/.buildkite/test_areas/models_basic.yaml index 39a5d51c48833556f2a08921a188059cbf7b8efb..2a86596a6d603ef093a4c991525098550ab97743 100644 --- a/.buildkite/test_areas/models_basic.yaml +++ b/.buildkite/test_areas/models_basic.yaml @@ -9,6 +9,7 @@ steps: source_file_dependencies: - vllm/ - tests/models/test_initialization.py + - tests/models/registry.py commands: # Run a subset of model initialization tests - pytest -v -s models/test_initialization.py::test_can_initialize_small_subset @@ -20,6 +21,7 @@ steps: source_file_dependencies: - vllm/model_executor/models/ - tests/models/test_initialization.py + - tests/models/registry.py commands: # Only when vLLM model source is modified - test initialization of a large # subset of supported models (the complement of the small subset in the above diff --git a/.buildkite/test_areas/pytorch.yaml b/.buildkite/test_areas/pytorch.yaml index 703c82eb1a91bb4b887e57b2315dd20332705865..332d5202d83384be02ff273385f0cca253c5e4bb 100644 --- a/.buildkite/test_areas/pytorch.yaml +++ b/.buildkite/test_areas/pytorch.yaml @@ -13,7 +13,9 @@ steps: # tests covered elsewhere. # Use `find` to launch multiple instances of pytest so that # they do not suffer from https://github.com/vllm-project/vllm/issues/28965 - - "find compile/ -maxdepth 1 -name 'test_*.py' -exec pytest -s -v {} \\;" + # However, find does not normally propagate error codes, so we combine it with xargs + # (using -0 for proper path handling) + - "find compile/ -maxdepth 1 -name 'test_*.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'" - label: PyTorch Fullgraph Smoke Test timeout_in_minutes: 30 diff --git a/.buildkite/test_areas/tool_use.yaml b/.buildkite/test_areas/tool_use.yaml deleted file mode 100644 index 69527a1214229e78b0c56aa6e79573125fbcd259..0000000000000000000000000000000000000000 --- a/.buildkite/test_areas/tool_use.yaml +++ /dev/null @@ -1,13 +0,0 @@ -group: Tool use -depends_on: - - image-build -steps: -- label: OpenAI-Compatible Tool Use - timeout_in_minutes: 35 - mirror_hardwares: [amdexperimental] - fast_check: false - source_file_dependencies: - - vllm/ - - tests/tool_use - commands: - - pytest -v -s tool_use diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index d6447649cd89a18147eb099f276c6d9d6f5b86a7..c963be4cb8f92a5c7fcdbd1e2dffc781514280bd 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -3,7 +3,6 @@ # This lists cover the "core" components of vLLM that require careful review /vllm/attention @LucasWilkinson -/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill /vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn /vllm/model_executor/layers/fused_moe @mgoin @pavanimajety /vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety @@ -15,6 +14,7 @@ /vllm/lora @jeejeelee /vllm/reasoning @aarnphm @chaunceyjiang /vllm/entrypoints @aarnphm @chaunceyjiang +/vllm/tool_parsers @aarnphm @chaunceyjiang /vllm/compilation @zou3519 @youkaichao @ProExpertProg /vllm/distributed/kv_transfer @NickLucche @ApostaC CMakeLists.txt @tlrmchlsmth @LucasWilkinson @@ -26,6 +26,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson # vLLM V1 /vllm/v1/attention @LucasWilkinson +/vllm/v1/attention/backend.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill /vllm/v1/attention/backends/mla @pavanimajety /vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety /vllm/v1/attention/backends/triton_attn.py @tdoublep @@ -116,15 +117,15 @@ mkdocs.yaml @hmellor /vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten # Kernels -/vllm/attention/ops/chunked_prefill_paged_decode.py @tdoublep -/vllm/attention/ops/triton_unified_attention.py @tdoublep +/vllm/v1/attention/ops/chunked_prefill_paged_decode.py @tdoublep +/vllm/v1/attention/ops/triton_unified_attention.py @tdoublep # ROCm related: specify owner with write access to notify AMD folks for careful code review /vllm/**/*rocm* @tjtanaa /docker/Dockerfile.rocm* @gshtras @tjtanaa /vllm/v1/attention/backends/rocm*.py @gshtras @tjtanaa /vllm/v1/attention/backends/mla/rocm*.py @gshtras @tjtanaa -/vllm/attention/ops/rocm*.py @gshtras @tjtanaa +/vllm/v1/attention/ops/rocm*.py @gshtras @tjtanaa /vllm/model_executor/layers/fused_moe/rocm*.py @gshtras @tjtanaa /csrc/rocm @gshtras @tjtanaa /requirements/*rocm* @tjtanaa @@ -152,7 +153,7 @@ mkdocs.yaml @hmellor /vllm/entrypoints/pooling @noooop /vllm/config/pooler.py @noooop /vllm/pooling_params.py @noooop -/vllm/model_executor/layers/pooler.py @noooop +/vllm/model_executor/layers/pooler @noooop # Security guide and policies /docs/usage/security.md @russellb diff --git a/.github/mergify.yml b/.github/mergify.yml index 3ad79f93bc7ad52fba91cd7ffb1925d25eef2107..a496dd302db507ae8a43e074ece96db51bd71939 100644 --- a/.github/mergify.yml +++ b/.github/mergify.yml @@ -222,10 +222,10 @@ pull_request_rules: - files~=^csrc/rocm/ - files~=^docker/Dockerfile.rocm - files~=^requirements/rocm.*\.txt - - files~=^vllm/attention/backends/rocm.*\.py - - files~=^vllm/attention/ops/rocm.*\.py - files~=^vllm/model_executor/layers/fused_moe/rocm.*\.py + - files~=^vllm/v1/attention/backends/rocm.*\.py - files~=^vllm/v1/attention/backends/mla/rocm.*\.py + - files~=^vllm/v1/attention/ops/rocm.*\.py - files~=^tests/kernels/.*_rocm.*\.py - files=vllm/platforms/rocm.py - title~=(?i)AMD @@ -235,6 +235,20 @@ pull_request_rules: add: - rocm +- name: label-cpu + description: Automatically apply cpu label + conditions: + - label != stale + - files~=^(?!.*kv_offload)(?!.*cpu_offload).*\bcpu.* + actions: + label: + add: + - cpu + assign: + users: + - "fadara01" + - "aditew01" + - name: label-structured-output description: Automatically apply structured-output label conditions: @@ -335,6 +349,18 @@ pull_request_rules: add: - tool-calling +- name: auto-rebase if approved, ready, and 40 commits behind main + conditions: + - base = main + - label=ready + - "#approved-reviews-by >= 1" + - "#commits-behind >= 40" + - -closed + - -draft + - -conflict + actions: + rebase: {} + - name: ping author on conflicts and add 'needs-rebase' label conditions: - label != stale diff --git a/.gitignore b/.gitignore index 7cda86478664fbc366454af101a132a8f06bd3e8..864542128c0508330efaeea3a99c4d380d4ba71d 100644 --- a/.gitignore +++ b/.gitignore @@ -227,3 +227,8 @@ ep_kernels_workspace/ # Allow tracked library source folders under submodules (e.g., benchmarks/lib) !vllm/benchmarks/lib/ + +# Generated gRPC protobuf files (compiled at build time from vllm_engine.proto) +vllm/grpc/vllm_engine_pb2.py +vllm/grpc/vllm_engine_pb2_grpc.py +vllm/grpc/vllm_engine_pb2.pyi diff --git a/CMakeLists.txt b/CMakeLists.txt index 83c86f3591625c860af16f12073b3fdde7e407a0..98daf9aa7097c0cbfee802a5e1bf0ab18f148dad 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -287,6 +287,7 @@ endif() set(VLLM_EXT_SRC "csrc/mamba/mamba_ssm/selective_scan_fwd.cu" "csrc/cache_kernels.cu" + "csrc/cache_kernels_fused.cu" "csrc/attention/paged_attention_v1.cu" "csrc/attention/paged_attention_v2.cu" "csrc/attention/merge_attn_states.cu" @@ -365,6 +366,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") # marlin arches for fp16 output cuda_archs_loose_intersection(MARLIN_ARCHS "8.0+PTX" "${CUDA_ARCHS}") + # marlin has limited support for turing + cuda_archs_loose_intersection(MARLIN_SM75_ARCHS "7.5" "${CUDA_ARCHS}") # marlin arches for bf16 output (we need 9.0 for bf16 atomicAdd PTX) cuda_archs_loose_intersection(MARLIN_BF16_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}") # marlin arches for fp8 input @@ -372,8 +375,10 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") # - sm90 and sm100 don't support QMMA.16832.F32.E4M3.E4M3 SAAS instruction # so we only enable fp8 computation for SM89 (e.g. RTX 40x0) and 12.0 (e.g. RTX 50x0) cuda_archs_loose_intersection(MARLIN_FP8_ARCHS "8.9;12.0" "${CUDA_ARCHS}") + # marlin arches for other files + cuda_archs_loose_intersection(MARLIN_OTHER_ARCHS "7.5;8.0+PTX" "${CUDA_ARCHS}") - if (MARLIN_ARCHS) + if (MARLIN_OTHER_ARCHS) # # For the Marlin kernels we automatically generate sources for various @@ -414,25 +419,39 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") message(STATUS "Marlin generation script has not changed, skipping generation.") endif() - file(GLOB MARLIN_TEMPLATE_KERNEL_SRC "csrc/quantization/gptq_marlin/sm80_kernel_*_float16.cu") - set_gencode_flags_for_srcs( - SRCS "${MARLIN_TEMPLATE_KERNEL_SRC}" - CUDA_ARCHS "${MARLIN_ARCHS}") - if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) - set_source_files_properties(${MARLIN_TEMPLATE_KERNEL_SRC} - PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + if (MARLIN_ARCHS) + file(GLOB MARLIN_TEMPLATE_KERNEL_SRC "csrc/quantization/gptq_marlin/sm80_kernel_*_float16.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_TEMPLATE_KERNEL_SRC}" + CUDA_ARCHS "${MARLIN_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_TEMPLATE_KERNEL_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_KERNEL_SRC}) + + file(GLOB MARLIN_TEMPLATE_BF16_KERNEL_SRC "csrc/quantization/gptq_marlin/sm80_kernel_*_bfloat16.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_TEMPLATE_BF16_KERNEL_SRC}" + CUDA_ARCHS "${MARLIN_BF16_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_TEMPLATE_BF16_KERNEL_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_BF16_KERNEL_SRC}) endif() - list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_KERNEL_SRC}) - file(GLOB MARLIN_TEMPLATE_BF16_KERNEL_SRC "csrc/quantization/gptq_marlin/sm80_kernel_*_bfloat16.cu") - set_gencode_flags_for_srcs( - SRCS "${MARLIN_TEMPLATE_BF16_KERNEL_SRC}" - CUDA_ARCHS "${MARLIN_BF16_ARCHS}") - if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) - set_source_files_properties(${MARLIN_TEMPLATE_BF16_KERNEL_SRC} - PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + if (MARLIN_SM75_ARCHS) + file(GLOB MARLIN_TEMPLATE_SM75_KERNEL_SRC "csrc/quantization/gptq_marlin/sm75_kernel_*.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_TEMPLATE_SM75_KERNEL_SRC}" + CUDA_ARCHS "${MARLIN_SM75_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_TEMPLATE_SM75_KERNEL_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_SM75_KERNEL_SRC}) endif() - list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_BF16_KERNEL_SRC}) if (MARLIN_FP8_ARCHS) file(GLOB MARLIN_TEMPLATE_FP8_KERNEL_SRC "csrc/quantization/gptq_marlin/sm89_kernel_*.cu") @@ -454,14 +473,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "csrc/quantization/gptq_marlin/awq_marlin_repack.cu") set_gencode_flags_for_srcs( SRCS "${MARLIN_SRCS}" - CUDA_ARCHS "${MARLIN_ARCHS}") + CUDA_ARCHS "${MARLIN_OTHER_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) - set_source_files_properties("csrc/quantization/gptq_marlin/gptq_marlin.cu" + set_source_files_properties(${MARLIN_SRCS} PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") endif() list(APPEND VLLM_EXT_SRC "${MARLIN_SRCS}") - message(STATUS "Building Marlin kernels for archs: ${MARLIN_ARCHS}") + message(STATUS "Building Marlin kernels for archs: ${MARLIN_OTHER_ARCHS}") else() message(STATUS "Not building Marlin kernels as no compatible archs found" " in CUDA target architectures") @@ -789,24 +808,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") else() cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}") endif() - if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) - set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu") - set_gencode_flags_for_srcs( - SRCS "${SRCS}" - CUDA_ARCHS "${SCALED_MM_ARCHS}") - list(APPEND VLLM_EXT_SRC "${SRCS}") - list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1") - message(STATUS "Building blockwise_scaled_group_mm_sm100 for archs: ${SCALED_MM_ARCHS}") - else() - if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) - message(STATUS "Not building blockwise_scaled_group_mm_sm100 kernels as CUDA Compiler version is " - "not >= 12.8, we recommend upgrading to CUDA 12.8 or later " - "if you intend on running FP8 quantized MoE models on Blackwell.") - else() - message(STATUS "Not building blockwise_scaled_group_mm_sm100 as no compatible archs found " - "in CUDA target architectures") - endif() - endif() # # Machete kernels @@ -989,12 +990,16 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") # note that we always set `use_atomic_add=False` for moe marlin now, # so we don't need 9.0 for bf16 atomicAdd PTX cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0+PTX" "${CUDA_ARCHS}") + # moe marlin has limited support for turing + cuda_archs_loose_intersection(MARLIN_MOE_SM75_ARCHS "7.5" "${CUDA_ARCHS}") # moe marlin arches for fp8 input # - sm80 doesn't support fp8 computation # - sm90 and sm100 don't support QMMA.16832.F32.E4M3.E4M3 SAAS instruction # so we only enable fp8 computation for SM89 (e.g. RTX 40x0) and 12.0 (e.g. RTX 50x0) cuda_archs_loose_intersection(MARLIN_MOE_FP8_ARCHS "8.9;12.0" "${CUDA_ARCHS}") - if (MARLIN_MOE_ARCHS) + # moe marlin arches for other files + cuda_archs_loose_intersection(MARLIN_MOE_OTHER_ARCHS "7.5;8.0+PTX" "${CUDA_ARCHS}") + if (MARLIN_MOE_OTHER_ARCHS) # # For the Marlin MOE kernels we automatically generate sources for various @@ -1035,16 +1040,29 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") message(STATUS "Marlin MOE generation script has not changed, skipping generation.") endif() - file(GLOB MARLIN_MOE_SRC "csrc/moe/marlin_moe_wna16/sm80_kernel_*.cu") - list(APPEND MARLIN_MOE_SRC "csrc/moe/marlin_moe_wna16/ops.cu") - set_gencode_flags_for_srcs( - SRCS "${MARLIN_MOE_SRC}" - CUDA_ARCHS "${MARLIN_MOE_ARCHS}") - if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) - set_source_files_properties(${MARLIN_MOE_SRC} - PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + if (MARLIN_MOE_ARCHS) + file(GLOB MARLIN_MOE_SRC "csrc/moe/marlin_moe_wna16/sm80_kernel_*.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_MOE_SRC}" + CUDA_ARCHS "${MARLIN_MOE_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_MOE_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_SRC}) + endif() + + if (MARLIN_MOE_SM75_ARCHS) + file(GLOB MARLIN_MOE_SM75_SRC "csrc/moe/marlin_moe_wna16/sm75_kernel_*.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_MOE_SM75_SRC}" + CUDA_ARCHS "${MARLIN_MOE_SM75_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_MOE_SM75_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_SM75_SRC}) endif() - list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_SRC}) if (MARLIN_MOE_FP8_ARCHS) file(GLOB MARLIN_MOE_FP8_SRC "csrc/moe/marlin_moe_wna16/sm89_kernel_*.cu") @@ -1058,7 +1076,17 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_FP8_SRC}) endif() - message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_ARCHS}") + set(MARLIN_MOE_OTHER_SRC "csrc/moe/marlin_moe_wna16/ops.cu") + set_gencode_flags_for_srcs( + SRCS "${MARLIN_MOE_OTHER_SRC}" + CUDA_ARCHS "${MARLIN_MOE_OTHER_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + set_source_files_properties(${MARLIN_MOE_OTHER_SRC} + PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false") + endif() + list(APPEND VLLM_MOE_EXT_SRC "${MARLIN_MOE_OTHER_SRC}") + + message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_OTHER_ARCHS}") else() message(STATUS "Not building Marlin MOE kernels as no compatible archs found" " in CUDA target architectures") diff --git a/README_ORIGIN.md b/README_ORIGIN.md index 7f08e9c16ab626bead5bd6e5621c81ca71bff5c8..4cab345f88d978678953222f76a87489e38cd0f0 100644 --- a/README_ORIGIN.md +++ b/README_ORIGIN.md @@ -14,51 +14,8 @@ Easy, fast, and cheap LLM serving for everyone | Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |

---- -Join us at the [PyTorch Conference, October 22-23](https://events.linuxfoundation.org/pytorch-conference/) and [Ray Summit, November 3-5](https://www.anyscale.com/ray-summit/2025) in San Francisco for our latest updates on vLLM and to meet the vLLM team! Register now for the largest vLLM community events of the year! - ---- - -*Latest News* 🔥 - -- [2025/11] We hosted [vLLM Bangkok Meetup](https://luma.com/v0f647nv). We explored vLLM and LMCache inference and low-resource language adaptation with speakers from Embedded LLM, AMD, and Red Hat. Please find the meetup slides [here](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing). -- [2025/11] We hosted [the first vLLM Europe Meetup in Zurich](https://luma.com/0gls27kb) focused on quantization, distributed inference, and reinforcement learning at scale with speakers from Mistral, IBM, and Red Hat. Please find the meetup slides [here](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) and recording [here](https://www.youtube.com/watch?v=6m6ZE6yVEDI) -- [2025/11] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w) focusing on distributed inference and diverse accelerator support with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link). -- [2025/10] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg) focused on hands-on vLLM inference optimization! Please find the meetup slides [here](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6). -- [2025/09] We hosted [vLLM Toronto Meetup](https://luma.com/e80e0ymm) focused on tackling inference at scale and speculative decoding with speakers from NVIDIA and Red Hat! Please find the meetup slides [here](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing). -- [2025/08] We hosted [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ) focusing on the ecosystem around vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA). -- [2025/08] We hosted [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet). We shared V1 updates, disaggregated serving and MLLM speedups with speakers from Embedded LLM, AMD, WekaIO, and A*STAR. Please find the meetup slides [here](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing). -- [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH). -- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/). -- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html). - -
-Previous News - -- [2025/08] We hosted [vLLM Korea Meetup](https://luma.com/cgcgprmh) with Red Hat and Rebellions! We shared the latest advancements in vLLM along with project spotlights from the vLLM Korea community. Please find the meetup slides [here](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view). -- [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152). -- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing). -- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). -- [2025/03] We hosted [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing). -- [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing). -- [2025/03] We hosted [the East Coast vLLM Meetup](https://lu.ma/7mu4k4xx)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0). -- [2025/02] We hosted [the ninth vLLM meetup](https://lu.ma/h7g3kuj9) with Meta! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) and AMD [here](https://drive.google.com/file/d/1Zk5qEJIkTmlQ2eQcXQZlljAx3m9s7nwn/view?usp=sharing). The slides from Meta will not be posted. -- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing). -- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone! -- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing). -- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there! -- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users! -- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing). -- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing). -- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html). -- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing). -- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing). -- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing). -- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing). -- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM. -- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai). - -
+🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more. +For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us. --- @@ -118,50 +75,6 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more. We welcome and value any contributions and collaborations. Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved. -## Sponsors - -vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support! - - - -Cash Donations: - -- a16z -- Dropbox -- Sequoia Capital -- Skywork AI -- ZhenFund - -Compute Resources: - -- Alibaba Cloud -- AMD -- Anyscale -- Arm -- AWS -- Crusoe Cloud -- Databricks -- DeepInfra -- Google Cloud -- IBM -- Intel -- Lambda Lab -- Nebius -- Novita AI -- NVIDIA -- Red Hat -- Replicate -- Roblox -- RunPod -- Trainy -- UC Berkeley -- UC San Diego -- Volcengine - -Slack Sponsor: Anyscale - -We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM. - ## Citation If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180): @@ -182,7 +95,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs - For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai) - For coordinating contributions and development, please use [Slack](https://slack.vllm.ai) - For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature -- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu) +- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai) ## Media Kit diff --git a/RELEASE.md b/RELEASE.md index db0d51afc7be13b4da530430c33fa1c899956e41..dfd4fa1ae04d499663b4b315a9fc4988408cbfc9 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,47 +1,30 @@ # Releasing vLLM -vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes. +vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via [PyPI](https://pypi.org/project/vllm). These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes. -## Release Versioning +## Release Cadence and Versioning -vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released. +We aim to have a regular release every 2 weeks. Since v0.12.0, regular releases increment the minor version rather than patch version. The list of past releases can be found [here](https://vllm.ai/releases). -* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0. -* _minor_ major features -* _patch_ features and backwards-compatible bug fixes -* _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release +Our version numbers are expressed in the form `vX.Y.Z`, where `X` is the major version, `Y` is the minor version, and `Z` is the patch version. They are incremented according to the following rules: -## Release Cadence +* _Major_ releases are reserved for architectural milestones involving sweeping API changes, similar to PyTorch 2.0. +* _Minor_ releases correspond to regular releases, which include new features, bug fixes and other backwards-compatible changes. +* _Patch_ releases correspond to special releases for new models, as well as emergency patches for critical performance, functionality and security issues. -Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release. -Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional. +This versioning scheme is similar to [SemVer](https://semver.org/) for compatibility purposes, except that backwards compatibility is only guaranteed for a limited number of minor releases (see our [deprecation policy](https://docs.vllm.ai/en/latest/contributing/deprecation_policy) for details). -| Release Date | Patch release versions | Post Release versions | -| --- | --- | --- | -| Jan 2025 | 0.7.0 | --- | -| Feb 2025 | 0.7.1, 0.7.2, 0.7.3 | --- | -| Mar 2025 | 0.7.4, 0.7.5 | --- | -| Apr 2025 | 0.7.6, 0.7.7 | --- | -| May 2025 | 0.7.8, 0.7.9 | --- | -| Jun 2025 | 0.7.10, 0.7.11 | --- | -| Jul 2025 | 0.7.12, 0.7.13 | --- | -| Aug 2025 | 0.7.14, 0.7.15 | --- | -| Sep 2025 | 0.7.16, 0.7.17 | --- | -| Oct 2025 | 0.7.18, 0.7.19 | --- | -| Nov 2025 | 0.7.20, 0.7.21 | --- | -| Dec 2025 | 0.7.22, 0.7.23 | --- | - -## Release branch +## Release Branch Each release is built from a dedicated release branch. -* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live. -* For post releases, previously cut release branch is reused -* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release. -* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets. -* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch. +* For _major_ and _minor_ releases, the release branch cut is performed 1-2 days before release is live. +* For _patch_ releases, previously cut release branch is reused. +* Release builds are triggered via push to RC tag like `vX.Y.Z-rc1`. This enables us to build and test multiple RCs for each release. +* Final tag: `vX.Y.Z` does not trigger the build but used for Release notes and assets. +* After branch cut is created, we monitor the main branch for any reverts and apply these reverts to a release branch. -## Release Cherry-Pick Criteria +### Cherry-Pick Criteria After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base. diff --git a/benchmarks/benchmark_batch_invariance.py b/benchmarks/benchmark_batch_invariance.py index b5c16c42de467abb26927a017003fc36c5c33e71..7473a41e51406dcb5b3e1a9a1ccfce41f10573fb 100755 --- a/benchmarks/benchmark_batch_invariance.py +++ b/benchmarks/benchmark_batch_invariance.py @@ -104,7 +104,6 @@ def run_benchmark_with_batch_invariant( random.seed(seed) # Set environment variables - os.environ["VLLM_ATTENTION_BACKEND"] = backend if batch_invariant: os.environ["VLLM_BATCH_INVARIANT"] = "1" else: @@ -140,6 +139,7 @@ def run_benchmark_with_batch_invariant( max_model_len=max_model_len, dtype="bfloat16", tensor_parallel_size=tp_size, + attention_config={"backend": backend}, enable_prefix_caching=False, ) init_time = time.perf_counter() - start_init diff --git a/benchmarks/benchmark_ngram_proposer.py b/benchmarks/benchmark_ngram_proposer.py index b5373d383b548eb97dad2ebb534b80096537776d..57a6c1aef5e78ee892a45d4267409c5d524ac4dd 100644 --- a/benchmarks/benchmark_ngram_proposer.py +++ b/benchmarks/benchmark_ngram_proposer.py @@ -135,7 +135,6 @@ def benchmark_batched_propose(args): block_sizes=[16], ) dummy_input_batch._req_ids = list(str(id) for id in range(args.num_req)) - dummy_input_batch.spec_decode_unsupported_reqs = () dummy_input_batch.num_tokens_no_spec = [args.num_token] * args.num_req dummy_input_batch.token_ids_cpu = np.random.randint( 0, 20, (args.num_req, args.num_token) @@ -151,10 +150,8 @@ def benchmark_batched_propose(args): start = time.time() runner.drafter.propose( sampled_token_ids, - dummy_input_batch.req_ids, dummy_input_batch.num_tokens_no_spec, dummy_input_batch.token_ids_cpu, - dummy_input_batch.spec_decode_unsupported_reqs, ) end = time.time() print(f"Iteration time (s): {end - start}") diff --git a/benchmarks/cutlass_benchmarks/sparse_benchmarks.py b/benchmarks/cutlass_benchmarks/sparse_benchmarks.py index 67fccdf4fd07e3aaa0feb29c64c114f7fdbb738f..7720f15e45cc1535e3c195faf2752d618c42ee9d 100644 --- a/benchmarks/cutlass_benchmarks/sparse_benchmarks.py +++ b/benchmarks/cutlass_benchmarks/sparse_benchmarks.py @@ -343,7 +343,9 @@ def bench( return bench_int8(dtype, m, k, n, label, sub_label) if dtype == torch.float8_e4m3fn: return bench_fp8(dtype, m, k, n, label, sub_label) - raise ValueError("unsupported type") + raise ValueError( + f"Unsupported dtype {dtype}: should be one of torch.int8, torch.float8_e4m3fn." + ) # runner diff --git a/benchmarks/kernels/bench_nvfp4_quant.py b/benchmarks/kernels/bench_nvfp4_quant.py new file mode 100644 index 0000000000000000000000000000000000000000..7517376535925d6a4271b711e8b21cd3f806846b --- /dev/null +++ b/benchmarks/kernels/bench_nvfp4_quant.py @@ -0,0 +1,177 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import argparse +import copy +import itertools + +import torch +from weight_shapes import WEIGHT_SHAPES + +from vllm import _custom_ops as ops +from vllm.platforms import current_platform +from vllm.scalar_type import scalar_types +from vllm.triton_utils import triton +from vllm.utils.flashinfer import flashinfer_fp4_quantize + +if not current_platform.has_device_capability(100): + raise RuntimeError("NVFP4 requires compute capability of 10.0 (Blackwell)") + +FLOAT4_E2M1_MAX = scalar_types.float4_e2m1f.max() +FLOAT8_E4M3_MAX = torch.finfo(torch.float8_e4m3fn).max + +PROVIDER_CFGS = { + "vllm": dict(backend="vllm", enabled=True), + "flashinfer": dict(backend="flashinfer", enabled=True), +} + +_enabled = [k for k, v in PROVIDER_CFGS.items() if v["enabled"]] + + +def compute_global_scale(tensor: torch.Tensor) -> torch.Tensor: + """Compute global scale for FP4 quantization.""" + amax = torch.abs(tensor).max().to(torch.float32) + return FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / amax + + +@triton.testing.perf_report( + triton.testing.Benchmark( + x_names=["batch_size"], + x_vals=[1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096], + x_log=False, + line_arg="provider", + line_vals=_enabled, + line_names=_enabled, + ylabel="us (lower is better)", + plot_name="NVFP4 Input Quantization Latency (us)", + args={}, + ) +) +def benchmark(batch_size, provider, N, K): + M = batch_size + device = "cuda" + dtype = torch.bfloat16 + + # Create input tensor + a = torch.randn((M, K), device=device, dtype=dtype) + + # Compute global scale for activation + a_global_scale = compute_global_scale(a) + + quantiles = [0.5, 0.2, 0.8] + + cfg = PROVIDER_CFGS[provider] + + if cfg["backend"] == "vllm": + # vLLM's FP4 quantization + ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + lambda: ops.scaled_fp4_quant(a, a_global_scale), + quantiles=quantiles, + ) + elif cfg["backend"] == "flashinfer": + # FlashInfer's FP4 quantization + # Use is_sf_swizzled_layout=True to match vLLM's output format + ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + lambda: flashinfer_fp4_quantize( + a, a_global_scale, is_sf_swizzled_layout=True + ), + quantiles=quantiles, + ) + + # Convert ms to us for better readability at small batch sizes + to_us = lambda t_ms: t_ms * 1000 + return to_us(ms), to_us(max_ms), to_us(min_ms) + + +def prepare_shapes(args): + out = [] + for model, tp_size in itertools.product(args.models, args.tp_sizes): + for KN, tp_dim in copy.deepcopy(WEIGHT_SHAPES[model]): + KN[tp_dim] //= tp_size + KN.append(model) + out.append(KN) + return out + + +def _test_accuracy_once(M: int, K: int, dtype: torch.dtype, device: str): + """Test accuracy between vLLM and FlashInfer FP4 quantization.""" + # Create input tensor + a = torch.randn((M, K), device=device, dtype=dtype) + + # Compute global scale + a_global_scale = compute_global_scale(a) + + # vLLM quantization + vllm_fp4, vllm_scale = ops.scaled_fp4_quant(a, a_global_scale) + + # FlashInfer quantization (with swizzled layout to match vLLM's output) + flashinfer_fp4, flashinfer_scale = flashinfer_fp4_quantize( + a, a_global_scale, is_sf_swizzled_layout=True + ) + flashinfer_scale = flashinfer_scale.view(torch.float8_e4m3fn) + + # Compare outputs + torch.testing.assert_close( + vllm_fp4, + flashinfer_fp4, + ) + print(f"M={M}, K={K}, dtype={dtype}: PASSED") + + +def test_accuracy(): + """Run accuracy tests across various shapes.""" + print("\n" + "=" * 60) + print("Running accuracy tests: vLLM vs FlashInfer") + print("=" * 60) + + device = "cuda" + dtype = torch.bfloat16 + + # Test various batch sizes and hidden dimensions + Ms = [1, 1024] + Ks = [4096] + + for M in Ms: + for K in Ks: + _test_accuracy_once(M, K, dtype, device) + + print("\nAll accuracy tests passed!") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Benchmark NVFP4 quantization: vLLM vs FlashInfer" + ) + parser.add_argument( + "--models", + nargs="+", + type=str, + default=["meta-llama/Llama-3.1-8B-Instruct"], + choices=list(WEIGHT_SHAPES.keys()), + ) + parser.add_argument("--tp-sizes", nargs="+", type=int, default=[1]) + parser.add_argument( + "--save-path", + type=str, + default=None, + help="Path to save benchmark results", + ) + parser.add_argument( + "--accuracy", + action="store_true", + help="Run accuracy tests", + ) + args = parser.parse_args() + + if args.accuracy: + test_accuracy() + + for K, N, model in prepare_shapes(args): + print(f"\n{model}, N={N} K={K}") + benchmark.run( + print_data=True, + save_path=args.save_path, + N=N, + K=K, + ) + + print("\nBenchmark finished!") diff --git a/benchmarks/kernels/benchmark_activation.py b/benchmarks/kernels/benchmark_activation.py index 66268b71b3de644b424429aa66ad7dc6216b3c3a..fbe5f744148e3591e944ff8b89b85c891ee08aa6 100644 --- a/benchmarks/kernels/benchmark_activation.py +++ b/benchmarks/kernels/benchmark_activation.py @@ -8,13 +8,12 @@ import torch import vllm.model_executor.layers.activation # noqa F401 from vllm.model_executor.custom_op import CustomOp -from vllm.platforms import current_platform from vllm.triton_utils import triton from vllm.utils.argparse_utils import FlexibleArgumentParser -from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE +from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed -batch_size_range = [1, 16, 32, 64, 128] -seq_len_range = [1, 16, 64, 128, 256, 512, 1024, 2048, 4096] +batch_size_range = [1, 16, 128] +seq_len_range = [1, 16, 64, 1024, 4096] intermediate_size = [3072, 9728, 12288] configs = list(itertools.product(batch_size_range, seq_len_range, intermediate_size)) @@ -30,7 +29,7 @@ def benchmark_activation( device = "cuda" num_tokens = batch_size * seq_len dim = intermediate_size - current_platform.seed_everything(42) + set_random_seed(42) torch.set_default_device(device) if func_name == "gelu_and_mul": diff --git a/benchmarks/kernels/benchmark_cutlass_moe_fp8.py b/benchmarks/kernels/benchmark_cutlass_moe_fp8.py index e07d6c776bc00b838a0a8d99202fe85bc0e4935c..9c6edee7b26454c924a6686de7dd096c7b5621fa 100644 --- a/benchmarks/kernels/benchmark_cutlass_moe_fp8.py +++ b/benchmarks/kernels/benchmark_cutlass_moe_fp8.py @@ -6,15 +6,19 @@ kernel. Both kernels take in fp8 quantized weights and 16-bit activations, but use different quantization strategies and backends. """ -import nvtx import torch +import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm import _custom_ops as ops from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config -from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8 +from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8 from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk +from vllm.model_executor.layers.fused_moe.prepare_finalize import ( + MoEPrepareAndFinalizeNoEP, +) from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.v1.worker.workspace import init_workspace_manager # Weight shapes for different models: [num_experts, topk, hidden_size, # intermediate_size] @@ -58,6 +62,7 @@ def bench_run( per_out_ch: bool, mkn: tuple[int, int, int], ): + init_workspace_manager(torch.cuda.current_device()) (m, k, n) = mkn dtype = torch.half @@ -120,85 +125,6 @@ def bench_run( # Force per-tensor quantization for all cases per_act_token = False - # Create stride tensors for CUTLASS - ab_strides1 = torch.full((num_experts,), k, dtype=torch.int64, device=device) - ab_strides2 = torch.full((num_experts,), n, dtype=torch.int64, device=device) - c_strides1 = torch.full((num_experts,), 2 * n, dtype=torch.int64, device=device) - c_strides2 = torch.full((num_experts,), k, dtype=torch.int64, device=device) - - def run_triton_moe( - a: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - w1_scale: torch.Tensor, - w2_scale: torch.Tensor, - a1_scale: torch.Tensor, - a2_scale: torch.Tensor, - num_repeats: int, - ): - quant_config = fp8_w8a8_moe_quant_config( - w1_scale=w1_scale, - w2_scale=w2_scale, - a1_scale=a1_scale, - a2_scale=a2_scale, - per_act_token_quant=per_act_token, - per_out_ch_quant=per_out_ch, - ) - - for _ in range(num_repeats): - fused_experts( - a, - w1, - w2, - topk_weights, - topk_ids, - quant_config=quant_config, - ) - - def run_cutlass_moe_fp8( - a: torch.Tensor, - w1: torch.Tensor, - w2: torch.Tensor, - topk_weights: torch.Tensor, - topk_ids: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, - w1_scale: torch.Tensor, - w2_scale: torch.Tensor, - a1_scale: torch.Tensor, - a2_scale: torch.Tensor, - num_repeats: int, - ): - quant_config = fp8_w8a8_moe_quant_config( - w1_scale=w1_scale, - w2_scale=w2_scale, - a1_scale=a1_scale, - a2_scale=a2_scale, - per_act_token_quant=per_act_token, - per_out_ch_quant=per_out_ch, - ) - - for _ in range(num_repeats): - with nvtx.annotate("cutlass_moe_fp8", color="blue"): - cutlass_moe_fp8( - a=a, - w1_q=w1, - w2_q=w2, - topk_weights=topk_weights, - topk_ids=topk_ids, - ab_strides1=ab_strides1, - ab_strides2=ab_strides2, - c_strides1=c_strides1, - c_strides2=c_strides2, - quant_config=quant_config, - activation="silu", - global_num_experts=num_experts, - ) - # Pre-create quantization config to avoid creating it inside CUDA graph quant_config = fp8_w8a8_moe_quant_config( w1_scale=w1_scale, @@ -209,23 +135,30 @@ def bench_run( per_out_ch_quant=per_out_ch, ) + fn = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(), + CutlassExpertsFp8( + out_dtype=a.dtype, + e=num_experts, + n=n, + k=k, + quant_config=quant_config, + device=w1.device, + ), + ) + # Create CUDA graphs for CUTLASS (match benchmark_moe.py pattern exactly) cutlass_stream = torch.cuda.Stream() cutlass_graph = torch.cuda.CUDAGraph() with torch.cuda.graph(cutlass_graph, stream=cutlass_stream): # Capture 10 invocations like benchmark_moe.py for _ in range(10): - cutlass_moe_fp8( - a=a, - w1_q=w1_fp8q_cutlass, - w2_q=w2_fp8q_cutlass, - topk_weights=topk_weights, - topk_ids=topk_ids, - ab_strides1=ab_strides1, - ab_strides2=ab_strides2, - c_strides1=c_strides1, - c_strides2=c_strides2, - quant_config=quant_config, + fn( + a, + w1_fp8q_cutlass, + w2_fp8q_cutlass, + topk_weights, + topk_ids, activation="silu", global_num_experts=num_experts, ) @@ -297,6 +230,10 @@ def bench_run( def main(args): + # Initialize workspace manager (required for CUTLASS MoE kernels) + device = torch.device("cuda:0") + init_workspace_manager(device) + print("Benchmarking models:") for i, model in enumerate(args.models): print(f"[{i}] {model}") diff --git a/benchmarks/kernels/benchmark_cutlass_fp4_moe.py b/benchmarks/kernels/benchmark_cutlass_moe_nvfp4.py similarity index 92% rename from benchmarks/kernels/benchmark_cutlass_fp4_moe.py rename to benchmarks/kernels/benchmark_cutlass_moe_nvfp4.py index 7982cbb1422c5e4dda0f8aff07a3bad91f85ad0a..10a3e3eab5fd397ec6a148290397b97f3d1db17d 100644 --- a/benchmarks/kernels/benchmark_cutlass_fp4_moe.py +++ b/benchmarks/kernels/benchmark_cutlass_moe_nvfp4.py @@ -11,16 +11,23 @@ import nvtx import torch import torch.utils.benchmark as benchmark +import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm import _custom_ops as ops from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config from vllm.model_executor.layers.fused_moe.config import ( fp8_w8a8_moe_quant_config, nvfp4_moe_quant_config, ) -from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4 +from vllm.model_executor.layers.fused_moe.cutlass_moe import ( + CutlassExpertsFp4, +) from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk +from vllm.model_executor.layers.fused_moe.prepare_finalize import ( + MoEPrepareAndFinalizeNoEP, +) from vllm.scalar_type import scalar_types from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.v1.worker.workspace import init_workspace_manager WEIGHT_SHAPES_MOE = { "nvidia/DeepSeek-R1-FP4": [ @@ -187,19 +194,24 @@ def bench_run( g1_alphas=w1_gs, g2_alphas=w2_gs, ) + + kernel = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(defer_input_quant=True), + CutlassExpertsFp4( + out_dtype=dtype, + max_experts_per_worker=e, + quant_config=quant_config, + ), + ) + for _ in range(num_repeats): with nvtx.annotate("cutlass_moe_fp4", color="green"): - cutlass_moe_fp4( - a=a, - w1_fp4=w1_fp4, - w2_fp4=w2_fp4, + kernel( + hidden_states=a, + w1=w1_fp4, + w2=w2_fp4, topk_weights=topk_weights, topk_ids=topk_ids, - m=m, - n=n, - k=k, - e=num_experts, - quant_config=quant_config, ) def run_cutlass_from_graph( @@ -229,20 +241,24 @@ def bench_run( g2_alphas=w2_gs, ) + kernel = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(defer_input_quant=True), + CutlassExpertsFp4( + out_dtype=dtype, + max_experts_per_worker=e, + quant_config=quant_config, + ), + ) + with set_current_vllm_config( VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1)) ): - return cutlass_moe_fp4( - a=a, - w1_fp4=w1_fp4, - w2_fp4=w2_fp4, + return kernel( + hidden_states=a, + w1=w1_fp4, + w2=w2_fp4, topk_weights=topk_weights, topk_ids=topk_ids, - m=m, - n=n, - k=k, - e=num_experts, - quant_config=quant_config, ) def run_triton_from_graph( @@ -441,6 +457,10 @@ def bench_run( def main(args): + # Initialize workspace manager (required for CUTLASS MoE kernels) + device = torch.device("cuda:0") + init_workspace_manager(device) + print("Benchmarking models:") for i, model in enumerate(args.models): print(f"[{i}] {model}") diff --git a/benchmarks/kernels/benchmark_device_communicators.py b/benchmarks/kernels/benchmark_device_communicators.py index b414efa6e330bca5f1f5e5604c60fa357d9803bc..7b453fe7b6809957fabd9bfb772ecec98ee55999 100644 --- a/benchmarks/kernels/benchmark_device_communicators.py +++ b/benchmarks/kernels/benchmark_device_communicators.py @@ -293,7 +293,7 @@ class CommunicatorBenchmark: graph = torch.cuda.CUDAGraph() graph_pool = torch.cuda.graph_pool_handle() set_graph_pool_id(graph_pool) - with torch.cuda.graph(graph, pool=graph_pool): + with torch.cuda.graph(graph, pool=graph_pool, stream=stream): for _ in range(CUDA_GRAPH_CAPTURE_CYCLES): allreduce_fn(graph_input) diff --git a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py index 9b426d8d5f778a483aab12fa82d451460b940274..b30a1263878bed39fef449fc4521053bb0a3136f 100644 --- a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py +++ b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py @@ -5,15 +5,20 @@ import torch import torch.utils.benchmark as benchmark from benchmark_shapes import WEIGHT_SHAPES_MOE +import vllm.model_executor.layers.fused_moe.modular_kernel as mk from vllm import _custom_ops as ops from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config -from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8 +from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8 from vllm.model_executor.layers.fused_moe.fused_moe import ( fused_experts, fused_topk, ) +from vllm.model_executor.layers.fused_moe.prepare_finalize import ( + MoEPrepareAndFinalizeNoEP, +) from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.v1.worker.workspace import init_workspace_manager DEFAULT_MODELS = [ "mistralai/Mixtral-8x7B-Instruct-v0.1", @@ -44,6 +49,7 @@ def bench_run( per_out_ch: bool, mkn: tuple[int, int, int], ): + init_workspace_manager(torch.cuda.current_device()) label = "Quant Matmul" sub_label = ( @@ -81,11 +87,6 @@ def bench_run( a, score, topk, renormalize=False ) - ab_strides1 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) - ab_strides2 = torch.full((num_experts,), n, device="cuda", dtype=torch.int64) - c_strides1 = torch.full((num_experts,), 2 * n, device="cuda", dtype=torch.int64) - c_strides2 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64) - def run_triton_moe( a: torch.Tensor, w1: torch.Tensor, @@ -119,10 +120,6 @@ def bench_run( w2: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, per_act_token: bool, @@ -134,31 +131,29 @@ def bench_run( per_act_token_quant=per_act_token, ) - for _ in range(num_repeats): - cutlass_moe_fp8( - a, - w1, - w2, - topk_weights, - topk_ids, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, + fn = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(), + CutlassExpertsFp8( + out_dtype=a.dtype, + # NOTE(rob): w2 is shaped as [E, hidden, intermediate] + e=w2.shape[0], + n=w2.shape[2], + k=w2.shape[1], quant_config=quant_config, - ) + device=w1.device, + ), + ) + + for _ in range(num_repeats): + fn(a, w1, w2, topk_weights, topk_ids) def run_cutlass_from_graph( a: torch.Tensor, a_scale: torch.Tensor, - w1_q: torch.Tensor, - w2_q: torch.Tensor, + w1: torch.Tensor, + w2: torch.Tensor, w1_scale: torch.Tensor, w2_scale: torch.Tensor, - ab_strides1: torch.Tensor, - ab_strides2: torch.Tensor, - c_strides1: torch.Tensor, - c_strides2: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, ): @@ -168,21 +163,23 @@ def bench_run( per_act_token_quant=per_act_token, ) + fn = mk.FusedMoEModularKernel( + MoEPrepareAndFinalizeNoEP(), + CutlassExpertsFp8( + out_dtype=a.dtype, + # NOTE(rob): w2 is shaped as [E, hidden, intermediate] + e=w2.shape[0], + n=w2.shape[2], + k=w2.shape[1], + quant_config=quant_config, + device=w1.device, + ), + ) + with set_current_vllm_config( VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1)) ): - return cutlass_moe_fp8( - a, - w1_q, - w2_q, - topk_weights, - topk_ids, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, - quant_config=quant_config, - ) + return fn(a, w1, w2, topk_weights, topk_ids) def run_triton_from_graph( a: torch.Tensor, @@ -226,10 +223,6 @@ def bench_run( w2_q, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, topk_weights, topk_ids, ) @@ -267,10 +260,6 @@ def bench_run( "w1_scale": w1_scale, "w2_scale": w2_scale, "per_act_token": per_act_token, - "ab_strides1": ab_strides1, - "ab_strides2": ab_strides2, - "c_strides1": c_strides1, - "c_strides2": c_strides2, # cuda graph params "cutlass_graph": cutlass_graph, "triton_graph": triton_graph, @@ -329,10 +318,6 @@ def bench_run( w2_q, w1_scale, w2_scale, - ab_strides1, - ab_strides2, - c_strides1, - c_strides2, topk_weights, topk_ids, per_act_token, @@ -341,7 +326,7 @@ def bench_run( results.append( benchmark.Timer( - stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, ab_strides1, ab_strides2, c_strides1, c_strides2, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 + stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501 globals=globals, label=label, sub_label=sub_label, @@ -364,6 +349,10 @@ def bench_run( def main(args): + # Initialize workspace manager (required for CUTLASS MoE kernels) + device = torch.device("cuda:0") + init_workspace_manager(device) + print("Benchmarking models:") for i, model in enumerate(args.models): print(f"[{i}] {model}") diff --git a/benchmarks/kernels/benchmark_layernorm.py b/benchmarks/kernels/benchmark_layernorm.py index 6fa5c248670e32cdd672e45e016d3c47d9689577..2292d2f87288f267082e7106e20270f6f8e17bbf 100644 --- a/benchmarks/kernels/benchmark_layernorm.py +++ b/benchmarks/kernels/benchmark_layernorm.py @@ -6,9 +6,8 @@ import time import torch from vllm.model_executor.layers.layernorm import RMSNorm -from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser -from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE +from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed @torch.inference_mode() @@ -22,7 +21,7 @@ def main( num_warmup_iters: int = 5, num_iters: int = 100, ) -> None: - current_platform.seed_everything(seed) + set_random_seed(seed) torch.set_default_device("cuda") layer = RMSNorm(hidden_size).to(dtype=dtype) diff --git a/benchmarks/kernels/benchmark_moe.py b/benchmarks/kernels/benchmark_moe.py index 67b033ca7b708fd0d364bcea7709e9d2dd38eaa6..35f00afbc4f9cf6acc03c32b69dcfb3d3ef7c1ee 100644 --- a/benchmarks/kernels/benchmark_moe.py +++ b/benchmarks/kernels/benchmark_moe.py @@ -2,6 +2,7 @@ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project import argparse +import gc import json import os import time @@ -22,10 +23,49 @@ from vllm.model_executor.layers.fused_moe.fused_moe import * from vllm.transformers_utils.config import get_config from vllm.triton_utils import triton from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.utils.torch_utils import set_random_seed # 移除全局的 current_platform 导入,改为在需要时局部导入 # FP8_DTYPE = current_platform.fp8_dtype() +# Default interval for clearing Triton JIT cache during tuning +# Set to 0 to disable automatic cache clearing +_CACHE_CLEAR_INTERVAL_ENV = "VLLM_MOE_TUNE_CACHE_CLEAR_INTERVAL" +TRITON_CACHE_CLEAR_INTERVAL = int(os.environ.get(_CACHE_CLEAR_INTERVAL_ENV, "50")) + + +def clear_triton_cache(): + """Clear Triton JIT compilation cache and Python/CUDA memory. + + This helps prevent OOM during tuning with large models (many experts). + """ + # Force Python garbage collection + gc.collect() + + # Clear CUDA memory cache + if torch.cuda.is_available(): + torch.cuda.empty_cache() + + # Try to clear Triton's runtime cache + try: + if ( + hasattr(triton, "runtime") + and hasattr(triton.runtime, "cache") + and hasattr(triton.runtime.cache, "clear") + ): + triton.runtime.cache.clear() + except ImportError: + # Triton not installed, skip cache clearing + pass + except AttributeError: + # Triton version doesn't have expected cache API + pass + except Exception as e: + print(f"Warning: Failed to clear Triton cache: {e}") + + # Additional garbage collection after clearing caches + gc.collect() + def ensure_divisibility(numerator, denominator, text): """Ensure that numerator is divisible by the denominator.""" @@ -454,7 +494,8 @@ class BenchmarkWorker: pass else: torch.set_default_device("cuda:"+ str(device_id)) - current_platform.seed_everything(seed) + + set_random_seed(seed) self.seed = seed # Store the logical device ID for Ray self.device_id = device_id @@ -475,7 +516,10 @@ class BenchmarkWorker: ) -> tuple[dict[str, int], float]: # 局部导入 current_platform from vllm.platforms import current_platform - current_platform.seed_everything(self.seed) + + from vllm.model_executor.layers.fused_moe.fused_moe import get_moe_configs, get_default_config + + set_random_seed(self.seed) dtype_str = _get_config_dtype_str( dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8 ) @@ -560,7 +604,7 @@ class BenchmarkWorker: need_device_guard = True with torch.cuda.device(self.device_id) if need_device_guard else nullcontext(): - for config in tqdm(search_space): + for idx, config in enumerate(tqdm(search_space)): try: kernel_time = benchmark_config( config, @@ -583,6 +627,19 @@ class BenchmarkWorker: if kernel_time < best_time: best_time = kernel_time best_config = config + + # Periodically clear Triton JIT cache to prevent OOM + # This is especially important for large models with many experts + if ( + TRITON_CACHE_CLEAR_INTERVAL > 0 + and idx > 0 + and idx % TRITON_CACHE_CLEAR_INTERVAL == 0 + ): + clear_triton_cache() + + # Final cleanup after tuning completes + clear_triton_cache() + now = datetime.now() print(f"{now.ctime()}] Completed tuning for batch_size={num_tokens}") assert best_config is not None diff --git a/benchmarks/kernels/benchmark_moe_permute_unpermute.py b/benchmarks/kernels/benchmark_moe_permute_unpermute.py index b8913a217c608c5b4109d4f2181dae20e72a4d0b..77b77a15b53af8cb63867bc3f88505a8f6f838de 100644 --- a/benchmarks/kernels/benchmark_moe_permute_unpermute.py +++ b/benchmarks/kernels/benchmark_moe_permute_unpermute.py @@ -18,6 +18,7 @@ from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import ( from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.utils.torch_utils import set_random_seed FP8_DTYPE = current_platform.fp8_dtype() @@ -261,7 +262,7 @@ def benchmark_unpermute( class BenchmarkWorker: def __init__(self, seed: int) -> None: torch.set_default_device("cuda") - current_platform.seed_everything(seed) + set_random_seed(seed) self.seed = seed # Get the device ID to allocate tensors and kernels # on the respective GPU. This is required for Ray to work @@ -279,7 +280,7 @@ class BenchmarkWorker: use_int8_w8a16: bool, use_customized_permute: bool = False, ) -> tuple[dict[str, int], float]: - current_platform.seed_everything(self.seed) + set_random_seed(self.seed) permute_time = benchmark_permute( num_tokens, diff --git a/benchmarks/kernels/benchmark_mrope.py b/benchmarks/kernels/benchmark_mrope.py index 09de5fa822f86802df996390852b4319e72f4598..3e03651357784bfd3c1d539a3eeafd76c54d311a 100644 --- a/benchmarks/kernels/benchmark_mrope.py +++ b/benchmarks/kernels/benchmark_mrope.py @@ -37,9 +37,9 @@ import numpy as np import torch from vllm.model_executor.layers.rotary_embedding import get_rope -from vllm.platforms import current_platform from vllm.transformers_utils.config import get_config from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.utils.torch_utils import set_random_seed device = torch.device("cuda" if torch.cuda.is_available() else "cpu") @@ -94,7 +94,7 @@ def benchmark_mrope( benchmark_iter: int = 100, csv_writer=None, ): - current_platform.seed_everything(seed) + set_random_seed(seed) torch.set_default_device(device) # the parameters to compute the q k v size based on tp_size mrope_helper_class = get_rope( diff --git a/benchmarks/kernels/benchmark_paged_attention.py b/benchmarks/kernels/benchmark_paged_attention.py index 9affd4c182ff6ea134af30ba488fab5dd64d7c72..b844f1e29fd16156e099e536e8acd0db3b8e93d8 100644 --- a/benchmarks/kernels/benchmark_paged_attention.py +++ b/benchmarks/kernels/benchmark_paged_attention.py @@ -13,6 +13,7 @@ from vllm.utils.argparse_utils import FlexibleArgumentParser from vllm.utils.torch_utils import ( STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random, + set_random_seed, ) import vllm.envs as envs @@ -39,7 +40,7 @@ def main( device: str = "cuda", kv_cache_dtype: str | None = None, ) -> None: - current_platform.seed_everything(seed) + set_random_seed(seed) scale = float(1.0 / (head_size**0.5)) query = torch.empty( diff --git a/benchmarks/kernels/benchmark_quant.py b/benchmarks/kernels/benchmark_quant.py index 3c2ac9128947af0f33aad790b811bc3f56afae10..9a21cfe94e5be1d69114fe049a6f8167eaf36592 100644 --- a/benchmarks/kernels/benchmark_quant.py +++ b/benchmarks/kernels/benchmark_quant.py @@ -6,9 +6,8 @@ import time import torch from vllm import _custom_ops as ops -from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser -from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE +from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed @torch.inference_mode() @@ -23,7 +22,7 @@ def main( num_warmup_iters: int = 5, num_iters: int = 100, ) -> None: - current_platform.seed_everything(seed) + set_random_seed(seed) torch.set_default_device("cuda") x = torch.randn(num_tokens, hidden_size, dtype=dtype) diff --git a/benchmarks/kernels/benchmark_reshape_and_cache.py b/benchmarks/kernels/benchmark_reshape_and_cache.py index 0d3aef0c630b284feb8a8b4e3e35cdfa62237f39..99067d8ac3710fc7f86dcd3017b3a8ea218426de 100644 --- a/benchmarks/kernels/benchmark_reshape_and_cache.py +++ b/benchmarks/kernels/benchmark_reshape_and_cache.py @@ -8,11 +8,11 @@ from tabulate import tabulate from vllm import _custom_ops as ops from vllm.logger import init_logger -from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser from vllm.utils.torch_utils import ( STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random, + set_random_seed, ) logger = init_logger(__name__) @@ -36,7 +36,7 @@ def run_benchmark( if kv_cache_dtype == "fp8" and head_size % 16: raise ValueError("fp8 kv-cache requires head_size to be a multiple of 16.") - current_platform.seed_everything(42) + set_random_seed(42) torch.set_default_device(device) # create random key / value tensors [T, H, D]. diff --git a/benchmarks/kernels/benchmark_reshape_and_cache_flash.py b/benchmarks/kernels/benchmark_reshape_and_cache_flash.py index 12f17ea575d9448b1cd7ad3e900e4a86a7594703..ef6be1f3c3597c9d4922b6bba8ad4128fecfbd0a 100644 --- a/benchmarks/kernels/benchmark_reshape_and_cache_flash.py +++ b/benchmarks/kernels/benchmark_reshape_and_cache_flash.py @@ -7,15 +7,15 @@ import torch from tabulate import tabulate from vllm import _custom_ops as ops -from vllm.attention.ops.triton_reshape_and_cache_flash import ( - triton_reshape_and_cache_flash, -) from vllm.logger import init_logger -from vllm.platforms import current_platform from vllm.utils.argparse_utils import FlexibleArgumentParser from vllm.utils.torch_utils import ( STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random_flash, + set_random_seed, +) +from vllm.v1.attention.ops.triton_reshape_and_cache_flash import ( + triton_reshape_and_cache_flash, ) logger = init_logger(__name__) @@ -49,7 +49,7 @@ def run_benchmark( if implementation == "triton" and kv_cache_layout == "HND": return float("nan") # Triton does not support HND layout yet. - current_platform.seed_everything(42) + set_random_seed(42) torch.set_default_device(device) # create random key / value tensors [T, H, D]. diff --git a/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py b/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py index de01ff197eab7593be32056b2cca1c9a0a3fe060..da32bc30cb2ae3b385b79c852334f1594a4fe52d 100644 --- a/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py +++ b/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py @@ -23,9 +23,9 @@ import torch from vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe import ( persistent_masked_m_silu_mul_quant, ) -from vllm.platforms import current_platform from vllm.triton_utils import tl, triton from vllm.utils.deep_gemm import is_deep_gemm_e8m0_used +from vllm.utils.torch_utils import set_random_seed @triton.jit @@ -207,7 +207,7 @@ def benchmark( ): def generate_data(seed_offset=0): """Generate input data with given seed offset""" - current_platform.seed_everything(42 + seed_offset) + set_random_seed(42 + seed_offset) y = torch.rand((E, T, 2 * H), dtype=torch.bfloat16, device="cuda").contiguous() if gen_strategy == "random_imbalanced": diff --git a/benchmarks/kernels/cpu/benchmark_cpu_attn.py b/benchmarks/kernels/cpu/benchmark_cpu_attn.py new file mode 100644 index 0000000000000000000000000000000000000000..30b86039537390c27a43e9f66b723b806fd06d21 --- /dev/null +++ b/benchmarks/kernels/cpu/benchmark_cpu_attn.py @@ -0,0 +1,272 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import functools +import time + +import numpy as np +import torch + +from vllm._custom_ops import ( + cpu_attention_with_kv_cache, + cpu_attn_get_scheduler_metadata, + cpu_attn_reshape_and_cache, +) +from vllm.platforms import CpuArchEnum, current_platform +from vllm.utils.argparse_utils import FlexibleArgumentParser +from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE +from vllm.v1.attention.backends.cpu_attn import CPUAttentionBackend, _get_attn_isa + + +def get_attn_isa( + block_size: int | None = None, + dtype: torch.dtype | None = None, +): + if block_size and dtype: + return _get_attn_isa(dtype, block_size) + else: + if current_platform.get_cpu_architecture() == CpuArchEnum.ARM: + return "neon" + elif torch._C._cpu._is_amx_tile_supported(): + return "amx" + else: + return "vec" + + +# rand number generation takes too much time, cache rand tensors +@functools.lru_cache(maxsize=128, typed=False) +def tensor_cache( + elem_num: int, + dtype: torch.dtype, +) -> torch.Tensor: + tensor = torch.randn(elem_num, dtype=dtype) + return tensor + + +@torch.inference_mode() +def main( + seq_lens: list[tuple[int, int]], + num_heads: tuple[int, int], + head_size: int, + sliding_window: int = None, + dtype: torch.dtype = torch.bfloat16, + block_size: int = 128, + num_blocks: int = 4096, + use_sink: bool = False, + enable_kv_split: bool = False, + isa: str | None = None, + seed: int = 0, + iters: int = 20, +) -> None: + current_platform.seed_everything(seed) + num_seqs = len(seq_lens) + query_lens = [x[0] for x in seq_lens] + kv_lens = [x[1] for x in seq_lens] + num_query_heads = num_heads[0] + num_kv_heads = num_heads[1] + assert num_query_heads % num_kv_heads == 0 + max_kv_len = max(kv_lens) + window_size = (sliding_window - 1, 0) if sliding_window is not None else (-1, -1) + scale = head_size**-0.5 + token_num = sum(query_lens) + + if isa is None: + isa = get_attn_isa(block_size, dtype) + + s_aux = ( + 15 * torch.rand((num_query_heads,), dtype=torch.bfloat16) if use_sink else None + ) + + query = tensor_cache( + elem_num=token_num * num_query_heads * head_size, + dtype=dtype, + ) + query = query.view( + token_num, + num_query_heads, + head_size, + ) + + key_value = tensor_cache( + elem_num=2 * num_blocks * num_kv_heads * block_size * head_size, + dtype=dtype, + ) + key_value = key_value.view( + 2, + num_blocks, + block_size, + num_kv_heads, + head_size, + ) + key_cache, value_cache = key_value.unbind(0) + + # KV cache for CPU attention + packed_key_cache = torch.empty( + num_blocks, num_kv_heads, block_size, head_size, dtype=dtype + ) + packed_value_cache = torch.empty_like(packed_key_cache) + + cu_query_lens = torch.tensor([0] + query_lens, dtype=torch.int32).cumsum( + dim=0, dtype=torch.int32 + ) + kv_lens_tensor = torch.tensor(kv_lens, dtype=torch.int32) + max_num_blocks_per_seq = (max_kv_len + block_size - 1) // block_size + block_tables = torch.randint( + 0, num_blocks, (num_seqs, max_num_blocks_per_seq), dtype=torch.int32 + ) + + # use reshape_and_cache to pack key_cache and value_cache + slot_mapping = torch.arange(0, num_blocks * block_size, dtype=torch.int64) + cpu_attn_reshape_and_cache( + key=key_cache.view(-1, num_kv_heads, head_size), + value=value_cache.view(-1, num_kv_heads, head_size), + key_cache=packed_key_cache, + value_cache=packed_value_cache, + slot_mapping=slot_mapping, + isa=isa, + ) + + metadata = cpu_attn_get_scheduler_metadata( + num_reqs=num_seqs, + num_heads=num_query_heads, + num_kv_heads=num_kv_heads, + head_dim=head_size, + seq_lens=kv_lens_tensor, + dtype=dtype, + query_start_loc=cu_query_lens, + causal=True, + sliding_window_size=sliding_window if sliding_window is not None else -1, + isa=isa, + enable_kv_split=enable_kv_split, + ) + + out_with_split = torch.empty_like(query) + + def run_benchmark(iters: int) -> list[float]: + times = [] + for _ in range(iters): + start_time = time.perf_counter_ns() + cpu_attention_with_kv_cache( + query=query, + key_cache=packed_key_cache, + value_cache=packed_value_cache, + output=out_with_split, + query_start_loc=cu_query_lens, + seq_lens=kv_lens_tensor, + scale=scale, + causal=True, + alibi_slopes=None, + sliding_window=window_size, + block_table=block_tables, + softcap=0, + scheduler_metadata=metadata, + s_aux=s_aux, + ) + end_time = time.perf_counter_ns() + times.append((end_time - start_time) / 1e6) + return times + + # warmup + run_benchmark(5) + # benchmark + times = run_benchmark(iters) + + time_min = min(times) + time_max = max(times) + time_mean = np.mean(times) + time_std = np.std(times) + + print("\tmin (ms) = ", time_min) + print("\tmax (ms) = ", time_max) + print("\tmean (ms) = ", time_mean) + print("\tstd = ", time_std) + print("\tmedian (ms) = ", np.median(times)) + + +def generate_seq_lens( + batch_size: int, + q_len_min: int, + q_len_max: int, + kv_len_min: int, + kv_len_max: int, + seed: int = 0, +) -> list[tuple[int, int]]: + assert 1 <= q_len_min <= q_len_max + assert 1 <= kv_len_min <= kv_len_max + assert kv_len_max >= q_len_min + + g = torch.Generator(device="cpu").manual_seed(seed) + + def rint(lo: int, hi: int) -> int: + return torch.randint(lo, hi + 1, (1,), generator=g).item() + + seq_lens: list[tuple[int, int]] = [] + for _ in range(batch_size): + # ensure q <= kv + kv = rint(max(kv_len_min, q_len_min), kv_len_max) + q = rint(q_len_min, min(q_len_max, kv)) + seq_lens.append((q, kv)) + + return seq_lens + + +if __name__ == "__main__": + parser = FlexibleArgumentParser(description="Benchmark the paged attention kernel.") + parser.add_argument("--batch-size", type=int, default=64) + parser.add_argument("--q-len-min", type=int, default=512) + parser.add_argument("--q-len-max", type=int, default=512) + parser.add_argument("--kv-len-min", type=int, default=512) + parser.add_argument("--kv-len-max", type=int, default=512) + parser.add_argument("--num-blocks", type=int, default=4096) + + parser.add_argument("--sliding-window", type=int, default=None) + parser.add_argument("--num-query-heads", type=int, default=32) + parser.add_argument("--num-kv-heads", type=int, default=8) + parser.add_argument( + "--head-size", + type=int, + choices=CPUAttentionBackend.get_supported_head_sizes(), + default=128, + ) + parser.add_argument("--enable-kv-split", action="store_true") + parser.add_argument("--block-size", type=int, choices=[32, 64, 128], default=128) + parser.add_argument( + "--dtype", type=str, choices=["half", "bfloat16", "float"], default="bfloat16" + ) + parser.add_argument("--use-sink", action="store_true") + parser.add_argument( + "--isa", type=str, choices=["vec", "neon", "amx", "vec16"], default=None + ) + parser.add_argument("--seed", type=int, default=0) + parser.add_argument("--iters", type=int, default=20) + + args = parser.parse_args() + print(args) + + seq_lens = generate_seq_lens( + args.batch_size, + args.q_len_min, + args.q_len_max, + args.kv_len_min, + args.kv_len_max, + args.seed, + ) + + print("batch (query len, kv len) = ", seq_lens) + + main( + seq_lens=seq_lens, + num_heads=(args.num_query_heads, args.num_kv_heads), + head_size=args.head_size, + sliding_window=args.sliding_window, + dtype=STR_DTYPE_TO_TORCH_DTYPE[args.dtype], + block_size=args.block_size, + num_blocks=args.num_blocks, + use_sink=args.use_sink, + enable_kv_split=args.enable_kv_split, + isa=args.isa + if args.isa is not None + else get_attn_isa(args.block_size, STR_DTYPE_TO_TORCH_DTYPE[args.dtype]), + seed=args.seed, + iters=args.iters, + ) diff --git a/benchmarks/kernels/cpu/benchmark_cpu_fused_moe.py b/benchmarks/kernels/cpu/benchmark_cpu_fused_moe.py new file mode 100644 index 0000000000000000000000000000000000000000..186b79ede0861bde51578d0704b419d4b50389ad --- /dev/null +++ b/benchmarks/kernels/cpu/benchmark_cpu_fused_moe.py @@ -0,0 +1,175 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import sys +import time + +import numpy as np +import torch + +from vllm.platforms import current_platform +from vllm.utils.argparse_utils import FlexibleArgumentParser + +# Check if CPU MoE operations are available +try: + from vllm._custom_ops import cpu_fused_moe, cpu_prepack_moe_weight +except (ImportError, AttributeError) as e: + print("ERROR: CPU fused MoE operations are not available on this platform.") + print("This benchmark requires x86 CPU with proper vLLM CPU extensions compiled.") + print( + "The cpu_fused_moe kernel is typically available on Linux x86_64 " + "with AVX2/AVX512." + ) + print(f"Import error: {e}") + sys.exit(1) + +# ISA selection following test_cpu_fused_moe.py pattern +ISA_CHOICES = ["amx", "vec"] if torch._C._cpu._is_amx_tile_supported() else ["vec"] + + +@torch.inference_mode() +def main( + batch_size: int, + expert_num: int, + hidden_size: int, + intermediate_size: int, + topk_num: int, + use_bias: bool = False, + dtype: torch.dtype = torch.bfloat16, + activation: str = "silu", + isa: str = "vec", + seed: int = 0, + iters: int = 20, +) -> None: + current_platform.seed_everything(seed) + # up_dim = 2 * intermediate_size for gate + up projection + up_dim = 2 * intermediate_size + + input_tensor = torch.randn((batch_size, hidden_size), dtype=dtype) / ( + 0.5 * hidden_size**0.5 + ) + + w13 = torch.randn((expert_num, up_dim, hidden_size), dtype=dtype) / ( + 0.5 * hidden_size**0.5 + ) + w2 = torch.randn((expert_num, hidden_size, intermediate_size), dtype=dtype) / ( + 0.5 * intermediate_size**0.5 + ) + + w13_bias = None + w2_bias = None + if use_bias: + w13_bias = torch.randn((expert_num, up_dim), dtype=dtype) / (0.5 * up_dim**0.5) + w2_bias = torch.randn((expert_num, hidden_size), dtype=dtype) / ( + 0.5 * hidden_size**0.5 + ) + + router_logits = torch.randn((batch_size, expert_num), dtype=dtype) + score = torch.softmax(router_logits, dim=-1, dtype=torch.float32) + topk_weights, topk_ids = torch.topk(score, topk_num) + topk_ids = topk_ids.to(torch.int32) + + packed_w13 = cpu_prepack_moe_weight(w13, isa) + packed_w2 = cpu_prepack_moe_weight(w2, isa) + + def run_benchmark(iters: int) -> list[float]: + times = [] + for _ in range(iters): + start_time = time.perf_counter_ns() + _ = cpu_fused_moe( + input_tensor, + packed_w13, + packed_w2, + w13_bias, + w2_bias, + topk_weights, + topk_ids, + activation, + isa, + ) + end_time = time.perf_counter_ns() + times.append((end_time - start_time) / 1e6) + return times + + # warmup + run_benchmark(5) + # benchmark + times = run_benchmark(iters) + + if not times: + print("No iterations to measure. Set --iters > 0.") + return + + time_min = min(times) + time_max = max(times) + time_mean = np.mean(times) + time_std = np.std(times) + + print("\tmin (ms) = ", time_min) + print("\tmax (ms) = ", time_max) + print("\tmean (ms) = ", time_mean) + print("\tstd = ", time_std) + print("\tmedian (ms) = ", np.median(times)) + + # Calculate throughput metrics + # FLOPs estimation: 2 * batch * topk * (hidden * up_dim + intermediate * hidden) + flops_per_token = ( + 2 * topk_num * (hidden_size * up_dim + intermediate_size * hidden_size) + ) + total_flops = batch_size * flops_per_token + tflops = total_flops / (time_mean * 1e-3) / 1e12 + print(f"\tthroughput (TFLOP/s) = {tflops:.4f}") + + +if __name__ == "__main__": + parser = FlexibleArgumentParser(description="Benchmark the CPU fused MoE kernel.") + parser.add_argument("--batch-size", type=int, default=64) + parser.add_argument("--expert-num", type=int, default=8) + parser.add_argument("--hidden-size", type=int, default=2880) + parser.add_argument("--intermediate-size", type=int, default=2880) + parser.add_argument( + "--topk-num", + type=int, + default=None, + help="Number of experts to route each token to (default: expert_num // 2)", + ) + parser.add_argument("--use-bias", action="store_true") + parser.add_argument( + "--activation", + type=str, + choices=["silu", "swigluoai"], + default="silu", + help="Activation function", + ) + parser.add_argument( + "--isa", + type=str, + choices=ISA_CHOICES, + default=ISA_CHOICES[0], + help=f"ISA to use (available: {ISA_CHOICES})", + ) + parser.add_argument("--seed", type=int, default=0) + parser.add_argument("--iters", type=int, default=20) + + args = parser.parse_args() + + # Default topk_num to expert_num // 2, minimum 1 + topk_num = ( + args.topk_num if args.topk_num is not None else max(args.expert_num // 2, 1) + ) + + print(args) + + main( + batch_size=args.batch_size, + expert_num=args.expert_num, + hidden_size=args.hidden_size, + intermediate_size=args.intermediate_size, + topk_num=topk_num, + use_bias=args.use_bias, + dtype=torch.bfloat16, # Following test_cpu_fused_moe.py + activation=args.activation, + isa=args.isa, + seed=args.seed, + iters=args.iters, + ) diff --git a/cmake/cpu_extension.cmake b/cmake/cpu_extension.cmake index 85b286f8d8d0a18575592a41e8f93b6653eaa397..0af87fd7f0b5338df2bc3ce6e84d18cc55d24de9 100644 --- a/cmake/cpu_extension.cmake +++ b/cmake/cpu_extension.cmake @@ -330,7 +330,7 @@ if ((AVX512_FOUND AND NOT AVX512_DISABLED) OR (ASIMD_FOUND AND NOT APPLE_SILICON PUBLIC ${oneDNN_BINARY_DIR}/include PRIVATE ${oneDNN_SOURCE_DIR}/src ) - target_link_libraries(dnnl_ext dnnl) + target_link_libraries(dnnl_ext dnnl torch) target_compile_options(dnnl_ext PRIVATE ${CXX_COMPILE_FLAGS} -fPIC) list(APPEND LIBS dnnl_ext) set(USE_ONEDNN ON) @@ -358,13 +358,13 @@ set(VLLM_EXT_SRC "csrc/cpu/pos_encoding.cpp" "csrc/moe/dynamic_4bit_int_moe_cpu.cpp" "csrc/cpu/cpu_attn.cpp" - "csrc/cpu/scratchpad_manager.cpp" "csrc/cpu/torch_bindings.cpp") if (AVX512_FOUND AND NOT AVX512_DISABLED) set(VLLM_EXT_SRC "csrc/cpu/shm.cpp" "csrc/cpu/cpu_wna16.cpp" + "csrc/cpu/cpu_fused_moe.cpp" ${VLLM_EXT_SRC}) if (ENABLE_AVX512BF16 AND ENABLE_AVX512VNNI) set(VLLM_EXT_SRC diff --git a/cmake/external_projects/flashmla.cmake b/cmake/external_projects/flashmla.cmake index 2cf3c1a755d3c0b86623121eefb481644280c8a3..0d4f9b7aa07c8912697b9c8b482699cc446aa0b1 100644 --- a/cmake/external_projects/flashmla.cmake +++ b/cmake/external_projects/flashmla.cmake @@ -35,16 +35,21 @@ message(STATUS "FlashMLA is available at ${flashmla_SOURCE_DIR}") # sm90a set(SUPPORT_ARCHS) -if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.3) - list(APPEND SUPPORT_ARCHS 9.0a) +if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3) + list(APPEND SUPPORT_ARCHS "9.0a") endif() -if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8) - list(APPEND SUPPORT_ARCHS 10.0a) +if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.9) + # CUDA 12.9 has introduced "Family-Specific Architecture Features" + # this supports all compute_10x family + list(APPEND SUPPORT_ARCHS "10.0f") +elseif(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8) + list(APPEND SUPPORT_ARCHS "10.0a") endif() cuda_archs_loose_intersection(FLASH_MLA_ARCHS "${SUPPORT_ARCHS}" "${CUDA_ARCHS}") if(FLASH_MLA_ARCHS) + message(STATUS "FlashMLA CUDA architectures: ${FLASH_MLA_ARCHS}") set(VLLM_FLASHMLA_GPU_FLAGS ${VLLM_GPU_FLAGS}) list(APPEND VLLM_FLASHMLA_GPU_FLAGS "--expt-relaxed-constexpr" "--expt-extended-lambda" "--use_fast_math") @@ -126,7 +131,8 @@ if(FLASH_MLA_ARCHS) $<$:-UPy_LIMITED_API> $<$:-UPy_LIMITED_API>) else() - # Create empty targets for setup.py when not targeting sm90a systems + message(STATUS "FlashMLA will not compile: unsupported CUDA architecture ${CUDA_ARCHS}") + # Create empty targets for setup.py on unsupported systems add_custom_target(_flashmla_C) add_custom_target(_flashmla_extension_C) endif() diff --git a/cmake/external_projects/qutlass.cmake b/cmake/external_projects/qutlass.cmake index 5a59a409999ad5d673eea5b3a67648bba2c43c5e..84bb1b00c1bba0fecb96ad2193587d9e52967040 100644 --- a/cmake/external_projects/qutlass.cmake +++ b/cmake/external_projects/qutlass.cmake @@ -31,10 +31,15 @@ if(NOT qutlass_SOURCE_DIR) endif() message(STATUS "[QUTLASS] QuTLASS is available at ${qutlass_SOURCE_DIR}") -cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a" "${CUDA_ARCHS}") -if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND QUTLASS_ARCHS) +if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0) + cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0f" "${CUDA_ARCHS}") +else() + cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a;10.3a" "${CUDA_ARCHS}") +endif() + +if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND QUTLASS_ARCHS) - if(QUTLASS_ARCHS MATCHES "10\\.0a") + if(QUTLASS_ARCHS MATCHES "10\\.(0a|3a|0f)") set(QUTLASS_TARGET_CC 100) elseif(QUTLASS_ARCHS MATCHES "12\\.0a") set(QUTLASS_TARGET_CC 120) diff --git a/cmake/external_projects/vllm_flash_attn.cmake b/cmake/external_projects/vllm_flash_attn.cmake index ff687e0af7b44d2caa1a83a98958a176f907dbf3..b51934a3ab29ac76b000ee5c5ba48b56ac86364e 100644 --- a/cmake/external_projects/vllm_flash_attn.cmake +++ b/cmake/external_projects/vllm_flash_attn.cmake @@ -38,7 +38,7 @@ else() FetchContent_Declare( vllm-flash-attn GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git - GIT_TAG 86f8f157cf82aa2342743752b97788922dd7de43 + GIT_TAG 188be16520ceefdc625fdf71365585d2ee348fe2 GIT_PROGRESS TRUE # Don't share the vllm-flash-attn build between build types BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn diff --git a/csrc/activation_kernels.cu b/csrc/activation_kernels.cu index 0ff4d6a4cd709e5b96175ce2c828d2f4c475f941..f30ec0c08a89ffe9a0e02b1ebfb11cbcb8f1bf89 100644 --- a/csrc/activation_kernels.cu +++ b/csrc/activation_kernels.cu @@ -15,19 +15,61 @@ __device__ __forceinline__ scalar_t compute(const scalar_t& x, const scalar_t& y) { return act_first ? ACT_FN(x) * y : x * ACT_FN(y); } -// Activation and gating kernel template. +// Check if all pointers are 16-byte aligned for int4 vectorized access +__device__ __forceinline__ bool is_16byte_aligned(const void* ptr) { + return (reinterpret_cast(ptr) & 15) == 0; +} + +// Activation and gating kernel template. template __global__ void act_and_mul_kernel( scalar_t* __restrict__ out, // [..., d] const scalar_t* __restrict__ input, // [..., 2, d] const int d) { + constexpr int VEC_SIZE = 16 / sizeof(scalar_t); const int64_t token_idx = blockIdx.x; - for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { - const scalar_t x = VLLM_LDG(&input[token_idx * 2 * d + idx]); - const scalar_t y = VLLM_LDG(&input[token_idx * 2 * d + d + idx]); - out[token_idx * d + idx] = compute(x, y); + const scalar_t* x_ptr = input + token_idx * 2 * d; + const scalar_t* y_ptr = x_ptr + d; + scalar_t* out_ptr = out + token_idx * d; + + // Check alignment for 128-bit vectorized access. + // All three pointers must be 16-byte aligned for safe int4 operations. + const bool aligned = is_16byte_aligned(x_ptr) && is_16byte_aligned(y_ptr) && + is_16byte_aligned(out_ptr); + + if (aligned && d >= VEC_SIZE) { + // Fast path: 128-bit vectorized loop + const int4* x_vec = reinterpret_cast(x_ptr); + const int4* y_vec = reinterpret_cast(y_ptr); + int4* out_vec = reinterpret_cast(out_ptr); + const int num_vecs = d / VEC_SIZE; + const int vec_end = num_vecs * VEC_SIZE; + + for (int i = threadIdx.x; i < num_vecs; i += blockDim.x) { + int4 x = VLLM_LDG(&x_vec[i]), y = VLLM_LDG(&y_vec[i]), r; + auto* xp = reinterpret_cast(&x); + auto* yp = reinterpret_cast(&y); + auto* rp = reinterpret_cast(&r); +#pragma unroll + for (int j = 0; j < VEC_SIZE; j++) { + rp[j] = compute(xp[j], yp[j]); + } + out_vec[i] = r; + } + // Scalar cleanup for remaining elements + for (int i = vec_end + threadIdx.x; i < d; i += blockDim.x) { + out_ptr[i] = compute(VLLM_LDG(&x_ptr[i]), + VLLM_LDG(&y_ptr[i])); + } + } else { + // Scalar fallback for unaligned data or small d + for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { + const scalar_t x = VLLM_LDG(&x_ptr[idx]); + const scalar_t y = VLLM_LDG(&y_ptr[idx]); + out_ptr[idx] = compute(x, y); + } } } @@ -120,50 +162,115 @@ template __global__ void act_and_mul_kernel_with_param( scalar_t* __restrict__ out, const scalar_t* __restrict__ input, const int d, const float param) { + constexpr int VEC_SIZE = 16 / sizeof(scalar_t); const int64_t token_idx = blockIdx.x; - for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { - const scalar_t x = VLLM_LDG(&input[token_idx * 2 * d + idx]); - const scalar_t y = VLLM_LDG(&input[token_idx * 2 * d + d + idx]); - out[token_idx * d + idx] = ACT_FN(x, param) * y; + const scalar_t* x_ptr = input + token_idx * 2 * d; + const scalar_t* y_ptr = x_ptr + d; + scalar_t* out_ptr = out + token_idx * d; + + // Check alignment for 128-bit vectorized access + const bool aligned = is_16byte_aligned(x_ptr) && is_16byte_aligned(y_ptr) && + is_16byte_aligned(out_ptr); + + if (aligned && d >= VEC_SIZE) { + // Fast path: 128-bit vectorized loop + const int4* x_vec = reinterpret_cast(x_ptr); + const int4* y_vec = reinterpret_cast(y_ptr); + int4* out_vec = reinterpret_cast(out_ptr); + const int num_vecs = d / VEC_SIZE; + const int vec_end = num_vecs * VEC_SIZE; + + for (int i = threadIdx.x; i < num_vecs; i += blockDim.x) { + int4 x = VLLM_LDG(&x_vec[i]), y = VLLM_LDG(&y_vec[i]), r; + auto* xp = reinterpret_cast(&x); + auto* yp = reinterpret_cast(&y); + auto* rp = reinterpret_cast(&r); +#pragma unroll + for (int j = 0; j < VEC_SIZE; j++) { + rp[j] = ACT_FN(xp[j], param) * yp[j]; + } + out_vec[i] = r; + } + // Scalar cleanup for remaining elements + for (int i = vec_end + threadIdx.x; i < d; i += blockDim.x) { + out_ptr[i] = ACT_FN(VLLM_LDG(&x_ptr[i]), param) * VLLM_LDG(&y_ptr[i]); + } + } else { + // Scalar fallback for unaligned data or small d + for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { + const scalar_t x = VLLM_LDG(&x_ptr[idx]); + const scalar_t y = VLLM_LDG(&y_ptr[idx]); + out_ptr[idx] = ACT_FN(x, param) * y; + } } } template __device__ __forceinline__ T swigluoai_and_mul(const T& gate, const T& up, float alpha, float limit) { - // clamp gate: min=None, max=limit - const float gate_f = (float)gate; - const float clamped_gate = gate_f > limit ? limit : gate_f; - - // clamp up: min=-limit, max=limit - const float up_f = (float)up; - const float clamped_up = - up_f > limit ? limit : (up_f < -limit ? -limit : up_f); - - // glu = gate * sigmoid(gate * alpha) - const float sigmoid_val = 1.0f / (1.0f + expf(-clamped_gate * alpha)); - const float glu = clamped_gate * sigmoid_val; - - // (up + 1) * glu - return (T)((clamped_up + 1.0f) * glu); + // Clamp gate to (-inf, limit] and up to [-limit, limit] + const float g = fminf((float)gate, limit); + const float u = fmaxf(fminf((float)up, limit), -limit); + // glu = gate * sigmoid(gate * alpha), then return (up + 1) * glu + return (T)((u + 1.0f) * g / (1.0f + expf(-g * alpha))); } +// Interleaved gate/up: input has [gate0, up0, gate1, up1, ...]. template __global__ void swigluoai_and_mul_kernel( scalar_t* __restrict__ out, // [..., d] - const scalar_t* __restrict__ input, // [..., 2, d] + const scalar_t* __restrict__ input, // [..., 2 * d] (interleaved) const int d, const float alpha, const float limit) { + // For interleaved data: input has 2*d elements per token (gate/up pairs) + // output has d elements per token + constexpr int VEC_SIZE = 16 / sizeof(scalar_t); + constexpr int PAIRS = VEC_SIZE / 2; // Number of gate/up pairs per int4 load const int64_t token_idx = blockIdx.x; - // TODO: Vectorize loads and stores. - for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { - // gate = x[..., ::2] (even indices) - const scalar_t gate = VLLM_LDG(&input[token_idx * 2 * d + 2 * idx]); - // up = x[..., 1::2] (odd indices) - const scalar_t up = VLLM_LDG(&input[token_idx * 2 * d + 2 * idx + 1]); - - out[token_idx * d + idx] = ACT_FN(gate, up, alpha, limit); + const scalar_t* in_ptr = input + token_idx * 2 * d; + scalar_t* out_ptr = out + token_idx * d; + + // Check alignment for 128-bit vectorized access on input. + // For output we use int2 (64-bit) which has 8-byte alignment requirement. + const bool in_aligned = is_16byte_aligned(in_ptr); + const bool out_aligned = + (reinterpret_cast(out_ptr) & 7) == 0; // 8-byte for int2 + + if (in_aligned && out_aligned && d >= PAIRS) { + // Fast path: vectorized loop + // Each int4 load gives VEC_SIZE elements = PAIRS gate/up pairs + // Each int2 store writes PAIRS output elements + const int4* in_vec = reinterpret_cast(in_ptr); + int2* out_vec = reinterpret_cast(out_ptr); + const int num_vecs = d / PAIRS; + const int vec_end = num_vecs * PAIRS; + + for (int i = threadIdx.x; i < num_vecs; i += blockDim.x) { + int4 v = VLLM_LDG(&in_vec[i]); + int2 r; + auto* vp = reinterpret_cast(&v); + auto* rp = reinterpret_cast(&r); +#pragma unroll + for (int j = 0; j < PAIRS; j++) { + rp[j] = ACT_FN(vp[2 * j], vp[2 * j + 1], alpha, limit); + } + out_vec[i] = r; + } + // Scalar cleanup for remaining elements + for (int i = vec_end + threadIdx.x; i < d; i += blockDim.x) { + out_ptr[i] = ACT_FN(VLLM_LDG(&in_ptr[2 * i]), + VLLM_LDG(&in_ptr[2 * i + 1]), alpha, limit); + } + } else { + // Scalar fallback for unaligned data or small d + for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { + // gate = x[..., ::2] (even indices) + const scalar_t gate = VLLM_LDG(&in_ptr[2 * idx]); + // up = x[..., 1::2] (odd indices) + const scalar_t up = VLLM_LDG(&in_ptr[2 * idx + 1]); + out_ptr[idx] = ACT_FN(gate, up, alpha, limit); + } } } @@ -217,10 +324,41 @@ __global__ void activation_kernel( scalar_t* __restrict__ out, // [..., d] const scalar_t* __restrict__ input, // [..., d] const int d) { + constexpr int VEC_SIZE = 16 / sizeof(scalar_t); const int64_t token_idx = blockIdx.x; - for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { - const scalar_t x = VLLM_LDG(&input[token_idx * d + idx]); - out[token_idx * d + idx] = ACT_FN(x); + const scalar_t* in_ptr = input + token_idx * d; + scalar_t* out_ptr = out + token_idx * d; + + // Check alignment for 128-bit vectorized access + const bool aligned = is_16byte_aligned(in_ptr) && is_16byte_aligned(out_ptr); + + if (aligned && d >= VEC_SIZE) { + // Fast path: 128-bit vectorized loop + const int4* in_vec = reinterpret_cast(in_ptr); + int4* out_vec = reinterpret_cast(out_ptr); + const int num_vecs = d / VEC_SIZE; + const int vec_end = num_vecs * VEC_SIZE; + + for (int i = threadIdx.x; i < num_vecs; i += blockDim.x) { + int4 v = VLLM_LDG(&in_vec[i]), r; + auto* vp = reinterpret_cast(&v); + auto* rp = reinterpret_cast(&r); +#pragma unroll + for (int j = 0; j < VEC_SIZE; j++) { + rp[j] = ACT_FN(vp[j]); + } + out_vec[i] = r; + } + // Scalar cleanup for remaining elements + for (int i = vec_end + threadIdx.x; i < d; i += blockDim.x) { + out_ptr[i] = ACT_FN(VLLM_LDG(&in_ptr[i])); + } + } else { + // Scalar fallback for unaligned data or small d + for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) { + const scalar_t x = VLLM_LDG(&in_ptr[idx]); + out_ptr[idx] = ACT_FN(x); + } } } diff --git a/csrc/cache.h b/csrc/cache.h index ee96a33ec8d95ada81ea72c79d107d43296cc8b2..bf007dabf6c170775f38ba8152495fcac81a0d28 100644 --- a/csrc/cache.h +++ b/csrc/cache.h @@ -9,16 +9,6 @@ void swap_blocks(torch::Tensor& src, torch::Tensor& dst, const torch::Tensor& block_mapping); -// Note: the key_caches and value_caches vectors are constant but -// not the Tensors they contain. The vectors need to be const refs -// in order to satisfy pytorch's C++ operator registration code. -void copy_blocks(std::vector const& key_caches, - std::vector const& value_caches, - const torch::Tensor& block_mapping); - -void copy_blocks_mla(std::vector const& kv_caches, - const torch::Tensor& block_mapping); - void reshape_and_cache(torch::Tensor& key, torch::Tensor& value, torch::Tensor& key_cache, torch::Tensor& value_cache, torch::Tensor& slot_mapping, @@ -43,6 +33,13 @@ void concat_and_cache_mla(torch::Tensor& kv_c, torch::Tensor& k_pe, const std::string& kv_cache_dtype, torch::Tensor& scale); +// NOTE: k_pe and kv_c order is flipped compared to concat_and_cache_mla +void concat_and_cache_mla_rope_fused( + torch::Tensor& positions, torch::Tensor& q_pe, torch::Tensor& k_pe, + torch::Tensor& kv_c, torch::Tensor& rope_cos_sin_cache, bool rope_is_neox, + torch::Tensor& kv_cache_slot_mapping, torch::Tensor& kv_cache, + const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale); + // Just for unittest void convert_fp8(torch::Tensor& dst_cache, torch::Tensor& src_cache, const double scale, const std::string& kv_cache_dtype); diff --git a/csrc/cache_kernels.cu b/csrc/cache_kernels.cu index 0112cea6a65bb657bd133a241f2a93e6cf44da0e..e4b95c2795fa78518952b6527f3b40e3302fc340 100644 --- a/csrc/cache_kernels.cu +++ b/csrc/cache_kernels.cu @@ -124,94 +124,6 @@ __global__ void copy_blocks_mla_kernel( } // namespace vllm -// Note: the key_caches and value_caches vectors are constant but -// not the Tensors they contain. The vectors need to be const refs -// in order to satisfy pytorch's C++ operator registration code. -void copy_blocks(std::vector const& key_caches, - std::vector const& value_caches, - const torch::Tensor& block_mapping) { - int num_layers = key_caches.size(); - TORCH_CHECK(num_layers == value_caches.size()); - if (num_layers == 0) { - return; - } - torch::Device cache_device = key_caches[0].device(); - TORCH_CHECK(cache_device.is_cuda()); - - // Create data structures for the kernel. - // Create an array of pointers to the key and value caches. - int64_t key_cache_ptrs[num_layers]; - int64_t value_cache_ptrs[num_layers]; - for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) { - key_cache_ptrs[layer_idx] = - reinterpret_cast(key_caches[layer_idx].data_ptr()); - value_cache_ptrs[layer_idx] = - reinterpret_cast(value_caches[layer_idx].data_ptr()); - } - - // block_mapping is a 2D tensor with shape (num_pairs, 2). - int num_pairs = block_mapping.size(0); - - // Move the data structures to the GPU. - // NOTE: This synchronizes the CPU and GPU. - torch::Tensor key_cache_ptrs_tensor = - torch::from_blob(key_cache_ptrs, {num_layers}, torch::kInt64) - .to(cache_device); - torch::Tensor value_cache_ptrs_tensor = - torch::from_blob(value_cache_ptrs, {num_layers}, torch::kInt64) - .to(cache_device); - - // Launch the kernel. - const int numel_per_block = key_caches[0][0].numel(); - dim3 grid(num_layers, num_pairs); - dim3 block(std::min(1024, numel_per_block)); - const at::cuda::OptionalCUDAGuard device_guard(cache_device); - const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); - VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES( - key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] { - vllm::copy_blocks_kernel<<>>( - key_cache_ptrs_tensor.data_ptr(), - value_cache_ptrs_tensor.data_ptr(), - block_mapping.data_ptr(), numel_per_block); - })); -} - -// copy blocks kernel for MLA (assumes a joint KV-cache) -void copy_blocks_mla(std::vector const& kv_caches, - const torch::Tensor& block_mapping) { - int num_layers = kv_caches.size(); - if (num_layers == 0) { - return; - } - torch::Device cache_device = kv_caches[0].device(); - TORCH_CHECK(cache_device.is_cuda(), "kv_cache must be on CUDA"); - - std::vector cache_ptrs(num_layers); - for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) { - cache_ptrs[layer_idx] = - reinterpret_cast(kv_caches[layer_idx].data_ptr()); - } - torch::Tensor cache_ptrs_tensor = - torch::from_blob(cache_ptrs.data(), {num_layers}, torch::kInt64) - .to(cache_device); - - int num_pairs = block_mapping.size(0); - // We use the stride instead of numel in case the cache is padded for memory - // alignment reasons, we assume the blocks data (inclusive of any padding) - // is contiguous in memory - int mem_footprint_per_block = kv_caches[0].stride(0); - dim3 grid(num_layers, num_pairs); - dim3 block(std::min(1024, mem_footprint_per_block)); - const at::cuda::OptionalCUDAGuard device_guard(cache_device); - const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); - VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES( - kv_caches[0].scalar_type(), "copy_blocks_mla_kernel", ([&] { - vllm::copy_blocks_mla_kernel<<>>( - cache_ptrs_tensor.data_ptr(), - block_mapping.data_ptr(), mem_footprint_per_block); - })); -} - namespace vllm { // Used to copy/convert one element @@ -770,9 +682,6 @@ __global__ void indexer_k_quant_and_cache_kernel( for (int i = 0; i < VEC_SIZE; i++) { amax = fmaxf(amax, fabsf(float(k_val_ptr[i]))); } -#ifndef USE_ROCM - __syncwarp(); -#endif // Reduced amax for (int mask = 16; mask > 0; mask /= 2) { @@ -782,9 +691,7 @@ __global__ void indexer_k_quant_and_cache_kernel( amax = fmaxf(amax, __shfl_xor_sync(unsigned(-1), amax, mask)); #endif } -#ifndef USE_ROCM - __syncwarp(); -#endif + #if defined(__gfx942__) float scale = fmaxf(amax, 1e-4) / 224.0f; #else diff --git a/csrc/cache_kernels_fused.cu b/csrc/cache_kernels_fused.cu new file mode 100644 index 0000000000000000000000000000000000000000..be037b2fdec2be66a67c4a61144a0e45b009fc7a --- /dev/null +++ b/csrc/cache_kernels_fused.cu @@ -0,0 +1,279 @@ +#include +#include +#include + +#include "cuda_compat.h" +#include "dispatch_utils.h" + +#include "quantization/w8a8/fp8/common.cuh" +#ifdef USE_ROCM + #include "quantization/w8a8/fp8/amd/quant_utils.cuh" +#else + #include "quantization/w8a8/fp8/nvidia/quant_utils.cuh" +#endif + +#ifdef USE_ROCM + #include +typedef __hip_bfloat16 __nv_bfloat16; +#endif + +namespace vllm { + +// NOTE Be EXTRA careful with raw_kv_scalar_t, for __half and __nv_bfloat16 it's +// using u16 as the backing type. +template +__global__ void concat_and_cache_mla_rope_fused_kernel( + const int64_t* __restrict__ positions, // [num_tokens] + qk_t* __restrict__ q_pe, // [num_tokens, num_q_heads, rot_dim] + qk_t* __restrict__ k_pe, // [num_tokens, rot_dim] + const qk_t* __restrict__ kv_c, // [num_tokens, kv_lora_rank] + const qk_t* __restrict__ rope_cos_sin_cache, // [max_position, 2, + // rot_dim // 2] + const int rot_dim, const int64_t q_pe_stride_token, + const int64_t q_pe_stride_head, const int64_t k_pe_stride, + const int64_t kv_c_stride, const int num_q_heads, + cache_t* __restrict__ kv_cache, // [num_blocks, block_size, (kv_lora_rank + + // rot_dim)] + const int64_t* __restrict__ kv_cache_slot_mapping, // [num_tokens] + const int block_stride, const int entry_stride, const int kv_lora_rank, + const int block_size, const float* kv_cache_quant_scale) { + // Each thread block is responsible for one token. + const int64_t token_idx = blockIdx.x; + const int64_t pos = positions[token_idx]; + + const qk_t* cos_sin_ptr = rope_cos_sin_cache + pos * rot_dim; + + const int embed_dim = rot_dim / 2; + + // Q ROPE + const int nq = num_q_heads * embed_dim; + for (int i = threadIdx.x; i < nq; i += blockDim.x) { + int head_idx = i / embed_dim; + int pair_idx = i % embed_dim; + + // NOTE: Would be nice to have interleaved sin/cos so we could just load + // both at the same time. + qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx); + qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim); + + qk_t* q_pe_head_ptr = + q_pe + token_idx * q_pe_stride_token + head_idx * q_pe_stride_head; + + int pair_idx_x, pair_idx_y; + if constexpr (IS_NEOX) { + // GPT-NeoX style rotary embedding. + pair_idx_x = pair_idx; + pair_idx_y = embed_dim + pair_idx; + } else { + // GPT-J style rotary embedding. + pair_idx_x = pair_idx * 2; + pair_idx_y = pair_idx * 2 + 1; + } + + qk_t x_src = q_pe_head_ptr[pair_idx_x]; + qk_t y_src = q_pe_head_ptr[pair_idx_y]; + + qk_t x_dst = x_src * cos - y_src * sin; + qk_t y_dst = y_src * cos + x_src * sin; + + q_pe_head_ptr[pair_idx_x] = x_dst; + q_pe_head_ptr[pair_idx_y] = y_dst; + } + + const int64_t slot_idx = kv_cache_slot_mapping[token_idx]; + const int64_t block_idx = slot_idx / block_size; + const int64_t entry_idx = slot_idx % block_size; + + // NOTE: slot_idx can be -1 if the token is padded + if (slot_idx < 0) { + return; + } + + // K with 1 HEAD + for (int i = threadIdx.x; i < embed_dim; i += blockDim.x) { + int pair_idx = i; + + qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx); + qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim); + + qk_t* k_pe_head_ptr = k_pe + token_idx * k_pe_stride; + + int pair_idx_x, pair_idx_y; + if constexpr (IS_NEOX) { + // GPT-NeoX style rotary embedding. + pair_idx_x = pair_idx; + pair_idx_y = embed_dim + pair_idx; + } else { + // GPT-J style rotary embedding. + pair_idx_x = pair_idx * 2; + pair_idx_y = pair_idx * 2 + 1; + } + + qk_t x_src = k_pe_head_ptr[pair_idx_x]; + qk_t y_src = k_pe_head_ptr[pair_idx_y]; + + qk_t x_dst = x_src * cos - y_src * sin; + qk_t y_dst = y_src * cos + x_src * sin; + + k_pe_head_ptr[pair_idx_x] = x_dst; + k_pe_head_ptr[pair_idx_y] = y_dst; + + // NOTE Why is this monster necessary? + // When K is of type float16, the actual template replacement for + // raw_kv_scalar_t with be u16. That's why it's used at the last moment + // otherwise CUDA ALU would break. + const raw_kv_scalar_t raw_x_value = + *reinterpret_cast(&x_dst); + const raw_kv_scalar_t raw_y_value = + *reinterpret_cast(&y_dst); + + cache_t* kv_cache_ptr = kv_cache + block_idx * block_stride + + entry_idx * entry_stride + kv_lora_rank; + + // MLA Cache Store + if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) { + kv_cache_ptr[pair_idx_x] = raw_x_value; + kv_cache_ptr[pair_idx_y] = raw_y_value; + } else { + kv_cache_ptr[pair_idx_x] = + fp8::scaled_convert( + raw_x_value, *kv_cache_quant_scale); + kv_cache_ptr[pair_idx_y] = + fp8::scaled_convert( + raw_y_value, *kv_cache_quant_scale); + } + } + + // NOPE + for (int i = threadIdx.x; i < kv_lora_rank; i += blockDim.x) { + const qk_t* src_ptr = kv_c + token_idx * kv_c_stride + i; + const raw_kv_scalar_t src_value = + *reinterpret_cast(src_ptr); + + cache_t* kv_cache_ptr = + kv_cache + block_idx * block_stride + entry_idx * entry_stride; + + if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) { + kv_cache_ptr[i] = src_value; + } else { + kv_cache_ptr[i] = fp8::scaled_convert( + src_value, *kv_cache_quant_scale); + } + } +} + +} // namespace vllm + +#define CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED(RAW_KV_T, CACHE_T, KV_DTYPE) \ + do { \ + VLLM_DISPATCH_FLOATING_TYPES(q_pe.scalar_type(), "qk_scalar_type", [&] { \ + using qk_t = scalar_t; \ + if (rope_is_neox) { \ + vllm::concat_and_cache_mla_rope_fused_kernel \ + <<>>( \ + positions.data_ptr(), q_pe.data_ptr(), \ + k_pe.data_ptr(), kv_c.data_ptr(), \ + rope_cos_sin_cache.data_ptr(), rot_dim, \ + q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \ + num_q_heads, reinterpret_cast(kv_cache.data_ptr()), \ + kv_cache_slot_mapping.data_ptr(), block_stride, \ + entry_stride, kv_lora_rank, block_size, \ + kv_cache_quant_scale.data_ptr()); \ + } else { \ + vllm::concat_and_cache_mla_rope_fused_kernel \ + <<>>( \ + positions.data_ptr(), q_pe.data_ptr(), \ + k_pe.data_ptr(), kv_c.data_ptr(), \ + rope_cos_sin_cache.data_ptr(), rot_dim, \ + q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \ + num_q_heads, reinterpret_cast(kv_cache.data_ptr()), \ + kv_cache_slot_mapping.data_ptr(), block_stride, \ + entry_stride, kv_lora_rank, block_size, \ + kv_cache_quant_scale.data_ptr()); \ + } \ + }); \ + } while (false) + +// Executes RoPE on q_pe and k_pe, then writes k_pe and kv_c in the kv cache. +// q_pe and k_pe are modified in place. +// Replaces DeepseekScalingRotaryEmbedding.self.rotary_emb and +// concat_and_cache_mla. +void concat_and_cache_mla_rope_fused( + torch::Tensor& positions, // [num_tokens] + torch::Tensor& q_pe, // [num_tokens, num_q_heads, rot_dim] + torch::Tensor& k_pe, // [num_tokens, rot_dim] + torch::Tensor& kv_c, // [num_tokens, kv_lora_rank] + torch::Tensor& rope_cos_sin_cache, // [max_position, rot_dim] + bool rope_is_neox, + torch::Tensor& + kv_cache_slot_mapping, // [num_tokens] or [num_actual_tokens] + torch::Tensor& + kv_cache, // [num_blocks, block_size, (kv_lora_rank + rot_dim)] + const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale) { + const int64_t num_tokens = q_pe.size(0); + + const int num_q_heads = q_pe.size(1); + const int rot_dim = q_pe.size(2); + const int kv_lora_rank = kv_c.size(1); + + TORCH_CHECK(positions.size(0) >= + num_tokens); // CUDA Graphs might pad this for us + TORCH_CHECK_EQ(positions.dim(), 1); + TORCH_CHECK_EQ(positions.scalar_type(), c10::ScalarType::Long); + + TORCH_CHECK_EQ(q_pe.size(0), num_tokens); + TORCH_CHECK_EQ(q_pe.size(1), num_q_heads); + TORCH_CHECK_EQ(q_pe.size(2), rot_dim); + TORCH_CHECK_EQ(q_pe.dim(), 3); + + TORCH_CHECK_EQ(k_pe.size(0), num_tokens); + TORCH_CHECK_EQ(k_pe.size(1), rot_dim); + TORCH_CHECK_EQ(k_pe.dim(), 2); + TORCH_CHECK_EQ(k_pe.scalar_type(), q_pe.scalar_type()); + + TORCH_CHECK_EQ(kv_c.size(0), num_tokens); + TORCH_CHECK_EQ(kv_c.size(1), kv_lora_rank); + TORCH_CHECK_EQ(kv_c.dim(), 2); + TORCH_CHECK_EQ(kv_c.scalar_type(), q_pe.scalar_type()); + TORCH_CHECK_EQ(kv_c.dtype(), q_pe.dtype()); + + TORCH_CHECK_EQ(rope_cos_sin_cache.size(1), rot_dim); + TORCH_CHECK_EQ(rope_cos_sin_cache.scalar_type(), q_pe.scalar_type()); + + TORCH_CHECK_EQ(kv_cache_slot_mapping.size(0), num_tokens); + TORCH_CHECK_EQ(kv_cache_slot_mapping.scalar_type(), c10::ScalarType::Long); + + TORCH_CHECK_EQ(kv_cache.size(2), kv_lora_rank + rot_dim); + TORCH_CHECK_EQ(kv_cache.dim(), 3); + + TORCH_CHECK_EQ(kv_cache_quant_scale.numel(), 1); + TORCH_CHECK_EQ(kv_cache_quant_scale.scalar_type(), c10::ScalarType::Float); + + int64_t q_pe_stride_token = q_pe.stride(0); + int64_t q_pe_stride_head = q_pe.stride(1); + + int64_t k_pe_stride = k_pe.stride(0); + int64_t kv_c_stride = kv_c.stride(0); + + int block_size = kv_cache.size(1); + + int block_stride = kv_cache.stride(0); + int entry_stride = kv_cache.stride(1); + + int rope_block_size = std::min(num_q_heads * rot_dim / 2, 512); + int mla_block_size = kv_lora_rank; + int thread_block_size = + std::min(std::max(rope_block_size, mla_block_size), 512); + + dim3 grid(num_tokens, 1, 1); + dim3 block(thread_block_size, 1, 1); + + const at::cuda::OptionalCUDAGuard device_guard(device_of(positions)); + const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); + + DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype, + CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED); +} diff --git a/csrc/cpu/cpu_attn_macros.h b/csrc/cpu/cpu_arch_macros.h similarity index 97% rename from csrc/cpu/cpu_attn_macros.h rename to csrc/cpu/cpu_arch_macros.h index 35716a0790ab354fa77b3b04f5581d10c0f7f566..c73b62ecdec901f4cf543bc12176aa6645b2a7dc 100644 --- a/csrc/cpu/cpu_attn_macros.h +++ b/csrc/cpu/cpu_arch_macros.h @@ -1,5 +1,5 @@ -#ifndef CPU_ATTN_MACROS_H -#define CPU_ATTN_MACROS_H +#ifndef CPU_ARCH_MACROS_H +#define CPU_ARCH_MACROS_H // x86_64 #ifdef __x86_64__ @@ -26,7 +26,7 @@ _mm512_castsi512_ps(_mm512_set1_epi32(0x42b17218)); \ const __m512i vec_127 = _mm512_set1_epi32(0x0000007f); \ const int n_mantissa_bits = 23; \ - auto fast_exp = [&](vec_op::FP32Vec16& vec) __attribute__(( \ + auto fast_exp = [&](const vec_op::FP32Vec16& vec) __attribute__(( \ always_inline)) { \ __m512 values = vec.reg; \ auto less_ln_flt_min_mask = \ @@ -98,7 +98,7 @@ poly = vbslq_f32(hi_mask, inf, poly); \ return vbslq_f32(lo_mask, zero, poly); \ }; \ - auto fast_exp = [&](vec_op::FP32Vec16& vec) \ + auto fast_exp = [&](const vec_op::FP32Vec16& vec) \ __attribute__((always_inline)) { \ float32x4x4_t result; \ result.val[0] = neon_expf(vec.reg.val[0]); \ @@ -110,4 +110,4 @@ #endif // __aarch64__ -#endif \ No newline at end of file +#endif diff --git a/csrc/cpu/cpu_attn.cpp b/csrc/cpu/cpu_attn.cpp index 02c722ba031a4c2168740c96794c4a040b0a259e..374fc2ee6ddcc756d5d9ededdabd56ad8571e62f 100644 --- a/csrc/cpu/cpu_attn.cpp +++ b/csrc/cpu/cpu_attn.cpp @@ -15,6 +15,7 @@ #ifdef __aarch64__ #include "cpu_attn_neon.hpp" + // NEON requires head_dim to be a multiple of 32 #define NEON_DISPATCH(...) \ case cpu_attention::ISA::NEON: { \ using attn_impl = cpu_attention::AttentionImpl { const int32_t q_heads_per_kv, const int64_t q_num_stride, const int64_t q_head_stride, const float scale) { constexpr int64_t bytes_per_head = head_dim * sizeof(scalar_t); - static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0); + // static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0); constexpr int64_t head_size_block_num = bytes_per_head / AMX_TILE_ROW_BYTES; constexpr int64_t head_elem_num_pre_block = AMX_TILE_ROW_BYTES / sizeof(scalar_t); diff --git a/csrc/cpu/cpu_attn_impl.hpp b/csrc/cpu/cpu_attn_impl.hpp index e3e077b845f4f992e48af87fcc42943225a53d53..08d208e05a62c89353b72e5555435e786bc0c288 100644 --- a/csrc/cpu/cpu_attn_impl.hpp +++ b/csrc/cpu/cpu_attn_impl.hpp @@ -8,10 +8,8 @@ #include #endif -#include "cpu_types.hpp" -#include "scratchpad_manager.h" -#include "cpu_attn_macros.h" -#include "utils.hpp" +#include "cpu/cpu_arch_macros.h" +#include "cpu/utils.hpp" namespace cpu_attention { enum class ISA { AMX, VEC, VEC16, NEON }; @@ -378,12 +376,13 @@ class AttentionScheduler { static constexpr int32_t MaxQTileIterNum = 128; - AttentionScheduler() : available_cache_size_(get_available_l2_size()) {} + AttentionScheduler() + : available_cache_size_(cpu_utils::get_available_l2_size()) {} torch::Tensor schedule(const ScheduleInput& input) const { const bool casual = input.casual; const int32_t thread_num = omp_get_max_threads(); - const int64_t cache_size = get_available_l2_size(); + const int64_t cache_size = cpu_utils::get_available_l2_size(); const int32_t max_num_q_per_iter = input.max_num_q_per_iter; const int32_t kv_len_alignment = input.kv_block_alignment; int32_t q_head_per_kv = input.num_heads_q / input.num_heads_kv; @@ -659,7 +658,7 @@ class AttentionScheduler { metadata_ptr->thread_num + metadata_ptr->reduction_scratchpad_size_per_kv_head * (use_gqa ? input.num_heads_kv : input.num_heads_q); - DNNLScratchPadManager::get_dnnl_scratchpad_manager()->realloc( + cpu_utils::ScratchPadManager::get_scratchpad_manager()->realloc( scratchpad_size); // metadata_ptr->print(); @@ -667,7 +666,7 @@ class AttentionScheduler { // test out of boundary access // { // float* cache_ptr = - // DNNLScratchPadManager::get_dnnl_scratchpad_manager()->get_data(); + // cpu_utils::ScratchPadManager::getl_scratchpad_manager()->get_data(); // for (int64_t i = 0; i < scratchpad_size / sizeof(float); ++i) { // cache_ptr[i] = std::numeric_limits::quiet_NaN(); // } @@ -749,27 +748,6 @@ class AttentionScheduler { return std::max(rounded_tile_size, round_size); } - static int64_t get_available_l2_size() { - static int64_t size = []() { -#if defined(__APPLE__) - // macOS doesn't have _SC_LEVEL2_CACHE_SIZE. Use sysctlbyname. - int64_t l2_cache_size = 0; - size_t len = sizeof(l2_cache_size); - if (sysctlbyname("hw.l2cachesize", &l2_cache_size, &len, NULL, 0) == 0 && - l2_cache_size > 0) { - return l2_cache_size >> 1; // use 50% of L2 cache - } - // Fallback if sysctlbyname fails - return 128LL * 1024 >> 1; // use 50% of 128KB -#else - long l2_cache_size = sysconf(_SC_LEVEL2_CACHE_SIZE); - TORCH_CHECK_NE(l2_cache_size, -1); - return l2_cache_size >> 1; // use 50% of L2 cache -#endif - }(); - return size; - } - private: int64_t available_cache_size_; }; @@ -1402,7 +1380,7 @@ class AttentionMainLoop { // init buffers void* scratchpad_ptr = - DNNLScratchPadManager::get_dnnl_scratchpad_manager() + cpu_utils::ScratchPadManager::get_scratchpad_manager() ->get_data(); AttentionScratchPad buffer_manager(thread_id, metadata, scratchpad_ptr); @@ -1422,8 +1400,7 @@ class AttentionMainLoop { } } - const int64_t available_cache_size = - AttentionScheduler::get_available_l2_size(); + const int64_t available_cache_size = cpu_utils::get_available_l2_size(); const int32_t default_tile_size = AttentionScheduler::calcu_default_tile_size( available_cache_size, head_dim, sizeof(kv_cache_t), diff --git a/csrc/cpu/cpu_attn_neon.hpp b/csrc/cpu/cpu_attn_neon.hpp index 827f0cfbc718ed97ce6d3ac2ed765d4ef9ee0f80..e9ecd1d32904e34b7dbaf4e85d17dfed245088ab 100644 --- a/csrc/cpu/cpu_attn_neon.hpp +++ b/csrc/cpu/cpu_attn_neon.hpp @@ -264,7 +264,7 @@ class AttentionImpl { constexpr static ISA ISAType = ISA::NEON; constexpr static bool scale_on_logits = false; // apply scale on q_buffer - static_assert(HeadDim % HeadDimAlignment == 0); + // static_assert(HeadDim % HeadDimAlignment == 0); // the gemm micro kernel is Mx8 static_assert(HeadDimAlignment % 8 == 0); static_assert(BlockSizeAlignment % 8 == 0); diff --git a/csrc/cpu/cpu_fused_moe.cpp b/csrc/cpu/cpu_fused_moe.cpp new file mode 100644 index 0000000000000000000000000000000000000000..090e2d4cd4b56504602af6594003f767344ed993 --- /dev/null +++ b/csrc/cpu/cpu_fused_moe.cpp @@ -0,0 +1,727 @@ +#include "cpu/cpu_types.hpp" +#include "cpu/utils.hpp" +#include "cpu/micro_gemm/cpu_micro_gemm_vec.hpp" +#include "cpu/cpu_arch_macros.h" + +#ifdef CPU_CAPABILITY_AMXBF16 + #include "cpu/micro_gemm/cpu_micro_gemm_amx.hpp" + #define AMX_DISPATCH(...) \ + case cpu_utils::ISA::AMX: { \ + using gemm_t = cpu_micro_gemm::MicroGemm; \ + return __VA_ARGS__(); \ + } +#else + #define AMX_DISPATCH(...) case cpu_utils::ISA::AMX: +#endif + +#define CPU_ISA_DISPATCH_IMPL(ISA_TYPE, ...) \ + [&] { \ + switch (ISA_TYPE) { \ + AMX_DISPATCH(__VA_ARGS__) \ + case cpu_utils::ISA::VEC: { \ + using gemm_t = \ + cpu_micro_gemm::MicroGemm; \ + return __VA_ARGS__(); \ + } \ + default: { \ + TORCH_CHECK(false, "Invalid CPU ISA type."); \ + } \ + } \ + }() + +namespace { +enum class FusedMOEAct { SiluAndMul, SwigluOAIAndMul }; + +FusedMOEAct get_act_type(const std::string& act) { + if (act == "silu") { + return FusedMOEAct::SiluAndMul; + } else if (act == "swigluoai") { + return FusedMOEAct::SwigluOAIAndMul; + } else { + TORCH_CHECK(false, "Invalid act type: " + act); + } +} + +template +void swigluoai_and_mul(float* __restrict__ input, scalar_t* __restrict__ output, + const int32_t m_size, const int32_t n_size, + const int32_t input_stride, + const int32_t output_stride) { + using scalar_vec_t = typename cpu_utils::VecTypeTrait::vec_t; + // For GPT-OSS interleaved gate-up weights + alignas(64) static int32_t index[16] = {0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30}; + vec_op::INT32Vec16 index_vec(index); + vec_op::FP32Vec16 gate_up_max_vec(7.0); + vec_op::FP32Vec16 up_min_vec(-7.0); + vec_op::FP32Vec16 alpha_vec(1.702); + vec_op::FP32Vec16 one_vec(1.0); + + DEFINE_FAST_EXP + + for (int32_t m = 0; m < m_size; ++m) { + for (int32_t n = 0; n < n_size; n += 32) { + vec_op::FP32Vec16 gate_vec(input + n, index_vec); + vec_op::FP32Vec16 up_vec(input + n + 1, index_vec); + gate_vec = gate_vec.min(gate_up_max_vec); + up_vec = up_vec.clamp(up_min_vec, gate_up_max_vec); + auto sigmoid_vec = one_vec / (one_vec + fast_exp(-gate_vec * alpha_vec)); + auto glu = gate_vec * sigmoid_vec; + auto gated_output_fp32 = (one_vec + up_vec) * glu; + scalar_vec_t gated_output = scalar_vec_t(gated_output_fp32); + gated_output.save(output + n / 2); + } + input += input_stride; + output += output_stride; + } +} + +template +void silu_and_mul(float* __restrict__ input, scalar_t* __restrict__ output, + const int32_t m_size, const int32_t n_size, + const int32_t input_stride, const int32_t output_stride) { + using scalar_vec_t = typename cpu_utils::VecTypeTrait::vec_t; + const int32_t dim = n_size / 2; + float* __restrict__ gate = input; + float* __restrict__ up = input + dim; + vec_op::FP32Vec16 one_vec(1.0); + + DEFINE_FAST_EXP + + for (int32_t m = 0; m < m_size; ++m) { + for (int32_t n = 0; n < dim; n += 16) { + vec_op::FP32Vec16 gate_vec(gate + n); + vec_op::FP32Vec16 up_vec(up + n); + auto sigmoid_vec = one_vec / (one_vec + fast_exp(-gate_vec)); + auto silu = gate_vec * sigmoid_vec; + auto gated_output_fp32 = up_vec * silu; + scalar_vec_t gated_output = scalar_vec_t(gated_output_fp32); + gated_output.save(output + n); + } + gate += input_stride; + up += input_stride; + output += output_stride; + } +} + +template +FORCE_INLINE void apply_gated_act(const FusedMOEAct act, + float* __restrict__ input, + scalar_t* __restrict__ output, + const int32_t m, const int32_t n, + const int32_t input_stride, + const int32_t output_stride) { + switch (act) { + case FusedMOEAct::SwigluOAIAndMul: + swigluoai_and_mul(input, output, m, n, input_stride, output_stride); + return; + case FusedMOEAct::SiluAndMul: + silu_and_mul(input, output, m, n, input_stride, output_stride); + return; + default: + TORCH_CHECK(false, "Unsupported act type."); + } +} + +template +void prepack_moe_weight_impl(scalar_t* __restrict__ weight_ptr, + scalar_t* __restrict__ packed_weight_ptr, + const int32_t expert_num, + const int32_t output_size, + const int32_t input_size, + const int64_t expert_stride) { +#pragma omp parallel for + for (int32_t e_idx = 0; e_idx < expert_num; ++e_idx) { + gemm_t::pack_weight(weight_ptr + expert_stride * e_idx, + packed_weight_ptr + expert_stride * e_idx, output_size, + input_size); + } +} + +template +void fused_moe_impl(scalar_t* __restrict__ output, scalar_t* __restrict__ input, + w_t* __restrict__ w13, w_t* __restrict__ w2, + w_t* __restrict__ w13_bias, w_t* __restrict__ w2_bias, + float* __restrict__ topk_weights, + int32_t* __restrict__ topk_id, FusedMOEAct act_type, + const int32_t token_num, const int32_t expert_num, + const int32_t topk_num, const int32_t input_size_13, + const int32_t output_size_13, const int32_t input_size_2, + const int32_t output_size_2) { + using scalar_vec_t = typename cpu_utils::VecTypeTrait::vec_t; + constexpr int32_t gemm_n_tile_size = gemm_t::NSize; + constexpr int32_t gemm_m_tile_size = gemm_t::MaxMSize; + constexpr int32_t min_w13_n_tile_size = 2 * gemm_n_tile_size; + static_assert(gemm_n_tile_size % 16 == 0); + + TORCH_CHECK_EQ(output_size_13 % min_w13_n_tile_size, 0); + TORCH_CHECK_EQ(output_size_2 % gemm_n_tile_size, 0); + TORCH_CHECK_EQ(output_size_13 / 2, input_size_2); + + const int32_t thread_num = omp_get_max_threads(); + + const int32_t w13_input_buffer_size = cpu_utils::round_up<64>( + gemm_m_tile_size * input_size_13 * sizeof(scalar_t)); + + const int32_t w13_n_tile_size = [&]() { + const int64_t cache_size = cpu_utils::get_available_l2_size(); + // input buffer + output buffer + weight + const int32_t n_size_cache_limit = + (cache_size - w13_input_buffer_size) / + (gemm_m_tile_size * sizeof(float) + input_size_13 * sizeof(scalar_t)); + const int32_t n_size_thread_limit = + output_size_13 / std::max(1, thread_num / topk_num); + const int32_t n_size = cpu_utils::round_down( + std::min(n_size_cache_limit, n_size_thread_limit)); + return std::max(n_size, min_w13_n_tile_size); + }(); + + const int32_t w2_input_tile_size = cpu_utils::round_up<64>( + gemm_m_tile_size * input_size_2 * sizeof(scalar_t)); + + const int32_t w2_n_tile_size = [&]() { + const int64_t cache_size = cpu_utils::get_available_l2_size(); + // input tile + weight + const int32_t n_size_cache_limit = + (cache_size - w2_input_tile_size) / (input_size_2 * sizeof(scalar_t)); + const int32_t n_size_thread_limit = + output_size_2 / std::max(1, thread_num / topk_num); + const int32_t n_size = cpu_utils::round_down( + std::min(n_size_cache_limit, n_size_thread_limit)); + return std::max(n_size, gemm_n_tile_size); + }(); + + // allocate buffers + int32_t common_buffer_offset = 0; + int32_t w13_thread_buffer_offset = 0; + int32_t ws_thread_buffer_offset = 0; + + // common buffers + const int32_t token_num_per_group_buffer_size = + cpu_utils::round_up<64>(expert_num * sizeof(int32_t)); + const int32_t token_num_per_group_buffer_offset = common_buffer_offset; + common_buffer_offset += token_num_per_group_buffer_size; + + const int32_t cu_token_num_per_group_buffer_size = + cpu_utils::round_up<64>((expert_num + 1) * sizeof(int32_t)); + const int32_t cu_token_num_per_group_buffer_offset = common_buffer_offset; + common_buffer_offset += cu_token_num_per_group_buffer_size; + + const int32_t expand_token_id_buffer_size = + cpu_utils::round_up<64>(token_num * topk_num * sizeof(int32_t)); + const int32_t expand_token_id_buffer_offset = common_buffer_offset; + common_buffer_offset += expand_token_id_buffer_size; + + const int32_t expand_token_id_index_buffer_size = + cpu_utils::round_up<64>(token_num * topk_num * sizeof(int32_t)); + const int32_t expand_token_id_index_buffer_offset = common_buffer_offset; + common_buffer_offset += expand_token_id_index_buffer_size; + + const int32_t w13_gemm_output_buffer_size = cpu_utils::round_up<64>( + token_num * topk_num * (output_size_13 / 2) * sizeof(scalar_t)); + const int32_t w13_gemm_output_buffer_offset = common_buffer_offset; + common_buffer_offset += w13_gemm_output_buffer_size; + + const int32_t w2_gemm_output_buffer_size = cpu_utils::round_up<64>( + token_num * topk_num * output_size_2 * sizeof(float)); + const int32_t w2_gemm_output_buffer_offset = common_buffer_offset; + common_buffer_offset += w2_gemm_output_buffer_size; + + // w13 GEMM thread buffers + const int32_t w13_input_buffer_offset = w13_thread_buffer_offset; + w13_thread_buffer_offset += w13_input_buffer_size; + + const int32_t w13_output_buffer_size = cpu_utils::round_up<64>( + gemm_m_tile_size * w13_n_tile_size * sizeof(float)); + const int32_t w13_output_buffer_offset = w13_thread_buffer_offset; + w13_thread_buffer_offset += w13_output_buffer_size; + + // Weighted sum thread buffer + const int32_t ws_output_buffer_size = + cpu_utils::round_up<64>(output_size_2 * sizeof(float)); + const int32_t ws_output_buffer_offset = ws_thread_buffer_offset; + ws_thread_buffer_offset += ws_output_buffer_size; + + const int32_t buffer_size = + common_buffer_offset + + std::max(w13_thread_buffer_offset, ws_thread_buffer_offset) * thread_num; + cpu_utils::ScratchPadManager::get_scratchpad_manager()->realloc(buffer_size); + uint8_t* common_buffer_start = + cpu_utils::ScratchPadManager::get_scratchpad_manager() + ->get_data(); + uint8_t* thread_buffer_start = common_buffer_start + common_buffer_offset; + + int32_t* __restrict__ token_num_per_group_buffer = reinterpret_cast( + common_buffer_start + token_num_per_group_buffer_offset); + int32_t* __restrict__ cu_token_num_per_group_buffer = + reinterpret_cast(common_buffer_start + + cu_token_num_per_group_buffer_offset); + int32_t* __restrict__ expand_token_id_buffer = reinterpret_cast( + common_buffer_start + expand_token_id_buffer_offset); + int32_t* __restrict__ expand_token_id_index_buffer = + reinterpret_cast(common_buffer_start + + expand_token_id_index_buffer_offset); + + // prepare token-expert mappings + { + std::memset(token_num_per_group_buffer, 0, expert_num * sizeof(int32_t)); + for (int32_t i = 0; i < token_num * topk_num; ++i) { + int32_t curr_expert_id = topk_id[i]; + ++token_num_per_group_buffer[curr_expert_id]; + } + + int32_t token_num_sum = 0; + cu_token_num_per_group_buffer[0] = 0; + int32_t* token_index_buffer = cu_token_num_per_group_buffer + 1; + for (int32_t i = 0; i < expert_num; ++i) { + token_index_buffer[i] = token_num_sum; + token_num_sum += token_num_per_group_buffer[i]; + } + + for (int32_t i = 0; i < token_num; ++i) { + int32_t* curr_topk_id = topk_id + i * topk_num; + int32_t* curr_index_buffer = expand_token_id_index_buffer + i * topk_num; + for (int32_t j = 0; j < topk_num; ++j) { + int32_t curr_expert_id = curr_topk_id[j]; + int32_t curr_index = token_index_buffer[curr_expert_id]; + ++token_index_buffer[curr_expert_id]; + expand_token_id_buffer[curr_index] = i; + curr_index_buffer[j] = curr_index; + } + } + } + + // w13 GEMM + act + { + alignas(64) cpu_utils::Counter counter; + cpu_utils::Counter* counter_ptr = &counter; + +#pragma omp parallel for schedule(static, 1) + for (int32_t thread_id = 0; thread_id < thread_num; ++thread_id) { + const int32_t task_num_per_expert = + (output_size_13 + w13_n_tile_size - 1) / w13_n_tile_size; + const int32_t task_num = task_num_per_expert * expert_num; + + uint8_t* __restrict__ thread_buffer = + thread_buffer_start + thread_id * w13_thread_buffer_offset; + scalar_t* __restrict__ w13_input_buffer = + reinterpret_cast(thread_buffer + w13_input_buffer_offset); + float* __restrict__ w13_output_buffer = + reinterpret_cast(thread_buffer + w13_output_buffer_offset); + scalar_t* __restrict__ w13_gemm_output_buffer = + reinterpret_cast(common_buffer_start + + w13_gemm_output_buffer_offset); + + gemm_t gemm; + + const int32_t input_size_13_bytes = input_size_13 * sizeof(scalar_t); + const int32_t w13_n_group_stride = 16 * input_size_13; + const int32_t w13_n_tile_stride = gemm_n_tile_size * input_size_13; + + for (;;) { + int32_t task_id = counter_ptr->acquire_counter(); + if (task_id >= task_num) { + break; + } + + const int32_t curr_expert_id = task_id / task_num_per_expert; + const int32_t curr_output_group_id = task_id % task_num_per_expert; + const int32_t curr_token_num = + token_num_per_group_buffer[curr_expert_id]; + if (curr_token_num == 0) { + continue; + } + + const int32_t actual_n_tile_size = + std::min(w13_n_tile_size, + output_size_13 - curr_output_group_id * w13_n_tile_size); + const int32_t* __restrict__ curr_expand_token_id_buffer = + expand_token_id_buffer + + cu_token_num_per_group_buffer[curr_expert_id]; + scalar_t* __restrict__ curr_w13_gemm_output_buffer = + w13_gemm_output_buffer + + cu_token_num_per_group_buffer[curr_expert_id] * + (output_size_13 / 2) + + curr_output_group_id * w13_n_tile_size / 2; + + w_t* __restrict__ w13_weight_ptr_0 = nullptr; + w_t* __restrict__ w13_weight_ptr_1 = nullptr; + w_t* __restrict__ w13_bias_ptr_0 = nullptr; + w_t* __restrict__ w13_bias_ptr_1 = nullptr; + if (act_type == FusedMOEAct::SwigluOAIAndMul) { + // For SwigluOAIAndMul, up and down weights are interleaved + w13_weight_ptr_0 = + w13 + curr_expert_id * input_size_13 * output_size_13 + + curr_output_group_id * w13_n_tile_size * input_size_13; + w13_weight_ptr_1 = + w13_weight_ptr_0 + actual_n_tile_size / 2 * input_size_13; + if (w13_bias != nullptr) { + w13_bias_ptr_0 = w13_bias + curr_expert_id * output_size_13 + + curr_output_group_id * w13_n_tile_size; + w13_bias_ptr_1 = w13_bias_ptr_0 + actual_n_tile_size / 2; + } + } else { + w13_weight_ptr_0 = + w13 + curr_expert_id * input_size_13 * output_size_13 + + curr_output_group_id * (w13_n_tile_size / 2) * input_size_13; + w13_weight_ptr_1 = + w13_weight_ptr_0 + output_size_13 / 2 * input_size_13; + if (w13_bias != nullptr) { + w13_bias_ptr_0 = w13_bias + curr_expert_id * output_size_13 + + curr_output_group_id * (w13_n_tile_size / 2); + w13_bias_ptr_1 = w13_bias_ptr_0 + output_size_13 / 2; + } + } + + scalar_t* __restrict__ curr_w13_input_buffer = w13_input_buffer; + for (int32_t token_idx = 0; token_idx < curr_token_num; + token_idx += gemm_m_tile_size) { + const int32_t actual_token_num = + std::min(gemm_m_tile_size, curr_token_num - token_idx); + // copy inputs + { + scalar_t* __restrict__ curr_w13_input_buffer_iter = + curr_w13_input_buffer; + for (int32_t i = 0; i < actual_token_num; ++i) { + const int32_t curr_token_id = curr_expand_token_id_buffer[i]; + int8_t* __restrict__ curr_input_iter = reinterpret_cast( + input + curr_token_id * input_size_13); + int8_t* __restrict__ curr_output_iter = + reinterpret_cast(curr_w13_input_buffer_iter); + int32_t j = 0; + for (; j < input_size_13_bytes - 64; j += 64) { + vec_op::INT8Vec64 vec(curr_input_iter); + vec.save(curr_output_iter); + curr_input_iter += 64; + curr_output_iter += 64; + } + vec_op::INT8Vec64 vec(curr_input_iter); + vec.save(curr_output_iter, input_size_13_bytes - j); + + // update + curr_w13_input_buffer_iter += input_size_13; + } + // update + curr_expand_token_id_buffer += actual_token_num; + } + + // gemm + act + { + scalar_t* __restrict__ w13_weight_ptr_0_iter = w13_weight_ptr_0; + scalar_t* __restrict__ w13_weight_ptr_1_iter = w13_weight_ptr_1; + scalar_t* __restrict__ w13_bias_ptr_0_iter = w13_bias_ptr_0; + scalar_t* __restrict__ w13_bias_ptr_1_iter = w13_bias_ptr_1; + scalar_t* __restrict__ curr_w13_input_buffer_iter = + curr_w13_input_buffer; + float* __restrict__ w13_output_buffer_0_iter = w13_output_buffer; + float* __restrict__ w13_output_buffer_1_iter = + w13_output_buffer + actual_n_tile_size / 2; + for (int32_t i = 0; i < actual_n_tile_size; + i += min_w13_n_tile_size) { + gemm.gemm(curr_w13_input_buffer_iter, w13_weight_ptr_0_iter, + w13_output_buffer_0_iter, actual_token_num, + input_size_13, input_size_13, w13_n_group_stride, + actual_n_tile_size, false); + + if (w13_bias != nullptr) { + cpu_micro_gemm::add_bias_epilogue( + w13_output_buffer_0_iter, w13_output_buffer_0_iter, + w13_bias_ptr_0_iter, actual_token_num, actual_n_tile_size, + actual_n_tile_size); + w13_bias_ptr_0_iter += gemm_n_tile_size; + } + + gemm.gemm(curr_w13_input_buffer_iter, w13_weight_ptr_1_iter, + w13_output_buffer_1_iter, actual_token_num, + input_size_13, input_size_13, w13_n_group_stride, + actual_n_tile_size, false); + + if (w13_bias != nullptr) { + cpu_micro_gemm::add_bias_epilogue( + w13_output_buffer_1_iter, w13_output_buffer_1_iter, + w13_bias_ptr_1_iter, actual_token_num, actual_n_tile_size, + actual_n_tile_size); + w13_bias_ptr_1_iter += gemm_n_tile_size; + } + + // update + w13_weight_ptr_0_iter += w13_n_tile_stride; + w13_weight_ptr_1_iter += w13_n_tile_stride; + w13_output_buffer_0_iter += gemm_n_tile_size; + w13_output_buffer_1_iter += gemm_n_tile_size; + } + + apply_gated_act(act_type, w13_output_buffer, + curr_w13_gemm_output_buffer, actual_token_num, + actual_n_tile_size, actual_n_tile_size, + output_size_13 / 2); + + // update + curr_w13_gemm_output_buffer += + gemm_m_tile_size * (output_size_13 / 2); + } + } + } + } + } + + // w2 GEMM + { + alignas(64) cpu_utils::Counter counter; + cpu_utils::Counter* counter_ptr = &counter; + +#pragma omp parallel for schedule(static, 1) + for (int32_t thread_id = 0; thread_id < thread_num; ++thread_id) { + const int32_t task_num_per_expert = + (output_size_2 + w2_n_tile_size - 1) / w2_n_tile_size; + const int32_t task_num = task_num_per_expert * expert_num; + scalar_t* __restrict__ w13_gemm_output_buffer = + reinterpret_cast(common_buffer_start + + w13_gemm_output_buffer_offset); + float* __restrict__ w2_gemm_output_buffer = reinterpret_cast( + common_buffer_start + w2_gemm_output_buffer_offset); + + gemm_t gemm; + + const int32_t w2_n_tile_stride = gemm_n_tile_size * input_size_2; + const int32_t w2_n_group_stride = 16 * input_size_2; + + for (;;) { + int32_t task_id = counter_ptr->acquire_counter(); + if (task_id >= task_num) { + break; + } + + const int32_t curr_expert_id = task_id / task_num_per_expert; + const int32_t curr_output_group_id = task_id % task_num_per_expert; + const int32_t curr_token_num = + token_num_per_group_buffer[curr_expert_id]; + if (curr_token_num == 0) { + continue; + } + + const int32_t actual_n_tile_size = + std::min(w2_n_tile_size, + output_size_2 - curr_output_group_id * w2_n_tile_size); + scalar_t* __restrict__ curr_w13_gemm_output_buffer = + w13_gemm_output_buffer + + cu_token_num_per_group_buffer[curr_expert_id] * input_size_2; + float* __restrict__ curr_w2_gemm_output_buffer = + w2_gemm_output_buffer + + cu_token_num_per_group_buffer[curr_expert_id] * output_size_2 + + curr_output_group_id * w2_n_tile_size; + scalar_t* __restrict__ w2_weight_ptr = + w2 + curr_expert_id * output_size_2 * input_size_2 + + curr_output_group_id * w2_n_tile_size * input_size_2; + scalar_t* __restrict__ w2_bias_ptr = nullptr; + if (w2_bias != nullptr) { + w2_bias_ptr = w2_bias + curr_expert_id * output_size_2 + + curr_output_group_id * w2_n_tile_size; + } + + for (int32_t token_idx = 0; token_idx < curr_token_num; + token_idx += gemm_m_tile_size) { + const int32_t actual_token_num = + std::min(gemm_m_tile_size, curr_token_num - token_idx); + + scalar_t* __restrict__ w2_weight_ptr_iter = w2_weight_ptr; + scalar_t* __restrict__ w2_bias_ptr_iter = w2_bias_ptr; + float* __restrict__ curr_w2_gemm_output_buffer_iter = + curr_w2_gemm_output_buffer; + for (int32_t i = 0; i < actual_n_tile_size; i += gemm_n_tile_size) { + gemm.gemm(curr_w13_gemm_output_buffer, w2_weight_ptr_iter, + curr_w2_gemm_output_buffer_iter, actual_token_num, + input_size_2, input_size_2, w2_n_group_stride, + output_size_2, false); + + if (w2_bias != nullptr) { + cpu_micro_gemm::add_bias_epilogue( + curr_w2_gemm_output_buffer_iter, + curr_w2_gemm_output_buffer_iter, w2_bias_ptr_iter, + actual_token_num, output_size_2, output_size_2); + w2_bias_ptr_iter += gemm_n_tile_size; + } + + w2_weight_ptr_iter += w2_n_tile_stride; + curr_w2_gemm_output_buffer_iter += gemm_n_tile_size; + } + + // update + curr_w13_gemm_output_buffer += gemm_m_tile_size * input_size_2; + curr_w2_gemm_output_buffer += gemm_m_tile_size * output_size_2; + } + } + } + } + + // weighted sum + { + alignas(64) cpu_utils::Counter counter; + cpu_utils::Counter* counter_ptr = &counter; + +#pragma omp parallel for schedule(static, 1) + for (int32_t thread_id = 0; thread_id < thread_num; ++thread_id) { + const int32_t task_num = token_num; + uint8_t* __restrict__ thread_buffer = + thread_buffer_start + thread_id * ws_thread_buffer_offset; + float* __restrict__ ws_output_buffer = + reinterpret_cast(thread_buffer + ws_output_buffer_offset); + float* __restrict__ w2_gemm_output_buffer = reinterpret_cast( + common_buffer_start + w2_gemm_output_buffer_offset); + + for (;;) { + int32_t task_id = counter_ptr->acquire_counter(); + if (task_id >= task_num) { + break; + } + + int32_t token_id = task_id; + int32_t* __restrict__ curr_expand_token_id_index_buffer = + expand_token_id_index_buffer + token_id * topk_num; + float* __restrict__ curr_weight = topk_weights + token_id * topk_num; + scalar_t* __restrict__ curr_output_buffer = + output + token_id * output_size_2; + + if (topk_num > 1) { + { + int32_t w2_output_idx = curr_expand_token_id_index_buffer[0]; + float* __restrict__ w2_output_iter = + w2_gemm_output_buffer + w2_output_idx * output_size_2; + float* __restrict__ ws_output_buffer_iter = ws_output_buffer; + vec_op::FP32Vec16 weight_vec(curr_weight[0]); + for (int32_t i = 0; i < output_size_2; i += 16) { + vec_op::FP32Vec16 vec(w2_output_iter); + vec = vec * weight_vec; + vec.save(ws_output_buffer_iter); + + // update + w2_output_iter += 16; + ws_output_buffer_iter += 16; + } + } + + { + for (int32_t idx = 1; idx < topk_num - 1; ++idx) { + int32_t w2_output_idx = curr_expand_token_id_index_buffer[idx]; + float* __restrict__ w2_output_iter = + w2_gemm_output_buffer + w2_output_idx * output_size_2; + float* __restrict__ ws_output_buffer_iter = ws_output_buffer; + vec_op::FP32Vec16 weight_vec(curr_weight[idx]); + for (int32_t i = 0; i < output_size_2; i += 16) { + vec_op::FP32Vec16 vec(w2_output_iter); + vec_op::FP32Vec16 sum(ws_output_buffer_iter); + sum = sum + vec * weight_vec; + sum.save(ws_output_buffer_iter); + + // update + w2_output_iter += 16; + ws_output_buffer_iter += 16; + } + } + } + + { + int32_t idx = topk_num - 1; + int32_t w2_output_idx = curr_expand_token_id_index_buffer[idx]; + float* __restrict__ w2_output_iter = + w2_gemm_output_buffer + w2_output_idx * output_size_2; + float* __restrict__ ws_output_buffer_iter = ws_output_buffer; + scalar_t* __restrict__ curr_output_buffer_iter = curr_output_buffer; + vec_op::FP32Vec16 weight_vec(curr_weight[idx]); + for (int32_t i = 0; i < output_size_2; i += 16) { + vec_op::FP32Vec16 vec(w2_output_iter); + vec_op::FP32Vec16 sum(ws_output_buffer_iter); + sum = sum + vec * weight_vec; + scalar_vec_t out_vec(sum); + out_vec.save(curr_output_buffer_iter); + + // update + w2_output_iter += 16; + ws_output_buffer_iter += 16; + curr_output_buffer_iter += 16; + } + } + } else { + int32_t w2_output_idx = curr_expand_token_id_index_buffer[0]; + float* __restrict__ w2_output_iter = + w2_gemm_output_buffer + w2_output_idx * output_size_2; + scalar_t* __restrict__ curr_output_buffer_iter = curr_output_buffer; + vec_op::FP32Vec16 weight_vec(curr_weight[0]); + for (int32_t i = 0; i < output_size_2; i += 16) { + vec_op::FP32Vec16 vec(w2_output_iter); + vec = vec * weight_vec; + scalar_vec_t out_vec(vec); + out_vec.save(curr_output_buffer_iter); + + // update + w2_output_iter += 16; + curr_output_buffer_iter += 16; + } + } + } + } + } +} +} // namespace + +void prepack_moe_weight( + const torch::Tensor& weight, // [expert_num, output_size, input_size] + torch::Tensor& packed_weight, const std::string& isa) { + TORCH_CHECK(weight.is_contiguous()); + const int32_t expert_num = weight.size(0); + const int32_t output_size = weight.size(1); + const int32_t input_size = weight.size(2); + TORCH_CHECK_EQ(output_size % 32, 0); + const int64_t expert_stride = weight.stride(0); + cpu_utils::ISA isa_type = cpu_utils::get_isa(isa); + + VLLM_DISPATCH_FLOATING_TYPES( + weight.scalar_type(), "prepack_moe_weight", [&]() { + CPU_ISA_DISPATCH_IMPL(isa_type, [&]() { + scalar_t* weight_ptr = weight.data_ptr(); + scalar_t* packed_weight_ptr = packed_weight.data_ptr(); + prepack_moe_weight_impl( + weight_ptr, packed_weight_ptr, expert_num, output_size, + input_size, expert_stride); + }); + }); +} + +void cpu_fused_moe( + torch::Tensor& output, // [token_num, output_size_2] + const torch::Tensor& input, // [token_num, input_size_13] + const torch::Tensor& + w13, // [expert_num, output_size_13, input_size_13], packed + const torch::Tensor& + w2, // [expert_num, output_size_2, input_size_2], packed + const std::optional& + w13_bias, // [expert_num, output_size_13] + const std::optional& w2_bias, // [expert_num, output_size_2] + const torch::Tensor& topk_weights, // [token_num, k], float32 + const torch::Tensor& topk_id, // [token_num, k], int32 + const std::string& act, const std::string& isa) { + const int32_t token_num = input.size(0); + const int32_t input_size_13 = input.size(1); + const int64_t input_stride = input.stride(0); + TORCH_CHECK_EQ(input_stride, input_size_13); + const int32_t expert_num = w13.size(0); + const int32_t output_size_13 = w13.size(1); + const int32_t input_size_2 = w2.size(2); + const int32_t output_size_2 = w2.size(1); + const int32_t topk_num = topk_id.size(1); + const FusedMOEAct act_type = get_act_type(act); + cpu_utils::ISA isa_type = cpu_utils::get_isa(isa); + + VLLM_DISPATCH_FLOATING_TYPES(w13.scalar_type(), "cpu_fused_moe", [&]() { + CPU_ISA_DISPATCH_IMPL(isa_type, [&]() { + fused_moe_impl( + output.data_ptr(), input.data_ptr(), + w13.data_ptr(), w2.data_ptr(), + w13_bias.has_value() ? w13_bias->data_ptr() : nullptr, + w2_bias.has_value() ? w2_bias->data_ptr() : nullptr, + topk_weights.data_ptr(), topk_id.data_ptr(), act_type, + token_num, expert_num, topk_num, input_size_13, output_size_13, + input_size_2, output_size_2); + }); + }); +} diff --git a/csrc/cpu/cpu_types_x86.hpp b/csrc/cpu/cpu_types_x86.hpp index 6f51277f784402a29fa45ffa9e4bd58f04547194..d94af338ac1c949f42352dc1418908eba3e965d2 100644 --- a/csrc/cpu/cpu_types_x86.hpp +++ b/csrc/cpu/cpu_types_x86.hpp @@ -352,6 +352,10 @@ struct FP32Vec16 : public Vec { explicit FP32Vec16(bool, void* ptr) : reg((__m512)_mm512_stream_load_si512(ptr)) {} + // strided load + explicit FP32Vec16(const float* ptr, INT32Vec16 idx) + : reg(_mm512_i32gather_ps(idx.reg, ptr, 4)) {} + explicit FP32Vec16(__m512 data) : reg(data) {} // de-pack 4 bit values @@ -408,6 +412,10 @@ struct FP32Vec16 : public Vec { return FP32Vec16(_mm512_sub_ps(reg, b.reg)); } + FP32Vec16 operator-() const { + return FP32Vec16(_mm512_xor_ps(reg, _mm512_set1_ps(-0.0f))); + } + FP32Vec16 operator/(const FP32Vec16& b) const { return FP32Vec16(_mm512_div_ps(reg, b.reg)); } diff --git a/csrc/cpu/cpu_wna16.cpp b/csrc/cpu/cpu_wna16.cpp index 816d195506e5285cbe66ddcd3b1526908d312b8a..88d48f3db8772e8494439ce7d2aff555db38b690 100644 --- a/csrc/cpu/cpu_wna16.cpp +++ b/csrc/cpu/cpu_wna16.cpp @@ -1,6 +1,5 @@ -#include "cpu_types.hpp" -#include "scratchpad_manager.h" -#include "utils.hpp" +#include "cpu/cpu_types.hpp" +#include "cpu/utils.hpp" #ifdef CPU_CAPABILITY_AMXBF16 #include "cpu/micro_gemm/cpu_micro_gemm_amx.hpp" @@ -158,7 +157,7 @@ void cpu_gemm_wna16_impl( // a simple schedule policy, just to hold more B tiles in L2 and make sure // each thread has tasks const int32_t n_partition_size = [&]() { - const int64_t cache_size = cpu_utils::get_l2_size(); + const int64_t cache_size = cpu_utils::get_available_l2_size(); int64_t ps_cache_limit = cache_size / (k_size * sizeof(scalar_t)); int64_t ps_thread_limit = n_size / thread_num; ps_cache_limit = @@ -179,8 +178,8 @@ void cpu_gemm_wna16_impl( const int64_t b_buffer_offset = 0; const int64_t c_buffer_offset = b_buffer_size; const int64_t buffer_size = b_buffer_size + c_buffer_size; - DNNLScratchPadManager::get_dnnl_scratchpad_manager()->realloc(buffer_size * - thread_num); + cpu_utils::ScratchPadManager::get_scratchpad_manager()->realloc(buffer_size * + thread_num); alignas(64) cpu_utils::Counter counter; cpu_utils::Counter* counter_ptr = &counter; @@ -190,9 +189,10 @@ void cpu_gemm_wna16_impl( scalar_t* __restrict__ b_buffer = nullptr; float* __restrict__ c_buffer = nullptr; { - uint8_t* buffer_ptr = DNNLScratchPadManager::get_dnnl_scratchpad_manager() - ->get_data() + - thread_id * buffer_size; + uint8_t* buffer_ptr = + cpu_utils::ScratchPadManager::get_scratchpad_manager() + ->get_data() + + thread_id * buffer_size; b_buffer = reinterpret_cast(buffer_ptr + b_buffer_offset); c_buffer = reinterpret_cast(buffer_ptr + c_buffer_offset); } diff --git a/csrc/cpu/dnnl_helper.cpp b/csrc/cpu/dnnl_helper.cpp index cfb6e78cba9a110d323ca996c5515a9af93de9d9..e337e10e1cf7b4ebfd97413f922d9688add2f4db 100644 --- a/csrc/cpu/dnnl_helper.cpp +++ b/csrc/cpu/dnnl_helper.cpp @@ -4,8 +4,8 @@ #include "common/memory_desc.hpp" #include "common/memory.hpp" -#include "dnnl_helper.h" -#include "scratchpad_manager.h" +#include "cpu/utils.hpp" +#include "cpu/dnnl_helper.h" static dnnl::engine& default_engine() { static dnnl::engine engine(dnnl::engine::kind::cpu, 0); @@ -274,7 +274,7 @@ void W8A8MatMulPrimitiveHandler::execute(ExecArgs& args) { auto&& [scratchpad_storage, scratchpad_mem_desc] = get_runtime_memory_ptr(5); scratchpad_storage->set_data_handle( - DNNLScratchPadManager::get_dnnl_scratchpad_manager()->get_data()); + cpu_utils::ScratchPadManager::get_scratchpad_manager()->get_data()); matmul.execute(default_stream(), memory_cache_); default_stream().wait(); @@ -294,7 +294,7 @@ dnnl::matmul W8A8MatMulPrimitiveHandler::get_matmul_cache( return m_size_cache_->get_or_create(key, [&]() { dnnl::matmul::primitive_desc desc = this->create_primitive_desc(key, false); - auto manager = DNNLScratchPadManager::get_dnnl_scratchpad_manager(); + auto manager = cpu_utils::ScratchPadManager::get_scratchpad_manager(); manager->realloc(desc.scratchpad_desc().get_size()); return dnnl::matmul(desc); }); @@ -470,7 +470,7 @@ void MatMulPrimitiveHandler::execute(ExecArgs& args) { auto&& [scratchpad_storage, scratchpad_mem_desc] = get_runtime_memory_ptr(3); scratchpad_storage->set_data_handle( - DNNLScratchPadManager::get_dnnl_scratchpad_manager()->get_data()); + cpu_utils::ScratchPadManager::get_scratchpad_manager()->get_data()); matmul.execute(default_stream(), memory_cache_); default_stream().wait(); @@ -486,7 +486,7 @@ dnnl::matmul MatMulPrimitiveHandler::get_matmul_cache( } return m_size_cache_->get_or_create(key, [&]() { dnnl::matmul::primitive_desc desc = this->create_primitive_desc(key, false); - auto manager = DNNLScratchPadManager::get_dnnl_scratchpad_manager(); + auto manager = cpu_utils::ScratchPadManager::get_scratchpad_manager(); manager->realloc(desc.scratchpad_desc().get_size()); return dnnl::matmul(desc); }); diff --git a/csrc/cpu/micro_gemm/cpu_micro_gemm_amx.hpp b/csrc/cpu/micro_gemm/cpu_micro_gemm_amx.hpp index 87a019773a8951be4ed4ea188935286939066882..357c7cf1d7844b4e1244e05a8dd362c618163a18 100644 --- a/csrc/cpu/micro_gemm/cpu_micro_gemm_amx.hpp +++ b/csrc/cpu/micro_gemm/cpu_micro_gemm_amx.hpp @@ -235,6 +235,39 @@ class MicroGemm { } } + static void pack_weight(const scalar_t* __restrict__ weight, + scalar_t* __restrict__ packed_weight, + const int32_t output_size, const int32_t input_size) { + constexpr int32_t elem_num_per_group = 4 / sizeof(scalar_t); + TORCH_CHECK_EQ(output_size % 16, 0); + TORCH_CHECK_EQ(input_size % (16 * elem_num_per_group), 0); + + const int32_t output_group_num = output_size / 16; + const int32_t input_32b_num = input_size / elem_num_per_group; + for (int32_t output_group_idx = 0; output_group_idx < output_group_num; + ++output_group_idx) { + const int32_t* __restrict__ weight_32b = + reinterpret_cast(weight); + int32_t* __restrict__ packed_weight_32b = + reinterpret_cast(packed_weight); + for (int32_t output_idx = 0; output_idx < 16; ++output_idx) { + for (int32_t weight_offset = 0, packed_offset = 0; + weight_offset < input_32b_num; + ++weight_offset, packed_offset += 16) { + packed_weight_32b[packed_offset] = weight_32b[weight_offset]; + } + + // update + weight_32b += input_32b_num; + packed_weight_32b += 1; + } + + // update + weight += 16 * input_size; + packed_weight += 16 * input_size; + } + } + private: alignas(64) __tilecfg amx_tile_config_; int32_t curr_m_; diff --git a/csrc/cpu/micro_gemm/cpu_micro_gemm_impl.hpp b/csrc/cpu/micro_gemm/cpu_micro_gemm_impl.hpp index 784da55a420e5cd1a754437ce9b0cc495bb8e0dd..23e78a681b5fed790a83567a162b3c13c8645a4b 100644 --- a/csrc/cpu/micro_gemm/cpu_micro_gemm_impl.hpp +++ b/csrc/cpu/micro_gemm/cpu_micro_gemm_impl.hpp @@ -13,6 +13,9 @@ namespace cpu_micro_gemm { #define CPU_MICRO_GEMM_PARAMS \ a_ptr, b_ptr, c_ptr, m, k, lda, b_n_group_stride, ldc, accum_c +// Note: weights for MicroGemm should be packed as (output_size / 16) contiguous +// blocks, means the logical shape of blocks is [16, input_size]. And the actual +// layout of blocks can be ISA-specific. template class MicroGemm { public: @@ -86,6 +89,41 @@ FORCE_INLINE void bias_epilogue(float* __restrict__ c_ptr, curr_d += ldd; } } + +template +FORCE_INLINE void add_bias_epilogue(float* c_ptr, float* d_ptr, + scalar_t* __restrict__ bias_ptr, + const int32_t m, const int64_t ldc, + const int64_t ldd) { + using scalar_vec_t = typename cpu_utils::VecTypeTrait::vec_t; + static_assert(n_size % 16 == 0); + constexpr int32_t n_group_num = n_size / 16; + static_assert(n_group_num <= 16); + + vec_op::FP32Vec16 bias_vecs[n_group_num]; + scalar_t* __restrict__ curr_bias = bias_ptr; + vec_op::unroll_loop([&](int32_t i) { + scalar_vec_t vec(curr_bias); + bias_vecs[i] = vec_op::FP32Vec16(vec); + curr_bias += 16; + }); + + float* curr_c = c_ptr; + float* curr_d = d_ptr; + for (int32_t i = 0; i < m; ++i) { + float* curr_c_iter = curr_c; + float* curr_d_iter = curr_d; + vec_op::unroll_loop([&](int32_t n_g_idx) { + vec_op::FP32Vec16 c_vec_fp32(curr_c_iter); + c_vec_fp32 = c_vec_fp32 + bias_vecs[n_g_idx]; + c_vec_fp32.save(curr_d_iter); + curr_c_iter += 16; + curr_d_iter += 16; + }); + curr_c += ldc; + curr_d += ldd; + } +} } // namespace cpu_micro_gemm #endif diff --git a/csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp b/csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp index 3985c2f2e5fe47a45381669756a9c3e1f5b7f5df..bdd3e85a1c522dfd3f28b78f7d9fcf43dbafeed2 100644 --- a/csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp +++ b/csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp @@ -109,6 +109,25 @@ class MicroGemm { void gemm(DEFINE_CPU_MICRO_GEMM_PARAMS) { TileGemm82::gemm(CPU_MICRO_GEMM_PARAMS); } + + // Note: pack contiguous weight [output_size, input_size] as contiguous + // packed weight [output_size / 16, input_size, 16] + static void pack_weight(const scalar_t* __restrict__ weight, + scalar_t* __restrict__ packed_weight, + const int32_t output_size, const int32_t input_size) { + TORCH_CHECK_EQ(output_size % 16, 0); + for (int32_t o_idx = 0; o_idx < output_size; ++o_idx) { + const scalar_t* __restrict__ curr_weight = weight + o_idx * input_size; + scalar_t* __restrict__ curr_packed_weight = + packed_weight + (o_idx / 16) * (16 * input_size) + o_idx % 16; + for (int32_t i_idx = 0; i_idx < input_size; ++i_idx) { + *curr_packed_weight = *curr_weight; + + curr_packed_weight += 16; + ++curr_weight; + } + } + } }; } // namespace cpu_micro_gemm diff --git a/csrc/cpu/scratchpad_manager.cpp b/csrc/cpu/scratchpad_manager.cpp deleted file mode 100644 index 05cd435f34b7a644176ca99cc518cf0f03dc5294..0000000000000000000000000000000000000000 --- a/csrc/cpu/scratchpad_manager.cpp +++ /dev/null @@ -1,23 +0,0 @@ -#include - -#include "scratchpad_manager.h" - -DNNLScratchPadManager::DNNLScratchPadManager() : size_(0), ptr_(nullptr) { - this->realloc(allocation_unit * 128); -} - -void DNNLScratchPadManager::realloc(size_t new_size) { - new_size = round(new_size); - if (new_size > size_) { - if (ptr_ != nullptr) { - std::free(ptr_); - } - ptr_ = std::aligned_alloc(64, new_size); - size_ = new_size; - } -} - -DNNLScratchPadManager* DNNLScratchPadManager::get_dnnl_scratchpad_manager() { - static DNNLScratchPadManager manager; - return &manager; -} diff --git a/csrc/cpu/scratchpad_manager.h b/csrc/cpu/scratchpad_manager.h deleted file mode 100644 index 0ecf59192f84532770783abea95dc25586abe049..0000000000000000000000000000000000000000 --- a/csrc/cpu/scratchpad_manager.h +++ /dev/null @@ -1,31 +0,0 @@ -#ifndef SCRATCHPAD_MANAGER_H -#define SCRATCHPAD_MANAGER_H - -#include -#include - -class DNNLScratchPadManager { - public: - static constexpr size_t allocation_unit = 4 * 1024; // 4KB - - static DNNLScratchPadManager* get_dnnl_scratchpad_manager(); - - DNNLScratchPadManager(); - - template - T* get_data() { - return reinterpret_cast(ptr_); - } - - static size_t round(size_t size) { - return ((size + allocation_unit - 1) / allocation_unit) * allocation_unit; - } - - void realloc(size_t new_size); - - private: - size_t size_; - void* ptr_; -}; - -#endif diff --git a/csrc/cpu/torch_bindings.cpp b/csrc/cpu/torch_bindings.cpp index d01349d59bc6625d5a3c9a755b605a8e9924e3fc..c98f970e41f078b75b0568d39125a150098b0ddd 100644 --- a/csrc/cpu/torch_bindings.cpp +++ b/csrc/cpu/torch_bindings.cpp @@ -110,6 +110,17 @@ void cpu_gemm_wna16(const torch::Tensor& input, const torch::Tensor& q_weight, const std::optional& bias, const int64_t pack_factor, const std::string& isa_hint); +void prepack_moe_weight(const torch::Tensor& weight, + torch::Tensor& packed_weight, const std::string& isa); + +void cpu_fused_moe(torch::Tensor& output, const torch::Tensor& input, + const torch::Tensor& w13, const torch::Tensor& w2, + const std::optional& w13_bias, + const std::optional& w2_bias, + const torch::Tensor& topk_weights, + const torch::Tensor& topk_id, const std::string& act, + const std::string& isa); + TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { // vLLM custom ops @@ -296,6 +307,19 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { "pack_factor, str isa_hint) -> ()"); ops.impl("cpu_gemm_wna16", torch::kCPU, &cpu_gemm_wna16); #endif + + // fused moe +#if defined(__AVX512F__) + ops.def( + "prepack_moe_weight(Tensor weight, Tensor(a1!) packed_weight, str isa) " + "-> ()"); + ops.impl("prepack_moe_weight", torch::kCPU, &prepack_moe_weight); + ops.def( + "cpu_fused_moe(Tensor(a0!) output, Tensor input, Tensor w13, Tensor w2, " + "Tensor? w13_bias, Tensor? w2_bias, Tensor topk_weights, Tensor topk_id, " + "str act, str isa) -> ()"); + ops.impl("cpu_fused_moe", torch::kCPU, &cpu_fused_moe); +#endif } TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _utils), utils) { diff --git a/csrc/cpu/utils.cpp b/csrc/cpu/utils.cpp index 3dacfc7b2b7a3bfbcc0179ca9949b875b7e60d5f..f2085b73b6a48a4dbea3dd516eb1c39f3b36e2fb 100644 --- a/csrc/cpu/utils.cpp +++ b/csrc/cpu/utils.cpp @@ -10,7 +10,7 @@ #define gettid() syscall(SYS_gettid) #endif -#include "cpu_types.hpp" +#include "cpu/utils.hpp" #ifdef VLLM_NUMA_DISABLED std::string init_cpu_threads_env(const std::string& cpu_ids) { @@ -24,6 +24,8 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) { #ifndef VLLM_NUMA_DISABLED std::string init_cpu_threads_env(const std::string& cpu_ids) { bitmask* omp_cpu_mask = numa_parse_cpustring_all(cpu_ids.c_str()); + TORCH_CHECK(omp_cpu_mask != nullptr, + "Failed to parse CPU string: " + cpu_ids); TORCH_CHECK(omp_cpu_mask->size > 0); std::vector omp_cpu_ids; omp_cpu_ids.reserve(omp_cpu_mask->size); @@ -44,20 +46,12 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) { // Memory node binding if (numa_available() != -1) { - int mem_node_id = numa_node_of_cpu(omp_cpu_ids.front()); std::set node_ids; for (const auto& cpu_id : omp_cpu_ids) { int node_id = numa_node_of_cpu(cpu_id); if (node_id != -1) { node_ids.insert(node_id); } - if (node_id != mem_node_id) { - TORCH_WARN("CPU ", cpu_id, " is on NUMA node ", node_id, ", but CPU ", - omp_cpu_ids.front(), " is on NUMA node ", mem_node_id, - ". All CPUs should be on the same NUMA node for optimal " - "performance. Memory will be bound to NUMA node ", - mem_node_id, "."); - } } // Concatenate all node_ids into a single comma-separated string if (!node_ids.empty()) { @@ -70,7 +64,7 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) { } bitmask* mask = numa_parse_nodestring(node_ids_str.c_str()); - bitmask* src_mask = numa_get_membind(); + bitmask* src_mask = numa_get_mems_allowed(); int pid = getpid(); @@ -83,15 +77,46 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) { std::to_string(errno)); } - // restrict memory allocation node. - numa_set_membind(mask); + // Restrict memory allocation to the selected NUMA node(s). + // Enhances memory locality for the threads bound to those NUMA CPUs. + if (node_ids.size() > 1) { + errno = 0; + numa_set_interleave_mask(mask); + if (errno != 0) { + TORCH_WARN("numa_set_interleave_mask failed. errno: " + + std::to_string(errno)); + } else { + TORCH_WARN( + "NUMA binding: Using INTERLEAVE policy for memory " + "allocation across multiple NUMA nodes (nodes: " + + node_ids_str + + "). Memory allocations will be " + "interleaved across the specified NUMA nodes."); + } + } else { + errno = 0; + numa_set_membind(mask); + if (errno != 0) { + TORCH_WARN("numa_set_membind failed. errno: " + + std::to_string(errno)); + } else { + TORCH_WARN( + "NUMA binding: Using MEMBIND policy for memory " + "allocation on the NUMA nodes (" + + node_ids_str + + "). Memory allocations will be " + "strictly bound to these NUMA nodes."); + } + } + numa_set_strict(1); numa_free_nodemask(mask); numa_free_nodemask(src_mask); } else { - TORCH_WARN("numa_parse_nodestring or numa_get_membind failed. errno: " + - std::to_string(errno)); + TORCH_WARN( + "numa_parse_nodestring or numa_get_run_node_mask failed. errno: " + + std::to_string(errno)); } } } @@ -138,4 +163,26 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) { return ss.str(); } -#endif +#endif // VLLM_NUMA_DISABLED + +namespace cpu_utils { +ScratchPadManager::ScratchPadManager() : size_(0), ptr_(nullptr) { + this->realloc(allocation_unit * 128); +} + +void ScratchPadManager::realloc(size_t new_size) { + new_size = round(new_size); + if (new_size > size_) { + if (ptr_ != nullptr) { + std::free(ptr_); + } + ptr_ = std::aligned_alloc(64, new_size); + size_ = new_size; + } +} + +ScratchPadManager* ScratchPadManager::get_scratchpad_manager() { + static ScratchPadManager manager; + return &manager; +} +} // namespace cpu_utils diff --git a/csrc/cpu/utils.hpp b/csrc/cpu/utils.hpp index d3def306b806918c33236377faef71c7d1b0b566..682751d67b1cdbd196f89aa4a7b120f8e3e89113 100644 --- a/csrc/cpu/utils.hpp +++ b/csrc/cpu/utils.hpp @@ -2,19 +2,24 @@ #define UTILS_HPP #include -#include -#include #include +#include -#if defined(__APPLE__) - #include -#endif - -#include "cpu_types.hpp" +#include "cpu/cpu_types.hpp" namespace cpu_utils { enum class ISA { AMX, VEC }; +inline ISA get_isa(const std::string& isa) { + if (isa == "amx") { + return ISA::AMX; + } else if (isa == "vec") { + return ISA::VEC; + } else { + TORCH_CHECK(false, "Invalid isa type: " + isa); + } +} + template struct VecTypeTrait { using vec_t = void; @@ -32,10 +37,12 @@ struct VecTypeTrait { }; #endif +#if !defined(__powerpc__) template <> struct VecTypeTrait { using vec_t = vec_op::FP16Vec16; }; +#endif struct Counter { std::atomic counter; @@ -48,26 +55,66 @@ struct Counter { int64_t acquire_counter() { return counter++; } }; -inline int64_t get_l2_size() { +inline int64_t get_available_l2_size() { static int64_t size = []() { -#if defined(__APPLE__) - // macOS doesn't have _SC_LEVEL2_CACHE_SIZE. Use sysctlbyname. - int64_t l2_cache_size = 0; - size_t len = sizeof(l2_cache_size); - if (sysctlbyname("hw.l2cachesize", &l2_cache_size, &len, NULL, 0) == 0 && - l2_cache_size > 0) { - return l2_cache_size >> 1; // use 50% of L2 cache - } - // Fallback if sysctlbyname fails - return 128LL * 1024 >> 1; // use 50% of 128KB -#else - long l2_cache_size = sysconf(_SC_LEVEL2_CACHE_SIZE); - assert(l2_cache_size != -1); + const uint32_t l2_cache_size = at::cpu::L2_cache_size(); return l2_cache_size >> 1; // use 50% of L2 cache -#endif }(); return size; } + +template +inline T round_up(T size) { + T alignment = alignment_v; + return (((size + alignment - 1) / alignment) * alignment); +} + +template +inline T round_down(T size) { + T alignment = alignment_v; + return (size / alignment) * alignment; +} + +template +inline void print_logits(const char* name, T* ptr, int32_t row, int32_t col, + int32_t stride) { + std::stringstream ss; + ss << std::fixed << std::setprecision(5) << name << ": [\n"; + auto* curr_logits_buffer = ptr; + for (int32_t m = 0; m < row; ++m) { + for (int32_t n = 0; n < col; ++n) { + ss << curr_logits_buffer[n] << ", "; + } + ss << "\n"; + curr_logits_buffer += stride; + } + ss << "]\n"; + std::printf("%s", ss.str().c_str()); +} + +class ScratchPadManager { + public: + static constexpr size_t allocation_unit = 4 * 1024; // 4KB + + static ScratchPadManager* get_scratchpad_manager(); + + ScratchPadManager(); + + template + T* get_data() { + return reinterpret_cast(ptr_); + } + + static size_t round(size_t size) { + return ((size + allocation_unit - 1) / allocation_unit) * allocation_unit; + } + + void realloc(size_t new_size); + + private: + size_t size_; + void* ptr_; +}; } // namespace cpu_utils #endif diff --git a/csrc/cumem_allocator.cpp b/csrc/cumem_allocator.cpp index 78dc840a98b677de9555dbd542b22f36d6ed68fd..6c2c18a6602d28ea09fa2f0cddfe826f8c993c49 100644 --- a/csrc/cumem_allocator.cpp +++ b/csrc/cumem_allocator.cpp @@ -107,6 +107,16 @@ void create_and_map(unsigned long long device, ssize_t size, CUdeviceptr d_mem, prop.location.id = device; prop.allocFlags.compressionType = CU_MEM_ALLOCATION_COMP_NONE; +#ifndef USE_ROCM + int flag = 0; + CUDA_CHECK(cuDeviceGetAttribute( + &flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, + device)); + if (flag) { // support GPUDirect RDMA if possible + prop.allocFlags.gpuDirectRDMACapable = 1; + } +#endif + #ifndef USE_ROCM // Allocate memory using cuMemCreate CUDA_CHECK(cuMemCreate(p_memHandle, size, &prop, 0)); diff --git a/csrc/fused_qknorm_rope_kernel.cu b/csrc/fused_qknorm_rope_kernel.cu index baff8363162efa2864a8ea30ae987da78628d281..a51e1a347e1d4cc028bdbb02cff8afcab8d46348 100644 --- a/csrc/fused_qknorm_rope_kernel.cu +++ b/csrc/fused_qknorm_rope_kernel.cu @@ -107,7 +107,8 @@ __global__ void fusedQKNormRopeKernel( void const* k_weight_void, // RMSNorm weights for key void const* cos_sin_cache_void, // Pre-computed cos/sin cache int64_t const* position_ids, // Position IDs for RoPE - int const num_tokens // Number of tokens + int const num_tokens, // Number of tokens + int const rotary_dim // Dimension for RoPE ) { #if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 800) && !defined(USE_ROCM) if constexpr ((std::is_same_v) || @@ -227,56 +228,59 @@ __global__ void fusedQKNormRopeKernel( // Calculate cache pointer for this position - similar to // pos_encoding_kernels.cu - T_cache const* cache_ptr = cos_sin_cache + pos_id * head_dim; - int const embed_dim = head_dim / 2; + T_cache const* cache_ptr = cos_sin_cache + pos_id * rotary_dim; + int const embed_dim = rotary_dim / 2; T_cache const* cos_ptr = cache_ptr; T_cache const* sin_ptr = cache_ptr + embed_dim; - - if constexpr (interleave) { - // Perform interleaving. Use pre-computed cos/sin values. + int const rotary_lanes = rotary_dim / numElemsPerThread; // rotary range + if (laneId < rotary_lanes) { + if constexpr (interleave) { + // Perform interleaving. Use pre-computed cos/sin values. #pragma unroll - for (int i = 0; i < numElemsPerThread / 2; ++i) { - int const idx0 = 2 * i; - int const idx1 = 2 * i + 1; - - float const val0 = elements[idx0]; - float const val1 = elements[idx1]; - - int const dim_idx = laneId * numElemsPerThread + idx0; - int const half_dim = dim_idx / 2; - float const cos_val = - CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim)); - float const sin_val = - CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim)); - - elements[idx0] = val0 * cos_val - val1 * sin_val; - elements[idx1] = val0 * sin_val + val1 * cos_val; - } - } else { - // Before data exchange with in warp, we need to sync. - __syncwarp(); - // Get the data from the other half of the warp. Use pre-computed cos/sin - // values. -#pragma unroll - for (int i = 0; i < numElemsPerThread; i++) { - elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], 16); - if (laneId < 16) { - elements2[i] = -elements2[i]; + for (int i = 0; i < numElemsPerThread / 2; ++i) { + int const idx0 = 2 * i; + int const idx1 = 2 * i + 1; + // Global dimension index in the head + int const dim_idx = laneId * numElemsPerThread + idx0; + + float const val0 = elements[idx0]; + float const val1 = elements[idx1]; + + int const half_dim = dim_idx / 2; + float const cos_val = + CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim)); + float const sin_val = + CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim)); + + elements[idx0] = val0 * cos_val - val1 * sin_val; + elements[idx1] = val0 * sin_val + val1 * cos_val; } + } else { + // Before data exchange with in warp, we need to sync. + __syncwarp(); + int pairOffset = (rotary_dim / 2) / numElemsPerThread; + // Get the data from the other half of the warp. Use pre-computed + // cos/sin values. +#pragma unroll + for (int i = 0; i < numElemsPerThread; i++) { + elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], pairOffset); - int dim_idx = laneId * numElemsPerThread + i; - dim_idx = (dim_idx * 2) % head_dim; - int half_dim = dim_idx / 2; - // Use pre-computed cos/sin from cache - float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim)); - float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim)); + if (laneId < pairOffset) { + elements2[i] = -elements2[i]; + } + int dim_idx = laneId * numElemsPerThread + i; - elements[i] = elements[i] * cos_val + elements2[i] * sin_val; + dim_idx = (dim_idx * 2) % rotary_dim; + int half_dim = dim_idx / 2; + float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim)); + float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim)); + + elements[i] = elements[i] * cos_val + elements2[i] * sin_val; + } + // __shfl_xor_sync does not provide memfence. Need to sync again. + __syncwarp(); } - // __shfl_xor_sync does not provide memfence. Need to sync again. - __syncwarp(); } - // Store. { vec_T vec; @@ -312,10 +316,10 @@ template void launchFusedQKNormRope(void* qkv, int const num_tokens, int const num_heads_q, int const num_heads_k, int const num_heads_v, int const head_dim, - float const eps, void const* q_weight, - void const* k_weight, void const* cos_sin_cache, - bool const interleave, int64_t const* position_ids, - cudaStream_t stream) { + int const rotary_dim, float const eps, + void const* q_weight, void const* k_weight, + void const* cos_sin_cache, bool const interleave, + int64_t const* position_ids, cudaStream_t stream) { constexpr int blockSize = 256; int const warpsPerBlock = blockSize / 32; @@ -332,7 +336,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens, fusedQKNormRopeKernel <<>>( qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight, - k_weight, cos_sin_cache, position_ids, num_tokens); + k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim); }); break; case 128: @@ -340,7 +344,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens, fusedQKNormRopeKernel <<>>( qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight, - k_weight, cos_sin_cache, position_ids, num_tokens); + k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim); }); break; case 256: @@ -348,7 +352,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens, fusedQKNormRopeKernel <<>>( qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight, - k_weight, cos_sin_cache, position_ids, num_tokens); + k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim); }); break; default: @@ -392,8 +396,11 @@ void fused_qk_norm_rope( "Query weights size must match head dimension"); TORCH_CHECK(k_weight.size(0) == head_dim, "Key weights size must match head dimension"); - TORCH_CHECK(cos_sin_cache.size(1) == head_dim, - "Cos/sin cache dimension must match head_dim"); + + TORCH_CHECK(cos_sin_cache.size(1) % 2 == 0, "rotary_dim must be even"); + TORCH_CHECK(cos_sin_cache.size(1) <= head_dim, + "rotary_dim must be less than or equal to head_dim"); + TORCH_CHECK(qkv.scalar_type() == q_weight.scalar_type() && qkv.scalar_type() == k_weight.scalar_type(), "qkv, q_weight and k_weight must have the same dtype"); @@ -419,7 +426,8 @@ void fused_qk_norm_rope( qkv.data_ptr(), static_cast(num_tokens), static_cast(num_heads_q), static_cast(num_heads_k), static_cast(num_heads_v), static_cast(head_dim), - static_cast(eps), q_weight.data_ptr(), k_weight.data_ptr(), + static_cast(cos_sin_cache.size(1)), static_cast(eps), + q_weight.data_ptr(), k_weight.data_ptr(), cos_sin_cache.data_ptr(), !is_neox, reinterpret_cast(position_ids.data_ptr()), stream); diff --git a/csrc/moe/grouped_topk_kernels.cu b/csrc/moe/grouped_topk_kernels.cu index 5fa367abd96f56250b25d312e1efbea4d83a3ce7..27e646bcd56fa8aac1a1e897d764a4517c633a8b 100644 --- a/csrc/moe/grouped_topk_kernels.cu +++ b/csrc/moe/grouped_topk_kernels.cu @@ -446,15 +446,19 @@ __device__ inline T apply_sigmoid(T val) { template __device__ inline T apply_scoring(T val) { - if constexpr (SF == SCORING_SIGMOID) { + if constexpr (SF == SCORING_NONE) { + return val; + } else if constexpr (SF == SCORING_SIGMOID) { return apply_sigmoid(val); } else { + static_assert(SF == SCORING_NONE || SF == SCORING_SIGMOID, + "Unsupported ScoringFunc in apply_scoring"); return val; } } -template -__device__ void topk_with_k2(T* output, T const* input, T const* bias, +template +__device__ void topk_with_k2(T* output, T const* input, BiasT const* bias, cg::thread_block_tile<32> const& tile, int32_t const lane_id, int const num_experts_per_group) { @@ -465,7 +469,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias, if (num_experts_per_group > WARP_SIZE) { for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) { T value = apply_scoring(input[i]); - value = value + bias[i]; + value = value + static_cast(bias[i]); if (value > largest) { second_largest = largest; @@ -477,7 +481,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias, } else { for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) { T value = apply_scoring(input[i]); - value = value + bias[i]; + value = value + static_cast(bias[i]); largest = value; } } @@ -499,8 +503,8 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias, } } -template -__global__ void topk_with_k2_kernel(T* output, T* input, T const* bias, +template +__global__ void topk_with_k2_kernel(T* output, T* input, BiasT const* bias, int64_t const num_tokens, int64_t const num_cases, int64_t const n_group, @@ -513,7 +517,7 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias, input += case_id * num_experts_per_group; // bias is per expert group, offset to current group int32_t group_id = case_id % n_group; - T const* group_bias = bias + group_id * num_experts_per_group; + BiasT const* group_bias = bias + group_id * num_experts_per_group; output += case_id; cg::thread_block block = cg::this_thread_block(); @@ -522,18 +526,19 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias, #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) asm volatile("griddepcontrol.wait;"); #endif - topk_with_k2(output, input, group_bias, tile, lane_id, - num_experts_per_group); + topk_with_k2(output, input, group_bias, tile, lane_id, + num_experts_per_group); } #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) asm volatile("griddepcontrol.launch_dependents;"); #endif } -template +template __global__ void group_idx_and_topk_idx_kernel( T* scores, T const* group_scores, float* topk_values, IdxT* topk_indices, - T const* bias, int64_t const num_tokens, int64_t const n_group, + BiasT const* bias, int64_t const num_tokens, int64_t const n_group, int64_t const topk_group, int64_t const topk, int64_t const num_experts, int64_t const num_experts_per_group, bool renormalize, double routed_scaling_factor) { @@ -619,7 +624,7 @@ __global__ void group_idx_and_topk_idx_kernel( T input = scores[offset + i]; if (is_finite(input)) { T score = apply_scoring(input); - candidates = score + bias[offset + i]; + candidates = score + static_cast(bias[offset + i]); } } queue.add(candidates, offset + i); @@ -670,10 +675,13 @@ __global__ void group_idx_and_topk_idx_kernel( if (case_id < num_tokens) { if (if_proceed_next_topk) { + float scale = routed_scaling_factor; + if (renormalize) { + scale /= topk_sum; + } for (int i = lane_id; i < topk; i += WARP_SIZE) { float base = cuda_cast(s_topk_value[i]); - float value = renormalize ? (base / topk_sum * routed_scaling_factor) - : (base * routed_scaling_factor); + float value = base * scale; topk_indices[i] = s_topk_idx[i]; topk_values[i] = value; } @@ -691,10 +699,10 @@ __global__ void group_idx_and_topk_idx_kernel( #endif } -template +template inline void launch_group_idx_and_topk_kernel( cudaLaunchConfig_t const& config, T* scores, T* group_scores, - float* topk_values, IdxT* topk_indices, T const* bias, + float* topk_values, IdxT* topk_indices, BiasT const* bias, int64_t const num_tokens, int64_t const n_group, int64_t const topk_group, int64_t const topk, int64_t const num_experts, int64_t const num_experts_per_group, bool const renormalize, @@ -708,36 +716,36 @@ inline void launch_group_idx_and_topk_kernel( switch (n_group) { case 4: { - launch(&group_idx_and_topk_idx_kernel); + launch(&group_idx_and_topk_idx_kernel); break; } case 8: { - launch(&group_idx_and_topk_idx_kernel); + launch(&group_idx_and_topk_idx_kernel); break; } case 16: { - launch(&group_idx_and_topk_idx_kernel); + launch(&group_idx_and_topk_idx_kernel); break; } case 32: { - launch(&group_idx_and_topk_idx_kernel); + launch(&group_idx_and_topk_idx_kernel); break; } default: { - launch(&group_idx_and_topk_idx_kernel); + launch(&group_idx_and_topk_idx_kernel); break; } } } -template +template void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values, - IdxT* topk_indices, T const* bias, int64_t const num_tokens, - int64_t const num_experts, int64_t const n_group, - int64_t const topk_group, int64_t const topk, - bool const renormalize, double const routed_scaling_factor, - int const scoring_func, bool enable_pdl = false, - cudaStream_t const stream = 0) { + IdxT* topk_indices, BiasT const* bias, + int64_t const num_tokens, int64_t const num_experts, + int64_t const n_group, int64_t const topk_group, + int64_t const topk, bool const renormalize, + double const routed_scaling_factor, int const scoring_func, + bool enable_pdl = false, cudaStream_t const stream = 0) { int64_t num_cases = num_tokens * n_group; int64_t topk_with_k2_num_blocks = (num_cases - 1) / NUM_WARPS_PER_BLOCK + 1; cudaLaunchConfig_t config; @@ -758,12 +766,12 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values, }; switch (sf) { case SCORING_NONE: { - auto* kernel_instance1 = &topk_with_k2_kernel; + auto* kernel_instance1 = &topk_with_k2_kernel; launch_topk_with_k2(kernel_instance1); break; } case SCORING_SIGMOID: { - auto* kernel_instance1 = &topk_with_k2_kernel; + auto* kernel_instance1 = &topk_with_k2_kernel; launch_topk_with_k2(kernel_instance1); break; } @@ -787,14 +795,14 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values, config.attrs = attrs; switch (sf) { case SCORING_NONE: { - launch_group_idx_and_topk_kernel( + launch_group_idx_and_topk_kernel( config, scores, group_scores, topk_values, topk_indices, bias, num_tokens, n_group, topk_group, topk, num_experts, num_experts_per_group, renormalize, routed_scaling_factor); break; } case SCORING_SIGMOID: { - launch_group_idx_and_topk_kernel( + launch_group_idx_and_topk_kernel( config, scores, group_scores, topk_values, topk_indices, bias, num_tokens, n_group, topk_group, topk, num_experts, num_experts_per_group, renormalize, routed_scaling_factor); @@ -805,17 +813,23 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values, } } -#define INSTANTIATE_NOAUX_TC(T, IdxT) \ - template void invokeNoAuxTc( \ - T * scores, T * group_scores, float* topk_values, IdxT* topk_indices, \ - T const* bias, int64_t const num_tokens, int64_t const num_experts, \ - int64_t const n_group, int64_t const topk_group, int64_t const topk, \ - bool const renormalize, double const routed_scaling_factor, \ +#define INSTANTIATE_NOAUX_TC(T, BiasT, IdxT) \ + template void invokeNoAuxTc( \ + T * scores, T * group_scores, float* topk_values, IdxT* topk_indices, \ + BiasT const* bias, int64_t const num_tokens, int64_t const num_experts, \ + int64_t const n_group, int64_t const topk_group, int64_t const topk, \ + bool const renormalize, double const routed_scaling_factor, \ int const scoring_func, bool enable_pdl, cudaStream_t const stream); -INSTANTIATE_NOAUX_TC(float, int32_t); -INSTANTIATE_NOAUX_TC(half, int32_t); -INSTANTIATE_NOAUX_TC(__nv_bfloat16, int32_t); +INSTANTIATE_NOAUX_TC(float, float, int32_t); +INSTANTIATE_NOAUX_TC(float, half, int32_t); +INSTANTIATE_NOAUX_TC(float, __nv_bfloat16, int32_t); +INSTANTIATE_NOAUX_TC(half, float, int32_t); +INSTANTIATE_NOAUX_TC(half, half, int32_t); +INSTANTIATE_NOAUX_TC(half, __nv_bfloat16, int32_t); +INSTANTIATE_NOAUX_TC(__nv_bfloat16, float, int32_t); +INSTANTIATE_NOAUX_TC(__nv_bfloat16, half, int32_t); +INSTANTIATE_NOAUX_TC(__nv_bfloat16, __nv_bfloat16, int32_t); } // end namespace moe } // namespace vllm @@ -824,6 +838,7 @@ std::tuple grouped_topk( int64_t topk, bool renormalize, double routed_scaling_factor, torch::Tensor const& bias, int64_t scoring_func = 0) { auto data_type = scores.scalar_type(); + auto bias_type = bias.scalar_type(); auto input_size = scores.sizes(); int64_t num_tokens = input_size[0]; int64_t num_experts = input_size[1]; @@ -847,39 +862,62 @@ std::tuple grouped_topk( auto stream = c10::cuda::getCurrentCUDAStream(scores.get_device()); +#define LAUNCH_KERNEL(T, IdxT) \ + do { \ + switch (bias_type) { \ + case torch::kFloat16: \ + vllm::moe::invokeNoAuxTc( \ + reinterpret_cast(scores.mutable_data_ptr()), \ + reinterpret_cast(group_scores.mutable_data_ptr()), \ + reinterpret_cast(topk_values.mutable_data_ptr()), \ + reinterpret_cast(topk_indices.mutable_data_ptr()), \ + reinterpret_cast(bias.data_ptr()), num_tokens, \ + num_experts, n_group, topk_group, topk, renormalize, \ + routed_scaling_factor, static_cast(scoring_func), false, \ + stream); \ + break; \ + case torch::kFloat32: \ + vllm::moe::invokeNoAuxTc( \ + reinterpret_cast(scores.mutable_data_ptr()), \ + reinterpret_cast(group_scores.mutable_data_ptr()), \ + reinterpret_cast(topk_values.mutable_data_ptr()), \ + reinterpret_cast(topk_indices.mutable_data_ptr()), \ + reinterpret_cast(bias.data_ptr()), num_tokens, \ + num_experts, n_group, topk_group, topk, renormalize, \ + routed_scaling_factor, static_cast(scoring_func), false, \ + stream); \ + break; \ + case torch::kBFloat16: \ + vllm::moe::invokeNoAuxTc( \ + reinterpret_cast(scores.mutable_data_ptr()), \ + reinterpret_cast(group_scores.mutable_data_ptr()), \ + reinterpret_cast(topk_values.mutable_data_ptr()), \ + reinterpret_cast(topk_indices.mutable_data_ptr()), \ + reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()), \ + num_tokens, num_experts, n_group, topk_group, topk, renormalize, \ + routed_scaling_factor, static_cast(scoring_func), false, \ + stream); \ + break; \ + default: \ + throw std::invalid_argument( \ + "Invalid bias dtype, only supports float16, float32, and " \ + "bfloat16"); \ + break; \ + } \ + } while (0) + switch (data_type) { case torch::kFloat16: // Handle Float16 - vllm::moe::invokeNoAuxTc( - reinterpret_cast(scores.mutable_data_ptr()), - reinterpret_cast(group_scores.mutable_data_ptr()), - reinterpret_cast(topk_values.mutable_data_ptr()), - reinterpret_cast(topk_indices.mutable_data_ptr()), - reinterpret_cast(bias.data_ptr()), num_tokens, - num_experts, n_group, topk_group, topk, renormalize, - routed_scaling_factor, static_cast(scoring_func), false, stream); + LAUNCH_KERNEL(half, int32_t); break; case torch::kFloat32: // Handle Float32 - vllm::moe::invokeNoAuxTc( - reinterpret_cast(scores.mutable_data_ptr()), - reinterpret_cast(group_scores.mutable_data_ptr()), - reinterpret_cast(topk_values.mutable_data_ptr()), - reinterpret_cast(topk_indices.mutable_data_ptr()), - reinterpret_cast(bias.data_ptr()), num_tokens, - num_experts, n_group, topk_group, topk, renormalize, - routed_scaling_factor, static_cast(scoring_func), false, stream); + LAUNCH_KERNEL(float, int32_t); break; case torch::kBFloat16: // Handle BFloat16 - vllm::moe::invokeNoAuxTc<__nv_bfloat16, int32_t>( - reinterpret_cast<__nv_bfloat16*>(scores.mutable_data_ptr()), - reinterpret_cast<__nv_bfloat16*>(group_scores.mutable_data_ptr()), - reinterpret_cast(topk_values.mutable_data_ptr()), - reinterpret_cast(topk_indices.mutable_data_ptr()), - reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()), num_tokens, - num_experts, n_group, topk_group, topk, renormalize, - routed_scaling_factor, static_cast(scoring_func), false, stream); + LAUNCH_KERNEL(__nv_bfloat16, int32_t); break; default: // Handle other data types @@ -887,5 +925,6 @@ std::tuple grouped_topk( "Invalid dtype, only supports float16, float32, and bfloat16"); break; } +#undef LAUNCH_KERNEL return {topk_values, topk_indices}; } diff --git a/csrc/moe/marlin_moe_wna16/.gitignore b/csrc/moe/marlin_moe_wna16/.gitignore index ba805f9250ecea707563336138067efe13b7b8b5..7dc482a8946605d91d192f43b2fffae518397f59 100644 --- a/csrc/moe/marlin_moe_wna16/.gitignore +++ b/csrc/moe/marlin_moe_wna16/.gitignore @@ -1,2 +1,3 @@ sm*_kernel_*.cu kernel_selector.h +kernel_*.cu diff --git a/csrc/moe/marlin_moe_wna16/generate_kernels.py b/csrc/moe/marlin_moe_wna16/generate_kernels.py index 88f1055337fd5332bef486a141025e90f937181f..9db03ea149d0c38083f63fa542b92cddc6dae9fc 100644 --- a/csrc/moe/marlin_moe_wna16/generate_kernels.py +++ b/csrc/moe/marlin_moe_wna16/generate_kernels.py @@ -10,6 +10,8 @@ import jinja2 ARCHS = [] SUPPORT_FP8 = False +SUPPORT_SM75 = False +SUPPORT_SM80 = False for arch in sys.argv[1].split(","): arch = arch[: arch.index(".") + 2].replace(".", "") arch = int(arch) @@ -19,6 +21,10 @@ for arch in sys.argv[1].split(","): # with FP16 MMA, so it cannot achieve any acceleration. if arch in [89, 120]: SUPPORT_FP8 = True + if arch >= 80: + SUPPORT_SM80 = True + if arch == 75: + SUPPORT_SM75 = True FILE_HEAD_COMMENT = """ // auto generated by generate_kernels.py @@ -157,6 +163,7 @@ def remove_old_kernels(): def generate_new_kernels(): result_dict = {} + sm_75_result_dict = {} for quant_config in QUANT_CONFIGS: c_types = quant_config.get("c_type", ["kFloat16", "kBFloat16"]) @@ -174,6 +181,8 @@ def generate_new_kernels(): s_type = quant_config.get("s_type", c_type) if (a_type, b_type, c_type) not in result_dict: result_dict[(a_type, b_type, c_type)] = [] + if a_type in ["kFloat16", "kS8"] and c_type == "kFloat16": + sm_75_result_dict[(a_type, b_type, c_type)] = [] for group_blocks, m_blocks, thread_configs in itertools.product( all_group_blocks, all_m_blocks, all_thread_configs @@ -197,78 +206,89 @@ def generate_new_kernels(): "thread_k_blocks": thread_k // 16, "thread_n_blocks": thread_n // 16, "m_block_size_8": "true" if m_blocks == 0.5 else "false", - "stages": "pipe_stages", + "stages": 4, "group_blocks": group_blocks, "is_zp_float": "false", } - result_dict[(a_type, b_type, c_type)].append(config) + if SUPPORT_SM80: + result_dict[(a_type, b_type, c_type)].append(config) + if (a_type, b_type, c_type) in sm_75_result_dict and SUPPORT_SM75: + config_sm75 = config.copy() + config_sm75["stages"] = 2 + sm_75_result_dict[(a_type, b_type, c_type)].append(config_sm75) kernel_selector_str = FILE_HEAD_COMMENT - for (a_type, b_type, c_type), config_list in result_dict.items(): - all_template_str_list = [] - for config in config_list: - s_type = config["s_type"] - template_str = jinja2.Template(TEMPLATE).render( - a_type_id=f"vllm::{a_type}.id()", - b_type_id=f"vllm::{b_type}.id()", - c_type_id=f"vllm::{c_type}.id()", - s_type_id=f"vllm::{s_type}.id()", - **config, - ) - all_template_str_list.append(template_str) - - conditions = [ - f"a_type == vllm::{a_type}", - f"b_type == vllm::{b_type}", - f"c_type == vllm::{c_type}", - f"s_type == vllm::{s_type}", - f"threads == {config['threads']}", - f"thread_m_blocks == {config['thread_m_blocks']}", - f"thread_n_blocks == {config['thread_n_blocks']}", - f"thread_k_blocks == {config['thread_k_blocks']}", - f"m_block_size_8 == {config['m_block_size_8']}", - f"group_blocks == {config['group_blocks']}", - f"is_zp_float == {config['is_zp_float']}", - ] - conditions = " && ".join(conditions) - - if kernel_selector_str == FILE_HEAD_COMMENT: - kernel_selector_str += f"if ({conditions})\n kernel = " - else: - kernel_selector_str += f"else if ({conditions})\n kernel = " - - kernel_template2 = ( - "Marlin<{{a_type_id}}, {{b_type_id}}, {{c_type_id}}, " - "{{s_type_id}}, {{threads}}, {{thread_m_blocks}}, " - "{{thread_n_blocks}}, {{thread_k_blocks}}, " - "{{m_block_size_8}}, {{stages}}, {{group_blocks}}, " - "{{is_zp_float}}>;" - ) - - kernel_selector_str += ( - jinja2.Template(kernel_template2).render( + for result_dict_tmp in [result_dict, sm_75_result_dict]: + for (a_type, b_type, c_type), config_list in result_dict_tmp.items(): + all_template_str_list = [] + if not config_list: + continue + for config in config_list: + s_type = config["s_type"] + template_str = jinja2.Template(TEMPLATE).render( a_type_id=f"vllm::{a_type}.id()", b_type_id=f"vllm::{b_type}.id()", c_type_id=f"vllm::{c_type}.id()", s_type_id=f"vllm::{s_type}.id()", **config, ) - + "\n" - ) + all_template_str_list.append(template_str) + + conditions = [ + f"a_type == vllm::{a_type}", + f"b_type == vllm::{b_type}", + f"c_type == vllm::{c_type}", + f"s_type == vllm::{s_type}", + f"threads == {config['threads']}", + f"thread_m_blocks == {config['thread_m_blocks']}", + f"thread_n_blocks == {config['thread_n_blocks']}", + f"thread_k_blocks == {config['thread_k_blocks']}", + f"m_block_size_8 == {config['m_block_size_8']}", + f"stages == {config['stages']}", + f"group_blocks == {config['group_blocks']}", + f"is_zp_float == {config['is_zp_float']}", + ] + conditions = " && ".join(conditions) + + if kernel_selector_str == FILE_HEAD_COMMENT: + kernel_selector_str += f"if ({conditions})\n kernel = " + else: + kernel_selector_str += f"else if ({conditions})\n kernel = " + + kernel_template2 = ( + "Marlin<{{a_type_id}}, {{b_type_id}}, {{c_type_id}}, " + "{{s_type_id}}, {{threads}}, {{thread_m_blocks}}, " + "{{thread_n_blocks}}, {{thread_k_blocks}}, " + "{{m_block_size_8}}, {{stages}}, {{group_blocks}}, " + "{{is_zp_float}}>;" + ) - file_content = FILE_HEAD + "\n\n" - file_content += "\n\n".join(all_template_str_list) + "\n\n}\n" - if a_type == "kFE4M3fn": - filename = f"sm89_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" - else: - filename = f"sm80_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + kernel_selector_str += ( + jinja2.Template(kernel_template2).render( + a_type_id=f"vllm::{a_type}.id()", + b_type_id=f"vllm::{b_type}.id()", + c_type_id=f"vllm::{c_type}.id()", + s_type_id=f"vllm::{s_type}.id()", + **config, + ) + + "\n" + ) + + file_content = FILE_HEAD + "\n\n" + file_content += "\n\n".join(all_template_str_list) + "\n\n}\n" + if a_type == "kFE4M3fn": + filename = f"sm89_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + elif result_dict_tmp is sm_75_result_dict: + filename = f"sm75_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + else: + filename = f"sm80_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" - filename = filename.lower() + filename = filename.lower() - with open(os.path.join(os.path.dirname(__file__), filename), "w") as f: - f.write(file_content) + with open(os.path.join(os.path.dirname(__file__), filename), "w") as f: + f.write(file_content) if not SUPPORT_FP8 and kernel_selector_str != FILE_HEAD_COMMENT: kernel_selector_str += ( diff --git a/csrc/moe/marlin_moe_wna16/kernel.h b/csrc/moe/marlin_moe_wna16/kernel.h index 57f5a17932d4437e45a48901582039d2b96d327d..eb83df22cde7243f37c09db38f5b2eccb47983a3 100644 --- a/csrc/moe/marlin_moe_wna16/kernel.h +++ b/csrc/moe/marlin_moe_wna16/kernel.h @@ -7,20 +7,20 @@ #include "quantization/gptq_marlin/marlin_dtypes.cuh" #include "core/scalar_type.hpp" -#define MARLIN_KERNEL_PARAMS \ - const int4 *__restrict__ A, const int4 *__restrict__ B, \ - int4 *__restrict__ C, int4 *__restrict__ C_tmp, \ - const int4 *__restrict__ b_bias_ptr, \ - const float *__restrict__ a_scales_ptr, \ - const int4 *__restrict__ scales_ptr, \ - const uint16_t *__restrict__ global_scale_ptr, \ - const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx, \ - const int32_t *__restrict__ sorted_token_ids_ptr, \ - const int32_t *__restrict__ expert_ids_ptr, \ - const int32_t *__restrict__ num_tokens_past_padded_ptr, \ - const float *__restrict__ topk_weights_ptr, int top_k, \ - bool mul_topk_weights, bool is_ep, int num_groups, int prob_m, \ - int prob_n, int prob_k, int *locks, bool has_bias, bool use_atomic_add, \ +#define MARLIN_KERNEL_PARAMS \ + const int4 *__restrict__ A, const int4 *__restrict__ B, \ + int4 *__restrict__ C, int4 *__restrict__ C_tmp, \ + const int4 *__restrict__ b_bias_ptr, \ + const float *__restrict__ a_scales_ptr, \ + const int4 *__restrict__ scales_ptr, \ + const uint16_t *__restrict__ global_scale_ptr, \ + const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx, \ + const int32_t *__restrict__ sorted_token_ids_ptr, \ + const int32_t *__restrict__ expert_ids_ptr, \ + const int32_t *__restrict__ num_tokens_past_padded_ptr, \ + const float *__restrict__ topk_weights_ptr, int top_k, \ + bool mul_topk_weights, int num_groups, int prob_m, int prob_n, \ + int prob_k, int *locks, bool has_bias, bool use_atomic_add, \ bool use_fp32_reduce namespace MARLIN_NAMESPACE_NAME { diff --git a/csrc/moe/marlin_moe_wna16/marlin_template.h b/csrc/moe/marlin_moe_wna16/marlin_template.h index 5b6b2456b4111fd6cacd5e8a209dd5f6cb01612c..5aac69b5c7d5a90d25b4863041d1df667b8693f9 100644 --- a/csrc/moe/marlin_moe_wna16/marlin_template.h +++ b/csrc/moe/marlin_moe_wna16/marlin_template.h @@ -26,6 +26,7 @@ #include "quantization/gptq_marlin/marlin.cuh" #include "quantization/gptq_marlin/marlin_dtypes.cuh" #include "quantization/gptq_marlin/dequant.h" +#include "quantization/gptq_marlin/marlin_mma.h" #include "core/scalar_type.hpp" #define STATIC_ASSERT_SCALAR_TYPE_VALID(scalar_t) \ @@ -35,7 +36,7 @@ namespace MARLIN_NAMESPACE_NAME { -#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750 template -__device__ inline void mma( - const typename MarlinScalarType::FragA& a_frag, - const typename MarlinScalarType::FragB& frag_b, - typename MarlinScalarType::FragC& frag_c, int idx = 0) { - const uint32_t* a = reinterpret_cast(&a_frag); - const uint32_t* b = reinterpret_cast(&frag_b); - using scalar_t = typename MarlinScalarType::scalar_t; - if constexpr (k_size == 16) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "f"(c[0]), - "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "r"(c[0]), - "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } else if (k_size == 32) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } -} - -template -__device__ inline void mma_trans( - const typename MarlinScalarType::FragA& a_frag, - const typename MarlinScalarType::FragB& frag_b, - const typename MarlinScalarType::FragB& frag_b2, - typename MarlinScalarType::FragC& frag_c) { - const uint32_t* a = reinterpret_cast(&a_frag); - const uint32_t* b = reinterpret_cast(&frag_b); - const uint32_t* b2 = reinterpret_cast(&frag_b2); - float* c = reinterpret_cast(&frag_c); - using scalar_t = typename MarlinScalarType::scalar_t; - if constexpr (k_size == 16) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "f"(c[0]), "f"(c[1]), "f"(c[2]), - "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "r"(c[0]), "r"(c[1]), "r"(c[2]), - "r"(c[3])); - } - } else { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 1200 - asm volatile( - "mma.sync.aligned.kind::f8f6f4.m16n8k32.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - #else - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - #endif - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } -} - // Instruction for loading a full 16x16 matrix fragment of operand A from shared // memory, directly in tensor core layout. template @@ -412,7 +272,6 @@ __global__ void Marlin( const float* __restrict__ topk_weights_ptr, // moe top weights int top_k, // num of experts per token bool mul_topk_weights, // mul topk weights or not - bool is_ep, // expert parallelism int num_groups, // number of scale groups per output channel int prob_m, // batch dimension m int prob_n, // output dimension n @@ -439,9 +298,20 @@ __global__ void Marlin( if constexpr (a_type_id == vllm::kFE4M3fn.id()) return; #endif + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + // Turing TensorCore only supports fp16 and int8 + if constexpr (a_type_id != vllm::kFloat16.id() && a_type_id != vllm::kS8.id()) + return; + #endif + int num_tokens_past_padded = num_tokens_past_padded_ptr[0]; constexpr int moe_block_size = m_block_size_8 ? 8 : (16 * thread_m_blocks); + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + constexpr bool use_fp16_accum = a_type_id == vllm::kFloat16.id(); + #else + constexpr bool use_fp16_accum = false; + #endif using Adtype = MarlinScalarType; using Cdtype = MarlinScalarType; @@ -504,14 +374,6 @@ __global__ void Marlin( // parallel: num valid moe blocks int parallel = num_tokens_past_padded / moe_block_size; - int num_valid_blocks = parallel; - if (is_ep) { - for (int i = 0; i < parallel; i++) { - if (expert_ids_ptr[i] == -1) num_valid_blocks--; - } - } - int num_invalid_blocks = parallel - num_valid_blocks; - parallel = num_valid_blocks; int k_tiles = prob_k / 16 / thread_k_blocks; int n_tiles = prob_n / 16 / thread_n_blocks; @@ -618,7 +480,22 @@ __global__ void Marlin( } } + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + + if constexpr (moe_block_size >= 16) + local_count += __shfl_down_sync(0xFFFFFFFF, local_count, 16); + if constexpr (moe_block_size >= 8) + local_count += __shfl_down_sync(0xFFFFFFFF, local_count, 8); + if constexpr (moe_block_size >= 4) + local_count += __shfl_down_sync(0xFFFFFFFF, local_count, 4); + if constexpr (moe_block_size >= 2) + local_count += __shfl_down_sync(0xFFFFFFFF, local_count, 2); + + local_count += __shfl_down_sync(0xFFFFFFFF, local_count, 1); + block_num_valid_tokens = local_count; + #else block_num_valid_tokens = __reduce_add_sync(0xffffffff, local_count); + #endif if (lane_id == 0) reinterpret_cast(sh_new)[0] = block_num_valid_tokens; @@ -651,22 +528,8 @@ __global__ void Marlin( if (par_id >= parallel) return; old_expert_id = expert_id; - if (num_invalid_blocks > 0) { - int skip_count = par_id; - for (int i = 0; i < num_tokens_past_padded / moe_block_size; i++) { - expert_id = expert_ids_ptr[i]; - if (expert_id != -1) { - if (skip_count == 0) { - block_id = i; - break; - }; - skip_count--; - }; - } - } else { - block_id = par_id; - expert_id = expert_ids_ptr[block_id]; - } + block_id = par_id; + expert_id = expert_ids_ptr[block_id]; if constexpr (b_type == vllm::kFE2M1f && s_type == vllm::kFE4M3fn) { uint16_t val = global_scale_ptr[expert_id]; @@ -1018,10 +881,6 @@ __global__ void Marlin( constexpr int sh_s_size = has_act_order ? (act_s_max_num_groups * s_sh_stride) : (stages * s_sh_stage); int4* sh_s = sh_zp + (stages * zp_sh_stage); - // shared memory reused by reduction should be smaller than - // shared memory used by weight. - static_assert(thread_m_blocks * 16 * thread_n_blocks * 16 / 8 <= - stages * b_sh_stage); int4* sh_a = sh_s + sh_s_size; // Register storage for double buffer of shared memory reads. @@ -1545,11 +1404,13 @@ __global__ void Marlin( #pragma unroll for (int i = 0; i < thread_m_blocks; i++) { if constexpr (m_block_size_8) { - mma_trans(frag_a[k2][i], frag_b0, frag_b1, - frag_c[i][j][0]); + mma_trans(frag_a[k2][i], frag_b0, frag_b1, + frag_c[i][j][0]); } else { - mma(frag_a[k2][i], frag_b0, frag_c[i][j][0]); - mma(frag_a[k2][i], frag_b1, frag_c[i][j][1]); + mma(frag_a[k2][i], frag_b0, + frag_c[i][j][0]); + mma(frag_a[k2][i], frag_b1, + frag_c[i][j][1]); } } } @@ -1583,10 +1444,12 @@ __global__ void Marlin( #pragma unroll for (int i = 0; i < thread_m_blocks; i++) { - mma(frag_a[k2][i], frag_b[0], - (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][0]); - mma(frag_a[k2][i], frag_b[1], - (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][1]); + mma( + frag_a[k2][i], frag_b[0], + (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][0]); + mma( + frag_a[k2][i], frag_b[1], + (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][1]); } if constexpr (group_blocks != -1) { @@ -2132,6 +1995,21 @@ __global__ void Marlin( // While this pattern may not be the most readable, other ways of writing // the loop seemed to noticeably worse performance after compilation. if (slice_iters == 0) { + // convert fp16 accum to fp32 for reduction + if constexpr (use_fp16_accum) { + #pragma unroll + for (int i = 0; i < (thread_m_blocks * (is_a_8bit ? 2 : 4) * 2); i++) { + float* frag_c_part_float = reinterpret_cast(frag_c) + i * 4; + scalar_t* frag_c_part_half = + reinterpret_cast(frag_c_part_float); + + #pragma unroll + for (int i = 3; i >= 0; i--) { + frag_c_part_float[i] = Cdtype::num2float(frag_c_part_half[i]); + } + } + } + if constexpr (is_a_8bit) { float frag_a_s[2 * thread_m_blocks]; diff --git a/csrc/moe/marlin_moe_wna16/ops.cu b/csrc/moe/marlin_moe_wna16/ops.cu index 4fd8fc5c542023edbc61609bcc811e8c406d3cb8..00b17f075af680fbfe7263c4941f85b3cf997761 100644 --- a/csrc/moe/marlin_moe_wna16/ops.cu +++ b/csrc/moe/marlin_moe_wna16/ops.cu @@ -142,7 +142,7 @@ typedef struct { int get_scales_cache_size(thread_config_t const& th_config, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, - bool has_act_order, bool is_k_full) { + bool has_act_order, bool is_k_full, int stages) { bool cache_scales_chunk = has_act_order && !is_k_full; int tb_n = th_config.thread_n; @@ -160,13 +160,13 @@ int get_scales_cache_size(thread_config_t const& th_config, int prob_m, if (cache_scales_chunk) { int load_groups = - tb_groups * pipe_stages * 2; // Chunk size is 2x pipeline over dim K + tb_groups * stages * 2; // Chunk size is 2x pipeline over dim K load_groups = max(load_groups, 32); // We load at least 32 scale groups return load_groups * tb_n * 2; } else { int tb_scales = tb_groups * tb_n * 2; - return tb_scales * pipe_stages; + return tb_scales * stages; } } @@ -174,7 +174,7 @@ int get_kernel_cache_size(thread_config_t const& th_config, bool m_block_size_8, int thread_m_blocks, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, bool has_act_order, bool is_k_full, int has_zp, - int is_zp_float, bool is_a_8bit) { + int is_zp_float, bool is_a_8bit, int stages) { int pack_factor = 32 / num_bits; // Get B size @@ -185,8 +185,8 @@ int get_kernel_cache_size(thread_config_t const& th_config, bool m_block_size_8, // shm size for block_sorted_ids/rd_block_sorted_ids/block_topk_weights // both of them requires tb_m * 4 bytes (tb_m * int32 or tb_m * float32) int sh_block_meta_size = tb_m * 16; - int sh_a_size = pipe_stages * (tb_m * tb_k) * (is_a_8bit ? 1 : 2); - int sh_b_size = pipe_stages * (tb_k * tb_n / pack_factor) * 4; + int sh_a_size = stages * (tb_m * tb_k) * (is_a_8bit ? 1 : 2); + int sh_b_size = stages * (tb_k * tb_n / pack_factor) * 4; int sh_red_size = tb_m * (tb_n + 8) * 2; int sh_bias_size = tb_n * 2; int tmp_size = @@ -195,8 +195,8 @@ int get_kernel_cache_size(thread_config_t const& th_config, bool m_block_size_8, int sh_s_size = get_scales_cache_size(th_config, prob_m, prob_n, prob_k, num_bits, - group_size, has_act_order, is_k_full); - int sh_g_idx_size = has_act_order && !is_k_full ? pipe_stages * tb_k / 4 : 0; + group_size, has_act_order, is_k_full, stages); + int sh_g_idx_size = has_act_order && !is_k_full ? stages * tb_k / 4 : 0; int sh_zp_size = 0; if (has_zp) { if (is_zp_float) @@ -217,7 +217,7 @@ bool is_valid_config(thread_config_t const& th_config, bool m_block_size_8, int thread_m_blocks, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, bool has_act_order, bool is_k_full, int has_zp, int is_zp_float, - int max_shared_mem, bool is_a_8bit) { + bool is_a_8bit, int stages, int max_shared_mem) { // Sanity if (th_config.thread_k == -1 || th_config.thread_n == -1 || th_config.num_threads == -1) { @@ -243,7 +243,7 @@ bool is_valid_config(thread_config_t const& th_config, bool m_block_size_8, int cache_size = get_kernel_cache_size(th_config, m_block_size_8, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, - is_k_full, has_zp, is_zp_float, is_a_8bit); + is_k_full, has_zp, is_zp_float, is_a_8bit, stages); return cache_size <= max_shared_mem; } @@ -252,7 +252,7 @@ MarlinFuncPtr get_marlin_kernel( const vllm::ScalarType c_type, const vllm::ScalarType s_type, int thread_m_blocks, int thread_n_blocks, int thread_k_blocks, bool m_block_size_8, bool has_act_order, bool has_zp, int group_blocks, - int threads, bool is_zp_float) { + int threads, bool is_zp_float, int stages) { int num_bits = b_type.size_bits(); auto kernel = MarlinDefault; @@ -266,8 +266,8 @@ exec_config_t determine_exec_config( const vllm::ScalarType& c_type, const vllm::ScalarType& s_type, int prob_m, int prob_n, int prob_k, int num_experts, int top_k, int thread_m_blocks, bool m_block_size_8, int num_bits, int group_size, bool has_act_order, - bool is_k_full, bool has_zp, bool is_zp_float, int max_shared_mem, int sms, - bool is_a_8bit) { + bool is_k_full, bool has_zp, bool is_zp_float, bool is_a_8bit, int stages, + int max_shared_mem, int sms) { exec_config_t exec_cfg = exec_config_t{1, thread_config_t{-1, -1, -1}}; thread_config_t* thread_configs = thread_m_blocks > 1 ? large_batch_thread_configs @@ -284,15 +284,15 @@ exec_config_t determine_exec_config( if (!is_valid_config(th_config, m_block_size_8, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, - is_k_full, has_zp, is_zp_float, max_shared_mem - 512, - is_a_8bit)) { + is_k_full, has_zp, is_zp_float, is_a_8bit, stages, + max_shared_mem - 512)) { continue; } int cache_size = get_kernel_cache_size( th_config, m_block_size_8, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, is_k_full, has_zp, is_zp_float, - is_a_8bit); + is_a_8bit, stages); int group_blocks = 0; if (!has_act_order) { @@ -303,7 +303,7 @@ exec_config_t determine_exec_config( get_marlin_kernel(a_type, b_type, c_type, s_type, thread_m_blocks, th_config.thread_n / 16, th_config.thread_k / 16, m_block_size_8, has_act_order, has_zp, group_blocks, - th_config.num_threads, is_zp_float); + th_config.num_threads, is_zp_float, stages); if (kernel == MarlinDefault) continue; @@ -336,14 +336,14 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, void* perm, void* a_tmp, void* sorted_token_ids, void* expert_ids, void* num_tokens_past_padded, void* topk_weights, int moe_block_size, int num_experts, - int top_k, bool mul_topk_weights, bool is_ep, int prob_m, - int prob_n, int prob_k, void* workspace, - vllm::ScalarType const& a_type, vllm::ScalarType const& b_type, - vllm::ScalarType const& c_type, vllm::ScalarType const& s_type, - bool has_bias, bool has_act_order, bool is_k_full, bool has_zp, - int num_groups, int group_size, int dev, cudaStream_t stream, - int thread_k, int thread_n, int sms, int blocks_per_sm, - bool use_atomic_add, bool use_fp32_reduce, bool is_zp_float) { + int top_k, bool mul_topk_weights, int prob_m, int prob_n, + int prob_k, void* workspace, vllm::ScalarType const& a_type, + vllm::ScalarType const& b_type, vllm::ScalarType const& c_type, + vllm::ScalarType const& s_type, bool has_bias, + bool has_act_order, bool is_k_full, bool has_zp, int num_groups, + int group_size, int dev, cudaStream_t stream, int thread_k, + int thread_n, int sms, int blocks_per_sm, bool use_atomic_add, + bool use_fp32_reduce, bool is_zp_float) { int thread_m_blocks = div_ceil(moe_block_size, 16); bool m_block_size_8 = moe_block_size == 8; bool is_a_8bit = a_type.size_bits() == 8; @@ -433,8 +433,14 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, dev); cudaDeviceGetAttribute(&minor_capability, cudaDevAttrComputeCapabilityMinor, dev); - TORCH_CHECK(major_capability * 10 + minor_capability >= 80, - "marlin kernel only support Ampere or newer GPUs."); + TORCH_CHECK(major_capability * 10 + minor_capability >= 75, + "marlin kernel only support Turing or newer GPUs."); + int stages = 4; + if (major_capability == 7 && minor_capability == 5) { + stages = 2; + TORCH_CHECK(a_type == vllm::kFloat16 || a_type == vllm::kS8, + "Turing only support FP16 or INT8 activation."); + } if (a_type == vllm::kFE4M3fn) { TORCH_CHECK(major_capability * 10 + minor_capability >= 89, "FP8 only support Ada Lovelace or newer GPUs."); @@ -461,8 +467,8 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, exec_cfg = determine_exec_config( a_type, b_type, c_type, s_type, prob_m, prob_n, prob_k, num_experts, top_k, thread_m_blocks, m_block_size_8, num_bits, group_size, - has_act_order, is_k_full, has_zp, is_zp_float, max_shared_mem, sms, - is_a_8bit); + has_act_order, is_k_full, has_zp, is_zp_float, is_a_8bit, stages, + max_shared_mem, sms); thread_tfg = exec_cfg.tb_cfg; } @@ -479,7 +485,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, TORCH_CHECK(is_valid_config(thread_tfg, m_block_size_8, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, is_k_full, has_zp, is_zp_float, - max_shared_mem, is_a_8bit), + is_a_8bit, stages, max_shared_mem), "Invalid thread config: thread_m_blocks = ", thread_m_blocks, ", thread_k = ", thread_tfg.thread_k, ", thread_n = ", thread_tfg.thread_n, @@ -493,12 +499,12 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, int sh_cache_size = get_kernel_cache_size(thread_tfg, m_block_size_8, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, - is_k_full, has_zp, is_zp_float, is_a_8bit); + is_k_full, has_zp, is_zp_float, is_a_8bit, stages); auto kernel = get_marlin_kernel( a_type, b_type, c_type, s_type, thread_m_blocks, thread_n_blocks, thread_k_blocks, m_block_size_8, has_act_order, has_zp, group_blocks, - num_threads, is_zp_float); + num_threads, is_zp_float, stages); if (kernel == MarlinDefault) { TORCH_CHECK(false, "Unsupported shapes: MNK = [", prob_m, ", ", prob_n, @@ -517,7 +523,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, kernel<<>>( A_ptr, B_ptr, C_ptr, C_tmp_ptr, bias_ptr, a_s_ptr, b_s_ptr, g_s_ptr, zp_ptr, g_idx_ptr, sorted_token_ids_ptr, expert_ids_ptr, num_tokens_past_padded_ptr, - topk_weights_ptr, top_k, mul_topk_weights, is_ep, num_groups, prob_m, + topk_weights_ptr, top_k, mul_topk_weights, num_groups, prob_m, prob_n, prob_k, locks, has_bias, use_atomic_add, use_fp32_reduce); // clang-format on } @@ -535,7 +541,7 @@ torch::Tensor moe_wna16_marlin_gemm( std::optional const& perm_or_none, torch::Tensor& workspace, torch::Tensor& sorted_token_ids, torch::Tensor& expert_ids, torch::Tensor& num_tokens_past_padded, torch::Tensor& topk_weights, - int64_t moe_block_size, int64_t top_k, bool mul_topk_weights, bool is_ep, + int64_t moe_block_size, int64_t top_k, bool mul_topk_weights, vllm::ScalarTypeId const& b_type_id, int64_t size_m, int64_t size_n, int64_t size_k, bool is_k_full, bool use_atomic_add, bool use_fp32_reduce, bool is_zp_float, int64_t thread_k, int64_t thread_n, @@ -849,9 +855,9 @@ torch::Tensor moe_wna16_marlin_gemm( perm.data_ptr(), a_tmp.data_ptr(), sorted_token_ids.data_ptr(), expert_ids.data_ptr(), num_tokens_past_padded.data_ptr(), topk_weights.data_ptr(), moe_block_size, num_experts, top_k, - mul_topk_weights, is_ep, size_m, size_n, size_k, workspace.data_ptr(), - a_type, b_type, c_type, s_type, has_bias, has_act_order, is_k_full, - has_zp, num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev), + mul_topk_weights, size_m, size_n, size_k, workspace.data_ptr(), a_type, + b_type, c_type, s_type, has_bias, has_act_order, is_k_full, has_zp, + num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev), thread_k, thread_n, sms, blocks_per_sm, use_atomic_add, use_fp32_reduce, is_zp_float); @@ -860,4 +866,4 @@ torch::Tensor moe_wna16_marlin_gemm( TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) { m.impl("moe_wna16_marlin_gemm", &moe_wna16_marlin_gemm); -} \ No newline at end of file +} diff --git a/csrc/moe/torch_bindings.cpp b/csrc/moe/torch_bindings.cpp index 20a2d85821133c62bb2aacfb3b686e2bd2b37a2d..cad4249d4697e12b41f6f6dfee51aff0201bbb42 100644 --- a/csrc/moe/torch_bindings.cpp +++ b/csrc/moe/torch_bindings.cpp @@ -80,7 +80,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) { "Tensor sorted_token_ids," "Tensor! expert_ids, Tensor! num_tokens_past_padded," "Tensor! topk_weights, int moe_block_size, int top_k, " - "bool mul_topk_weights, bool is_ep, int b_type_id," + "bool mul_topk_weights, int b_type_id," "int size_m, int size_n, int size_k," "bool is_full_k, bool use_atomic_add," "bool use_fp32_reduce, bool is_zp_float," diff --git a/csrc/ops.h b/csrc/ops.h index e08ca40090404e778f65725bb90f59b6ebada5ca..f51a651f54a3bc9c192da4dbb55403a34a60f15c 100644 --- a/csrc/ops.h +++ b/csrc/ops.h @@ -2,6 +2,7 @@ #include #include +#include #include "core/scalar_type.hpp" @@ -280,6 +281,11 @@ void get_cutlass_moe_mm_problem_sizes( const int64_t k, const std::optional& blockscale_offsets, std::optional force_swap_ab = std::nullopt); +void get_cutlass_moe_mm_problem_sizes_from_expert_offsets( + const torch::Tensor& expert_first_token_offset, + torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, + const int64_t n, const int64_t k, const bool swap_ab); + void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets, torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, @@ -316,6 +322,12 @@ void scaled_fp4_experts_quant( torch::Tensor const& input_offset_by_experts, torch::Tensor const& output_scale_offset_by_experts); +void silu_and_mul_scaled_fp4_experts_quant( + torch::Tensor& output, torch::Tensor& output_scale, + torch::Tensor const& input, torch::Tensor const& input_global_scale, + torch::Tensor const& input_offset_by_experts, + torch::Tensor const& output_scale_offset_by_experts); + void per_token_group_quant_fp8(const torch::Tensor& input, torch::Tensor& output_q, torch::Tensor& output_s, int64_t group_size, double eps, double fp8_min, @@ -350,8 +362,9 @@ void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, // void gptq_shuffle(torch::Tensor q_weight, torch::Tensor q_perm, int64_t bit); -// void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, -// torch::Tensor const& scale); +// void static_scaled_fp8_quant( +// torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, +// std::optional> group_shape = std::nullopt); // void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, // torch::Tensor& scale); diff --git a/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu b/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu index 7539f836ecf379e50efbe9602eb78207bcd13dee..2ea229c47d7ec9d2fb82423b49f253957e5fabdf 100644 --- a/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu +++ b/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu @@ -31,37 +31,6 @@ namespace vllm { -// silu in float32 -__device__ __forceinline__ float silu(float x) { - return __fdividef(x, (1.f + __expf(-x))); -} - -__device__ __forceinline__ float2 silu2(float2 x) { - return make_float2(silu(x.x), silu(x.y)); -} - -template -__inline__ __device__ PackedVec compute_silu_mul(PackedVec& vec, - PackedVec& vec2) { - PackedVec result; - using packed_type = typename TypeConverter::Type; - -#pragma unroll - for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) { - // silu_mul in float32 - if constexpr (std::is_same_v) { - float2 silu_vec = silu2(__half22float2(vec.elts[i])); - result.elts[i] = - __float22half2_rn(__fmul2_rn(silu_vec, __half22float2(vec2.elts[i]))); - } else { - float2 silu_vec = silu2(__bfloat1622float2(vec.elts[i])); - result.elts[i] = __float22bfloat162_rn( - __fmul2_rn(silu_vec, __bfloat1622float2(vec2.elts[i]))); - } - } - return result; -} - // Use UE4M3 by default. template __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) @@ -74,6 +43,9 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD, "Vec size is not matched."); + // Precompute SF layout parameter (constant for entire kernel). + int32_t const numKTiles = (numCols + 63) / 64; + // Get the global scaling factor, which will be applied to the SF. // Note SFScale is the same as next GEMM's alpha, which is // (448.f / (Alpha_A / 6.f)). @@ -101,7 +73,7 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) auto sf_out = cvt_quant_to_fp4_get_sf_out_offset( - rowIdx, colIdx, numCols, SFout); + rowIdx, colIdx, numKTiles, SFout); out_pos = cvt_warp_fp16_to_fp4(out_silu_mul, SFScaleVal, sf_out); diff --git a/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu b/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu index 674440278383294414438a2572acde32e76eccf7..ae8ef1bf99d6480eb35cd72f31de71d66bf3cde2 100644 --- a/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu +++ b/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu @@ -62,7 +62,9 @@ __global__ void __get_group_gemm_starts( ElementSF* a_scales_base_as_int, ElementSF* b_scales_base_as_int, ElementAccumulator* alphas_base_as_int, const int32_t* expert_offsets, const int32_t* sf_offsets, const int32_t* problem_sizes_as_shapes, - const int K, const int N) { + int64_t* a_strides, int64_t* b_strides, int64_t* c_strides, + const int64_t a_stride_val, const int64_t b_stride_val, + const int64_t c_stride_val, const int K, const int N) { int64_t expert_id = threadIdx.x; if (expert_id >= gridDim.x * blockDim.x) { return; @@ -103,6 +105,11 @@ __global__ void __get_group_gemm_starts( // Shape of alpha = [E] alpha_offsets[expert_id] = alphas_base_as_int + expert_id; + // Initialize strides (constant across all experts, avoids separate kernels) + a_strides[expert_id] = a_stride_val; + b_strides[expert_id] = b_stride_val; + c_strides[expert_id] = c_stride_val; + LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id; LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id; @@ -135,7 +142,11 @@ __global__ void __get_group_gemm_starts( static_cast(alphas.data_ptr()), \ static_cast(expert_offsets.data_ptr()), \ static_cast(sf_offsets.data_ptr()), \ - static_cast(problem_sizes.data_ptr()), K, N); \ + static_cast(problem_sizes.data_ptr()), \ + static_cast(a_strides.data_ptr()), \ + static_cast(b_strides.data_ptr()), \ + static_cast(c_strides.data_ptr()), a_stride_val, \ + b_stride_val, c_stride_val, K, N); \ } template @@ -144,6 +155,9 @@ void run_get_group_gemm_starts( const torch::Tensor& out_starts, const torch::Tensor& a_scales_starts, const torch::Tensor& b_scales_starts, const torch::Tensor& alpha_starts, const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb, + const torch::Tensor& a_strides, const torch::Tensor& b_strides, + const torch::Tensor& c_strides, int64_t a_stride_val, int64_t b_stride_val, + int64_t c_stride_val, /*these are used for their base addresses*/ torch::Tensor const& a_tensors, torch::Tensor const& b_tensors, torch::Tensor const& out_tensors, torch::Tensor const& a_scales, @@ -269,17 +283,16 @@ void run_fp4_blockwise_scaled_group_mm_sm100( torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int); torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int); torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int); - torch::Tensor c_strides1 = - torch::full({num_experts}, output.stride(0), options_int); - torch::Tensor a_strides1 = - torch::full({num_experts}, a.stride(0) * 2, options_int); - torch::Tensor b_strides1 = - torch::full({num_experts}, b.stride(1) * 2, options_int); + torch::Tensor a_strides1 = torch::empty(num_experts, options_int); + torch::Tensor b_strides1 = torch::empty(num_experts, options_int); + torch::Tensor c_strides1 = torch::empty(num_experts, options_int); run_get_group_gemm_starts( a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs, - layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas, - expert_offsets, sf_offsets, problem_sizes, M, N, K); + layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1, + a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output, + a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets, + problem_sizes, M, N, K); // Create an instance of the GEMM Gemm gemm_op; @@ -444,17 +457,16 @@ void run_fp4_blockwise_scaled_group_mm_sm120( torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int); torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int); torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int); - torch::Tensor c_strides1 = - torch::full({num_experts}, output.stride(0), options_int); - torch::Tensor a_strides1 = - torch::full({num_experts}, a.stride(0) * 2, options_int); - torch::Tensor b_strides1 = - torch::full({num_experts}, b.stride(1) * 2, options_int); + torch::Tensor a_strides1 = torch::empty(num_experts, options_int); + torch::Tensor b_strides1 = torch::empty(num_experts, options_int); + torch::Tensor c_strides1 = torch::empty(num_experts, options_int); run_get_group_gemm_starts( a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs, - layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas, - expert_offsets, sf_offsets, problem_sizes, M, N, K); + layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1, + a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output, + a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets, + problem_sizes, M, N, K); // Create an instance of the GEMM Gemm gemm_op; diff --git a/csrc/quantization/fp4/nvfp4_experts_quant.cu b/csrc/quantization/fp4/nvfp4_experts_quant.cu index 82c53c2375a31e91dd58e1cefe335088e51e601e..aa573c007b3dfde325a2db6bd52c92fca4c153ed 100644 --- a/csrc/quantization/fp4/nvfp4_experts_quant.cu +++ b/csrc/quantization/fp4/nvfp4_experts_quant.cu @@ -25,13 +25,18 @@ #include #include "dispatch_utils.h" +#include "cuda_utils.h" #include "nvfp4_utils.cuh" #include "launch_bounds_utils.h" namespace vllm { +// NVFP4 quantization kernel for experts (low-latency path). +// When FUSE_SILU_MUL=true, expects input with gate||up layout and fuses +// SiLU(gate)*up before quantization. // Use UE4M3 by default. -template +template __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512)) cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in, float const* SFScale, uint32_t* out, uint32_t* SFout, @@ -44,8 +49,13 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512)) static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD, "Vec size is not matched."); + // Precompute SF layout parameter (constant for entire kernel). + int32_t const numKTiles = (numCols + 63) / 64; + int tid = blockIdx.x * blockDim.x + threadIdx.x; int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD; + // When fusing SiLU+Mul, input has gate || up layout (doubled width) + int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow; // Each global thread processes one element for (int globalIdx = tid; globalIdx < numRows * colsPerRow; @@ -54,13 +64,6 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512)) int rowIdx = globalIdx / colsPerRow; int colIdx = globalIdx % colsPerRow; - int64_t inOffset = rowIdx * colsPerRow + colIdx; - PackedVec in_vec = reinterpret_cast(in)[inOffset]; - // Get the output tensor offset. - // Same as inOffset because 8 elements are packed into one uint32_t. - int64_t outOffset = inOffset; - auto& out_pos = out[outOffset]; - // Find index within the experts using different strategies based on expert // count int rowIdx_in_expert = 0; @@ -107,29 +110,46 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512)) } } + // Load input and optionally apply fused SiLU+Mul + int64_t inOffset = rowIdx * inColsPerRow + colIdx; + PackedVec in_vec = reinterpret_cast(in)[inOffset]; + PackedVec quant_input; + if constexpr (FUSE_SILU_MUL) { + PackedVec in_vec_up = + reinterpret_cast(in)[inOffset + colsPerRow]; + quant_input = compute_silu_mul(in_vec, in_vec_up); + } else { + quant_input = in_vec; + } + + // Get the output tensor offset. + // Same as inOffset because 8 elements are packed into one uint32_t. + int64_t outOffset = rowIdx * colsPerRow + colIdx; + auto& out_pos = out[outOffset]; + // Get the global scaling factor, which will be applied to the SF. // Note SFScale is the same as next GEMM's alpha, which is // (448.f / (Alpha_A / 6.f)). float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx]; - int factor = CVT_FP4_SF_VEC_SIZE * 4; - // The actual output_scales dim is computed from the padded numCols. - int32_t numCols_padded = (numCols + factor - 1) / factor * factor; - int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4; uint32_t* SFout_in_expert = - SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout; + SFout + output_scale_offset_by_experts[expert_idx] * numKTiles; auto sf_out = cvt_quant_to_fp4_get_sf_out_offset( - rowIdx_in_expert, colIdx, numCols, SFout_in_expert); + rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert); - out_pos = cvt_warp_fp16_to_fp4(in_vec, SFScaleVal, sf_out); + out_pos = + cvt_warp_fp16_to_fp4(quant_input, SFScaleVal, sf_out); } } -// Kernel for LARGE_M_TOPK = true (large m_topk optimized version) -template +// NVFP4 quantization kernel for LARGE_M_TOPK = true (large m_topk optimized +// version). When FUSE_SILU_MUL=true, expects input with gate||up layout and +// fuses SiLU(gate)*up before quantization. +template __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in, float const* SFScale, uint32_t* out, uint32_t* SFout, @@ -140,6 +160,10 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) (CVT_FP4_SF_VEC_SIZE / CVT_FP4_ELTS_PER_THREAD); static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD, "Vec size is not matched."); + + // Precompute SF layout parameter (constant for entire kernel). + int32_t const numKTiles = (numCols + 63) / 64; + extern __shared__ uint32_t shared_input_offsets[]; // Load input offsets into shared memory. @@ -163,6 +187,8 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) int tid = blockIdx.x * blockDim.x + threadIdx.x; int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD; + // When fusing SiLU+Mul, input has gate || up layout (doubled width) + int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow; // Each global thread processes one element for (int globalIdx = tid; globalIdx < numRows * colsPerRow; @@ -171,11 +197,6 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) int rowIdx = globalIdx / colsPerRow; int colIdx = globalIdx % colsPerRow; - int64_t inOffset = rowIdx * colsPerRow + colIdx; - PackedVec in_vec = reinterpret_cast(in)[inOffset]; - int64_t outOffset = inOffset; - auto& out_pos = out[outOffset]; - // Find expert using binary search for better performance with large m_topk int rowIdx_in_expert = 0; int expert_idx = 0; @@ -200,34 +221,43 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024)) } } + // Load input and optionally apply fused SiLU+Mul + int64_t inOffset = rowIdx * inColsPerRow + colIdx; + PackedVec in_vec = reinterpret_cast(in)[inOffset]; + PackedVec quant_input; + if constexpr (FUSE_SILU_MUL) { + PackedVec in_vec_up = + reinterpret_cast(in)[inOffset + colsPerRow]; + quant_input = compute_silu_mul(in_vec, in_vec_up); + } else { + quant_input = in_vec; + } + + int64_t outOffset = rowIdx * colsPerRow + colIdx; + auto& out_pos = out[outOffset]; + float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx]; - int factor = CVT_FP4_SF_VEC_SIZE * 4; - int32_t numCols_padded = (numCols + factor - 1) / factor * factor; - int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4; uint32_t* SFout_in_expert = - SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout; + SFout + output_scale_offset_by_experts[expert_idx] * numKTiles; auto sf_out = cvt_quant_to_fp4_get_sf_out_offset( - rowIdx_in_expert, colIdx, numCols, SFout_in_expert); + rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert); - out_pos = cvt_warp_fp16_to_fp4(in_vec, SFScaleVal, sf_out); + out_pos = + cvt_warp_fp16_to_fp4(quant_input, SFScaleVal, sf_out); } } -template +template void quant_impl(void* output, void* output_scale, void* input, void* input_global_scale, void* input_offset_by_experts, void* output_scale_offset_by_experts, int m_topk, int k, int n_experts, cudaStream_t stream) { - // TODO: this multiProcessorCount should be cached. - int device; - cudaGetDevice(&device); - int multiProcessorCount; - cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount, - device); + int multiProcessorCount = + get_device_attribute(cudaDevAttrMultiProcessorCount, -1); // Grid, Block size. // Each thread converts 8 values. @@ -249,7 +279,7 @@ void quant_impl(void* output, void* output_scale, void* input, if (blockRepeat > 1) { size_t shared_mem_size = (n_experts + 1) * sizeof(uint32_t); if (n_experts >= 4) { - cvt_fp16_to_fp4 + cvt_fp16_to_fp4 <<>>( m_topk, k, reinterpret_cast(input), reinterpret_cast(input_global_scale), @@ -259,34 +289,37 @@ void quant_impl(void* output, void* output_scale, void* input, reinterpret_cast(output_scale_offset_by_experts), n_experts); } else { - cvt_fp16_to_fp4<<>>( - m_topk, k, reinterpret_cast(input), - reinterpret_cast(input_global_scale), - reinterpret_cast(output), - reinterpret_cast(output_scale), - reinterpret_cast(input_offset_by_experts), - reinterpret_cast(output_scale_offset_by_experts), - n_experts); + cvt_fp16_to_fp4 + <<>>( + m_topk, k, reinterpret_cast(input), + reinterpret_cast(input_global_scale), + reinterpret_cast(output), + reinterpret_cast(output_scale), + reinterpret_cast(input_offset_by_experts), + reinterpret_cast(output_scale_offset_by_experts), + n_experts); } } else { if (n_experts >= 16) { - cvt_fp16_to_fp4<<>>( - m_topk, k, reinterpret_cast(input), - reinterpret_cast(input_global_scale), - reinterpret_cast(output), - reinterpret_cast(output_scale), - reinterpret_cast(input_offset_by_experts), - reinterpret_cast(output_scale_offset_by_experts), - n_experts, /* bool low_latency */ true); + cvt_fp16_to_fp4 + <<>>( + m_topk, k, reinterpret_cast(input), + reinterpret_cast(input_global_scale), + reinterpret_cast(output), + reinterpret_cast(output_scale), + reinterpret_cast(input_offset_by_experts), + reinterpret_cast(output_scale_offset_by_experts), + n_experts, /* bool low_latency */ true); } else { - cvt_fp16_to_fp4<<>>( - m_topk, k, reinterpret_cast(input), - reinterpret_cast(input_global_scale), - reinterpret_cast(output), - reinterpret_cast(output_scale), - reinterpret_cast(input_offset_by_experts), - reinterpret_cast(output_scale_offset_by_experts), - n_experts, /* bool low_latency */ true); + cvt_fp16_to_fp4 + <<>>( + m_topk, k, reinterpret_cast(input), + reinterpret_cast(input_global_scale), + reinterpret_cast(output), + reinterpret_cast(output_scale), + reinterpret_cast(input_offset_by_experts), + reinterpret_cast(output_scale_offset_by_experts), + n_experts, /* bool low_latency */ true); } } } @@ -307,19 +340,19 @@ constexpr auto FLOAT = at::ScalarType::Float; constexpr auto INT = at::ScalarType::Int; constexpr auto UINT8 = at::ScalarType::Byte; -void scaled_fp4_experts_quant_sm1xxa( - torch::Tensor& output, torch::Tensor& output_scale, +// Common validation for fp4 experts quantization entry points. +static void validate_fp4_experts_quant_inputs( + torch::Tensor const& output, torch::Tensor const& output_scale, torch::Tensor const& input, torch::Tensor const& input_global_scale, torch::Tensor const& input_offset_by_experts, - torch::Tensor const& output_scale_offset_by_experts) { - CHECK_INPUT(output, "output must be a CUDA tensor"); - CHECK_INPUT(output_scale, "output_scale must be a CUDA tensor"); - CHECK_INPUT(input, "input must be a CUDA tensor"); - CHECK_INPUT(input_global_scale, "input_global_scale must be a CUDA tensor"); - CHECK_INPUT(input_offset_by_experts, - "input_offset_by_experts must be a CUDA tensor"); - CHECK_INPUT(output_scale_offset_by_experts, - "output_scale_offset_by_experts must be a CUDA tensor"); + torch::Tensor const& output_scale_offset_by_experts, int64_t m_topk, + int64_t k) { + CHECK_INPUT(output, "output"); + CHECK_INPUT(output_scale, "output_scale"); + CHECK_INPUT(input, "input"); + CHECK_INPUT(input_global_scale, "input_global_scale"); + CHECK_INPUT(input_offset_by_experts, "input_offset_by_experts"); + CHECK_INPUT(output_scale_offset_by_experts, "output_scale_offset_by_experts"); TORCH_CHECK(output.dim() == 2); TORCH_CHECK(output_scale.dim() == 2); @@ -338,8 +371,6 @@ void scaled_fp4_experts_quant_sm1xxa( TORCH_CHECK(output_scale.scalar_type() == INT); const int BLOCK_SIZE = 16; - auto m_topk = input.size(0); - auto k = input.size(1); TORCH_CHECK(k % BLOCK_SIZE == 0, "k must be a multiple of 16"); auto n_experts = input_global_scale.size(0); TORCH_CHECK(input_offset_by_experts.size(0) == n_experts + 1); @@ -351,7 +382,21 @@ void scaled_fp4_experts_quant_sm1xxa( int padded_k = (scales_k + (4 - 1)) / 4 * 4; // 4 means 4 fp8 values are packed into one int32 TORCH_CHECK(output_scale.size(1) * 4 == padded_k); +} +void scaled_fp4_experts_quant_sm1xxa( + torch::Tensor& output, torch::Tensor& output_scale, + torch::Tensor const& input, torch::Tensor const& input_global_scale, + torch::Tensor const& input_offset_by_experts, + torch::Tensor const& output_scale_offset_by_experts) { + auto m_topk = input.size(0); + auto k = input.size(1); + + validate_fp4_experts_quant_inputs(output, output_scale, input, + input_global_scale, input_offset_by_experts, + output_scale_offset_by_experts, m_topk, k); + + auto n_experts = input_global_scale.size(0); const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); const cudaStream_t stream = at::cuda::getCurrentCUDAStream(input.get_device()); @@ -359,7 +404,38 @@ void scaled_fp4_experts_quant_sm1xxa( VLLM_DISPATCH_HALF_TYPES( input.scalar_type(), "nvfp4_experts_quant_kernel", [&] { using cuda_type = vllm::CUDATypeConverter::Type; - vllm::quant_impl( + vllm::quant_impl( + output.data_ptr(), output_scale.data_ptr(), input.data_ptr(), + input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(), + output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts, + stream); + }); +} + +void silu_and_mul_scaled_fp4_experts_quant_sm1xxa( + torch::Tensor& output, torch::Tensor& output_scale, + torch::Tensor const& input, torch::Tensor const& input_global_scale, + torch::Tensor const& input_offset_by_experts, + torch::Tensor const& output_scale_offset_by_experts) { + auto m_topk = input.size(0); + // Input has gate || up layout, so k = input.size(1) / 2 + auto k_times_2 = input.size(1); + TORCH_CHECK(k_times_2 % 2 == 0, "input width must be even (gate || up)"); + auto k = k_times_2 / 2; + + validate_fp4_experts_quant_inputs(output, output_scale, input, + input_global_scale, input_offset_by_experts, + output_scale_offset_by_experts, m_topk, k); + + auto n_experts = input_global_scale.size(0); + const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + const cudaStream_t stream = + at::cuda::getCurrentCUDAStream(input.get_device()); + + VLLM_DISPATCH_HALF_TYPES( + input.scalar_type(), "silu_mul_nvfp4_experts_quant_kernel", [&] { + using cuda_type = vllm::CUDATypeConverter::Type; + vllm::quant_impl( output.data_ptr(), output_scale.data_ptr(), input.data_ptr(), input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(), output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts, diff --git a/csrc/quantization/fp4/nvfp4_quant_entry.cu b/csrc/quantization/fp4/nvfp4_quant_entry.cu index fb6d22f035b99d8fef3e4bfa54d080fc258b0d8c..25e0ba8486c7e6098b7d5e71558bd3d4c4601c1f 100644 --- a/csrc/quantization/fp4/nvfp4_quant_entry.cu +++ b/csrc/quantization/fp4/nvfp4_quant_entry.cu @@ -41,6 +41,15 @@ void silu_and_mul_nvfp4_quant_sm1xxa(torch::Tensor& output, torch::Tensor& input_sf); #endif +#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \ + (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120) +void silu_and_mul_scaled_fp4_experts_quant_sm1xxa( + torch::Tensor& output, torch::Tensor& output_scale, + torch::Tensor const& input, torch::Tensor const& input_global_scale, + torch::Tensor const& input_offset_by_experts, + torch::Tensor const& output_scale_offset_by_experts); +#endif + void scaled_fp4_quant(torch::Tensor& output, torch::Tensor const& input, torch::Tensor& output_sf, torch::Tensor const& input_sf) { #if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \ @@ -74,3 +83,18 @@ void silu_and_mul_nvfp4_quant(torch::Tensor& output, torch::Tensor& output_sf, TORCH_CHECK_NOT_IMPLEMENTED( false, "No compiled silu_and_mul nvfp4 quantization kernel"); } + +void silu_and_mul_scaled_fp4_experts_quant( + torch::Tensor& output, torch::Tensor& output_scale, + torch::Tensor const& input, torch::Tensor const& input_global_scale, + torch::Tensor const& input_offset_by_experts, + torch::Tensor const& output_scale_offset_by_experts) { +#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \ + (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120) + return silu_and_mul_scaled_fp4_experts_quant_sm1xxa( + output, output_scale, input, input_global_scale, input_offset_by_experts, + output_scale_offset_by_experts); +#endif + TORCH_CHECK_NOT_IMPLEMENTED( + false, "No compiled silu_and_mul nvfp4 experts quantization kernel"); +} diff --git a/csrc/quantization/fp4/nvfp4_quant_kernels.cu b/csrc/quantization/fp4/nvfp4_quant_kernels.cu index 6d69852bb4e4f9c639fd6ad1ffad197555d9e54c..8e38deeb6607fb8da8da323aee03817c502c2cad 100644 --- a/csrc/quantization/fp4/nvfp4_quant_kernels.cu +++ b/csrc/quantization/fp4/nvfp4_quant_kernels.cu @@ -35,7 +35,13 @@ template __host__ __device__ inline Int round_up(Int x, Int y) { static_assert(std::is_integral_v, "round_up argument must be integral type"); - return (x + y - 1) / y * y; + return ((x + y - 1) / y) * y; +} + +// Compute effective rows for grid configuration with swizzled SF layouts. +inline int computeEffectiveRows(int m) { + constexpr int ROW_TILE = 128; + return round_up(m, ROW_TILE); } // Use UE4M3 by default. @@ -49,81 +55,57 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512)) static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD, "Vec size is not matched."); + // Precompute SF layout parameter (constant for entire kernel). + int32_t const numKTiles = (numCols + 63) / 64; + int sf_m = round_up(numRows, 128); int sf_n_unpadded = numCols / CVT_FP4_SF_VEC_SIZE; int sf_n_int = round_up(sf_n_unpadded, 4) / 4; - for (int row = numRows + blockIdx.x; row < sf_m; row += gridDim.x) { - // Each thread writes 4 uint32_t elements. - for (int col = sf_n_unpadded + threadIdx.x * 4; col < sf_n_int; - col += blockDim.x * 4) { - SFout[row * sf_n_int + col] = 0x00; - } - } + int num_padded_cols = sf_n_int * 4 * CVT_FP4_SF_VEC_SIZE; // Get the global scaling factor, which will be applied to the SF. // Note SFScale is the same as next GEMM's alpha, which is // (448.f / (Alpha_A / 6.f)). float const global_scale = SFScale == nullptr ? 1.0f : SFScale[0]; - // Input tensor row/col loops. - for (int rowIdx = blockIdx.x; rowIdx < numRows; rowIdx += gridDim.x) { - for (int colIdx = threadIdx.x; colIdx < numCols / CVT_FP4_ELTS_PER_THREAD; + // Iterate over all rows and cols including padded ones - + // ensures we visit every single scale factor address to initialize it. + for (int rowIdx = blockIdx.x; rowIdx < sf_m; rowIdx += gridDim.x) { + for (int colIdx = threadIdx.x; + colIdx < num_padded_cols / CVT_FP4_ELTS_PER_THREAD; colIdx += blockDim.x) { + int elem_idx = colIdx * CVT_FP4_ELTS_PER_THREAD; + + PackedVec in_vec; int64_t inOffset = rowIdx * (numCols / CVT_FP4_ELTS_PER_THREAD) + colIdx; - PackedVec in_vec = reinterpret_cast(in)[inOffset]; - // Get the output tensor offset. - // Same as inOffset because 8 elements are packed into one uint32_t. - int64_t outOffset = inOffset; - auto& out_pos = out[outOffset]; + + // If we are outside valid rows OR outside valid columns -> Use Zeros + if (rowIdx >= numRows || elem_idx >= numCols) { + memset(&in_vec, 0, sizeof(PackedVec)); + + } else { + // Valid Region: Load actual data + in_vec = reinterpret_cast(in)[inOffset]; + } auto sf_out = cvt_quant_to_fp4_get_sf_out_offset( - rowIdx, colIdx, numCols, SFout); + rowIdx, colIdx, numKTiles, SFout); - out_pos = + auto out_val = cvt_warp_fp16_to_fp4(in_vec, global_scale, sf_out); - } - } -} -template -void invokeFP4Quantization(int m, int n, T const* input, float const* SFScale, - int64_t* output, int32_t* SFOuput, bool useUE8M0, - int multiProcessorCount, cudaStream_t stream) { - // Grid, Block size. - // Each thread converts 8 values. - dim3 block(std::min(int(n / ELTS_PER_THREAD), 512)); - // Get number of blocks per SM - int const numBlocksPerSM = - vllm_runtime_blocks_per_sm(static_cast(block.x)); - dim3 grid(std::min(int(m), multiProcessorCount * numBlocksPerSM)); - - // Launch the cvt kernel. - if (useUE8M0) { - cvt_fp16_to_fp4<<>>( - m, n, input, SFScale, reinterpret_cast(output), - reinterpret_cast(SFOuput)); - } else { - cvt_fp16_to_fp4<<>>( - m, n, input, SFScale, reinterpret_cast(output), - reinterpret_cast(SFOuput)); + // We do NOT write output for padding because the 'out' tensor is not + // padded. + if (rowIdx < numRows && elem_idx < numCols) { + // Same as inOffset because 8 elements are packed into one uint32_t. + out[inOffset] = out_val; + } + } } } -// Instantiate the function. -template void invokeFP4Quantization(int m, int n, half const* input, - float const* SFScale, int64_t* output, - int32_t* SFOuput, bool useUE8M0, - int multiProcessorCount, - cudaStream_t stream); - -template void invokeFP4Quantization(int m, int n, __nv_bfloat16 const* input, - float const* SFScale, int64_t* output, - int32_t* SFOuput, bool useUE8M0, - int multiProcessorCount, - cudaStream_t stream); - } // namespace vllm void scaled_fp4_quant_sm1xxa(torch::Tensor const& output, @@ -147,13 +129,19 @@ void scaled_fp4_quant_sm1xxa(torch::Tensor const& output, const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); auto stream = at::cuda::getCurrentCUDAStream(input.get_device()); - // We don't support e8m0 scales at this moment. - bool useUE8M0 = false; + // Grid, Block size. Each thread converts 8 values. + dim3 block(std::min(int(n / ELTS_PER_THREAD), 512)); + int const numBlocksPerSM = + vllm_runtime_blocks_per_sm(static_cast(block.x)); + int effectiveRows = vllm::computeEffectiveRows(m); + dim3 grid(std::min(effectiveRows, multiProcessorCount * numBlocksPerSM)); VLLM_DISPATCH_HALF_TYPES(input.scalar_type(), "nvfp4_quant_kernel", [&] { using cuda_type = vllm::CUDATypeConverter::Type; auto input_ptr = static_cast(input.data_ptr()); - vllm::invokeFP4Quantization(m, n, input_ptr, input_sf_ptr, output_ptr, - sf_out, useUE8M0, multiProcessorCount, stream); + // NOTE: We don't support e8m0 scales at this moment. + vllm::cvt_fp16_to_fp4<<>>( + m, n, input_ptr, input_sf_ptr, reinterpret_cast(output_ptr), + reinterpret_cast(sf_out)); }); -} +} \ No newline at end of file diff --git a/csrc/quantization/fp4/nvfp4_utils.cuh b/csrc/quantization/fp4/nvfp4_utils.cuh index 48e4959de979378e09eba59924e8bc3ff2ad8e3b..7082ad684bc3e651d38948c8827de109439b367c 100644 --- a/csrc/quantization/fp4/nvfp4_utils.cuh +++ b/csrc/quantization/fp4/nvfp4_utils.cuh @@ -128,51 +128,42 @@ inline __device__ float reciprocal_approximate_ftz(float a) { return b; } +// Compute SF output offset for swizzled tensor core layout. +// SF layout: [numMTiles, numKTiles, 32, 4, 4] +// Caller must precompute: numKTiles = (numCols + 63) / 64 template -__device__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset(int rowIdx, int colIdx, - int numCols, - SFType* SFout) { +__device__ __forceinline__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset( + int rowIdx, int colIdx, int32_t numKTiles, SFType* SFout) { static_assert(CVT_FP4_NUM_THREADS_PER_SF == 1 || CVT_FP4_NUM_THREADS_PER_SF == 2); // One pair of threads write one SF to global memory. // TODO: stage through smem for packed STG.32 // is it better than STG.8 from 4 threads ? - if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF == 0) { - // SF vector index (16 elements share one SF in the K dimension). - int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF; - int32_t mIdx = rowIdx; - - // SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)] - // --> index [mTileIdx, kTileIdx, outerMIdx, innerMIdx, innerKIdx] - - int32_t mTileIdx = mIdx / (32 * 4); - // SF vector size 16. - int factor = CVT_FP4_SF_VEC_SIZE * 4; - int32_t numKTiles = (numCols + factor - 1) / factor; - int64_t mTileStride = numKTiles * 32 * 4 * 4; - - int32_t kTileIdx = (kIdx / 4); - int64_t kTileStride = 32 * 4 * 4; - - // M tile layout [32, 4] is column-major. - int32_t outerMIdx = (mIdx % 32); - int64_t outerMStride = 4 * 4; - - int32_t innerMIdx = (mIdx % (32 * 4)) / 32; - int64_t innerMStride = 4; - - int32_t innerKIdx = (kIdx % 4); - int64_t innerKStride = 1; - - // Compute the global offset. - int64_t SFOffset = mTileIdx * mTileStride + kTileIdx * kTileStride + - outerMIdx * outerMStride + innerMIdx * innerMStride + - innerKIdx * innerKStride; - - return reinterpret_cast(SFout) + SFOffset; + if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF != 0) { + return nullptr; } - return nullptr; + + // SF vector index (16 elements share one SF in the K dimension). + int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF; + int32_t mIdx = rowIdx; + + // Decompose indices using bitwise ops (all divisors are powers of 2). + // SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)] + int32_t mTileIdx = mIdx >> 7; // mIdx / 128 + int32_t outerMIdx = mIdx & 31; // mIdx % 32 + int32_t innerMIdx = (mIdx >> 5) & 3; // (mIdx / 32) % 4 + int32_t kTileIdx = kIdx >> 2; // kIdx / 4 + int32_t innerKIdx = kIdx & 3; // kIdx % 4 + + // Compute global SF offset: mTileIdx * (numKTiles * 512) + kTileIdx * 512 + + // outerMIdx * 16 + innerMIdx * 4 + innerKIdx + // Use bitwise OR for non-overlapping lower bits. + int64_t SFOffset = (static_cast(mTileIdx) * numKTiles + kTileIdx) + << 9 | + (outerMIdx << 4) | (innerMIdx << 2) | innerKIdx; + + return reinterpret_cast(SFout) + SFOffset; } // Quantizes the provided PackedVec into the uint32_t output @@ -248,4 +239,34 @@ __device__ uint32_t cvt_warp_fp16_to_fp4(PackedVec& vec, float SFScaleVal, return e2m1Vec; } +// silu in float32 +__device__ __forceinline__ float silu(float x) { + return __fdividef(x, (1.f + __expf(-x))); +} + +__device__ __forceinline__ float2 silu2(float2 x) { + return make_float2(silu(x.x), silu(x.y)); +} + +template +__inline__ __device__ PackedVec compute_silu_mul( + const PackedVec& x_vec, const PackedVec& y_vec) { + PackedVec result; + +#pragma unroll + for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) { + // silu_mul in float32 + if constexpr (std::is_same_v) { + float2 silu_vec = silu2(__half22float2(x_vec.elts[i])); + result.elts[i] = __float22half2_rn( + __fmul2_rn(silu_vec, __half22float2(y_vec.elts[i]))); + } else { + float2 silu_vec = silu2(__bfloat1622float2(x_vec.elts[i])); + result.elts[i] = __float22bfloat162_rn( + __fmul2_rn(silu_vec, __bfloat1622float2(y_vec.elts[i]))); + } + } + return result; +} + } // namespace vllm diff --git a/csrc/quantization/gptq/q_gemm.cu b/csrc/quantization/gptq/q_gemm.cu index 2f6bc7903b4546c2d0e26af01813ae2c8b65554d..5ce6229bbd617a2c908eda7a0e9dcb48722a9222 100644 --- a/csrc/quantization/gptq/q_gemm.cu +++ b/csrc/quantization/gptq/q_gemm.cu @@ -235,11 +235,6 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel( // Zero output if (n >= size_n) return; - if (blockIdx.z == 0) { - for (int m = 0; m < m_count; m++) - *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0; - } - __syncthreads(); // Find initial group @@ -374,11 +369,6 @@ __global__ void gemm_half_q_half_gptq_2bit_kernel( // Zero output if (n >= size_n) return; - if (blockIdx.z == 0) { - for (int m = 0; m < m_count; m++) - *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0; - } - __syncthreads(); // Find initial group @@ -496,11 +486,6 @@ __global__ void gemm_half_q_half_gptq_3bit_kernel( // Zero output if (n >= size_n) return; - if (blockIdx.z == 0) { - for (int m = 0; m < m_count; m++) - *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0; - } - __syncthreads(); // Find initial group @@ -625,11 +610,6 @@ __global__ void gemm_half_q_half_gptq_8bit_kernel( // Zero output if (n >= size_n) return; - if (blockIdx.z == 0) { - for (int m = 0; m < m_count; m++) - *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0; - } - __syncthreads(); // Find initial group @@ -1226,9 +1206,6 @@ __global__ void gemm_half_q_half_alt_4bit_kernel( __halves2half2(__int2half_rn(val & 0xF), __int2half_rn(val >> 4)); } - if (blockIdx.z == 0) { - for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0); - } __syncthreads(); int i = width * h + w; @@ -1321,9 +1298,6 @@ __global__ void gemm_half_q_half_alt_8bit_kernel( } } - if (blockIdx.z == 0) { - for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0); - } __syncthreads(); int i = width * h + w; @@ -1860,7 +1834,7 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight, bool use_exllama, bool use_v2_format, int64_t bit) { const at::cuda::OptionalCUDAGuard device_guard(device_of(a)); auto options = torch::TensorOptions().dtype(a.dtype()).device(a.device()); - at::Tensor c = torch::empty({a.size(0), b_q_weight.size(1)}, options); + at::Tensor c = torch::zeros({a.size(0), b_q_weight.size(1)}, options); at::Tensor temp_dq = torch::empty( {b_q_weight.size(0) * 32 / bit, b_q_weight.size(1)}, options); diff --git a/csrc/quantization/gptq_marlin/.gitignore b/csrc/quantization/gptq_marlin/.gitignore index ba805f9250ecea707563336138067efe13b7b8b5..7dc482a8946605d91d192f43b2fffae518397f59 100644 --- a/csrc/quantization/gptq_marlin/.gitignore +++ b/csrc/quantization/gptq_marlin/.gitignore @@ -1,2 +1,3 @@ sm*_kernel_*.cu kernel_selector.h +kernel_*.cu diff --git a/csrc/quantization/gptq_marlin/dequant.h b/csrc/quantization/gptq_marlin/dequant.h index 26b8d40368aa959af3a78013c56e869c50f86f60..edd97dbfcd8e58b4d6dee3f695792816331510e5 100644 --- a/csrc/quantization/gptq_marlin/dequant.h +++ b/csrc/quantization/gptq_marlin/dequant.h @@ -67,7 +67,7 @@ where `scale_factor * multiplier` can be computed at weight loading. namespace MARLIN_NAMESPACE_NAME { -#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800 +#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 750 // Lookup-table based 3-input logical operation; explicitly used for // dequantization as the compiler does not seem to automatically recognize it in // all cases. diff --git a/csrc/quantization/gptq_marlin/generate_kernels.py b/csrc/quantization/gptq_marlin/generate_kernels.py index 27ef7271ba41cfbcdbaf9d5b44b5d25a9779a570..24866fc5cd5463e16cfc2460628d2327f6ff99a1 100644 --- a/csrc/quantization/gptq_marlin/generate_kernels.py +++ b/csrc/quantization/gptq_marlin/generate_kernels.py @@ -10,6 +10,8 @@ import jinja2 ARCHS = [] SUPPORT_FP8 = False +SUPPORT_SM75 = False +SUPPORT_SM80 = False for arch in sys.argv[1].split(","): arch = arch[: arch.index(".") + 2].replace(".", "") arch = int(arch) @@ -19,6 +21,10 @@ for arch in sys.argv[1].split(","): # with FP16 MMA, so it cannot achieve any acceleration. if arch in [89, 120]: SUPPORT_FP8 = True + if arch >= 80: + SUPPORT_SM80 = True + if arch == 75: + SUPPORT_SM75 = True FILE_HEAD_COMMENT = """ // auto generated by generate_kernels.py @@ -166,6 +172,7 @@ def remove_old_kernels(): def generate_new_kernels(): result_dict = {} + sm_75_result_dict = {} for quant_config in QUANT_CONFIGS: c_types = quant_config.get("c_type", ["kFloat16", "kBFloat16"]) @@ -184,6 +191,8 @@ def generate_new_kernels(): s_type = quant_config.get("s_type", c_type) if (a_type, b_type, c_type) not in result_dict: result_dict[(a_type, b_type, c_type)] = [] + if a_type in ["kFloat16", "kS8"] and c_type == "kFloat16": + sm_75_result_dict[(a_type, b_type, c_type)] = [] for group_blocks, m_blocks, thread_configs in itertools.product( all_group_blocks, all_m_blocks, all_thread_configs @@ -207,78 +216,89 @@ def generate_new_kernels(): "thread_k_blocks": thread_k // 16, "thread_n_blocks": thread_n // 16, "m_block_size_8": "true" if m_blocks == 0.5 else "false", - "stages": "pipe_stages", + "stages": 4, "group_blocks": group_blocks, "is_zp_float": "true" if is_zp_float else "false", } - result_dict[(a_type, b_type, c_type)].append(config) + if SUPPORT_SM80: + result_dict[(a_type, b_type, c_type)].append(config) + if (a_type, b_type, c_type) in sm_75_result_dict and SUPPORT_SM75: + config_sm75 = config.copy() + config_sm75["stages"] = 2 + sm_75_result_dict[(a_type, b_type, c_type)].append(config_sm75) kernel_selector_str = FILE_HEAD_COMMENT - for (a_type, b_type, c_type), config_list in result_dict.items(): - all_template_str_list = [] - for config in config_list: - s_type = config["s_type"] - template_str = jinja2.Template(TEMPLATE).render( - a_type_id=f"vllm::{a_type}.id()", - b_type_id=f"vllm::{b_type}.id()", - c_type_id=f"vllm::{c_type}.id()", - s_type_id=f"vllm::{s_type}.id()", - **config, - ) - all_template_str_list.append(template_str) - - conditions = [ - f"a_type == vllm::{a_type}", - f"b_type == vllm::{b_type}", - f"c_type == vllm::{c_type}", - f"s_type == vllm::{s_type}", - f"threads == {config['threads']}", - f"thread_m_blocks == {config['thread_m_blocks']}", - f"thread_n_blocks == {config['thread_n_blocks']}", - f"thread_k_blocks == {config['thread_k_blocks']}", - f"m_block_size_8 == {config['m_block_size_8']}", - f"group_blocks == {config['group_blocks']}", - f"is_zp_float == {config['is_zp_float']}", - ] - conditions = " && ".join(conditions) - - if kernel_selector_str == FILE_HEAD_COMMENT: - kernel_selector_str += f"if ({conditions})\n kernel = " - else: - kernel_selector_str += f"else if ({conditions})\n kernel = " - - kernel_template2 = ( - "Marlin<{{a_type_id}}, {{b_type_id}}, {{c_type_id}}, " - "{{s_type_id}}, {{threads}}, {{thread_m_blocks}}, " - "{{thread_n_blocks}}, {{thread_k_blocks}}, " - "{{m_block_size_8}}, {{stages}}, {{group_blocks}}, " - "{{is_zp_float}}>;" - ) - - kernel_selector_str += ( - jinja2.Template(kernel_template2).render( + for result_dict_tmp in [result_dict, sm_75_result_dict]: + for (a_type, b_type, c_type), config_list in result_dict_tmp.items(): + all_template_str_list = [] + if not config_list: + continue + for config in config_list: + s_type = config["s_type"] + template_str = jinja2.Template(TEMPLATE).render( a_type_id=f"vllm::{a_type}.id()", b_type_id=f"vllm::{b_type}.id()", c_type_id=f"vllm::{c_type}.id()", s_type_id=f"vllm::{s_type}.id()", **config, ) - + "\n" - ) + all_template_str_list.append(template_str) + + conditions = [ + f"a_type == vllm::{a_type}", + f"b_type == vllm::{b_type}", + f"c_type == vllm::{c_type}", + f"s_type == vllm::{s_type}", + f"threads == {config['threads']}", + f"thread_m_blocks == {config['thread_m_blocks']}", + f"thread_n_blocks == {config['thread_n_blocks']}", + f"thread_k_blocks == {config['thread_k_blocks']}", + f"m_block_size_8 == {config['m_block_size_8']}", + f"stages == {config['stages']}", + f"group_blocks == {config['group_blocks']}", + f"is_zp_float == {config['is_zp_float']}", + ] + conditions = " && ".join(conditions) + + if kernel_selector_str == FILE_HEAD_COMMENT: + kernel_selector_str += f"if ({conditions})\n kernel = " + else: + kernel_selector_str += f"else if ({conditions})\n kernel = " + + kernel_template2 = ( + "Marlin<{{a_type_id}}, {{b_type_id}}, {{c_type_id}}, " + "{{s_type_id}}, {{threads}}, {{thread_m_blocks}}, " + "{{thread_n_blocks}}, {{thread_k_blocks}}, " + "{{m_block_size_8}}, {{stages}}, {{group_blocks}}, " + "{{is_zp_float}}>;" + ) - file_content = FILE_HEAD + "\n\n" - file_content += "\n\n".join(all_template_str_list) + "\n\n}\n" - if a_type == "kFE4M3fn": - filename = f"sm89_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" - else: - filename = f"sm80_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + kernel_selector_str += ( + jinja2.Template(kernel_template2).render( + a_type_id=f"vllm::{a_type}.id()", + b_type_id=f"vllm::{b_type}.id()", + c_type_id=f"vllm::{c_type}.id()", + s_type_id=f"vllm::{s_type}.id()", + **config, + ) + + "\n" + ) + + file_content = FILE_HEAD + "\n\n" + file_content += "\n\n".join(all_template_str_list) + "\n\n}\n" + if a_type == "kFE4M3fn": + filename = f"sm89_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + elif result_dict_tmp is sm_75_result_dict: + filename = f"sm75_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" + else: + filename = f"sm80_kernel_{a_type[1:]}_{b_type[1:]}_{c_type[1:]}.cu" - filename = filename.lower() + filename = filename.lower() - with open(os.path.join(os.path.dirname(__file__), filename), "w") as f: - f.write(file_content) + with open(os.path.join(os.path.dirname(__file__), filename), "w") as f: + f.write(file_content) if not SUPPORT_FP8 and kernel_selector_str != FILE_HEAD_COMMENT: kernel_selector_str += ( diff --git a/csrc/quantization/gptq_marlin/gptq_marlin.cu b/csrc/quantization/gptq_marlin/gptq_marlin.cu index 28ff06559a98a254c325a14ccbd6b550fb6134f5..77f319d53bc52ec0f8bc84327db06147494b6e37 100644 --- a/csrc/quantization/gptq_marlin/gptq_marlin.cu +++ b/csrc/quantization/gptq_marlin/gptq_marlin.cu @@ -37,7 +37,7 @@ __global__ void MarlinDefault(MARLIN_KERNEL_PARAMS){}; using MarlinFuncPtr = void (*)(MARLIN_KERNEL_PARAMS); -#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750 __global__ void permute_cols_kernel(int4 const* __restrict__ a_int4_ptr, int const* __restrict__ perm_int_ptr, @@ -148,7 +148,7 @@ typedef struct { int get_scales_cache_size(thread_config_t const& th_config, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, - bool has_act_order, bool is_k_full) { + bool has_act_order, bool is_k_full, int stages) { bool cache_scales_chunk = has_act_order && !is_k_full; int tb_n = th_config.thread_n; @@ -166,28 +166,29 @@ int get_scales_cache_size(thread_config_t const& th_config, int prob_m, if (cache_scales_chunk) { int load_groups = - tb_groups * pipe_stages * 2; // Chunk size is 2x pipeline over dim K + tb_groups * stages * 2; // Chunk size is 2x pipeline over dim K load_groups = max(load_groups, 32); // We load at least 32 scale groups return load_groups * tb_n * 2; } else { int tb_scales = tb_groups * tb_n * 2; - return tb_scales * pipe_stages; + return tb_scales * stages; } } int get_kernel_cache_size(thread_config_t const& th_config, int thread_m_blocks, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, bool has_act_order, bool is_k_full, - int has_zp, int is_zp_float) { + int has_zp, bool is_zp_float, bool is_a_8bit, + int stages) { int pack_factor = 32 / num_bits; // Get B size int tb_k = th_config.thread_k; int tb_n = th_config.thread_n; int tb_m = thread_m_blocks * 16; - int sh_a_size = pipe_stages * (tb_m * tb_k) * 2; - int sh_b_size = pipe_stages * (tb_k * tb_n / pack_factor) * 4; + int sh_a_size = stages * (tb_m * tb_k) * (is_a_8bit ? 1 : 2); + int sh_b_size = stages * (tb_k * tb_n / pack_factor) * 4; int sh_red_size = tb_m * (tb_n + 8) * 2; int sh_bias_size = tb_n * 2; int tmp_size = @@ -196,8 +197,8 @@ int get_kernel_cache_size(thread_config_t const& th_config, int thread_m_blocks, int sh_s_size = get_scales_cache_size(th_config, prob_m, prob_n, prob_k, num_bits, - group_size, has_act_order, is_k_full); - int sh_g_idx_size = has_act_order && !is_k_full ? pipe_stages * tb_k / 4 : 0; + group_size, has_act_order, is_k_full, stages); + int sh_g_idx_size = has_act_order && !is_k_full ? stages * tb_k / 4 : 0; int sh_zp_size = 0; if (has_zp) { if (is_zp_float) @@ -217,7 +218,8 @@ int get_kernel_cache_size(thread_config_t const& th_config, int thread_m_blocks, bool is_valid_config(thread_config_t const& th_config, int thread_m_blocks, int prob_m, int prob_n, int prob_k, int num_bits, int group_size, bool has_act_order, bool is_k_full, - int has_zp, int is_zp_float, int max_shared_mem) { + int has_zp, bool is_zp_float, bool is_a_8bit, int stages, + int max_shared_mem) { // Sanity if (th_config.thread_k == -1 || th_config.thread_n == -1 || th_config.num_threads == -1) { @@ -242,7 +244,7 @@ bool is_valid_config(thread_config_t const& th_config, int thread_m_blocks, // Check that pipeline fits into cache int cache_size = get_kernel_cache_size( th_config, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, - has_act_order, is_k_full, has_zp, is_zp_float); + has_act_order, is_k_full, has_zp, is_zp_float, is_a_8bit, stages); return cache_size <= max_shared_mem; } @@ -251,7 +253,7 @@ MarlinFuncPtr get_marlin_kernel( const vllm::ScalarType c_type, const vllm::ScalarType s_type, int thread_m_blocks, int thread_n_blocks, int thread_k_blocks, bool m_block_size_8, bool has_act_order, bool has_zp, int group_blocks, - int threads, bool is_zp_float) { + int threads, bool is_zp_float, int stages) { int num_bits = b_type.size_bits(); auto kernel = MarlinDefault; @@ -265,7 +267,8 @@ exec_config_t determine_exec_config( const vllm::ScalarType& c_type, const vllm::ScalarType& s_type, int prob_m, int prob_n, int prob_k, int thread_m_blocks, bool m_block_size_8, int num_bits, int group_size, bool has_act_order, bool is_k_full, - bool has_zp, bool is_zp_float, int max_shared_mem, int sms) { + bool has_zp, bool is_zp_float, int is_a_8bit, int stages, + int max_shared_mem, int sms) { exec_config_t exec_cfg = exec_config_t{1, thread_config_t{-1, -1, -1}}; thread_config_t* thread_configs = thread_m_blocks > 1 ? large_batch_thread_configs @@ -280,13 +283,15 @@ exec_config_t determine_exec_config( if (!is_valid_config(th_config, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, group_size, has_act_order, is_k_full, has_zp, - is_zp_float, max_shared_mem - 512)) { + is_zp_float, is_a_8bit, stages, + max_shared_mem - 512)) { continue; } - int cache_size = get_kernel_cache_size( - th_config, thread_m_blocks, prob_m, prob_n, prob_k, num_bits, - group_size, has_act_order, is_k_full, has_zp, is_zp_float); + int cache_size = get_kernel_cache_size(th_config, thread_m_blocks, prob_m, + prob_n, prob_k, num_bits, group_size, + has_act_order, is_k_full, has_zp, + is_zp_float, is_a_8bit, stages); int group_blocks = 0; if (!has_act_order) { @@ -297,14 +302,10 @@ exec_config_t determine_exec_config( get_marlin_kernel(a_type, b_type, c_type, s_type, thread_m_blocks, th_config.thread_n / 16, th_config.thread_k / 16, m_block_size_8, has_act_order, has_zp, group_blocks, - th_config.num_threads, is_zp_float); + th_config.num_threads, is_zp_float, stages); if (kernel == MarlinDefault) continue; - // int m_tiles = div_ceil(prob_m, thread_m_blocks * 16); - // int n_tiles = prob_n / th_config.thread_n; - // int k_tiles = prob_k / th_config.thread_k; - return {1, th_config}; } @@ -321,6 +322,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, int group_size, int dev, cudaStream_t stream, int thread_k_init, int thread_n_init, int sms, bool use_atomic_add, bool use_fp32_reduce, bool is_zp_float) { + bool is_a_8bit = a_type.size_bits() == 8; TORCH_CHECK(prob_m > 0 && prob_n > 0 && prob_k > 0, "Invalid MNK = [", prob_m, ", ", prob_n, ", ", prob_k, "]"); @@ -389,8 +391,14 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, dev); cudaDeviceGetAttribute(&minor_capability, cudaDevAttrComputeCapabilityMinor, dev); - TORCH_CHECK(major_capability * 10 + minor_capability >= 80, - "marlin kernel only support Ampere or newer GPUs."); + TORCH_CHECK(major_capability * 10 + minor_capability >= 75, + "marlin kernel only support Turing or newer GPUs."); + int stages = 4; + if (major_capability == 7 && minor_capability == 5) { + stages = 2; + TORCH_CHECK(a_type == vllm::kFloat16 || a_type == vllm::kS8, + "Turing only support FP16 or INT8 activation."); + } if (a_type == vllm::kFE4M3fn) { TORCH_CHECK( major_capability * 10 + minor_capability == 89 || @@ -431,7 +439,8 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, exec_cfg = determine_exec_config( a_type, b_type, c_type, s_type, prob_m_split, prob_n, prob_k, thread_m_blocks, m_block_size_8, num_bits, group_size, has_act_order, - is_k_full, has_zp, is_zp_float, max_shared_mem, sms); + is_k_full, has_zp, is_zp_float, is_a_8bit, stages, max_shared_mem, + sms); thread_tfg = exec_cfg.tb_cfg; if (thread_tfg.thread_n != -1) { if (prob_n / thread_tfg.thread_n * @@ -440,7 +449,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, if (is_valid_config({128, 64, 128}, thread_m_blocks, prob_m_split, prob_n, prob_k, num_bits, group_size, has_act_order, is_k_full, has_zp, is_zp_float, - max_shared_mem_new)) { + is_a_8bit, stages, max_shared_mem_new)) { thread_tfg = {128, 64, 128}; exec_cfg = {1, thread_tfg}; } @@ -466,7 +475,8 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, TORCH_CHECK( is_valid_config(thread_tfg, thread_m_blocks, prob_m_split, prob_n, prob_k, num_bits, group_size, has_act_order, is_k_full, - has_zp, is_zp_float, max_shared_mem_new), + has_zp, is_zp_float, is_a_8bit, stages, + max_shared_mem_new), "Invalid thread config: thread_m_blocks = ", thread_m_blocks, ", thread_k = ", thread_tfg.thread_k, ", thread_n = ", thread_tfg.thread_n, @@ -475,12 +485,12 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias, ", prob_m_split = ", prob_m_split, ", group_size = ", group_size, ", has_act_order = ", has_act_order, ", is_k_full = ", is_k_full, ", has_zp = ", has_zp, ", is_zp_float = ", is_zp_float, - ", max_shared_mem_new = ", max_shared_mem_new); + ", stages = ", stages, ", max_shared_mem_new = ", max_shared_mem_new); auto kernel = get_marlin_kernel( a_type, b_type, c_type, s_type, thread_m_blocks, thread_n_blocks, thread_k_blocks, m_block_size_8, has_act_order, has_zp, group_blocks, - num_threads, is_zp_float); + num_threads, is_zp_float, stages); if (kernel == MarlinDefault) { TORCH_CHECK(false, "Unsupported shapes: MNK = [", prob_m, ", ", prob_n, diff --git a/csrc/quantization/gptq_marlin/marlin.cuh b/csrc/quantization/gptq_marlin/marlin.cuh index 2505e221322dde60ad749d37dd070ac25250e648..33fe52f605b4238205996352cceba2455022629a 100644 --- a/csrc/quantization/gptq_marlin/marlin.cuh +++ b/csrc/quantization/gptq_marlin/marlin.cuh @@ -1,17 +1,19 @@ #pragma once -#include +#ifndef _marlin_cuh + #define _marlin_cuh + #include -#include -#include -#include -#include -#include -#include + #include + #include + #include + #include + #include + #include -#ifndef MARLIN_NAMESPACE_NAME - #define MARLIN_NAMESPACE_NAME marlin -#endif + #ifndef MARLIN_NAMESPACE_NAME + #define MARLIN_NAMESPACE_NAME marlin + #endif namespace MARLIN_NAMESPACE_NAME { @@ -51,9 +53,51 @@ using I4 = Vec; constexpr int div_ceil(int a, int b) { return (a + b - 1) / b; } -#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 -// No support for async -#else + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 + +__device__ inline void cp_async1_ca_pred(void* smem_ptr, const void* glob_ptr, + bool pred = true) { + if (pred) { + reinterpret_cast(smem_ptr)[0] = + reinterpret_cast(glob_ptr)[0]; + } +} + +__device__ inline void cp_async2_ca_pred(void* smem_ptr, const void* glob_ptr, + bool pred = true) { + if (pred) { + reinterpret_cast(smem_ptr)[0] = + reinterpret_cast(glob_ptr)[0]; + } +} + +__device__ inline void cp_async4_ca_pred(void* smem_ptr, const void* glob_ptr, + bool pred = true) { + if (pred) { + reinterpret_cast(smem_ptr)[0] = + reinterpret_cast(glob_ptr)[0]; + } +} + +__device__ inline void cp_async4_pred(void* smem_ptr, const void* glob_ptr, + bool pred = true) { + if (pred) { + reinterpret_cast(smem_ptr)[0] = + reinterpret_cast(glob_ptr)[0]; + } +} + +__device__ inline void cp_async4(void* smem_ptr, const void* glob_ptr) { + reinterpret_cast(smem_ptr)[0] = + reinterpret_cast(glob_ptr)[0]; +} + +__device__ inline void cp_async_fence() {} + +template +__device__ inline void cp_async_wait() {} + + #else __device__ inline void cp_async1_ca_pred(void* smem_ptr, const void* glob_ptr, bool pred = true) { @@ -126,6 +170,8 @@ __device__ inline void cp_async_wait() { asm volatile("cp.async.wait_group %0;\n" ::"n"(n)); } -#endif + #endif } // namespace MARLIN_NAMESPACE_NAME + +#endif \ No newline at end of file diff --git a/csrc/quantization/gptq_marlin/marlin_mma.h b/csrc/quantization/gptq_marlin/marlin_mma.h new file mode 100644 index 0000000000000000000000000000000000000000..6ec2aaafc4392c3bd70827b37188efced0763eec --- /dev/null +++ b/csrc/quantization/gptq_marlin/marlin_mma.h @@ -0,0 +1,269 @@ + +#include "marlin_dtypes.cuh" + +namespace MARLIN_NAMESPACE_NAME { + +// m16n8k16 tensor core mma instruction with fp16 inputs and fp32 +// output/accumulation. +template +__device__ inline void mma( + const typename MarlinScalarType::FragA& a_frag, + const typename MarlinScalarType::FragB& frag_b, + typename MarlinScalarType::FragC& frag_c, int idx = 0) { + const uint32_t* a = reinterpret_cast(&a_frag); + const uint32_t* b = reinterpret_cast(&frag_b); + using scalar_t = typename MarlinScalarType::scalar_t; + if constexpr (!std::is_same::value || k_size != 16) { + static_assert(!use_fp16_accum); + } + + if constexpr (k_size == 16) { + if constexpr (std::is_same::value && !use_fp16_accum) { +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[0]), "r"(a[1]), "r"(b[0]), "f"(c[0]), "f"(c[1]), "f"(c[2]), + "f"(c[3])); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[2]), "r"(a[3]), "r"(b[1]), "f"(c[0]), "f"(c[1]), "f"(c[2]), + "f"(c[3])); +#else + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); +#endif + } else if constexpr (std::is_same::value && + use_fp16_accum) { +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + uint32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3}, {%4}, {%5,%6};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(a[0]), "r"(a[1]), "r"(b[0]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3}, {%4}, {%5,%6};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(a[2]), "r"(a[3]), "r"(b[1]), "r"(c[0]), "r"(c[1])); +#else + uint32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), + "r"(c[0]), "r"(c[1])); +#endif + } else if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); + } else if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "f"(c[0]), + "f"(c[1]), "f"(c[2]), "f"(c[3])); + } else if constexpr (std::is_same::value) { + int32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) + : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "r"(c[0]), + "r"(c[1]), "r"(c[2]), "r"(c[3])); + } + } else if (k_size == 32) { + if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); + } else if constexpr (std::is_same::value) { + int32_t* c = reinterpret_cast(&frag_c); +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(a[0]), "r"(b[0]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[2]), "=r"(c[3]) + : "r"(a[1]), "r"(b[0]), "r"(c[2]), "r"(c[3])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(a[2]), "r"(b[1]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[2]), "=r"(c[3]) + : "r"(a[3]), "r"(b[1]), "r"(c[2]), "r"(c[3])); +#else + asm volatile( + "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) + : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), + "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); +#endif + } + } +} + +template +__device__ inline void mma_trans( + const typename MarlinScalarType::FragA& a_frag, + const typename MarlinScalarType::FragB& frag_b, + const typename MarlinScalarType::FragB& frag_b2, + typename MarlinScalarType::FragC& frag_c) { + const uint32_t* a = reinterpret_cast(&a_frag); + const uint32_t* b = reinterpret_cast(&frag_b); + const uint32_t* b2 = reinterpret_cast(&frag_b2); + float* c = reinterpret_cast(&frag_c); + using scalar_t = typename MarlinScalarType::scalar_t; + if constexpr (!std::is_same::value || k_size != 16) { + static_assert(!use_fp16_accum); + } + + if constexpr (k_size == 16) { + if constexpr (std::is_same::value && !use_fp16_accum) { +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "f"(c[0]), "f"(c[1]), "f"(c[2]), + "f"(c[3])); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[1]), "r"(b2[1]), "r"(a[1]), "f"(c[0]), "f"(c[1]), "f"(c[2]), + "f"(c[3])); +#else + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); +#endif + } else if constexpr (std::is_same::value && + use_fp16_accum) { +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + uint32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3}, {%4}, {%5,%6};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3}, {%4}, {%5,%6};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(b[1]), "r"(b2[1]), "r"(a[1]), "r"(c[0]), "r"(c[1])); +#else + uint32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 " + "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), + "r"(c[0]), "r"(c[1])); +#endif + } else if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); + } else if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "f"(c[0]), "f"(c[1]), "f"(c[2]), + "f"(c[3])); + } else if constexpr (std::is_same::value) { + int32_t* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" + : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "r"(c[0]), "r"(c[1]), "r"(c[2]), + "r"(c[3])); + } + } else { + if constexpr (std::is_same::value) { + float* c = reinterpret_cast(&frag_c); + asm volatile( + "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), + "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); + } else if constexpr (std::is_same::value) { + int32_t* c = reinterpret_cast(&frag_c); +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(b[0]), "r"(a[0]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[2]), "=r"(c[3]) + : "r"(b2[1]), "r"(a[0]), "r"(c[2]), "r"(c[3])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[0]), "=r"(c[1]) + : "r"(b[0]), "r"(a[1]), "r"(c[0]), "r"(c[1])); + asm volatile( + "mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1}, {%2}, {%3}, {%4,%5};\n" + : "=r"(c[2]), "=r"(c[3]) + : "r"(b2[1]), "r"(a[1]), "r"(c[2]), "r"(c[3])); +#else + asm volatile( + "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " + "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" + : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) + : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), + "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); +#endif + } + } +} + +} // namespace MARLIN_NAMESPACE_NAME \ No newline at end of file diff --git a/csrc/quantization/gptq_marlin/marlin_template.h b/csrc/quantization/gptq_marlin/marlin_template.h index 22bb71e482ce8775920fc4e864002b5e0a74729f..c7b53696c12237967effe3db72ecbbd1375cf4d6 100644 --- a/csrc/quantization/gptq_marlin/marlin_template.h +++ b/csrc/quantization/gptq_marlin/marlin_template.h @@ -26,6 +26,7 @@ #include "marlin.cuh" #include "marlin_dtypes.cuh" #include "dequant.h" +#include "marlin_mma.h" #include "core/scalar_type.hpp" #define STATIC_ASSERT_SCALAR_TYPE_VALID(scalar_t) \ @@ -35,7 +36,7 @@ namespace MARLIN_NAMESPACE_NAME { -#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 750 template -__device__ inline void mma( - const typename MarlinScalarType::FragA& a_frag, - const typename MarlinScalarType::FragB& frag_b, - typename MarlinScalarType::FragC& frag_c, int idx = 0) { - const uint32_t* a = reinterpret_cast(&a_frag); - const uint32_t* b = reinterpret_cast(&frag_b); - using scalar_t = typename MarlinScalarType::scalar_t; - if constexpr (k_size == 16) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "f"(c[0]), - "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(a[idx * 2]), "r"(a[idx * 2 + 1]), "r"(b[idx]), "r"(c[0]), - "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } else if (k_size == 32) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]), - "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } -} - -template -__device__ inline void mma_trans( - const typename MarlinScalarType::FragA& a_frag, - const typename MarlinScalarType::FragB& frag_b, - const typename MarlinScalarType::FragB& frag_b2, - typename MarlinScalarType::FragC& frag_c) { - const uint32_t* a = reinterpret_cast(&a_frag); - const uint32_t* b = reinterpret_cast(&frag_b); - const uint32_t* b2 = reinterpret_cast(&frag_b2); - float* c = reinterpret_cast(&frag_c); - using scalar_t = typename MarlinScalarType::scalar_t; - if constexpr (k_size == 16) { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "f"(c[0]), "f"(c[1]), "f"(c[2]), - "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k16.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(a[0]), "r"(c[0]), "r"(c[1]), "r"(c[2]), - "r"(c[3])); - } - } else { - if constexpr (std::is_same::value) { - float* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])); - } else if constexpr (std::is_same::value) { - int32_t* c = reinterpret_cast(&frag_c); - asm volatile( - "mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite " - "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n" - : "=r"(c[0]), "=r"(c[1]), "=r"(c[2]), "=r"(c[3]) - : "r"(b[0]), "r"(b2[0]), "r"(b[1]), "r"(b2[1]), "r"(a[0]), "r"(a[1]), - "r"(c[0]), "r"(c[1]), "r"(c[2]), "r"(c[3])); - } - } -} - // Instruction for loading a full 16x16 matrix fragment of operand A from shared // memory, directly in tensor core layout. template @@ -415,6 +285,17 @@ __global__ void Marlin( if constexpr (a_type_id == vllm::kFE4M3fn.id()) return; #endif + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + // Turing TensorCore only supports fp16 and int8 + if constexpr (a_type_id != vllm::kFloat16.id() && a_type_id != vllm::kS8.id()) + return; + #endif + + #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 750 + constexpr bool use_fp16_accum = a_type_id == vllm::kFloat16.id(); + #else + constexpr bool use_fp16_accum = false; + #endif using Adtype = MarlinScalarType; using Cdtype = MarlinScalarType; const int4* A = A0; @@ -873,10 +754,6 @@ __global__ void Marlin( constexpr int sh_s_size = has_act_order ? (act_s_max_num_groups * s_sh_stride) : (stages * s_sh_stage); int4* sh_s = sh_zp + (stages * zp_sh_stage); - // shared memory reused by reduction should be smaller than - // shared memory used by weight. - static_assert(thread_m_blocks * 16 * thread_n_blocks * 16 / 8 <= - stages * b_sh_stage); int4* sh_a = sh_s + sh_s_size; // Register storage for double buffer of shared memory reads. @@ -1395,11 +1272,13 @@ __global__ void Marlin( #pragma unroll for (int i = 0; i < thread_m_blocks; i++) { if constexpr (m_block_size_8) { - mma_trans(frag_a[k2][i], frag_b0, frag_b1, - frag_c[i][j][0]); + mma_trans(frag_a[k2][i], frag_b0, frag_b1, + frag_c[i][j][0]); } else { - mma(frag_a[k2][i], frag_b0, frag_c[i][j][0]); - mma(frag_a[k2][i], frag_b1, frag_c[i][j][1]); + mma(frag_a[k2][i], frag_b0, + frag_c[i][j][0]); + mma(frag_a[k2][i], frag_b1, + frag_c[i][j][1]); } } } @@ -1433,10 +1312,12 @@ __global__ void Marlin( #pragma unroll for (int i = 0; i < thread_m_blocks; i++) { - mma(frag_a[k2][i], frag_b[0], - (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][0]); - mma(frag_a[k2][i], frag_b[1], - (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][1]); + mma( + frag_a[k2][i], frag_b[0], + (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][0]); + mma( + frag_a[k2][i], frag_b[1], + (group_blocks == -1 ? frag_c : frag_c_tmp)[i][j][1]); } if constexpr (group_blocks != -1) { @@ -1956,6 +1837,21 @@ __global__ void Marlin( // While this pattern may not be the most readable, other ways of writing // the loop seemed to noticeably worse performance after compilation. if (slice_iters == 0) { + // convert fp16 accum to fp32 for reduction + if constexpr (use_fp16_accum) { + #pragma unroll + for (int i = 0; i < (thread_m_blocks * (is_a_8bit ? 2 : 4) * 2); i++) { + float* frag_c_part_float = reinterpret_cast(frag_c) + i * 4; + scalar_t* frag_c_part_half = + reinterpret_cast(frag_c_part_float); + + #pragma unroll + for (int i = 3; i >= 0; i--) { + frag_c_part_float[i] = Cdtype::num2float(frag_c_part_half[i]); + } + } + } + if constexpr (is_a_8bit) { float frag_a_s[2 * thread_m_blocks]; diff --git a/csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu b/csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu deleted file mode 100644 index 6c8f6309ef43f1f10e00384233436616181342d8..0000000000000000000000000000000000000000 --- a/csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu +++ /dev/null @@ -1,373 +0,0 @@ -#include "core/registration.h" - -#include -#include - -#include -#include -#include - -#include "cute/tensor.hpp" -#include "cutlass/tensor_ref.h" -#include "cutlass/epilogue/collective/default_epilogue.hpp" -#include "cutlass/epilogue/thread/linear_combination.h" -#include "cutlass/gemm/dispatch_policy.hpp" -#include "cutlass/gemm/group_array_problem_shape.hpp" -#include "cutlass/gemm/collective/collective_builder.hpp" -#include "cutlass/epilogue/collective/collective_builder.hpp" -#include "cutlass/gemm/device/gemm_universal_adapter.h" -#include "cutlass/gemm/kernel/gemm_universal.hpp" - -#include "cutlass/util/command_line.h" -#include "cutlass/util/distribution.h" -#include "cutlass/util/host_tensor.h" -#include "cutlass/util/packed_stride.hpp" -#include "cutlass/util/tensor_view_io.h" -#include "cutlass/util/reference/device/gemm.h" -#include "cutlass/util/reference/device/tensor_compare.h" -#include "cutlass/util/reference/host/tensor_fill.h" -#include "cutlass/util/reference/host/gett.hpp" -#include "cutlass/util/reference/host/tensor_norm.h" -#include "cutlass/util/reference/host/tensor_compare.h" -#include - -using namespace cute; - -template -__global__ void get_ggemm_starts( - int32_t* expert_offsets, ElementAB** a_offsets, ElementAB** b_offsets, - ElementC** out_offsets, ElementAccumulator** a_scale_offsets, - ElementAccumulator** b_scale_offsets, ElementAB* a_base_as_int, - ElementAB* b_base_as_int, ElementC* out_base_as_int, - ElementAccumulator* a_scale_base_as_int, - ElementAccumulator* b_scale_base_as_int, LayoutSFA* layout_sfa_base_as_int, - LayoutSFB* layout_sfb_base_as_int, int* problem_sizes) { - int expert_id = threadIdx.x; - - if (expert_id >= gridDim.x * blockDim.x) { - return; - } - - int m = problem_sizes[expert_id * 3]; - int n = problem_sizes[expert_id * 3 + 1]; - int k = problem_sizes[expert_id * 3 + 2]; - - int32_t expert_offset = expert_offsets[expert_id]; - int a_stride = expert_offset * k; - int b_stride = expert_id * k * n; - int a_scale_stride = expert_offset * k / 128; - int b_scale_stride = expert_id * k * n / 128 / 128; - - a_offsets[expert_id] = a_base_as_int + a_stride; - b_offsets[expert_id] = b_base_as_int + b_stride; - out_offsets[expert_id] = out_base_as_int + expert_offset * n; - a_scale_offsets[expert_id] = a_scale_base_as_int + a_scale_stride; - b_scale_offsets[expert_id] = b_scale_base_as_int + b_scale_stride; - - LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id; - LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id; - - *layout_sfa_ptr = - ScaleConfig::tile_atom_to_shape_SFA(cute::make_shape(m, n, k, 1)); - *layout_sfb_ptr = - ScaleConfig::tile_atom_to_shape_SFB(cute::make_shape(m, n, k, 1)); -} - -#define __CALL_GET_STARTS_KERNEL(TENSOR_C_TYPE, C_TYPE, LayoutSFA, LayoutSFB, \ - ScaleConfig) \ - else if (out_tensors.dtype() == TENSOR_C_TYPE) { \ - get_ggemm_starts<<<1, num_experts, 0, stream>>>( \ - static_cast(expert_offsets.data_ptr()), \ - static_cast(a_ptrs.data_ptr()), \ - static_cast(b_ptrs.data_ptr()), \ - static_cast(out_ptrs.data_ptr()), \ - static_cast(a_scales_ptrs.data_ptr()), \ - static_cast(b_scales_ptrs.data_ptr()), \ - static_cast(a_tensors.data_ptr()), \ - static_cast(b_tensors.data_ptr()), \ - static_cast(out_tensors.data_ptr()), \ - static_cast(a_scales.data_ptr()), \ - static_cast(b_scales.data_ptr()), \ - reinterpret_cast(layout_sfa.data_ptr()), \ - reinterpret_cast(layout_sfb.data_ptr()), \ - static_cast(problem_sizes.data_ptr())); \ - } - -template -void run_get_ggemm_starts( - torch::Tensor const& expert_offsets, torch::Tensor& a_ptrs, - torch::Tensor& b_ptrs, torch::Tensor& out_ptrs, - torch::Tensor& a_scales_ptrs, torch::Tensor& b_scales_ptrs, - torch::Tensor const& a_tensors, torch::Tensor const& b_tensors, - torch::Tensor out_tensors, torch::Tensor const& a_scales, - torch::Tensor const& b_scales, torch::Tensor const& layout_sfa, - torch::Tensor const& layout_sfb, torch::Tensor const& problem_sizes) { - TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn); - TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn); - TORCH_CHECK(a_scales.dtype() == torch::kFloat32); - TORCH_CHECK(b_scales.dtype() == torch::kFloat32); - TORCH_CHECK(out_tensors.size(1) % 128 == 0 or out_tensors.size(0) % 128 == 0); - TORCH_CHECK(a_tensors.size(1) % 128 == 0 or a_tensors.size(0) % 128 == 0); - - int num_experts = (int)expert_offsets.size(0); - auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index()); - - if (false) { - } - __CALL_GET_STARTS_KERNEL(torch::kBFloat16, cutlass::bfloat16_t, LayoutSFA, - LayoutSFB, ScaleConfig) - __CALL_GET_STARTS_KERNEL(torch::kFloat16, cutlass::half_t, LayoutSFA, - LayoutSFB, ScaleConfig) - else { - TORCH_CHECK(false, "Unsupported output tensor type"); - } -} - -template -void run_blockwise_scaled_group_mm( - torch::Tensor& out_ptrs, const torch::Tensor& a_ptrs, - const torch::Tensor& b_ptrs, const torch::Tensor& a_scales_ptrs, - const torch::Tensor& b_scales_ptrs, const torch::Tensor& stride_a, - const torch::Tensor& stride_b, const torch::Tensor& stride_c, - const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb, - const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) { - using ProblemShape = cutlass::gemm::GroupProblemShape>; - - // Types - using ElementA = cutlass::float_e4m3_t; - using ElementB = cutlass::float_e4m3_t; - using ElementC = OutType; - using ElementD = ElementC; - using ElementAccumulator = float; - using LayoutA = cutlass::layout::RowMajor; - using LayoutB = cutlass::layout::ColumnMajor; - using LayoutC = LayoutD; - - // Alignments - static constexpr int AlignmentA = 128 / cutlass::sizeof_bits::value; - static constexpr int AlignmentB = 128 / cutlass::sizeof_bits::value; - static constexpr int AlignmentC = 128 / cutlass::sizeof_bits::value; - - using ArchTag = cutlass::arch::Sm100; - using OperatorClass = cutlass::arch::OpClassTensorOp; - - using CollectiveEpilogue = - typename cutlass::epilogue::collective::CollectiveBuilder< - ArchTag, OperatorClass, typename ScheduleConfig::MmaTileShape, - typename ScheduleConfig::ClusterShape, - cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator, - ElementAccumulator, void, LayoutC*, AlignmentC, ElementD, LayoutC*, - AlignmentC, typename ScheduleConfig::EpilogueSchedule>::CollectiveOp; - - using CollectiveMainloop = - typename cutlass::gemm::collective::CollectiveBuilder< - ArchTag, OperatorClass, ElementA, - cute::tuple, - AlignmentA, ElementB, - cute::tuple, - AlignmentB, ElementAccumulator, typename ScheduleConfig::MmaTileShape, - typename ScheduleConfig::ClusterShape, - cutlass::gemm::collective::StageCountAutoCarveout( - sizeof(typename CollectiveEpilogue::SharedStorage))>, - typename ScheduleConfig::KernelSchedule>::CollectiveOp; - - using GemmKernel = - cutlass::gemm::kernel::GemmUniversal; - - using Gemm = cutlass::gemm::device::GemmUniversalAdapter; - using StrideA = typename Gemm::GemmKernel::InternalStrideA; - using StrideB = typename Gemm::GemmKernel::InternalStrideB; - using StrideC = typename Gemm::GemmKernel::InternalStrideC; - using StrideD = typename Gemm::GemmKernel::InternalStrideD; - - using UnderlyingProblemShape = ProblemShape::UnderlyingProblemShape; - int num_experts = (int)expert_offsets.size(0); - - Gemm gemm_op; - - // Mainloop Arguments - typename GemmKernel::MainloopArguments mainloop_args{ - static_cast(a_ptrs.data_ptr()), - static_cast(stride_a.data_ptr()), - static_cast(b_ptrs.data_ptr()), - static_cast(stride_b.data_ptr()), - static_cast(a_scales_ptrs.data_ptr()), - reinterpret_cast( - layout_sfa.data_ptr()), - static_cast(b_scales_ptrs.data_ptr()), - reinterpret_cast( - layout_sfb.data_ptr())}; - - int device_id = a_ptrs.device().index(); - static const cutlass::KernelHardwareInfo hw_info{ - device_id, cutlass::KernelHardwareInfo::query_device_multiprocessor_count( - device_id)}; - - // Epilogue Arguments - typename GemmKernel::EpilogueArguments epilogue_args{ - {}, // epilogue.thread - nullptr, - static_cast(stride_c.data_ptr()), - static_cast(out_ptrs.data_ptr()), - static_cast(stride_c.data_ptr())}; - - UnderlyingProblemShape* problem_sizes_as_shapes = - static_cast(problem_sizes.data_ptr()); - - // Gemm Arguments - typename GemmKernel::Arguments args{ - cutlass::gemm::GemmUniversalMode::kGrouped, - {num_experts, problem_sizes_as_shapes, nullptr}, - mainloop_args, - epilogue_args, - hw_info}; - - at::cuda::CUDAGuard device_guard{(char)a_ptrs.device().index()}; - const cudaStream_t stream = - at::cuda::getCurrentCUDAStream(a_ptrs.get_device()); - - auto can_implement_status = gemm_op.can_implement(args); - TORCH_CHECK(can_implement_status == cutlass::Status::kSuccess, - "Failed to implement GEMM"); - - size_t workspace_size = gemm_op.get_workspace_size(args); - auto const workspace_options = - torch::TensorOptions().dtype(torch::kUInt8).device(a_ptrs.device()); - auto workspace = torch::empty(workspace_size, workspace_options); - - auto status = gemm_op.initialize(args, workspace.data_ptr(), stream); - TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to initialize GEMM"); - - status = gemm_op.run(stream); - TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to run GEMM"); -} - -template -void blockwise_scaled_group_mm_dispatch_shape( - torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b, - const torch::Tensor& scales_a, const torch::Tensor& scales_b, - const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) { - struct MmaConfig { - using ElementA = cutlass::float_e4m3_t; - using KernelSchedule = - cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100; - using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm; - using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig< - 1, 128, 128, cute::UMMA::Major::K, cute::UMMA::Major::K>; - using LayoutSFA = decltype(ScaleConfig::deduce_layoutSFA()); - using LayoutSFB = decltype(ScaleConfig::deduce_layoutSFB()); - using LayoutC = cutlass::layout::RowMajor; - using MmaTileShape = Shape<_128, _128, _128>; - using ClusterShape = Shape<_1, _1, _1>; - }; - - int num_experts = (int)expert_offsets.size(0); - - auto a_ptrs = torch::empty( - {num_experts}, - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto b_ptrs = torch::empty( - {num_experts}, - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto out_ptrs = torch::empty( - {num_experts}, - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto a_scales_ptrs = torch::empty( - {num_experts}, - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto b_scales_ptrs = torch::empty( - {num_experts}, - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - - auto layout_sfa = torch::empty( - {num_experts, 5}, - torch::TensorOptions().dtype(torch::kInt32).device(a.device())); - auto layout_sfb = torch::empty( - {num_experts, 5}, - torch::TensorOptions().dtype(torch::kInt32).device(a.device())); - - auto stride_a = torch::full( - {num_experts}, a.size(1), - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto stride_b = torch::full( - {num_experts}, a.size(1), - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - auto stride_c = torch::full( - {num_experts}, output.size(1), - torch::TensorOptions().dtype(torch::kInt64).device(a.device())); - - torch::TensorOptions options_int = - torch::TensorOptions().dtype(torch::kInt64).device(a.device()); - - run_get_ggemm_starts( - expert_offsets, a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, a, - b, output, scales_a, scales_b, layout_sfa, layout_sfb, problem_sizes); - - run_blockwise_scaled_group_mm( - out_ptrs, a_ptrs, b_ptrs, a_scales_ptrs, b_scales_ptrs, stride_a, - stride_b, stride_c, layout_sfa, layout_sfb, problem_sizes, - expert_offsets); -} - -void cutlass_blockwise_scaled_grouped_mm( - torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b, - const torch::Tensor& scales_a, const torch::Tensor& scales_b, - const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) { - TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor"); - TORCH_CHECK(problem_sizes.size(1) == 3, - "problem_sizes must have shape (num_experts, 3)"); - TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0), - "Number of experts in problem_sizes must match expert_offsets"); - TORCH_CHECK(problem_sizes.dtype() == torch::kInt32, - "problem_sizes must be int32"); - TORCH_CHECK(a.scalar_type() == torch::kFloat8_e4m3fn, - "a must be kFloat8_e4m3fn"); - TORCH_CHECK(b.scalar_type() == torch::kFloat8_e4m3fn, - "b must be kFloat8_e4m3fn"); - TORCH_CHECK(output.scalar_type() == torch::kBFloat16 || - output.scalar_type() == torch::kHalf, - "output must be bfloat16 or half"); - TORCH_CHECK(scales_a.scalar_type() == torch::kFloat32, - "scales_a must be float32"); - TORCH_CHECK(scales_b.scalar_type() == torch::kFloat32, - "scales_b must be float32"); - TORCH_CHECK(expert_offsets.scalar_type() == torch::kInt32, - "expert_offsets must be int32"); - - TORCH_CHECK(output.dim() == 2, "output must be 2D tensor"); - TORCH_CHECK(a.dim() == 2, "a must be 2D tensor"); - TORCH_CHECK(b.dim() == 3, "b must be 3D tensor"); - TORCH_CHECK(scales_a.dim() == 2, "scales_a must be 2D tensor"); - TORCH_CHECK(scales_b.dim() == 3, "scales_b must be 3D tensor"); - TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor"); - TORCH_CHECK(problem_sizes.size(1) == 3, - "problem_sizes must have shape (num_experts, 3)"); - TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0), - "Number of experts in problem_sizes must match expert_offsets"); - TORCH_CHECK(problem_sizes.dtype() == torch::kInt32, - "problem_sizes must be int32"); - TORCH_CHECK(expert_offsets.dim() == 1, "expert_offsets must be 1D tensor"); - -#if defined(ENABLE_CUTLASS_MOE_SM100) && ENABLE_CUTLASS_MOE_SM100 - if (output.scalar_type() == torch::kBFloat16) { - blockwise_scaled_group_mm_dispatch_shape( - output, a, b, scales_a, scales_b, problem_sizes, expert_offsets); - } else if (output.scalar_type() == torch::kFloat16) { - blockwise_scaled_group_mm_dispatch_shape( - output, a, b, scales_a, scales_b, problem_sizes, expert_offsets); - } else { - TORCH_CHECK(false, "Unsupported output tensor type"); - } -#endif -} - -TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) { - m.impl("cutlass_blockwise_scaled_grouped_mm", - &cutlass_blockwise_scaled_grouped_mm); -} diff --git a/csrc/quantization/w8a8/cutlass/moe/moe_data.cu b/csrc/quantization/w8a8/cutlass/moe/moe_data.cu index 99fec8fd6febc81ec796ada7b09de225272e29e0..28af2e7d4d80fc75b5e723dc07e5262203bf1c03 100644 --- a/csrc/quantization/w8a8/cutlass/moe/moe_data.cu +++ b/csrc/quantization/w8a8/cutlass/moe/moe_data.cu @@ -3,6 +3,8 @@ #include #include +#include "dispatch_utils.h" + #include constexpr uint64_t THREADS_PER_EXPERT = 512; @@ -114,22 +116,17 @@ inline void launch_compute_problem_sizes(const torch::Tensor& topk_ids, const bool swap_ab) { int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel()); - const int32_t* topk_ptr = static_cast(topk_ids.data_ptr()); - int32_t* ps1_ptr = static_cast(problem_sizes1.data_ptr()); - int32_t* ps2_ptr = static_cast(problem_sizes2.data_ptr()); - int32_t* atomic_ptr = static_cast(atomic_buffer.data_ptr()); + auto const* topk_ptr = topk_ids.data_ptr(); + auto* ps1_ptr = problem_sizes1.data_ptr(); + auto* ps2_ptr = problem_sizes2.data_ptr(); + auto* atomic_ptr = atomic_buffer.data_ptr(); - if (swap_ab) { - compute_problem_sizes<<>>( + VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] { + compute_problem_sizes<<>>( topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr, static_cast(topk_ids.numel()), static_cast(n), static_cast(k)); - } else { - compute_problem_sizes<<>>( - topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr, - static_cast(topk_ids.numel()), static_cast(n), - static_cast(k)); - } + }); } } // namespace @@ -153,6 +150,93 @@ void get_cutlass_moe_mm_problem_sizes_caller( may_swap_ab); } +template +__global__ void compute_problem_sizes_from_expert_offsets( + const int64_t* __restrict__ expert_first_token_offset, + int32_t* __restrict__ problem_sizes1, int32_t* __restrict__ problem_sizes2, + const int num_experts, const int n, const int k) { + int const expert_id = blockIdx.x * blockDim.x + threadIdx.x; + if (expert_id >= num_experts) { + return; + } + + int64_t const m64 = expert_first_token_offset[expert_id + 1] - + expert_first_token_offset[expert_id]; + int32_t const m = static_cast(m64); + + int32_t* ps1 = problem_sizes1 + expert_id * 3; + int32_t* ps2 = problem_sizes2 + expert_id * 3; + + if constexpr (!SWAP_AB) { + // [M, 2*N, K] + ps1[0] = m; + ps1[1] = 2 * n; + ps1[2] = k; + // [M, K, N] + ps2[0] = m; + ps2[1] = k; + ps2[2] = n; + } else { + // swap logical M/N in the problem shape + // [2*N, M, K] + ps1[0] = 2 * n; + ps1[1] = m; + ps1[2] = k; + // [K, M, N] + ps2[0] = k; + ps2[1] = m; + ps2[2] = n; + } +} + +void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller( + const torch::Tensor& expert_first_token_offset, + torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, + const int64_t n, const int64_t k, const bool swap_ab) { + TORCH_CHECK(expert_first_token_offset.is_cuda(), + "expert_first_token_offset must be a CUDA tensor"); + TORCH_CHECK(expert_first_token_offset.dtype() == torch::kInt64, + "expert_first_token_offset must be int64"); + + TORCH_CHECK(problem_sizes1.is_cuda() && problem_sizes2.is_cuda(), + "problem_sizes must be CUDA tensors"); + TORCH_CHECK(problem_sizes1.dtype() == torch::kInt32 && + problem_sizes2.dtype() == torch::kInt32, + "problem_sizes must be int32"); + TORCH_CHECK(problem_sizes1.is_contiguous() && problem_sizes2.is_contiguous(), + "problem_sizes must be contiguous"); + TORCH_CHECK(problem_sizes1.dim() == 2 && problem_sizes2.dim() == 2, + "problem_sizes must be 2D tensors"); + TORCH_CHECK(problem_sizes1.size(1) == 3 && problem_sizes2.size(1) == 3, + "problem_sizes second dim must be 3"); + TORCH_CHECK(problem_sizes1.sizes() == problem_sizes2.sizes(), + "problem_sizes1 and problem_sizes2 must have same shape"); + + int64_t const num_experts64 = problem_sizes1.size(0); + TORCH_CHECK(expert_first_token_offset.numel() == num_experts64 + 1, + "expert_first_token_offset must have num_experts + 1 elements"); + TORCH_CHECK(num_experts64 <= INT32_MAX, "num_experts must fit in int32"); + TORCH_CHECK(n <= INT32_MAX && k <= INT32_MAX, "n and k must fit in int32"); + + int const num_experts = static_cast(num_experts64); + auto stream = at::cuda::getCurrentCUDAStream( + expert_first_token_offset.device().index()); + + int const threads = (num_experts < 256) ? num_experts : 256; + int const blocks = (num_experts + threads - 1) / threads; + + auto const* offsets_ptr = expert_first_token_offset.data_ptr(); + auto* ps1_ptr = problem_sizes1.data_ptr(); + auto* ps2_ptr = problem_sizes2.data_ptr(); + + VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] { + compute_problem_sizes_from_expert_offsets + <<>>(offsets_ptr, ps1_ptr, ps2_ptr, + num_experts, static_cast(n), + static_cast(k)); + }); +} + void get_cutlass_moe_mm_data_caller( const torch::Tensor& topk_ids, torch::Tensor& expert_offsets, torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, diff --git a/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu b/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu index 5de21cfbbaafb4ee71d879ff906e26633bd5284f..077966a1d92a0c08182f79b4a493a64eea50208f 100644 --- a/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu +++ b/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu @@ -83,6 +83,11 @@ void get_cutlass_moe_mm_problem_sizes_caller( const int64_t k, const std::optional& blockscale_offsets, std::optional force_swap_ab = std::nullopt); +void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller( + const torch::Tensor& expert_first_token_offset, + torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, + const int64_t n, const int64_t k, const bool swap_ab); + void get_cutlass_pplx_moe_mm_data_caller(torch::Tensor& expert_offsets, torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, @@ -322,6 +327,25 @@ void get_cutlass_moe_mm_problem_sizes( version_num, ". Required capability: 90, 100, or 120"); } +void get_cutlass_moe_mm_problem_sizes_from_expert_offsets( + const torch::Tensor& expert_first_token_offset, + torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, + const int64_t n, const int64_t k, const bool swap_ab) { + int32_t version_num = get_sm_version_num(); +#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) || \ + (defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100) || \ + (defined ENABLE_CUTLASS_MOE_SM120 && ENABLE_CUTLASS_MOE_SM120) + get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller( + expert_first_token_offset, problem_sizes1, problem_sizes2, n, k, swap_ab); + return; +#endif + TORCH_CHECK_NOT_IMPLEMENTED( + false, + "No compiled get_cutlass_moe_mm_problem_sizes_from_expert_offsets: " + "no cutlass_scaled_mm kernel for CUDA device capability: ", + version_num, ". Required capability: 90, 100, or 120"); +} + void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets, torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2, diff --git a/csrc/quantization/w8a8/fp8/common.cu b/csrc/quantization/w8a8/fp8/common.cu index 7a822fb8fb8aa8b8492906e4666f6c9402079837..d07cdd571fedd18b3893d5dbbd7b70d22c9f4ec7 100644 --- a/csrc/quantization/w8a8/fp8/common.cu +++ b/csrc/quantization/w8a8/fp8/common.cu @@ -4,28 +4,77 @@ #include "quantization/vectorization_utils.cuh" #include #include +#include namespace vllm { -template -__global__ void scaled_fp8_quant_kernel_strided( +// STRIDE_I_ZERO: true if scale_stride_i == 0 (per-tensor or per-channel) +// STRIDE_J_ZERO: true if scale_stride_j == 0 (per-tensor or per-token) +template +__global__ void scaled_fp8_quant_kernel_strided_group_shape( fp8_type* __restrict__ out, const scalar_t* __restrict__ input, const float* __restrict__ scale, int hidden_size, int64_t in_row_stride, - int64_t out_row_stride) { - const int64_t token_idx = blockIdx.x; // one token per block + int64_t out_row_stride, int group_m, int group_n, int64_t scale_stride_i, + int64_t scale_stride_j) { + const int64_t token_idx = blockIdx.x; const int tid = threadIdx.x; const scalar_t* token_in = input + token_idx * in_row_stride; fp8_type* token_out = out + token_idx * out_row_stride; - const float inv_scale = 1.0f / (*scale); - - vectorize_with_alignment<16>( - token_in, token_out, hidden_size, tid, blockDim.x, - [=] __device__(fp8_type & dst, const scalar_t& src) { - dst = scaled_fp8_conversion(static_cast(src), - inv_scale); - }); + // Precompute row-level base offset for scale access (compile-time eliminated + // when STRIDE_I_ZERO) + const int64_t scale_row_base = + STRIDE_I_ZERO ? 0 + : static_cast(token_idx) / group_m * scale_stride_i; + + auto get_inv_scale = [&](int gj) { + return 1.0f / scale[scale_row_base + gj * scale_stride_j]; + }; + + int cached_gj = -1; + float cached_inv_scale = 0.0f; + auto get_inv_scale_cached = [&](int gj) { + if (gj != cached_gj) { + cached_inv_scale = 1.0f / scale[scale_row_base + gj * scale_stride_j]; + cached_gj = gj; + } + return cached_inv_scale; + }; + + constexpr int VEC_SIZE = 16; // FP8 so vectorize to 128 bits + auto scaled_fp8_conversion_vectorized = [&](const scalar_t* in, fp8_type* out, + int size, float inv_scale) { + vectorize_with_alignment( + in, out, size, tid, blockDim.x, + [=] __device__(fp8_type & dst, const scalar_t& src) { + dst = scaled_fp8_conversion(static_cast(src), + inv_scale); + }); + }; + + if (STRIDE_J_ZERO && hidden_size % VEC_SIZE == 0) { + // Per-tensor or per-token: single scale per row, vectorize full row + scaled_fp8_conversion_vectorized(token_in, token_out, hidden_size, + get_inv_scale(0)); + } else if (group_n % VEC_SIZE == 0) { + // Multiple column groups with vectorization + const int num_groups_n = hidden_size / group_n; + + for (int gj = 0; gj < num_groups_n; gj++) { + scaled_fp8_conversion_vectorized(token_in + gj * group_n, + token_out + gj * group_n, group_n, + get_inv_scale(gj)); + } + } else { + // Scalar path for small column groups (group_n < VEC_SIZE) + for (int n = tid; n < hidden_size; n += blockDim.x) { + const int gj = n / group_n; + token_out[n] = scaled_fp8_conversion( + static_cast(token_in[n]), get_inv_scale_cached(gj)); + } + } } template @@ -133,17 +182,116 @@ __global__ void dynamic_per_token_scaled_fp8_quant_kernel_strided( } // namespace vllm -void static_scaled_fp8_quant(torch::Tensor& out, // [..., d] - torch::Tensor const& input, // [..., d] - torch::Tensor const& scale) // [1] +void static_scaled_fp8_quant( + torch::Tensor& out, // [..., d] + torch::Tensor const& input, // [..., d] + torch::Tensor const& scale, // various shapes + std::optional> + opt_group_shape) // optional explicit (group_m, group_n) { TORCH_CHECK(input.stride(-1) == 1, "last dimension of input must be contiguous"); TORCH_CHECK(out.stride(-1) == 1, "last dimension of output must be contiguous"); - const int hidden_size = input.size(-1); - const int num_tokens = input.numel() / hidden_size; + const int hidden_size = input.size(-1); // N (columns) + const int num_tokens = input.numel() / hidden_size; // M (rows) + + // Determine group_m, group_n, and scale strides from scale shape + // Scale indexing: scale[gi * scale_stride_j + gj * scale_stride_i] + // where gi = m / group_m, gj = n / group_n + int group_m, group_n; + int64_t scale_stride_i, scale_stride_j; + + if (scale.dim() == 0 || scale.numel() == 1) { + // Per-tensor: one scale for the entire tensor + group_m = num_tokens; + group_n = hidden_size; + scale_stride_i = 0; + scale_stride_j = 0; + } else if (scale.dim() == 1) { + // 1D scale: require explicit group_shape to disambiguate per-channel vs + // per-token (avoids edge case where num_tokens == hidden_size) + TORCH_CHECK(opt_group_shape.has_value(), + "1D scale requires explicit group_shape to disambiguate " + "per-channel vs per-token quantization. " + "Use group_shape=(-1, 1) for per-channel or group_shape=(1, " + "-1) for per-token."); + + const auto& [opt_group_m, opt_group_n] = opt_group_shape.value(); + group_m = opt_group_m == -1 ? num_tokens : static_cast(opt_group_m); + group_n = opt_group_n == -1 ? hidden_size : static_cast(opt_group_n); + + // Validate the explicit group shape matches the 1D scale + const int64_t scale_len = scale.numel(); + const int64_t expected_scale_m = num_tokens / group_m; + const int64_t expected_scale_n = hidden_size / group_n; + const int64_t expected_scale_numel = expected_scale_m * expected_scale_n; + + TORCH_CHECK(scale_len == expected_scale_numel, "1D scale length (", + scale_len, ") does not match expected size (", + expected_scale_numel, ") for group_shape (", opt_group_m, ", ", + opt_group_n, ") with input shape (", num_tokens, ", ", + hidden_size, ")"); + + // For 1D scale, determine strides based on which dim is trivial + // Scale indexing: scale[gi * scale_stride_i + gj * scale_stride_j] + // where gi = m / group_m (row group), gj = n / group_n (col group) + if (expected_scale_m == 1) { + // Per-channel style: one scale in M dim, scale varies along N + // gi = 0 always, gj varies, so stride_1 traverses the scale + scale_stride_i = 0; + scale_stride_j = scale.stride(0); + } else if (expected_scale_n == 1) { + // Per-token style: one scale in N dim, scale varies along M + // gj = 0 always, gi varies, so stride_0 traverses the scale + scale_stride_i = scale.stride(0); + scale_stride_j = 0; + } else { + TORCH_CHECK( + false, + "1D scale can only be used when one of the scale dimensions is 1. " + "For 2D group scaling, use a 2D scale tensor."); + } + } else if (scale.dim() == 2) { + // 2D scale: infer group sizes from scale dimensions (or use explicit if + // provided) + const int64_t scale_size_0 = scale.size(0); + const int64_t scale_size_1 = scale.size(1); + + TORCH_CHECK(num_tokens % scale_size_0 == 0, "num_tokens (", num_tokens, + ") must be divisible by scale.size(0) (", scale_size_0, ")"); + TORCH_CHECK(hidden_size % scale_size_1 == 0, "hidden_size (", hidden_size, + ") must be divisible by scale.size(1) (", scale_size_1, ")"); + + // Infer from 2D scale shape + int inferred_group_m = num_tokens / scale_size_0; + int inferred_group_n = hidden_size / scale_size_1; + + // Use explicit if provided, otherwise use inferred + if (opt_group_shape.has_value()) { + const auto& [opt_group_m, opt_group_n] = opt_group_shape.value(); + group_m = opt_group_m == -1 ? num_tokens : static_cast(opt_group_m); + group_n = opt_group_n == -1 ? hidden_size : static_cast(opt_group_n); + + // Validate explicit matches inferred + TORCH_CHECK(group_m == inferred_group_m && group_n == inferred_group_n, + "Explicit group_shape (", opt_group_m, ", ", opt_group_n, + ") does not match inferred group shape (", inferred_group_m, + ", ", inferred_group_n, ") from 2D scale tensor shape (", + scale_size_0, ", ", scale_size_1, ")"); + } else { + group_m = inferred_group_m; + group_n = inferred_group_n; + } + + scale_stride_i = scale.stride(0); + scale_stride_j = scale.stride(1); + } else { + TORCH_CHECK(false, "scale must be 0D, 1D, or 2D tensor, but got ", + scale.dim(), "D"); + } + const int block_size = 256; dim3 grid(num_tokens); dim3 block(block_size); @@ -153,15 +301,23 @@ void static_scaled_fp8_quant(torch::Tensor& out, // [..., d] const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); + + // Dispatch to template-specialized kernel based on stride pattern VLLM_DISPATCH_FLOATING_TYPES( input.scalar_type(), "scaled_fp8_quant_kernel_scalar_type", [&] { VLLM_DISPATCH_FP8_TYPES( out.scalar_type(), "scaled_fp8_quant_kernel_fp8_type", [&] { - vllm::scaled_fp8_quant_kernel_strided - <<>>( - out.data_ptr(), input.data_ptr(), - scale.data_ptr(), hidden_size, in_row_stride, - out_row_stride); + VLLM_DISPATCH_BOOL(scale_stride_i == 0, S0_ZERO, [&] { + VLLM_DISPATCH_BOOL(scale_stride_j == 0, S1_ZERO, [&] { + vllm::scaled_fp8_quant_kernel_strided_group_shape< + scalar_t, fp8_t, S0_ZERO, S1_ZERO> + <<>>( + out.data_ptr(), input.data_ptr(), + scale.data_ptr(), hidden_size, in_row_stride, + out_row_stride, group_m, group_n, scale_stride_i, + scale_stride_j); + }); + }); }); }); } diff --git a/csrc/sampler.cu b/csrc/sampler.cu index fc2154beff9e0ffbc9361a9a87c102c6ad910903..f7c091f1d4ee40ac24332f60067f25c54cd27a61 100644 --- a/csrc/sampler.cu +++ b/csrc/sampler.cu @@ -1,3 +1,4 @@ +#include "cuda_compat.h" #include "dispatch_utils.h" #include @@ -97,7 +98,9 @@ static inline __device__ bool isPartialMatch(float x, uint32_t pattern) { template __device__ void vectorized_process(size_t thread_rank, size_t num_threads, const T* in, idxT len, Func f) { - constexpr int WARP_SIZE = 32; + // Use dynamic WARP_SIZE from cuda_compat.h to support both + // Wave64 (MI300X/gfx942) and Wave32 (Strix Halo/gfx1151) architectures + constexpr int kWarpSize = WARP_SIZE; using WideT = float4; if constexpr (sizeof(T) >= sizeof(WideT)) { for (idxT i = thread_rank; i < len; i += num_threads) { @@ -132,8 +135,8 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads, } } - static_assert(WARP_SIZE >= items_per_scalar); - // and because items_per_scalar > skip_cnt, WARP_SIZE > skip_cnt + static_assert(kWarpSize >= items_per_scalar); + // and because items_per_scalar > skip_cnt, kWarpSize > skip_cnt // no need to use loop if (thread_rank < skip_cnt) { f(in[thread_rank], thread_rank); @@ -142,7 +145,7 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads, // len_cast * items_per_scalar + items_per_scalar > len - skip_cnt; // and so // len - (skip_cnt + len_cast * items_per_scalar) < items_per_scalar <= - // WARP_SIZE no need to use loop + // kWarpSize no need to use loop const idxT remain_i = skip_cnt + len_cast * items_per_scalar + thread_rank; if (remain_i < len) { f(in[remain_i], remain_i); @@ -550,8 +553,8 @@ static __global__ __launch_bounds__(kNumThreadsPerBlock) void topKPerRowPrefill( int rowEnd = rowEnds[rowIdx]; // Local pointers to this block - outIndices += rowIdx * topK; - logits += rowIdx * stride0; + outIndices += static_cast(rowIdx) * topK; + logits += static_cast(rowIdx) * stride0; topKPerRowJob( nullptr, logits, rowStart, rowEnd, outIndices, nullptr, stride1, topK); @@ -576,19 +579,21 @@ static __global__ __launch_bounds__(kNumThreadsPerBlock) void topKPerRowDecode( // Local pointers to this block if constexpr (!multipleBlocksPerRow && !mergeBlocks) { - outIndices += rowIdx * topK; + outIndices += static_cast(rowIdx) * topK; } else if constexpr (multipleBlocksPerRow) { const auto blockSize = rowEnd / gridDim.y; // 16384 / 2 = 8192 rowStart = blockSize * blockIdx.y; // 8192 * 1 = 8192 rowEnd = gridDim.y == blockIdx.y + 1 ? rowEnd : rowStart + blockSize; - outIndices += rowIdx * gridDim.y * topK + blockIdx.y * topK; - outLogits += rowIdx * gridDim.y * topK + blockIdx.y * topK; + outIndices += + static_cast(rowIdx) * gridDim.y * topK + blockIdx.y * topK; + outLogits += + static_cast(rowIdx) * gridDim.y * topK + blockIdx.y * topK; } else if constexpr (mergeBlocks) { rowEnd = numBlocksToMerge * topK; - indices += rowIdx * numBlocksToMerge * topK; - outIndices += rowIdx * topK; + indices += static_cast(rowIdx) * numBlocksToMerge * topK; + outIndices += static_cast(rowIdx) * topK; } - logits += rowIdx * stride0; + logits += static_cast(rowIdx) * stride0; topKPerRowJob( diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp index b50bb6ffbed4831e8741208ace5c2deb876b0a7b..45e8d8d980562f222837c7bd306d688602e3e45f 100644 --- a/csrc/torch_bindings.cpp +++ b/csrc/torch_bindings.cpp @@ -444,13 +444,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { " Tensor alpha) -> ()"); ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm); - // cutlass blockwise scaledgroup GEMM - ops.def( - "cutlass_blockwise_scaled_grouped_mm(Tensor! output, Tensor a, Tensor b, " - "Tensor scales_a, Tensor scales_b, " - "Tensor problem_sizes, Tensor expert_offsets) -> ()"); - // conditionally compiled so impl registration is in source file - // cutlass nvfp4 block scaled group GEMM ops.def( "cutlass_fp4_group_mm(Tensor! out, Tensor a, Tensor b," @@ -522,6 +515,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { ops.impl("get_cutlass_moe_mm_problem_sizes", torch::kCUDA, &get_cutlass_moe_mm_problem_sizes); + // compute per-expert problem sizes from expert_first_token_offset + // produced by vLLM's moe_permute kernel + ops.def( + "get_cutlass_moe_mm_problem_sizes_from_expert_offsets(" + " Tensor expert_first_token_offset, " + " Tensor! problem_sizes1, " + " Tensor! problem_sizes2, " + " int n, int k, bool swap_ab) -> ()"); + ops.impl("get_cutlass_moe_mm_problem_sizes_from_expert_offsets", torch::kCUDA, + &get_cutlass_moe_mm_problem_sizes_from_expert_offsets); + // A function that computes data required to run fused MoE with w8a8 grouped // GEMM and PPLX. It takes expert_num_tokens and non_zero_expert_idxs // as an input, and computes expert_offsets (token start indices of each @@ -593,6 +597,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { "Tensor output_scale_offset_by_experts) -> ()"); ops.impl("scaled_fp4_experts_quant", torch::kCUDA, &scaled_fp4_experts_quant); + // Fused SiLU+Mul+NVFP4 experts quantization. + ops.def( + "silu_and_mul_scaled_fp4_experts_quant(Tensor! output, Tensor! " + "output_scale," + "Tensor input, Tensor input_global_scale, Tensor input_offset_by_experts," + "Tensor output_scale_offset_by_experts) -> ()"); + ops.impl("silu_and_mul_scaled_fp4_experts_quant", torch::kCUDA, + &silu_and_mul_scaled_fp4_experts_quant); + // Check if cutlass_scaled_mm_fp4 is supported for CUDA devices // of the given capability ops.def("cutlass_scaled_mm_supports_fp4(int cuda_device_capability) -> bool"); @@ -615,19 +628,22 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { // ops.impl("gptq_shuffle", torch::kCUDA, &gptq_shuffle); // Compute FP8 quantized tensor for given scaling factor. + // Supports per-tensor, per-channel, per-token, and arbitrary 2D group + // scaling. Optional group_m/group_n specify the group shape explicitly; + // required for 1D scales to disambiguate per-channel vs per-token. // ops.def( -// "static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale) -> " -// "()"); +// "static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale, " +// "(int, int)? group_shape=None) -> ()"); // ops.impl("static_scaled_fp8_quant", torch::kCUDA, &static_scaled_fp8_quant); -// // Compute dynamic-per-tensor FP8 quantized tensor and scaling factor. + // Compute dynamic-per-tensor FP8 quantized tensor and scaling factor. // ops.def( // "dynamic_scaled_fp8_quant(Tensor! result, Tensor input, Tensor! scale) " // "-> " // "()"); // ops.impl("dynamic_scaled_fp8_quant", torch::kCUDA, &dynamic_scaled_fp8_quant); -// // Compute dynamic-per-token FP8 quantized tensor and scaling factor. + // Compute dynamic-per-token FP8 quantized tensor and scaling factor. // ops.def( // "dynamic_per_token_scaled_fp8_quant(Tensor! result, Tensor input, " // "Tensor! scale, Tensor? scale_ub) -> " @@ -721,16 +737,6 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) { "swap_blocks(Tensor src, Tensor! dst, Tensor block_mapping) -> ()"); cache_ops.impl("swap_blocks", torch::kCUDA, &swap_blocks); - // Copy the cache blocks from src to dst. - cache_ops.def( - "copy_blocks(Tensor(a!)[] key_caches, Tensor[](b!) value_caches, " - "Tensor block_mapping) -> ()"); - cache_ops.impl("copy_blocks", torch::kCUDA, ©_blocks); - - cache_ops.def( - "copy_blocks_mla(Tensor(a!)[] kv_caches, Tensor block_mapping) -> ()"); - cache_ops.impl("copy_blocks_mla", torch::kCUDA, ©_blocks_mla); - // Reshape the key and value tensors and cache them. cache_ops.def( "reshape_and_cache(Tensor key, Tensor value," @@ -785,6 +791,22 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) { " Tensor scale) -> ()"); cache_ops.impl("concat_and_cache_mla", torch::kCUDA, &concat_and_cache_mla); + // Rotate Q and K, then write to kv cache for MLA + cache_ops.def( + "concat_and_cache_mla_rope_fused(" + " Tensor positions," + " Tensor! q_pe," + " Tensor! k_pe," + " Tensor kv_c," + " Tensor cos_sin_cache," + " bool is_neox," + " Tensor slot_mapping," + " Tensor! kv_cache," + " str kv_cache_dtype," + " Tensor kv_cache_scale) -> ()"); + cache_ops.impl("concat_and_cache_mla_rope_fused", torch::kCUDA, + &concat_and_cache_mla_rope_fused); + // Convert the key and value cache to fp8 data type. cache_ops.def( "convert_fp8(Tensor! dst_cache, Tensor src_cache, float scale, " diff --git a/docker/Dockerfile b/docker/Dockerfile index 0d50d97e54c6c3787c5b3939f2194d31374dd10f..ec6bfc5dfc30746f42a2bc41bcfbc257c3d01151 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -32,7 +32,7 @@ ARG DEADSNAKES_GPGKEY_URL # The PyPA get-pip.py script is a self contained script+zip file, that provides # both the installer script and the pip base85-encoded zip archive. This allows -# bootstrapping pip in environment where a dsitribution package does not exist. +# bootstrapping pip in environment where a distribution package does not exist. # # By parameterizing the URL for get-pip.py installation script, we allow # third-party to use their own copy of the script stored in a private mirror. @@ -73,15 +73,13 @@ ARG INSTALL_KV_CONNECTORS=false #################### BASE BUILD IMAGE #################### # prepare basic build environment FROM ${BUILD_BASE_IMAGE} AS base + ARG CUDA_VERSION ARG PYTHON_VERSION -ARG TARGETPLATFORM -ARG INSTALL_KV_CONNECTORS=false -ENV DEBIAN_FRONTEND=noninteractive -ARG GET_PIP_URL +ENV DEBIAN_FRONTEND=noninteractive -# Install system dependencies and uv, then create Python virtual environment +# Install system dependencies including build tools RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ && apt-get update -y \ @@ -107,32 +105,30 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && ln -s /opt/venv/bin/pip /usr/bin/pip \ && python3 --version && python3 -m pip --version -ARG PIP_INDEX_URL UV_INDEX_URL -ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL -ARG PYTORCH_CUDA_INDEX_BASE_URL -ARG PIP_KEYRING_PROVIDER UV_KEYRING_PROVIDER - # Activate virtual environment and add uv to PATH ENV PATH="/opt/venv/bin:/root/.local/bin:$PATH" ENV VIRTUAL_ENV="/opt/venv" -# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out -# Reference: https://github.com/astral-sh/uv/pull/1694 +# Environment for uv ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" -# Use copy mode to avoid hardlink failures with Docker cache mounts ENV UV_LINK_MODE=copy -RUN <&2; exit 1 ;; \ + esac \ + && export SCCACHE_DOWNLOAD_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \ && curl -L -o sccache.tar.gz ${SCCACHE_DOWNLOAD_URL} \ && tar -xzf sccache.tar.gz \ - && sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache \ - && rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \ + && sudo mv sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \ + && rm -rf sccache.tar.gz sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl \ && if [ ! -z ${SCCACHE_ENDPOINT} ] ; then export SCCACHE_ENDPOINT=${SCCACHE_ENDPOINT} ; fi \ && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \ && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \ @@ -241,6 +242,50 @@ RUN --mount=type=cache,target=/root/.cache/ccache \ fi #################### CSRC BUILD IMAGE #################### +#################### EXTENSIONS BUILD IMAGE #################### +# Build DeepGEMM, pplx-kernels, DeepEP - runs in PARALLEL with csrc-build +# This stage is independent and doesn't affect csrc cache +FROM base AS extensions-build +ARG CUDA_VERSION + +# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out +ENV UV_HTTP_TIMEOUT=500 +ENV UV_INDEX_STRATEGY="unsafe-best-match" +ENV UV_LINK_MODE=copy + +WORKDIR /workspace + +# Build DeepGEMM wheel +ARG DEEPGEMM_GIT_REF +COPY tools/install_deepgemm.sh /tmp/install_deepgemm.sh +RUN --mount=type=cache,target=/root/.cache/uv \ + mkdir -p /tmp/deepgemm/dist && \ + VLLM_DOCKER_BUILD_CONTEXT=1 TORCH_CUDA_ARCH_LIST="9.0a 10.0a" /tmp/install_deepgemm.sh \ + --cuda-version "${CUDA_VERSION}" \ + ${DEEPGEMM_GIT_REF:+--ref "$DEEPGEMM_GIT_REF"} \ + --wheel-dir /tmp/deepgemm/dist || \ + echo "DeepGEMM build skipped (CUDA version requirement not met)" + +# Ensure the wheel dir exists so COPY won't fail when DeepGEMM is skipped +RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped + +# Build pplx-kernels and DeepEP wheels +COPY tools/ep_kernels/install_python_libraries.sh /tmp/install_python_libraries.sh +ARG PPLX_COMMIT_HASH +ARG DEEPEP_COMMIT_HASH +ARG NVSHMEM_VER +RUN --mount=type=cache,target=/root/.cache/uv \ + mkdir -p /tmp/ep_kernels_workspace/dist && \ + export TORCH_CUDA_ARCH_LIST='9.0a 10.0a' && \ + /tmp/install_python_libraries.sh \ + --workspace /tmp/ep_kernels_workspace \ + --mode wheel \ + ${PPLX_COMMIT_HASH:+--pplx-ref "$PPLX_COMMIT_HASH"} \ + ${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} \ + ${NVSHMEM_VER:+--nvshmem-ver "$NVSHMEM_VER"} && \ + find /tmp/ep_kernels_workspace/nvshmem -name '*.a' -delete +#################### EXTENSIONS BUILD IMAGE #################### + #################### WHEEL BUILD IMAGE #################### FROM base AS build ARG TARGETPLATFORM @@ -265,6 +310,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \ WORKDIR /workspace +# Copy pre-built csrc wheel directly COPY --from=csrc-build /workspace/dist /precompiled-wheels COPY . . @@ -286,27 +332,9 @@ RUN --mount=type=cache,target=/root/.cache/uv \ fi && \ python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 -# Install DeepGEMM from source -ARG DEEPGEMM_GIT_REF -COPY tools/install_deepgemm.sh /tmp/install_deepgemm.sh -RUN --mount=type=cache,target=/root/.cache/uv \ - VLLM_DOCKER_BUILD_CONTEXT=1 TORCH_CUDA_ARCH_LIST="9.0a 10.0a" /tmp/install_deepgemm.sh --cuda-version "${CUDA_VERSION}" ${DEEPGEMM_GIT_REF:+--ref "$DEEPGEMM_GIT_REF"} --wheel-dir /tmp/deepgemm/dist - -# Ensure the wheel dir exists so later-stage COPY won't fail when DeepGEMM is skipped -RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped - -COPY tools/ep_kernels/install_python_libraries.sh /tmp/install_python_libraries.sh -# Install EP kernels(pplx-kernels and DeepEP) -ARG PPLX_COMMIT_HASH -ARG DEEPEP_COMMIT_HASH -RUN --mount=type=cache,target=/root/.cache/uv \ - export TORCH_CUDA_ARCH_LIST='9.0a 10.0a' && \ - /tmp/install_python_libraries.sh \ - --workspace /tmp/ep_kernels_workspace \ - --mode wheel \ - ${PPLX_COMMIT_HASH:+--pplx-ref "$PPLX_COMMIT_HASH"} \ - ${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} && \ - find /tmp/ep_kernels_workspace/nvshmem -name '*.a' -delete +# Copy extension wheels from extensions-build stage for later use +COPY --from=extensions-build /tmp/deepgemm/dist /tmp/deepgemm/dist +COPY --from=extensions-build /tmp/ep_kernels_workspace/dist /tmp/ep_kernels_workspace/dist # Check the size of the wheel if RUN_WHEEL_CHECK is true COPY .buildkite/check-wheel-size.py check-wheel-size.py @@ -344,32 +372,25 @@ RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --python /opt/venv/bin/python3 -r requirements/dev.txt \ --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') #################### DEV IMAGE #################### - #################### vLLM installation IMAGE #################### # image with vLLM installed FROM ${FINAL_BASE_IMAGE} AS vllm-base + ARG CUDA_VERSION ARG PYTHON_VERSION -ARG INSTALL_KV_CONNECTORS=false -WORKDIR /vllm-workspace -ENV DEBIAN_FRONTEND=noninteractive -ARG TARGETPLATFORM - -# TODO (huydhn): There is no prebuilt gdrcopy package on 12.9 at the moment -ARG GDRCOPY_CUDA_VERSION=12.8 -# Keep in line with FINAL_BASE_IMAGE -ARG GDRCOPY_OS_VERSION=Ubuntu22_04 - -SHELL ["/bin/bash", "-c"] - ARG DEADSNAKES_MIRROR_URL ARG DEADSNAKES_GPGKEY_URL ARG GET_PIP_URL +ENV DEBIAN_FRONTEND=noninteractive +WORKDIR /vllm-workspace + + +# Python version string for paths (e.g., "312" for 3.12) RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \ echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment -# Install Python and other dependencies +# Install Python and system dependencies RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ && apt-get update -y \ @@ -408,62 +429,103 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION} \ && python3 --version && python3 -m pip --version -# Install CUDA development tools and build essentials for runtime JIT compilation +# Install CUDA development tools for runtime JIT compilation # (FlashInfer, DeepGEMM, EP kernels all require compilation at runtime) RUN CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-') && \ apt-get update -y && \ apt-get install -y --no-install-recommends \ - cuda-nvcc-${CUDA_VERSION_DASH} \ - cuda-cudart-${CUDA_VERSION_DASH} \ - cuda-nvrtc-${CUDA_VERSION_DASH} \ - cuda-cuobjdump-${CUDA_VERSION_DASH} \ - # https://github.com/vllm-project/vllm/issues/29590 - libcurand-dev-${CUDA_VERSION_DASH} \ - libcublas-${CUDA_VERSION_DASH} \ - # Fixes nccl_allocator requiring nccl.h at runtime - # https://github.com/vllm-project/vllm/blob/1336a1ea244fa8bfd7e72751cabbdb5b68a0c11a/vllm/distributed/device_communicators/pynccl_allocator.py#L22 - libnccl-dev && \ + cuda-nvcc-${CUDA_VERSION_DASH} \ + cuda-cudart-${CUDA_VERSION_DASH} \ + cuda-nvrtc-${CUDA_VERSION_DASH} \ + cuda-cuobjdump-${CUDA_VERSION_DASH} \ + libcurand-dev-${CUDA_VERSION_DASH} \ + libcublas-${CUDA_VERSION_DASH} \ + # Fixes nccl_allocator requiring nccl.h at runtime + # https://github.com/vllm-project/vllm/blob/1336a1ea244fa8bfd7e72751cabbdb5b68a0c11a/vllm/distributed/device_communicators/pynccl_allocator.py#L22 + libnccl-dev && \ rm -rf /var/lib/apt/lists/* -ARG PIP_INDEX_URL UV_INDEX_URL -ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL -ARG PYTORCH_CUDA_INDEX_BASE_URL -ARG PIP_KEYRING_PROVIDER UV_KEYRING_PROVIDER - # Install uv for faster pip installs -RUN --mount=type=cache,target=/root/.cache/uv \ - python3 -m pip install uv +RUN python3 -m pip install uv -# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out -# Reference: https://github.com/astral-sh/uv/pull/1694 +# Environment for uv ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" -# Use copy mode to avoid hardlink failures with Docker cache mounts ENV UV_LINK_MODE=copy -# Workaround for https://github.com/openai/triton/issues/2507 and -# https://github.com/pytorch/pytorch/issues/107960 -- hopefully -# this won't be needed for future versions of this docker image -# or future versions of triton. +# Workaround for triton/pytorch issues RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ -# Install vllm wheel first, so that torch etc will be installed. -RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \ - --mount=type=cache,target=/root/.cache/uv \ - uv pip install --system dist/*.whl --verbose \ - --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') +# ============================================================ +# SLOW-CHANGING DEPENDENCIES BELOW +# These are the expensive layers that we want to cache +# ============================================================ + +# Install PyTorch and core CUDA dependencies +# This is ~2GB and rarely changes +ARG PYTORCH_CUDA_INDEX_BASE_URL +COPY requirements/common.txt /tmp/common.txt +COPY requirements/cuda.txt /tmp/requirements-cuda.txt +RUN --mount=type=cache,target=/root/.cache/uv \ + uv pip install --system -r /tmp/requirements-cuda.txt \ + --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') && \ + rm /tmp/requirements-cuda.txt /tmp/common.txt # Install FlashInfer pre-compiled kernel cache and binaries +# This is ~1.1GB and only changes when FlashInfer version bumps # https://docs.flashinfer.ai/installation.html +ARG FLASHINFER_VERSION=0.5.3 RUN --mount=type=cache,target=/root/.cache/uv \ - uv pip install --system flashinfer-cubin==0.5.3 \ - && uv pip install --system flashinfer-jit-cache==0.5.3 \ + uv pip install --system flashinfer-cubin==${FLASHINFER_VERSION} \ + && uv pip install --system flashinfer-jit-cache==${FLASHINFER_VERSION} \ --extra-index-url https://flashinfer.ai/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') \ && flashinfer show-config -COPY examples examples -COPY benchmarks benchmarks -COPY ./vllm/collect_env.py . +# ============================================================ +# OPENAI API SERVER DEPENDENCIES +# Pre-install these to avoid reinstalling on every vLLM wheel rebuild +# ============================================================ + +# Install gdrcopy (saves ~6s per build) +# TODO (huydhn): There is no prebuilt gdrcopy package on 12.9 at the moment +ARG GDRCOPY_CUDA_VERSION=12.8 +ARG GDRCOPY_OS_VERSION=Ubuntu22_04 +ARG TARGETPLATFORM +COPY tools/install_gdrcopy.sh /tmp/install_gdrcopy.sh +RUN set -eux; \ + case "${TARGETPLATFORM}" in \ + linux/arm64) UUARCH="aarch64" ;; \ + linux/amd64) UUARCH="x64" ;; \ + *) echo "Unsupported TARGETPLATFORM: ${TARGETPLATFORM}" >&2; exit 1 ;; \ + esac; \ + /tmp/install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "${GDRCOPY_CUDA_VERSION}" "${UUARCH}" && \ + rm /tmp/install_gdrcopy.sh + +# Install vllm-openai dependencies (saves ~2.6s per build) +# These are stable packages that don't depend on vLLM itself +RUN --mount=type=cache,target=/root/.cache/uv \ + if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ + BITSANDBYTES_VERSION="0.42.0"; \ + else \ + BITSANDBYTES_VERSION="0.46.1"; \ + fi; \ + uv pip install --system accelerate hf_transfer modelscope \ + "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm>=1.0.17' 'runai-model-streamer[s3,gcs]>=0.15.3' + +# ============================================================ +# VLLM INSTALLATION (depends on build stage) +# ============================================================ + +ARG PIP_INDEX_URL UV_INDEX_URL +ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL +ARG PYTORCH_CUDA_INDEX_BASE_URL +ARG PIP_KEYRING_PROVIDER UV_KEYRING_PROVIDER + +# Install vllm wheel first, so that torch etc will be installed. +RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \ + --mount=type=cache,target=/root/.cache/uv \ + uv pip install --system dist/*.whl --verbose \ + --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') RUN --mount=type=cache,target=/root/.cache/uv \ . /etc/environment && \ @@ -478,7 +540,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \ echo "No DeepGEMM wheels to install; skipping."; \ fi' -# Pytorch now installs NVSHMEM, setting LD_LIBRARY_PATH (https://github.com/pytorch/pytorch/blob/d38164a545b4a4e4e0cf73ce67173f70574890b6/.ci/manywheel/build_cuda.sh#L141C14-L141C36) +# Pytorch now installs NVSHMEM, setting LD_LIBRARY_PATH ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH # Install EP kernels wheels (pplx-kernels and DeepEP) that have been built in the `build` stage @@ -487,23 +549,17 @@ RUN --mount=type=bind,from=build,src=/tmp/ep_kernels_workspace/dist,target=/vllm uv pip install --system ep_kernels/dist/*.whl --verbose \ --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') -RUN --mount=type=bind,source=tools/install_gdrcopy.sh,target=/tmp/install_gdrcopy.sh,ro \ - set -eux; \ - case "${TARGETPLATFORM}" in \ - linux/arm64) UUARCH="aarch64" ;; \ - linux/amd64) UUARCH="x64" ;; \ - *) echo "Unsupported TARGETPLATFORM: ${TARGETPLATFORM}" >&2; exit 1 ;; \ - esac; \ - /tmp/install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "${GDRCOPY_CUDA_VERSION}" "${UUARCH}" - # CUDA image changed from /usr/local/nvidia to /usr/local/cuda in 12.8 but will # return to /usr/local/nvidia in 13.0 to allow container providers to mount drivers # consistently from the host (see https://github.com/vllm-project/vllm/issues/18859). # Until then, add /usr/local/nvidia/lib64 before the image cuda path to allow override. ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH} +# Copy examples and benchmarks at the end to minimize cache invalidation +COPY examples examples +COPY benchmarks benchmarks +COPY ./vllm/collect_env.py . #################### vLLM installation IMAGE #################### - #################### TEST IMAGE #################### # image to run unit testing suite # note that this uses vllm installed by `pip` @@ -561,6 +617,7 @@ RUN mv vllm src/vllm FROM vllm-base AS vllm-openai-base ARG TARGETPLATFORM ARG INSTALL_KV_CONNECTORS=false +ARG CUDA_VERSION ARG PIP_INDEX_URL UV_INDEX_URL ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL @@ -569,18 +626,32 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 -# install additional dependencies for openai api server +# install kv_connectors if requested +ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' +ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ + CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ + CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ + CUDA_HOME=/usr/local/cuda; \ + # lmcache requires explicit specifying CUDA_HOME + BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ + libcublas-dev-${CUDA_VERSION_DASH} \ + libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ - uv pip install --system -r /tmp/kv_connectors.txt; \ - fi; \ - if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ - BITSANDBYTES_VERSION="0.42.0"; \ - else \ - BITSANDBYTES_VERSION="0.46.1"; \ - fi; \ - uv pip install --system accelerate hf_transfer modelscope "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm>=1.0.17' 'runai-model-streamer[s3,gcs]>=0.15.3' + if [ "$CUDA_MAJOR" -ge 13 ]; then \ + uv pip install --system nixl-cu13; \ + fi; \ + uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ + # if the above fails, install from source + apt-get update -y && \ + apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ + uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ + apt-get purge -y ${BUILD_PKGS} && \ + # clean up -dev packages, keep runtime libraries + rm -rf /var/lib/apt/lists/* \ + ); \ + fi ENV VLLM_USAGE_SOURCE production-docker-image diff --git a/docker/Dockerfile.cpu b/docker/Dockerfile.cpu index 8d55ecfba3e52eabccc24e2d0dc9ea0b0687b7c0..2caf1ad144178f0872ddc0e99c3d60fefcbcb545 100644 --- a/docker/Dockerfile.cpu +++ b/docker/Dockerfile.cpu @@ -17,7 +17,7 @@ # VLLM_CPU_DISABLE_AVX512=false (default)|true # VLLM_CPU_AVX512BF16=false (default)|true # VLLM_CPU_AVX512VNNI=false (default)|true -# VLLM_CPU_AMXBF16=false (default)|true +# VLLM_CPU_AMXBF16=false |true (default) # ######################### COMMON BASE IMAGE ######################### @@ -95,7 +95,7 @@ ENV VLLM_CPU_AVX512BF16=${VLLM_CPU_AVX512BF16} ARG VLLM_CPU_AVX512VNNI=0 ENV VLLM_CPU_AVX512VNNI=${VLLM_CPU_AVX512VNNI} # Support for building with AMXBF16 ISA: docker build --build-arg VLLM_CPU_AMXBF16="true" ... -ARG VLLM_CPU_AMXBF16=0 +ARG VLLM_CPU_AMXBF16=1 ENV VLLM_CPU_AMXBF16=${VLLM_CPU_AMXBF16} WORKDIR /workspace/vllm @@ -147,7 +147,9 @@ WORKDIR /workspace/vllm RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ - apt-get install -y --no-install-recommends vim numactl xz-utils + apt-get install -y --no-install-recommends vim numactl xz-utils make clangd-14 + +RUN ln -s /usr/bin/clangd-14 /usr/bin/clangd # install development dependencies (for testing) RUN --mount=type=cache,target=/root/.cache/uv \ diff --git a/docker/Dockerfile.ppc64le b/docker/Dockerfile.ppc64le index b16bea3607d2f62a2fcd3154277aeeaf1b2ffce4..07b64a509a4b43efd66541ab2e393127dd496792 100644 --- a/docker/Dockerfile.ppc64le +++ b/docker/Dockerfile.ppc64le @@ -22,13 +22,13 @@ RUN microdnf install -y dnf && dnf install -y gcc-toolset-14 make wget unzip \ ############################################################### FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS centos-deps-builder RUN microdnf install -y dnf && \ - dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-24.el9.noarch.rpm \ - https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-24.el9.noarch.rpm \ + dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-26.el9.noarch.rpm \ + https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-26.el9.noarch.rpm \ https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ dnf config-manager --set-enabled crb -RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel && \ - dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-24.el9.noarch +RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel yajl-devel && \ + dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-26.el9.noarch ############################################################### @@ -346,4 +346,4 @@ WORKDIR /workspace/ RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks -ENTRYPOINT ["vllm", "serve"] \ No newline at end of file +ENTRYPOINT ["vllm", "serve"] diff --git a/docker/Dockerfile.rocm b/docker/Dockerfile.rocm index 1b6bdabc7a539c28b8bd0819a2bd40edd53e9e0b..2744117af9519dbd551028de2843fe5c9d66b595 100644 --- a/docker/Dockerfile.rocm +++ b/docker/Dockerfile.rocm @@ -3,6 +3,14 @@ ARG REMOTE_VLLM="0" ARG COMMON_WORKDIR=/app ARG BASE_IMAGE=rocm/vllm-dev:base +# Sccache configuration (only used in release pipeline) +ARG USE_SCCACHE +ARG SCCACHE_DOWNLOAD_URL +ARG SCCACHE_ENDPOINT +ARG SCCACHE_BUCKET_NAME=vllm-build-sccache +ARG SCCACHE_REGION_NAME=us-west-2 +ARG SCCACHE_S3_NO_CREDENTIALS=0 + FROM ${BASE_IMAGE} AS base ARG ARG_PYTORCH_ROCM_ARCH @@ -14,9 +22,14 @@ ENV RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1 RUN apt-get update -q -y && apt-get install -q -y \ sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \ apt-transport-https ca-certificates wget curl -# Remove sccache RUN python3 -m pip install --upgrade pip -RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(which sccache)" +# Remove sccache only if not using sccache (it exists in base image from Dockerfile.rocm_base) +ARG USE_SCCACHE +RUN if [ "$USE_SCCACHE" != "1" ]; then \ + apt-get purge -y sccache || true; \ + python3 -m pip uninstall -y sccache || true; \ + rm -f "$(which sccache)" || true; \ + fi # Install UV RUN curl -LsSf https://astral.sh/uv/install.sh | env UV_INSTALL_DIR="/usr/local/bin" sh @@ -28,6 +41,39 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match" # Use copy mode to avoid hardlink failures with Docker cache mounts ENV UV_LINK_MODE=copy +# Install sccache if USE_SCCACHE is enabled (for release builds) +ARG USE_SCCACHE +ARG SCCACHE_DOWNLOAD_URL +ARG SCCACHE_ENDPOINT +ARG SCCACHE_BUCKET_NAME +ARG SCCACHE_REGION_NAME +ARG SCCACHE_S3_NO_CREDENTIALS +RUN if [ "$USE_SCCACHE" = "1" ]; then \ + if command -v sccache >/dev/null 2>&1; then \ + echo "sccache already installed, skipping installation"; \ + sccache --version; \ + else \ + echo "Installing sccache..." \ + && SCCACHE_ARCH="x86_64" \ + && SCCACHE_VERSION="v0.8.1" \ + && SCCACHE_DL_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \ + && curl -L -o /tmp/sccache.tar.gz ${SCCACHE_DL_URL} \ + && tar -xzf /tmp/sccache.tar.gz -C /tmp \ + && mv /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \ + && chmod +x /usr/bin/sccache \ + && rm -rf /tmp/sccache.tar.gz /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl \ + && sccache --version; \ + fi; \ + fi + +# Set sccache environment variables only when USE_SCCACHE=1 +# This prevents S3 config from leaking into images when sccache is not used +ARG USE_SCCACHE +ENV SCCACHE_BUCKET=${USE_SCCACHE:+${SCCACHE_BUCKET_NAME}} +ENV SCCACHE_REGION=${USE_SCCACHE:+${SCCACHE_REGION_NAME}} +ENV SCCACHE_S3_NO_CREDENTIALS=${USE_SCCACHE:+${SCCACHE_S3_NO_CREDENTIALS}} +ENV SCCACHE_IDLE_TIMEOUT=${USE_SCCACHE:+0} + ARG COMMON_WORKDIR WORKDIR ${COMMON_WORKDIR} @@ -39,6 +85,8 @@ ONBUILD COPY ./ vllm/ FROM base AS fetch_vllm_1 ARG VLLM_REPO="https://github.com/vllm-project/vllm.git" ARG VLLM_BRANCH="main" +ENV VLLM_REPO=${VLLM_REPO} +ENV VLLM_BRANCH=${VLLM_BRANCH} ONBUILD RUN git clone ${VLLM_REPO} \ && cd vllm \ && git fetch -v --prune -- origin ${VLLM_BRANCH} \ @@ -51,7 +99,7 @@ FROM fetch_vllm_${REMOTE_VLLM} AS fetch_vllm # ----------------------- # vLLM build stages FROM fetch_vllm AS build_vllm -# Build vLLM +# Build vLLM (setup.py auto-detects sccache in PATH) RUN cd vllm \ && python3 -m pip install -r requirements/rocm.txt \ && python3 setup.py clean --all \ @@ -67,6 +115,178 @@ COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/docker/Dockerfile.rocm /docker/ COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/vllm/v1 /vllm_v1 +# RIXL/UCX build stages +FROM base AS build_rixl +ARG RIXL_BRANCH="f33a5599" +ARG RIXL_REPO="https://github.com/ROCm/RIXL.git" +ARG UCX_BRANCH="da3fac2a" +ARG UCX_REPO="https://github.com/ROCm/ucx.git" +ENV ROCM_PATH=/opt/rocm +ENV UCX_HOME=/usr/local/ucx +ENV RIXL_HOME=/usr/local/rixl +ENV RIXL_BENCH_HOME=/usr/local/rixl_bench + +# RIXL build system dependences and RDMA support +RUN apt-get -y update && apt-get -y install autoconf libtool pkg-config \ + libgrpc-dev \ + libgrpc++-dev \ + libprotobuf-dev \ + protobuf-compiler-grpc \ + libcpprest-dev \ + libaio-dev \ + librdmacm1 \ + librdmacm-dev \ + libibverbs1 \ + libibverbs-dev \ + ibverbs-utils \ + rdmacm-utils \ + ibverbs-providers \ + && rm -rf /var/lib/apt/lists/* + +RUN uv pip install --system meson auditwheel patchelf tomlkit + +RUN cd /usr/local/src && \ + git clone ${UCX_REPO} && \ + cd ucx && \ + git checkout ${UCX_BRANCH} && \ + ./autogen.sh && \ + mkdir build && cd build && \ + ../configure \ + --prefix=/usr/local/ucx \ + --enable-shared \ + --disable-static \ + --disable-doxygen-doc \ + --enable-optimizations \ + --enable-devel-headers \ + --with-rocm=/opt/rocm \ + --with-verbs \ + --with-dm \ + --enable-mt && \ + make -j && \ + make install + +ENV PATH=/usr/local/ucx/bin:$PATH +ENV LD_LIBRARY_PATH=${UCX_HOME}/lib:${LD_LIBRARY_PATH} + +RUN git clone ${RIXL_REPO} /opt/rixl && \ + cd /opt/rixl && \ + git checkout ${RIXL_BRANCH} && \ + meson setup build --prefix=${RIXL_HOME} \ + -Ducx_path=${UCX_HOME} \ + -Drocm_path=${ROCM_PATH} && \ + cd build && \ + ninja && \ + ninja install + +# Generate RIXL wheel +RUN cd /opt/rixl && mkdir -p /app/install && \ + ./contrib/build-wheel.sh \ + --output-dir /app/install \ + --rocm-dir ${ROCM_PATH} \ + --ucx-plugins-dir ${UCX_HOME}/lib/ucx \ + --nixl-plugins-dir ${RIXL_HOME}/lib/x86_64-linux-gnu/plugins + + +# ----------------------- +# vLLM wheel release build stage (for building distributable wheels) +# This stage pins dependencies to custom ROCm wheel versions and handles version detection +FROM fetch_vllm AS build_vllm_wheel_release + +ARG COMMON_WORKDIR + +# Create /install directory for custom wheels +RUN mkdir -p /install + +# Copy custom ROCm wheels from docker/context if they exist +# COPY ensures Docker cache is invalidated when wheels change +# .keep file ensures directory always exists for COPY to work +COPY docker/context/base-wheels/ /tmp/base-wheels/ +# This is how we know if we are building for a wheel release or not. +# If there are not wheels found there, we are not building for a wheel release. +# So we exit with an error. To skip this stage. +RUN if [ -n "$(ls /tmp/base-wheels/*.whl 2>/dev/null)" ]; then \ + echo "Found custom wheels - copying to /install"; \ + cp /tmp/base-wheels/*.whl /install/ && \ + echo "Copied custom wheels:"; \ + ls -lh /install/; \ + else \ + echo "ERROR: No custom wheels found in docker/context/base-wheels/"; \ + echo "Wheel releases require pre-built ROCm wheels."; \ + exit 1; \ + fi + +# GIT_REPO_CHECK: Verify repo is clean and tags are available (for release builds) +# This matches CUDA's Dockerfile behavior for proper version detection via setuptools_scm +ARG GIT_REPO_CHECK=0 +RUN if [ "$GIT_REPO_CHECK" != "0" ]; then \ + echo "Running repository checks..."; \ + cd vllm && bash tools/check_repo.sh; \ + fi + +# Extract version from git BEFORE any modifications (pin_rocm_dependencies.py modifies requirements/rocm.txt) +# This ensures setuptools_scm sees clean repo state for version detection +RUN --mount=type=bind,source=.git,target=vllm/.git \ + cd vllm \ + && pip install setuptools_scm \ + && VLLM_VERSION=$(python3 -c "import setuptools_scm; print(setuptools_scm.get_version())") \ + && echo "Detected vLLM version: ${VLLM_VERSION}" \ + && echo "${VLLM_VERSION}" > /tmp/vllm_version.txt + +# Fail if git-based package dependencies are found in requirements files +# (uv doesn't handle git+ URLs well, and packages should be distributed on PyPI) +# Extra notes: pip install is able to handle git+ URLs, but uv doesn't. +RUN echo "Checking for git-based packages in requirements files..." \ + && echo "Checking common.txt for git-based packages:" \ + && if grep -q 'git+' ${COMMON_WORKDIR}/vllm/requirements/common.txt; then \ + echo "ERROR: Git-based packages found in common.txt:"; \ + grep 'git+' ${COMMON_WORKDIR}/vllm/requirements/common.txt; \ + echo "Please publish these packages to PyPI instead of using git dependencies."; \ + exit 1; \ + else \ + echo " ✓ No git-based packages found in common.txt"; \ + fi \ + && echo "Checking rocm.txt for git-based packages:" \ + && if grep -q 'git+' ${COMMON_WORKDIR}/vllm/requirements/rocm.txt; then \ + echo "ERROR: Git-based packages found in rocm.txt:"; \ + grep 'git+' ${COMMON_WORKDIR}/vllm/requirements/rocm.txt; \ + echo "Please publish these packages to PyPI instead of using git dependencies."; \ + exit 1; \ + else \ + echo " ✓ No git-based packages found in rocm.txt"; \ + fi \ + && echo "All requirements files are clean - no git-based packages found" + +# Pin vLLM dependencies to exact versions of custom ROCm wheels +# This ensures 'pip install vllm' automatically installs correct torch/triton/torchvision/amdsmi +COPY tools/vllm-rocm/pin_rocm_dependencies.py /tmp/pin_rocm_dependencies.py +RUN echo "Pinning vLLM dependencies to custom wheel versions..." \ + && python3 /tmp/pin_rocm_dependencies.py /install ${COMMON_WORKDIR}/vllm/requirements/rocm.txt + +# Install dependencies using custom wheels from /install +RUN cd vllm \ + && echo "Building vLLM with custom wheels from /install" \ + && python3 -m pip install --find-links /install -r requirements/rocm.txt \ + && python3 setup.py clean --all + +# Build wheel using pre-extracted version to avoid dirty state from modified requirements/rocm.txt +# (setup.py auto-detects sccache in PATH) +RUN --mount=type=bind,source=.git,target=vllm/.git \ + cd vllm \ + && export SETUPTOOLS_SCM_PRETEND_VERSION=$(cat /tmp/vllm_version.txt) \ + && echo "Building wheel with version: ${SETUPTOOLS_SCM_PRETEND_VERSION}" \ + && python3 setup.py bdist_wheel --dist-dir=dist + +FROM scratch AS export_vllm_wheel_release +ARG COMMON_WORKDIR +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/dist/*.whl / +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/requirements /requirements +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/benchmarks /benchmarks +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/tests /tests +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/examples /examples +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/docker/Dockerfile.rocm /docker/ +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite +COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/vllm/v1 /vllm_v1 + # ----------------------- # Test vLLM image FROM base AS test @@ -83,6 +303,10 @@ RUN --mount=type=bind,from=export_vllm,src=/,target=/install \ && pip uninstall -y vllm \ && uv pip install --system *.whl +# Install RIXL wheel +RUN --mount=type=bind,from=build_rixl,src=/app/install,target=/rixl_install \ + uv pip install --system /rixl_install/*.whl + WORKDIR /vllm-workspace ARG COMMON_WORKDIR COPY --from=build_vllm ${COMMON_WORKDIR}/vllm /vllm-workspace @@ -97,6 +321,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --system hf_transfer ENV HF_HUB_ENABLE_HF_TRANSFER=1 +# install audio decode package `torchcodec` from source (required due to +# ROCm and torch version mismatch) for tests with datasets package +COPY tools/install_torchcodec_rocm.sh /tmp/install_torchcodec.sh +RUN bash /tmp/install_torchcodec.sh \ + && rm /tmp/install_torchcodec.sh \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + # Copy in the v1 package (for python-only install test group) COPY --from=export_vllm /vllm_v1 /usr/local/lib/python${PYTHON_VERSION}/dist-packages/vllm/v1 @@ -130,6 +362,7 @@ RUN --mount=type=bind,from=export_vllm,src=/,target=/install \ && uv pip install --system *.whl ARG COMMON_WORKDIR +ARG BASE_IMAGE # Copy over the benchmark scripts as well COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks @@ -144,4 +377,13 @@ ENV SAFETENSORS_FAST_GPU=1 # Performance environment variable. ENV HIP_FORCE_DEV_KERNARG=1 +# Workaround for ROCm profiler limits +RUN echo "ROCTRACER_MAX_EVENTS=10000000" > ${COMMON_WORKDIR}/libkineto.conf +ENV KINETO_CONFIG="${COMMON_WORKDIR}/libkineto.conf" +RUN echo "VLLM_BASE_IMAGE=${BASE_IMAGE}" >> ${COMMON_WORKDIR}/versions.txt + CMD ["/bin/bash"] + +#Set entrypoint for vllm-openai official images +FROM final As vllm-openai +ENTRYPOINT ["vllm", "serve"] diff --git a/docker/Dockerfile.rocm_base b/docker/Dockerfile.rocm_base index a57ee728d924367bacdf43bb377eb3756a9b1edf..6f8c7222fdcea78f5bfbb6a3fee12a51eabaf58e 100644 --- a/docker/Dockerfile.rocm_base +++ b/docker/Dockerfile.rocm_base @@ -1,16 +1,26 @@ -ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.1-complete +ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.0-complete ARG TRITON_BRANCH="57c693b6" ARG TRITON_REPO="https://github.com/ROCm/triton.git" -ARG PYTORCH_BRANCH="1c57644d" -ARG PYTORCH_VISION_BRANCH="v0.23.0" +ARG PYTORCH_BRANCH="89075173" ARG PYTORCH_REPO="https://github.com/ROCm/pytorch.git" +ARG PYTORCH_VISION_BRANCH="v0.24.1" ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git" ARG PYTORCH_AUDIO_BRANCH="v2.9.0" ARG PYTORCH_AUDIO_REPO="https://github.com/pytorch/audio.git" ARG FA_BRANCH="0e60e394" ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git" -ARG AITER_BRANCH="59bd8ff2" +ARG AITER_BRANCH="6af8b687" ARG AITER_REPO="https://github.com/ROCm/aiter.git" +ARG MORI_BRANCH="2d02c6a9" +ARG MORI_REPO="https://github.com/ROCm/mori.git" + +# Sccache configuration (only used in release pipeline) +ARG USE_SCCACHE +ARG SCCACHE_DOWNLOAD_URL +ARG SCCACHE_ENDPOINT +ARG SCCACHE_BUCKET_NAME=vllm-build-sccache +ARG SCCACHE_REGION_NAME=us-west-2 +ARG SCCACHE_S3_NO_CREDENTIALS=0 FROM ${BASE_IMAGE} AS base @@ -20,6 +30,7 @@ ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib: ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151 ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} ENV AITER_ROCM_ARCH=gfx942;gfx950 +ENV MORI_GPU_ARCHS=gfx942;gfx950 # Required for RCCL in ROCm7.1 ENV HSA_NO_SCRATCH_RECLAIM=1 @@ -33,7 +44,7 @@ ENV DEBIAN_FRONTEND=noninteractive # Install Python and other dependencies RUN apt-get update -y \ - && apt-get install -y software-properties-common git curl sudo vim less libgfortran5 \ + && apt-get install -y software-properties-common git curl sudo vim less libgfortran5 libopenmpi-dev libpci-dev \ && for i in 1 2 3; do \ add-apt-repository -y ppa:deadsnakes/ppa && break || \ { echo "Attempt $i failed, retrying in 5s..."; sleep 5; }; \ @@ -50,6 +61,53 @@ RUN apt-get update -y \ RUN pip install -U packaging 'cmake<4' ninja wheel 'setuptools<80' pybind11 Cython RUN apt-get update && apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox && rm -rf /var/lib/apt/lists/* +# Install sccache if USE_SCCACHE is enabled (for release builds) +ARG USE_SCCACHE +ARG SCCACHE_DOWNLOAD_URL +ARG SCCACHE_ENDPOINT +ARG SCCACHE_BUCKET_NAME +ARG SCCACHE_REGION_NAME +ARG SCCACHE_S3_NO_CREDENTIALS +RUN if [ "$USE_SCCACHE" = "1" ]; then \ + echo "Installing sccache..." \ + && SCCACHE_ARCH="x86_64" \ + && SCCACHE_VERSION="v0.8.1" \ + && SCCACHE_DL_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \ + && curl -L -o /tmp/sccache.tar.gz ${SCCACHE_DL_URL} \ + && tar -xzf /tmp/sccache.tar.gz -C /tmp \ + && mv /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \ + && chmod +x /usr/bin/sccache \ + && rm -rf /tmp/sccache.tar.gz /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl \ + && sccache --version; \ + fi + +# Setup sccache for HIP compilation via HIP_CLANG_PATH +# This creates wrapper scripts in a separate directory and points HIP to use them +# This avoids modifying the original ROCm binaries which can break detection +# NOTE: HIP_CLANG_PATH is NOT set as ENV to avoid affecting downstream images (Dockerfile.rocm) +# Instead, each build stage should export HIP_CLANG_PATH=/opt/sccache-wrappers if USE_SCCACHE=1 +RUN if [ "$USE_SCCACHE" = "1" ]; then \ + echo "Setting up sccache wrappers for HIP compilation..." \ + && mkdir -p /opt/sccache-wrappers \ + && printf '#!/bin/bash\nexec sccache /opt/rocm/lib/llvm/bin/clang++ "$@"\n' > /opt/sccache-wrappers/clang++ \ + && chmod +x /opt/sccache-wrappers/clang++ \ + && printf '#!/bin/bash\nexec sccache /opt/rocm/lib/llvm/bin/clang "$@"\n' > /opt/sccache-wrappers/clang \ + && chmod +x /opt/sccache-wrappers/clang \ + && echo "sccache wrappers created in /opt/sccache-wrappers"; \ + fi + +# Set sccache environment variables only when USE_SCCACHE=1 +# This prevents S3 config from leaking into images when sccache is not used +ARG USE_SCCACHE +ENV SCCACHE_BUCKET=${USE_SCCACHE:+${SCCACHE_BUCKET_NAME}} +ENV SCCACHE_REGION=${USE_SCCACHE:+${SCCACHE_REGION_NAME}} +ENV SCCACHE_S3_NO_CREDENTIALS=${USE_SCCACHE:+${SCCACHE_S3_NO_CREDENTIALS}} +ENV SCCACHE_IDLE_TIMEOUT=${USE_SCCACHE:+0} + + +### +### Triton Build +### FROM base AS build_triton ARG TRITON_BRANCH ARG TRITON_REPO @@ -62,11 +120,19 @@ RUN cd triton \ RUN if [ -d triton/python/triton_kernels ]; then pip install build && cd triton/python/triton_kernels \ && python3 -m build --wheel && cp dist/*.whl /app/install; fi + +### +### AMD SMI Build +### FROM base AS build_amdsmi RUN cd /opt/rocm/share/amd_smi \ && pip wheel . --wheel-dir=dist RUN mkdir -p /app/install && cp /opt/rocm/share/amd_smi/dist/*.whl /app/install + +### +### Pytorch build +### FROM base AS build_pytorch ARG PYTORCH_BRANCH ARG PYTORCH_VISION_BRANCH @@ -74,42 +140,93 @@ ARG PYTORCH_AUDIO_BRANCH ARG PYTORCH_REPO ARG PYTORCH_VISION_REPO ARG PYTORCH_AUDIO_REPO +ARG USE_SCCACHE RUN git clone ${PYTORCH_REPO} pytorch RUN cd pytorch && git checkout ${PYTORCH_BRANCH} \ && pip install -r requirements.txt && git submodule update --init --recursive \ && python3 tools/amd_build/build_amd.py \ + && if [ "$USE_SCCACHE" = "1" ]; then \ + export HIP_CLANG_PATH=/opt/sccache-wrappers \ + && export CMAKE_C_COMPILER_LAUNCHER=sccache \ + && export CMAKE_CXX_COMPILER_LAUNCHER=sccache \ + && sccache --show-stats; \ + fi \ && CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \ + && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \ && pip install dist/*.whl RUN git clone ${PYTORCH_VISION_REPO} vision RUN cd vision && git checkout ${PYTORCH_VISION_BRANCH} \ + && if [ "$USE_SCCACHE" = "1" ]; then \ + export HIP_CLANG_PATH=/opt/sccache-wrappers \ + && export CMAKE_C_COMPILER_LAUNCHER=sccache \ + && export CMAKE_CXX_COMPILER_LAUNCHER=sccache; \ + fi \ && python3 setup.py bdist_wheel --dist-dir=dist \ + && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \ && pip install dist/*.whl RUN git clone ${PYTORCH_AUDIO_REPO} audio RUN cd audio && git checkout ${PYTORCH_AUDIO_BRANCH} \ && git submodule update --init --recursive \ && pip install -r requirements.txt \ + && if [ "$USE_SCCACHE" = "1" ]; then \ + export HIP_CLANG_PATH=/opt/sccache-wrappers \ + && export CMAKE_C_COMPILER_LAUNCHER=sccache \ + && export CMAKE_CXX_COMPILER_LAUNCHER=sccache; \ + fi \ && python3 setup.py bdist_wheel --dist-dir=dist \ + && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \ && pip install dist/*.whl RUN mkdir -p /app/install && cp /app/pytorch/dist/*.whl /app/install \ && cp /app/vision/dist/*.whl /app/install \ && cp /app/audio/dist/*.whl /app/install + +### +### MORI Build +### +FROM base AS build_mori +ARG MORI_BRANCH +ARG MORI_REPO +RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \ + pip install /install/*.whl +RUN git clone ${MORI_REPO} +RUN cd mori \ + && git checkout ${MORI_BRANCH} \ + && git submodule update --init --recursive \ + && python3 setup.py bdist_wheel --dist-dir=dist && ls /app/mori/dist/*.whl +RUN mkdir -p /app/install && cp /app/mori/dist/*.whl /app/install + + +### +### FlashAttention Build +### FROM base AS build_fa ARG FA_BRANCH ARG FA_REPO +ARG USE_SCCACHE RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \ pip install /install/*.whl RUN git clone ${FA_REPO} RUN cd flash-attention \ && git checkout ${FA_BRANCH} \ && git submodule update --init \ - && GPU_ARCHS=$(echo ${PYTORCH_ROCM_ARCH} | sed -e 's/;gfx1[0-9]\{3\}//g') python3 setup.py bdist_wheel --dist-dir=dist + && if [ "$USE_SCCACHE" = "1" ]; then \ + export HIP_CLANG_PATH=/opt/sccache-wrappers \ + && sccache --show-stats; \ + fi \ + && GPU_ARCHS=$(echo ${PYTORCH_ROCM_ARCH} | sed -e 's/;gfx1[0-9]\{3\}//g') python3 setup.py bdist_wheel --dist-dir=dist \ + && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi RUN mkdir -p /app/install && cp /app/flash-attention/dist/*.whl /app/install + +### +### AITER Build +### FROM base AS build_aiter ARG AITER_BRANCH ARG AITER_REPO +ARG USE_SCCACHE RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \ pip install /install/*.whl RUN git clone --recursive ${AITER_REPO} @@ -117,9 +234,37 @@ RUN cd aiter \ && git checkout ${AITER_BRANCH} \ && git submodule update --init --recursive \ && pip install -r requirements.txt -RUN pip install pyyaml && cd aiter && PREBUILD_KERNELS=1 GPU_ARCHS=${AITER_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist && ls /app/aiter/dist/*.whl +RUN pip install pyyaml && cd aiter \ + && if [ "$USE_SCCACHE" = "1" ]; then \ + export HIP_CLANG_PATH=/opt/sccache-wrappers \ + && sccache --show-stats; \ + fi \ + && PREBUILD_KERNELS=1 GPU_ARCHS=${AITER_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist \ + && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \ + && ls /app/aiter/dist/*.whl RUN mkdir -p /app/install && cp /app/aiter/dist/*.whl /app/install + +### +### Final Build +### + +# Wheel release stage - +# only includes dependencies used by wheel release pipeline +FROM base AS debs_wheel_release +RUN mkdir /app/debs +RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs +RUN --mount=type=bind,from=build_fa,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs +RUN --mount=type=bind,from=build_amdsmi,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs +RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs +RUN --mount=type=bind,from=build_aiter,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs + +# Full debs stage - includes Mori (used by Docker releases) FROM base AS debs RUN mkdir /app/debs RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \ @@ -132,6 +277,8 @@ RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \ cp /install/*.whl /app/debs RUN --mount=type=bind,from=build_aiter,src=/app/install/,target=/install \ cp /install/*.whl /app/debs +RUN --mount=type=bind,from=build_mori,src=/app/install/,target=/install \ + cp /install/*.whl /app/debs FROM base AS final RUN --mount=type=bind,from=debs,src=/app/debs,target=/install \ @@ -150,6 +297,8 @@ ARG FA_BRANCH ARG FA_REPO ARG AITER_BRANCH ARG AITER_REPO +ARG MORI_BRANCH +ARG MORI_REPO RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \ && echo "TRITON_BRANCH: ${TRITON_BRANCH}" >> /app/versions.txt \ && echo "TRITON_REPO: ${TRITON_REPO}" >> /app/versions.txt \ @@ -162,4 +311,6 @@ RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \ && echo "FA_BRANCH: ${FA_BRANCH}" >> /app/versions.txt \ && echo "FA_REPO: ${FA_REPO}" >> /app/versions.txt \ && echo "AITER_BRANCH: ${AITER_BRANCH}" >> /app/versions.txt \ - && echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt \ No newline at end of file + && echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt \ + && echo "MORI_BRANCH: ${MORI_BRANCH}" >> /app/versions.txt \ + && echo "MORI_REPO: ${MORI_REPO}" >> /app/versions.txt diff --git a/docker/Dockerfile.xpu b/docker/Dockerfile.xpu index 72d2053102c22426316df924f1bf15838b18f0ac..f63ce2c5037fbe4afb55ea9b53c7149b80be8b18 100644 --- a/docker/Dockerfile.xpu +++ b/docker/Dockerfile.xpu @@ -2,7 +2,7 @@ FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 AS vllm-base RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \ echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \ - add-apt-repository -y ppa:kobuk-team/intel-graphics + add-apt-repository -y ppa:kobuk-team/intel-graphics-staging RUN apt clean && apt-get update -y && \ apt-get install -y --no-install-recommends --fix-missing \ @@ -28,10 +28,14 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1 RUN apt install -y libze1 libze-dev libze-intel-gpu1 intel-opencl-icd libze-intel-gpu-raytracing intel-ocloc # This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2. -RUN wget https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.6/intel-oneccl-2021.15.6.9_offline.sh -RUN bash intel-oneccl-2021.15.6.9_offline.sh -a --silent --eula accept && \ +ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.6_offline.sh" +RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \ + bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \ + rm "${ONECCL_INSTALLER}" && \ echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc && \ echo "source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force" >> /root/.bashrc +RUN rm -f /opt/intel/oneapi/ccl/latest && \ + ln -s /opt/intel/oneapi/ccl/2021.15 /opt/intel/oneapi/ccl/latest SHELL ["bash", "-c"] CMD ["bash", "-c", "source /root/.bashrc && exec bash"] @@ -47,6 +51,11 @@ RUN --mount=type=cache,target=/root/.cache/pip \ pip install --no-cache-dir \ -r requirements/xpu.txt +# arctic-inference is built from source which needs torch-xpu properly installed +# used for suffix method speculative decoding +RUN --mount=type=cache,target=/root/.cache/pip \ + pip install --no-cache-dir arctic-inference==0.1.1 + ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/" COPY . . diff --git a/docker/docker-bake.hcl b/docker/docker-bake.hcl new file mode 100644 index 0000000000000000000000000000000000000000..daf0d62a683d9d04346227d3c410886283349a75 --- /dev/null +++ b/docker/docker-bake.hcl @@ -0,0 +1,76 @@ +# docker-bake.hcl - vLLM Docker build configuration +# +# This file lives in vLLM repo at docker/docker-bake.hcl +# +# Usage: +# cd docker && docker buildx bake # Build default target (openai) +# cd docker && docker buildx bake test # Build test target +# docker buildx bake --print # Show resolved config +# +# Reference: https://docs.docker.com/build/bake/reference/ + +# Build configuration + +variable "MAX_JOBS" { + default = 16 +} + +variable "NVCC_THREADS" { + default = 8 +} + +variable "TORCH_CUDA_ARCH_LIST" { + default = "8.0 8.9 9.0 10.0" +} + +variable "COMMIT" { + default = "" +} + +# Groups + +group "default" { + targets = ["openai"] +} + +# Base targets + +target "_common" { + dockerfile = "docker/Dockerfile" + context = "." + args = { + max_jobs = MAX_JOBS + nvcc_threads = NVCC_THREADS + torch_cuda_arch_list = TORCH_CUDA_ARCH_LIST + } +} + +target "_labels" { + labels = { + "org.opencontainers.image.source" = "https://github.com/vllm-project/vllm" + "org.opencontainers.image.vendor" = "vLLM" + "org.opencontainers.image.title" = "vLLM" + "org.opencontainers.image.description" = "vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs" + "org.opencontainers.image.licenses" = "Apache-2.0" + "org.opencontainers.image.revision" = COMMIT + } + annotations = [ + "index,manifest:org.opencontainers.image.revision=${COMMIT}", + ] +} + +# Build targets + +target "test" { + inherits = ["_common", "_labels"] + target = "test" + tags = ["vllm:test"] + output = ["type=docker"] +} + +target "openai" { + inherits = ["_common", "_labels"] + target = "vllm-openai" + tags = ["vllm:openai"] + output = ["type=docker"] +} diff --git a/docs/README.md b/docs/README.md index 0c279c19f96ca7da550364899b312f475c264cf7..4b480c463abb730503d09d957f7eb3fa9f27a7e6 100644 --- a/docs/README.md +++ b/docs/README.md @@ -62,7 +62,7 @@ vLLM is flexible and easy to use with: For more information, check out the following: -- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) +- [vLLM announcing blog post](https://blog.vllm.ai/2023/06/20/vllm.html) (intro to PagedAttention) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. - [vLLM Meetups](community/meetups.md) diff --git a/docs/api/README.md b/docs/api/README.md index d51329ec2faa3d360d8a7bbdc6dda3d26a2092e2..14780c803c75eb69c2ff8f0a6b16476ca930c501 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -72,7 +72,6 @@ Internal data structures. - [vllm.multimodal.inputs.MultiModalFieldConfig][] - [vllm.multimodal.inputs.MultiModalKwargsItem][] - [vllm.multimodal.inputs.MultiModalKwargsItems][] -- [vllm.multimodal.inputs.MultiModalKwargs][] - [vllm.multimodal.inputs.MultiModalInputs][] ### Data Parsing diff --git a/docs/assets/contributing/dockerfile-stages-dependency.png b/docs/assets/contributing/dockerfile-stages-dependency.png index 7420ca4d89441e6dd320657092aaf3e1c0491e9c..c8839eb93de95fa5ffd6b3338b38ce270ea0e1c7 100644 Binary files a/docs/assets/contributing/dockerfile-stages-dependency.png and b/docs/assets/contributing/dockerfile-stages-dependency.png differ diff --git a/docs/assets/deployment/claude-code-example.png b/docs/assets/deployment/claude-code-example.png new file mode 100644 index 0000000000000000000000000000000000000000..c6f14419666bec396bb60123a92a0e8f5835abc9 Binary files /dev/null and b/docs/assets/deployment/claude-code-example.png differ diff --git a/docs/benchmarking/dashboard.md b/docs/benchmarking/dashboard.md index 7cc4d23250df978e17df54fe750bd1e777db6b16..701fb16ae2cf10b0e43883e4131406d1fd0a7ca5 100644 --- a/docs/benchmarking/dashboard.md +++ b/docs/benchmarking/dashboard.md @@ -8,12 +8,19 @@ The results are automatically published to the public [vLLM Performance Dashboar ## Manually Trigger the benchmark Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite. -For CPU environment, please use the image with "-cpu" postfix. +For x86 CPU environment, please use the image with "-cpu" postfix. For AArch64 CPU environment, please use the image with "-arm64-cpu" postfix. -Here is an example for docker run command for CPU. +Here is an example for docker run command for CPU. For GPUs skip setting the `ON_CPU` env var. ```bash -docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN='' --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu +export VLLM_COMMIT=1da94e673c257373280026f75ceb4effac80e892 # use full commit hash from the main branch +export HF_TOKEN= +if [[ "$(uname -m)" == aarch64 || "$(uname -m)" == arm64 ]]; then + IMG_SUFFIX="arm64-cpu" +else + IMG_SUFFIX="cpu" +fi +docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN=$HF_TOKEN -e ON_ARM64_CPU=1 --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}-${IMG_SUFFIX} ``` Then, run below command inside the docker instance. @@ -26,14 +33,65 @@ When run, benchmark script generates results under **benchmark/results** folder, ### Runtime environment variables -- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0. +- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file). - `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string. - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. -For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results). +### Visualization + +The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table with real benchmarking results. +You can find the result presented as a table inside the `buildkite/performance-benchmark` job page. +If you do not see the table, please wait till the benchmark finish running. +The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file. +The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking. + +#### Performance Results Comparison + +The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`. +When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`. +`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT. +If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead. + +Here is an example using the script to compare result_a and result_b with max concurrency and qps for same Model, Dataset name, input/output length. +`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` + +***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]*** + +| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio | +|----|------|-----|-----------|----------|----------| +| 0 | 12 | inf | 24.98 | 186.03 | 7.45 | +| 1 | 16 | inf| 25.49 | 246.92 | 9.69 | +| 2 | 24 | inf| 27.74 | 293.34 | 10.57 | +| 3 | 32 | inf| 28.61 |306.69 | 10.72 | + +***compare-json-results.py – Command-Line Parameters*** + +compare-json-results.py provides configurable parameters to compare one or more benchmark_results.json files and generate summary tables and plots. +In most cases, users only need to specify --file to parse the desired benchmark results. + +| Parameter | Type | Default Value | Description | +| ---------------------- | ------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------- | +| `--file` | `str` (appendable) | *None* | Input JSON result file(s). Can be specified multiple times to compare multiple benchmark outputs. | +| `--debug` | `bool` | `False` | Enables debug mode. When set, prints all available information to aid troubleshooting and validation. | +| `--plot` / `--no-plot` | `bool` | `True` | Controls whether performance plots are generated. Use `--no-plot` to disable graph generation. | +| `--xaxis` | `str` | `# of max concurrency.` | Column name used as the X-axis in comparison plots (for example, concurrency or batch size). | +| `--latency` | `str` | `p99` | Latency aggregation method used for TTFT/TPOT. Supported values: `median` or `p99`. | +| `--ttft-max-ms` | `float` | `3000.0` | Reference upper bound (milliseconds) for TTFT plots, typically used to visualize SLA thresholds. | +| `--tpot-max-ms` | `float` | `100.0` | Reference upper bound (milliseconds) for TPOT plots, typically used to visualize SLA thresholds. | + +***Valid Max Concurrency Summary*** + +Based on the configured TTFT and TPOT SLA thresholds, compare-json-results.py computes the maximum valid concurrency for each benchmark result. +The “Max # of max concurrency. (Both)” column represents the highest concurrency level that satisfies both TTFT and TPOT constraints simultaneously. +This value is typically used in capacity planning and sizing guides. + +| # | Configuration | Max # of max concurrency. (TTFT ≤ 10000 ms) | Max # of max concurrency. (TPOT ≤ 100 ms) | Max # of max concurrency. (Both) | Output Tput @ Both (tok/s) | TTFT @ Both (ms) | TPOT @ Both (ms) | +| - | -------------- | ------------------------------------------- | ----------------------------------------- | -------------------------------- | -------------------------- | ---------------- | ---------------- | +| 0 | results-a | 128.00 | 12.00 | 12.00 | 127.76 | 3000.82 | 93.24 | +| 1 | results-b | 128.00 | 32.00 | 32.00 | 371.42 | 2261.53 | 81.74 | More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md). diff --git a/docs/benchmarking/sweeps.md b/docs/benchmarking/sweeps.md index ee4d40d876deaa2a52a25d4da6cc639c3e321ab1..93b9f4d6273a7f182d35ac09e3a032efd8eab0e5 100644 --- a/docs/benchmarking/sweeps.md +++ b/docs/benchmarking/sweeps.md @@ -129,10 +129,10 @@ vllm bench sweep serve_sla \ The algorithm for adjusting the SLA variable is as follows: -1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable. - - For example, the initial request rate is set to the concurrency under infinite QPS. -2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied. -3. Apply binary search over the window to find the maximum value that still satisfies the SLA. +1. Run the benchmark once with maximum possible QPS, and once with minimum possible QPS. For each run, calculate the distance of the SLA metrics from their targets, resulting in data points of QPS vs SLA distance. +2. Perform spline interpolation between the data points to estimate the QPS that results in zero SLA distance. +3. Run the benchmark with the estimated QPS and add the resulting data point to the history. +4. Repeat Steps 2 and 3 until the maximum QPS that passes SLA and the minimum QPS that fails SLA in the history are close enough to each other. !!! important SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`. diff --git a/docs/cli/bench/latency.md b/docs/cli/bench/latency.md index ea7ea7321ffcdb8acc6c2a6ee97385123a4280bc..9e1b905339757d48bcb51f37ecf5b75ea2bd4719 100644 --- a/docs/cli/bench/latency.md +++ b/docs/cli/bench/latency.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_latency.inc.md" +--8<-- "docs/generated/argparse/bench_latency.inc.md" diff --git a/docs/cli/bench/mm_processor.md b/docs/cli/bench/mm_processor.md new file mode 100644 index 0000000000000000000000000000000000000000..af2c3a8cfd36b07bde4f21b34983b7ce8c0e8243 --- /dev/null +++ b/docs/cli/bench/mm_processor.md @@ -0,0 +1,9 @@ +# vllm bench mm-processor + +## JSON CLI Arguments + +--8<-- "docs/cli/json_tip.inc.md" + +## Arguments + +--8<-- "docs/generated/argparse/bench_mm_processor.inc.md" diff --git a/docs/cli/bench/serve.md b/docs/cli/bench/serve.md index f7dc8036cc262dcd3308c62980806d68e055bf24..792c6e094b35102cad0ba82555b4320e6d879ad8 100644 --- a/docs/cli/bench/serve.md +++ b/docs/cli/bench/serve.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_serve.inc.md" +--8<-- "docs/generated/argparse/bench_serve.inc.md" diff --git a/docs/cli/bench/sweep/plot.md b/docs/cli/bench/sweep/plot.md index a101330e093cc4dfd94173b6bb2382c6792542ca..d7dc65e6df62c7c573a0e4ba729250831704d378 100644 --- a/docs/cli/bench/sweep/plot.md +++ b/docs/cli/bench/sweep/plot.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_sweep_plot.inc.md" +--8<-- "docs/generated/argparse/bench_sweep_plot.inc.md" diff --git a/docs/cli/bench/sweep/plot_pareto.md b/docs/cli/bench/sweep/plot_pareto.md index f5dc257ce6772f4b6052576e8227526b0bd613ab..13dffd7f2b5c423808dbf702414e23c507bfd99d 100644 --- a/docs/cli/bench/sweep/plot_pareto.md +++ b/docs/cli/bench/sweep/plot_pareto.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md" +--8<-- "docs/generated/argparse/bench_sweep_plot_pareto.inc.md" diff --git a/docs/cli/bench/sweep/serve.md b/docs/cli/bench/sweep/serve.md index f0468f06fc287014c415b0c1bb3071ae767a6d01..6a8182feb40614359d4d0b88be6f913b47d8d275 100644 --- a/docs/cli/bench/sweep/serve.md +++ b/docs/cli/bench/sweep/serve.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_sweep_serve.inc.md" +--8<-- "docs/generated/argparse/bench_sweep_serve.inc.md" diff --git a/docs/cli/bench/sweep/serve_sla.md b/docs/cli/bench/sweep/serve_sla.md index 5642ec67eb0077fe1861cc7dbaf546b9331d94e4..688d64f0bc24d0d1f8f607ff090368becc4caf1e 100644 --- a/docs/cli/bench/sweep/serve_sla.md +++ b/docs/cli/bench/sweep/serve_sla.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_sweep_serve_sla.inc.md" +--8<-- "docs/generated/argparse/bench_sweep_serve_sla.inc.md" diff --git a/docs/cli/bench/throughput.md b/docs/cli/bench/throughput.md index e7f618fb4d14797042b241a77a65d092f1b383c1..66434c87819f1cc1cc362e62bbc3285d4eca27bb 100644 --- a/docs/cli/bench/throughput.md +++ b/docs/cli/bench/throughput.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/bench_throughput.inc.md" +--8<-- "docs/generated/argparse/bench_throughput.inc.md" diff --git a/docs/cli/chat.md b/docs/cli/chat.md index 0246bd431b10170ca3d3c75f8bff787bbc1f670a..7b8e718f625fe40f613855ae94728e1b91298cc7 100644 --- a/docs/cli/chat.md +++ b/docs/cli/chat.md @@ -2,4 +2,4 @@ ## Arguments ---8<-- "docs/argparse/chat.inc.md" +--8<-- "docs/generated/argparse/chat.inc.md" diff --git a/docs/cli/complete.md b/docs/cli/complete.md index eb2ffdaabac25fc97c3dbea0d6696b8b88674b8b..65d953a7c046a070f18dfc970b1c0fc9a1eb20c9 100644 --- a/docs/cli/complete.md +++ b/docs/cli/complete.md @@ -2,4 +2,4 @@ ## Arguments ---8<-- "docs/argparse/complete.inc.md" +--8<-- "docs/generated/argparse/complete.inc.md" diff --git a/docs/cli/run-batch.md b/docs/cli/run-batch.md index 758fbda283978596cce5eb75c7406115bc92af84..f2255e66373d0a81373847e0daa0e91246f097a6 100644 --- a/docs/cli/run-batch.md +++ b/docs/cli/run-batch.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/run-batch.inc.md" +--8<-- "docs/generated/argparse/run-batch.inc.md" diff --git a/docs/cli/serve.md b/docs/cli/serve.md index 35652fec587b3c4e2756cf6791e79a9fa16cd9f7..0326fe29ec7f0e3e1b028a543e0a368cb4ee8a52 100644 --- a/docs/cli/serve.md +++ b/docs/cli/serve.md @@ -6,4 +6,4 @@ ## Arguments ---8<-- "docs/argparse/serve.inc.md" +--8<-- "docs/generated/argparse/serve.inc.md" diff --git a/docs/community/meetups.md b/docs/community/meetups.md index d8cf4ecdd5a320406af43a9930b6e02ddcf15c97..43eb5cb246fc812d1d6fcb9f044bc7f17343b953 100644 --- a/docs/community/meetups.md +++ b/docs/community/meetups.md @@ -2,45 +2,4 @@ We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. -## Upcoming Meetups - -Stay tuned for upcoming meetups! Follow us on [Twitter/X](https://x.com/vllm_project), join our [Slack](https://slack.vllm.ai), and follow vLLM on [Luma](https://luma.com/vLLM-Meetups) to get notified about new events. - -## Past Meetups - -Below you'll find slides and recordings from our previous meetups: - -- [vLLM Bangkok Meetup](https://luma.com/v0f647nv), November 21st 2025. [[Slides]](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing) -- [vLLM Zurich Meetup](https://luma.com/0gls27kb), November 6th 2025. [[Slides]](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) [[Recording]](https://www.youtube.com/watch?v=6m6ZE6yVEDI) -- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w), November 1st 2025. [[Slides]](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link) -- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg), October 25th 2025. [[Slides]](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6) -- [vLLM Toronto Meetup](https://luma.com/e80e0ymm), September 25th 2025. [[Slides]](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing) -- [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA) -- [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing) -- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH) -- [vLLM Korea Meetup](https://luma.com/cgcgprmh), August 19th 2025. [[Slides]](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view). -- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA), August 2nd 2025. [[Slides]](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) [[Recording]](https://www.chaspark.com/#/live/1166916873711665152). -- [NYC vLLM Meetup](https://lu.ma/c1rqyf1f), May 7th, 2025. [[Slides]](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing) -- [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day), April 3rd 2025. [[Slides]](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). -- [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama), March 27th 2025. [[Slides]](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing). -- [The first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg), March 16th 2025. [[Slides]](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing). -- [The East Coast vLLM Meetup](https://lu.ma/7mu4k4xx), March 11th 2025. [[Slides]](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0) -- [The ninth vLLM meetup](https://lu.ma/h7g3kuj9), with Meta, February 27th 2025. [[Slides]](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) -- [The eighth vLLM meetup](https://lu.ma/zep56hui), with Google Cloud, January 22nd 2025. [[Slides]](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing) -- [The seventh vLLM meetup](https://lu.ma/h0qvrajz), with Snowflake, November 14th 2024. [[Slides]](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing) -- [The sixth vLLM meetup](https://lu.ma/87q3nvnh), with NVIDIA, September 9th 2024. [[Slides]](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing) -- [The fifth vLLM meetup](https://lu.ma/lp0gyjqr), with AWS, July 24th 2024. [[Slides]](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing) -- [The fourth vLLM meetup](https://lu.ma/agivllm), with Cloudflare and BentoML, June 11th 2024. [[Slides]](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing) -- [The third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/), with Roblox, April 2nd 2024. [[Slides]](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing) -- [The second vLLM meetup](https://lu.ma/ygxbpzhl), with IBM Research, January 31st 2024. [[Slides]](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing) [[Video (vLLM Update)]](https://youtu.be/Y0C-DUvEnZQ) [[Video (IBM Research & torch.compile)]](https://youtu.be/m0dMtFLI-dg) -- [The first vLLM meetup](https://lu.ma/first-vllm-meetup), with a16z, October 5th 2023. [[Slides]](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing) - -## Get Involved - -**Want to host or speak at a vLLM meetup?** We're always looking for speakers and sponsors for our meetups. Whether you want to: - -- Share your vLLM feature, use case, project extension, or deployment experience -- Host a meetup in your city -- Sponsor an event - -Please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu). +Please visit [vllm.ai/events](https://vllm.ai/events) to learn more. diff --git a/docs/community/sponsors.md b/docs/community/sponsors.md index 847b99cce45c9d68c2ea24c2ddccddb64526966e..b645eaed0cd96d3b3b381d54d1873b622e337bc4 100644 --- a/docs/community/sponsors.md +++ b/docs/community/sponsors.md @@ -2,43 +2,4 @@ vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support! - - - -Cash Donations: - -- a16z -- Dropbox -- Sequoia Capital -- Skywork AI -- ZhenFund - -Compute Resources: - -- Alibaba Cloud -- AMD -- Anyscale -- Arm -- AWS -- Crusoe Cloud -- Databricks -- DeepInfra -- Google Cloud -- IBM -- Intel -- Lambda Lab -- Nebius -- Novita AI -- NVIDIA -- Red Hat -- Replicate -- Roblox -- RunPod -- Trainy -- UC Berkeley -- UC San Diego -- Volcengine - -Slack Sponsor: Anyscale - -We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM. +Please visit [vllm.ai/#sponsors](https://vllm.ai/#sponsors) to learn more. diff --git a/docs/configuration/engine_args.md b/docs/configuration/engine_args.md index 05d4f762306a37d397e23db94d97a0e7691dc64e..14589478821f920d7a96ae5000b496ef8945b960 100644 --- a/docs/configuration/engine_args.md +++ b/docs/configuration/engine_args.md @@ -15,8 +15,8 @@ The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and ## `EngineArgs` ---8<-- "docs/argparse/engine_args.md" +--8<-- "docs/generated/argparse/engine_args.inc.md" ## `AsyncEngineArgs` ---8<-- "docs/argparse/async_engine_args.md" +--8<-- "docs/generated/argparse/async_engine_args.inc.md" diff --git a/docs/contributing/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md index 735bb2e2053323e15ef176fdde49b7890bb6e17a..74c0beb779c7db0d4899656ea4ccec3e3107a842 100644 --- a/docs/contributing/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -77,25 +77,20 @@ This complicates the process as we cannot use the out-of-the-box - `.buildkite/release-pipeline.yaml` - `.buildkite/scripts/upload-wheels.sh` -## Address long vLLM build time +## Manually running vLLM builds on BuildKiteCI -When building vLLM with a new PyTorch/CUDA version, no cache will exist -in the vLLM sccache S3 bucket, causing the build job on CI to potentially take more than 5 hours -and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mode, -it doesn't populate the cache, so re-running it to warm up the cache -is ineffective. +When building vLLM with a new PyTorch/CUDA version, the vLLM sccache S3 bucket +will not have any cached artifacts, which can cause CI build jobs to exceed 5 hours. +Furthermore, vLLM's fastcheck pipeline operates in read-only mode and does not +populate the cache, making it ineffective for cache warm-up purposes. -While ongoing efforts like -address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH` -to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/long_build`) -when manually triggering a build on Buildkite. This branch accomplishes two things: +To address this, manually trigger a build on Buildkite to accomplish two objectives: -1. Increase the timeout limit to 10 hours so that the build doesn't time out. -2. Allow the compiled artifacts to be written to the vLLM sccache S3 bucket -to warm it up so that future builds are faster. +1. Run the complete test suite against the PyTorch RC build by setting the environment variables: `RUN_ALL=1` and `NIGHTLY=1` +2. Populate the vLLM sccache S3 bucket with compiled artifacts, enabling faster subsequent builds

- Buildkite new build popup +Buildkite new build popup

## Update all the different vLLM platforms diff --git a/docs/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md index 904ef4ca058c008e71b88b1ab12f9a9273e84186..99b7c382da9c7cb4be7fd7c03ca5104ad62d38aa 100644 --- a/docs/contributing/deprecation_policy.md +++ b/docs/contributing/deprecation_policy.md @@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs -### 2.Deprecated (Off By Default) +### 2. Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index d37501b86556fa02ccb9ec035c45d836b516bb3a..915fe1495f452364fd218e8974128999abd0a814 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -118,7 +118,7 @@ To support a model with interleaving sliding windows, we need to take care of th - Make sure the model's `config.json` contains `layer_types`. - In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171). -With these two steps, interleave sliding windows should work with the model. +With these two steps, interleaved sliding windows should work with the model. ### How to support models that use Mamba? @@ -142,7 +142,7 @@ We use "mamba-like" to refer to layers that posses a state that is updated in-pl For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`. It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers. Please see [`LinearAttentionMetadata`](../../../vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](../../../vllm/v1/attention/backends/short_conv_attn.py) for examples of this. -It is also worth noting that we should update `MAMBA_TYPE_TO_BACKEND_MAP` and `MambaAttentionBackendEnum` in [`registry.py`](../../../vllm/attention/backends/registry.py) when adding a new mamba backend. +It is also worth noting that we should update `MAMBA_TYPE_TO_BACKEND_MAP` and `MambaAttentionBackendEnum` in [`registry.py`](../../../vllm/v1/attention/backends/registry.py) when adding a new mamba backend. Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it. Please see the calls to `direct_register_custom_op` in [vllm/model_executor/models/minimax_text_01.py](../../../vllm/model_executor/models/minimax_text_01.py) or [vllm/model_executor/layers/mamba/short_conv.py](../../../vllm/model_executor/layers/mamba/short_conv.py) for examples of this. The new custom op should then be added to the list `_attention_ops` in [vllm/config/compilation.py](../../../vllm/config/compilation.py) to ensure that piecewise CUDA graphs works as intended. diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index cbce14ce992ec9758b635e697d01183a2a96b970..ce10adaf0cad294eb102a91bd0bd8061f476bab2 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -54,6 +54,29 @@ vllm bench serve \ --num-prompts 2 ``` +Or use http request: + +```shell +# We need first call /start_profile api to start profile. +$ curl -X POST http://localhost:8000/start_profile + +# Call model generate. +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "messages": [ + { + "role": "user", + "content": "San Francisco is a" + } + ] + }' + +# After need call /stop_profile api to stop profile. +$ curl -X POST http://localhost:8000/stop_profile +``` + ## Profile with NVIDIA Nsight Systems Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events. diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index d70e0142e3202ceaa53c433e2c0da00b3c88360a..ae7cea4364b401b877b9923054b007a4749b5a55 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -80,6 +80,15 @@ DOCKER_BUILDKIT=1 docker build . \ If you are using Podman instead of Docker, you might need to disable SELinux labeling by adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184). +!!! note + If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time. + + * **Enable the feature** by adding the build argument: `--build-arg VLLM_USE_PRECOMPILED="1"`. + * **How it works**: By default, vLLM automatically finds the correct wheels from our [Nightly Builds](../contributing/ci/nightly_builds.md) by using the merge-base commit with the upstream `main` branch. + * **Override commit**: To use wheels from a specific commit, provide the `--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=` argument. + + For a detailed explanation, refer to the documentation on 'Set up using Python-only build (without compilation)' part in [Build wheel from source](../contributing/ci/nightly_builds.md#precompiled-wheels-usage), these args are similar. + ## Building for Arm64/aarch64 A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag `--platform "linux/arm64"` will build for arm64. diff --git a/docs/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md index 960347d9525c45ad08d84906689352fe7a364b5b..1b7c5d5a921380d3e75bb5c1cd1ce261c40aec17 100644 --- a/docs/deployment/frameworks/cerebrium.md +++ b/docs/deployment/frameworks/cerebrium.md @@ -59,7 +59,7 @@ Then, run the following code to deploy it to the cloud: cerebrium deploy ``` -If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) +If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case `/run`) ??? console "Command" diff --git a/docs/deployment/frameworks/hf_inference_endpoints.md b/docs/deployment/frameworks/hf_inference_endpoints.md index 05df0dacd8f11600e3f98f418f86e875f7d9b0f3..6217dc062d21a1f28f6103359bae215b5c830153 100644 --- a/docs/deployment/frameworks/hf_inference_endpoints.md +++ b/docs/deployment/frameworks/hf_inference_endpoints.md @@ -70,7 +70,7 @@ This method applies to models with the [`transformers` library tag](https://hugg ![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png) -3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. +3. Click the **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. ![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png) diff --git a/docs/deployment/integrations/kserve.md b/docs/deployment/integrations/kserve.md index 37b29aa1a487659d88158224837942e25f905a21..06ad5f29a1a65dc03084e9c5c0c2ff3858147cc8 100644 --- a/docs/deployment/integrations/kserve.md +++ b/docs/deployment/integrations/kserve.md @@ -2,4 +2,4 @@ vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. -Please see [this guide](https://kserve.github.io/website/docs/model-serving/generative-inference/overview) for more details on using vLLM with KServe. +You can use vLLM with KServe's [Hugging Face serving runtime](https://kserve.github.io/website/docs/model-serving/generative-inference/overview) or via [`LLMInferenceService` that uses llm-d](https://kserve.github.io/website/docs/model-serving/generative-inference/llmisvc/llmisvc-overview). diff --git a/docs/deployment/integrations/llm-d.md b/docs/deployment/integrations/llm-d.md new file mode 100644 index 0000000000000000000000000000000000000000..cccf1773c6be676c7da46dc6e39a3389a689b356 --- /dev/null +++ b/docs/deployment/integrations/llm-d.md @@ -0,0 +1,5 @@ +# llm-d + +vLLM can be deployed with [llm-d](https://github.com/llm-d/llm-d), a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale. It helps achieve the fastest "time to state-of-the-art (SOTA) performance" for key OSS models across most hardware accelerators and infrastructure providers. + +You can use vLLM with llm-d directly by following [this guide](https://llm-d.ai/docs/guide) or via [KServe's LLMInferenceService](https://kserve.github.io/website/docs/model-serving/generative-inference/llmisvc/llmisvc-overview). diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 624e98a08c98db9643b34befcd61aa6d5f2f87a5..4db595164e3de0338120d7a6e83fd088cd4a317f 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -10,7 +10,7 @@ If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](h ## Pre-requisite -Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine). ## Deployment using vLLM production stack diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index 05814cbad9bfcb556ea93f001681efba81e632a7..77a159009aa8d5407bc7f672989937c65aa422fa 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -12,6 +12,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: - [Helm](frameworks/helm.md) - [InftyAI/llmaz](integrations/llmaz.md) +- [llm-d](integrations/llm-d.md) - [KAITO](integrations/kaito.md) - [KServe](integrations/kserve.md) - [Kthena](integrations/kthena.md) diff --git a/docs/design/cuda_graphs.md b/docs/design/cuda_graphs.md index 19c02fc88641ccca0b64e009c694ac4938ffe7a6..af9e5b5ba6f9b666fa3f1d183b5d6fb83b098ec2 100644 --- a/docs/design/cuda_graphs.md +++ b/docs/design/cuda_graphs.md @@ -149,7 +149,7 @@ The CUDA Graphs wrapper no longer manages the warm-up logic. The warm-up process ## CUDA Graphs Compatibility of Attention Backends -To signal the CUDA Graphs compatibility of the attention backends, we introduce a new enum type [AttentionCGSupport][vllm.v1.attention.backends.utils.AttentionCGSupport], which is an enum type that tracks the capability of the attention backend to support CUDA Graphs. The value is sorted in the order of the capability, i.e., `ALWAYS`> `UNIFORM_BATCH`> `UNIFORM_SINGLE_TOKEN_DECODE`> `NEVER`. +To signal the CUDA Graphs compatibility of the attention backends, we introduce a new enum type [AttentionCGSupport][vllm.v1.attention.backend.AttentionCGSupport], which is an enum type that tracks the capability of the attention backend to support CUDA Graphs. The value is sorted in the order of the capability, i.e., `ALWAYS`> `UNIFORM_BATCH`> `UNIFORM_SINGLE_TOKEN_DECODE`> `NEVER`. ```python class AttentionCGSupport(enum.Enum): diff --git a/docs/design/custom_op.md b/docs/design/custom_op.md new file mode 100644 index 0000000000000000000000000000000000000000..13c2915abe8f3e31d31b5518efe4645de90d76d6 --- /dev/null +++ b/docs/design/custom_op.md @@ -0,0 +1,318 @@ +# CustomOp + +`CustomOp` is an abstract class used for dispatching the forward method of various operations to the appropriate backend. It also offers a mechanism for both vLLM and OOT (Out-Of-Tree) plugins to register their custom operations. + +This document will introduce how CustomOp works in vLLM and how to implement a new `CustomOp`. + +## How CustomOp Works in vLLM + +`CustomOp` manages two dictionaries of all custom ops (i.e., op classes, indexed by registered name) in its class, for vLLM and OOT plugins respectively. + +??? code + + ```python + class CustomOp(nn.Module): + + op_registry: dict[str, type["CustomOp"]] = {} + op_registry_oot: dict[str, type["CustomOp"]] = {} + ``` + +We can use `@CustomOp.register("op_name")` to register an op class to the `CustomOp` system. After this, the `op_name` and its class will be added into the `op_registry` dictionary. In addition, We can also register an OOT op by `@CustomOp.register_oot("op_name")`. We will introduce this mechanism in detail later. + +When a `CustomOp` is called (i.e., call its `forward()` method), if it is enabled (i.e., with `--compilation_config.custom_ops '["+op_name"]'`), it will automatically dispatch the forward method to the appropriate backend according to `current_platform`. Otherwise (i.e., it is disabled), it will only call the `forward_native()` method to use PyTorch-native implementation of this forward method. + +- **CPU platform:** dispatch to `forward_cpu()`. +- **CUDA platform:** dispatch to `forward_cuda()`. +- **ROCm platform:** dispatch to `forward_hip()`. If `forward_hip()` is not implemented, it will use `forward_cuda()` as a fallback. +- **XPU platform:** dispatch to `forward_xpu()`. +- **TPU platform:** dispatch to `forward_tpu()`. +- **OOT platform:** dispatch to `forward_oot()`. This will only be called on OOT platforms. +- **Default:** dispatch to `forward_native()` as a final fallback for all platforms. + +!!! note + Note that the dispatching logic might not be absolute because of class inheritance. Derived class might override the behavior. + +Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `compilation_config.custom_ops`. To be specific, if a `CustomOp` is not registered in `compilation_config.custom_ops` (i.e., uses the default config), it will be enabled if `compilation_config.custom_ops` contains `all`, or will be disabled if it contains `none`. + +!!! note + Note that `all` and `none` cannot coexist in `compilation_config.custom_ops`. + +By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops. + +!!! note + For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level. + + Note that this `enforce_enable` mechanism will be removed after we add a separate `compilation_config` for multi-modal part. + +## How to Customise Your Configuration for CustomOp + +vLLM also offers fine-grained control over which custom ops to enable or disable for users, by manually passing a `--compilation_config.custom_ops '["..."]'` when launching a server. + +For example: + +- Use `--compilation_config.custom_ops '["all"]'` to enable all custom ops. +- Use `--compilation_config.custom_ops '["none"]'` to disable all custom ops. +- Use `--compilation_config.custom_ops '["all,-op1"]'` to enable all custom ops except op1 (i.e., prefixed with a `-` means "disable"). +- Use `--compilation_config.custom_ops '["none,+op1,+op2"]'` to only enable op1 and op2 (i.e., prefixed with a `+` means "enable"). + +## Types of Supported CustomOp in vLLM + +**1. Attention:** + +```python +--8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn" + +--8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention" +``` + +**2. Activation:** + +```python +--8<-- "vllm/model_executor/layers/activation.py:silu_and_mul" + +--8<-- "vllm/model_executor/layers/activation.py:mul_and_silu" + +--8<-- "vllm/model_executor/layers/activation.py:gelu_new" + +--8<-- "vllm/model_executor/layers/activation.py:gelu_fast" + +--8<-- "vllm/model_executor/layers/activation.py:quick_gelu" + +--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul" + +--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul_sparse" + +--8<-- "vllm/model_executor/layers/activation.py:relu2" + +--8<-- "vllm/model_executor/layers/activation.py:xielu" + +--8<-- "vllm/model_executor/layers/activation.py:swigluoai_and_mul" + +--8<-- "vllm/model_executor/layers/activation.py:fatrelu_and_mul" +``` + +**3. MM-Conv:** + +```python +--8<-- "vllm/model_executor/layers/conv.py:conv2d" + +--8<-- "vllm/model_executor/layers/conv.py:conv3d" +``` + +**4. Embedding:** + +```python +--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:vocab_parallel_embedding" + +--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:parallel_lm_head" +``` + +**5. Linear:** + +```python +--8<-- "vllm/model_executor/layers/linear.py:row_parallel_linear" + +--8<-- "vllm/model_executor/layers/linear.py:column_parallel_linear" + +--8<-- "vllm/model_executor/layers/linear.py:replicated_linear" +``` + +**6. Logits Processor:** + +```python +--8<-- "vllm/model_executor/layers/logits_processor.py:logits_processor" +``` + +**7. Mamba:** + +```python +--8<-- "vllm/model_executor/layers/mamba/mamba_mixer.py:mamba_mixer" + +--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mamba_mixer2" + +--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mixer2_gated_rms_norm" + +--8<-- "vllm/model_executor/models/plamo2.py:plamo2_mamba_mixer" + +--8<-- "vllm/model_executor/layers/mamba/short_conv.py:short_conv" +``` + +**8. MoE:** + +```python +--8<-- "vllm/model_executor/layers/fused_moe/layer.py:fused_moe" + +--8<-- "vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py:modular_fused_moe" + +--8<-- "vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py:unquantized_fused_moe" + +--8<-- "vllm/model_executor/models/transformers/moe.py:transformers_fused_moe" + +--8<-- "vllm/model_executor/layers/fused_moe/fused_moe.py:grouped_topk" +``` + +**9. Norm:** + +```python +--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm" + +--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm_gated" + +--8<-- "vllm/model_executor/layers/layernorm.py:gemma_rms_norm" +``` + +**10. Quantization:** + +```python +--8<-- "vllm/model_executor/layers/quantization/input_quant_fp8.py:quant_fp8" +``` + +**11. Rope:** + +```python +--8<-- "vllm/model_executor/layers/rotary_embedding/base.py:rotary_embedding" + +--8<-- "vllm/model_executor/layers/rotary_embedding/dual_chunk_rope.py:dual_chunk_rotary_embedding" + +--8<-- "vllm/model_executor/layers/rotary_embedding/common.py:apply_rotary_emb" +``` + +## Guidelines for Implementing a New CustomOp + +### Implement a New CustomOp in vLLM + +This part is a tutorial of how to implement a New `CustomOp` in vLLM. + +Steps: + +1. Implement a new op class, which extends from `CustomOp` base class. +2. Add the `@CustomOp.register("op_name")` decorator on this op class to register it into `CustomOp` system. +3. Implement different `forward_xxx()` method according to your needs. + +Taking `MMEncoderAttention` as an example: + +??? code + + ```python + @CustomOp.register("mm_encoder_attn") + class MMEncoderAttention(CustomOp): + + def __init__( + self, + num_heads: int, + head_size: int, + scale: float | None = None, + num_kv_heads: int | None = None, + prefix: str = "", + multimodal_config: MultiModalConfig | None = None, + ) -> None: + super().__init__() + # Init... + + def forward_native( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + cu_seqlens: torch.Tensor | None = None, + max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention + ) -> torch.Tensor: + # Call TORCH_SDPA implementation... + + def forward_cuda( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + cu_seqlens: torch.Tensor | None = None, + max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention + ) -> torch.Tensor: + # Call FA or TORCH_SDPA implementation... + + def forward_cpu( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + cu_seqlens: torch.Tensor | None = None, + max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention + ) -> torch.Tensor: + # Call TORCH_SDPA implementation... + + def forward_xpu( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + cu_seqlens: torch.Tensor | None = None, + max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention + ) -> torch.Tensor: + # Call FA implementation... + + def forward_tpu( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + cu_seqlens: torch.Tensor | None = None, + max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention + ) -> torch.Tensor: + # Call PALLAS implementation... + ``` + +### Register a New CustomOp in OOT Device Plugins + +Currently, thanks to [vLLM's hardware-plugin mechanism](./plugin_system.md), there are various OOT device plugins emerging out to enable vLLM seamlessly runs on different hardwares. You can also find more details about this mechanism at [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html). + +- **Official device plugins:** [vllm-ascend](https://github.com/vllm-project/vllm-ascend) (for Huawei Ascend NPU), [vllm-spyre](https://github.com/vllm-project/vllm-spyre) +(for Spyre), [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi) (for Intel Gaudi), [vllm-neuron](https://github.com/vllm-project/vllm-neuron) (for AWS Neuron), [vllm-meta](https://github.com/vllm-project/vllm-metal) (for Apple Silicon), etc. +- **Non-official device plugins:** [vllm-metax](https://github.com/MetaX-MACA/vLLM-metax) (for MetaX GPU), [vllm-kunlun](https://github.com/baidu/vLLM-Kunlun) (for Baidu Kunlun XPU), etc. + +In this case, `CustomOp` can enable these hardware manufacturers to seamlessly replace vLLM's operations with their deep-optimized kernels for specific devices at runtime, by just registering an OOT `CustomOp` and implementing the `forward_oot()` method. + +Now, this part will show you how to register an OOT `CustomOp` for a device plugin. + +Taking `MMEncoderAttention` as an example: + +1. Implement a `CustomMMEncoderAttention` class which extends from `MMEncoderAttention` and implement its `forward_oot()` method. +2. Register your `CustomMMEncoderAttention` into vLLM to replace `MMEncoderAttention`. + +??? code + + ```python + from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention + from vllm.model_executor.custom_op import CustomOp + + + @CustomOp.register_oot("MMEncoderAttention") + class CustomMMEncoderAttention(MMEncoderAttention): + + def __init__(...): + super().__init__(...) + + def forward_oot(...): + # Call optimized device-specific kernels. + ... + ``` + +In this case, a new item `{"MMEncoderAttention": CustomMMEncoderAttention}` will be added into `op_registry_oot`. When initializing a `MMEncoderAttention` op object, if the class name (i.e., `MMEncoderAttention`) is contained in the keys of `op_registry_oot`, vLLM will replace it with our registered class (i.e., `CustomMMEncoderAttention`) and instantiate it. + +After that, when this `MMEncoderAttention` op is called, your `forward_oot()` will be called if it is enabled. Thus, you will get expected performance on your hardwares without directly modify vLLM. + +In addition, you can also register all your `CustomOp` at one place for better management. + +??? code + + ```python + from vllm.model_executor.custom_op import CustomOp + + + REGISTERED_CUSTOM_OPS = { + "CustomOP1": YourCustomOp1, + "CustomOP2": YourCustomOp2, + "CustomOP3": YourCustomOp3, + } + + for op_name, op_cls in REGISTERED_CUSTOM_OPS.items(): + CustomOp.register_oot(_decorated_op_cls=op_cls, name=op_name) + ``` diff --git a/docs/design/debug_vllm_compile.md b/docs/design/debug_vllm_compile.md index 731e542a0307bdaa86cc5f9c4d8e2440a1137cad..328df581627a21d8df5c7e03f6b97528f8d1a29c 100644 --- a/docs/design/debug_vllm_compile.md +++ b/docs/design/debug_vllm_compile.md @@ -33,7 +33,7 @@ goals while minimizing impact to performance and also helps us (vLLM) when you o For more details on the design, please see the following resources: - [Introduction to vLLM-torch.compile blogpost](https://blog.vllm.ai/2025/08/20/torch-compile.html) -- [vLLM-torch.compile integration design](https://docs.vllm.ai/en/latest/design/torch_compile.html) +- [vLLM-torch.compile integration design](./torch_compile.md) - [vLLM Office Hours #26](https://www.youtube.com/live/xLyxc7hxCJc?si=Xulo9pe53C6ywf0V&t=561) - [Talk at PyTorch Conference 2025](https://youtu.be/1wV1ESbGrVQ?si=s1GqymUfwiwOrDTg&t=725) diff --git a/docs/design/fused_moe_modular_kernel.md b/docs/design/fused_moe_modular_kernel.md index e1a96be6c3445b7d517faca53daf19b0bb8231f8..975df8ba29dc41fffb54d4ce40347d3c734f28b1 100644 --- a/docs/design/fused_moe_modular_kernel.md +++ b/docs/design/fused_moe_modular_kernel.md @@ -2,7 +2,7 @@ ## Introduction -FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py) +FusedMoEModularKernel is implemented [here](../../vllm/model_executor/layers/fused_moe/modular_kernel.py) Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. diff --git a/docs/design/logits_processors.md b/docs/design/logits_processors.md index 8eadeb386fcf25c7622aa7ff15ca8a08d7052d15..af1d7b6bbb45d075b3239fddc4cb37fe9844b6d6 100644 --- a/docs/design/logits_processors.md +++ b/docs/design/logits_processors.md @@ -138,7 +138,7 @@ Note that the sampler will access the logits processors via `SamplingMetadata.lo # ...return sampler output data structure... - def sample(self, logits, sampling_metadta) + def sample(self, logits, sampling_metadata) ... diff --git a/docs/design/moe_kernel_features.md b/docs/design/moe_kernel_features.md index 48341d199cb804afa9677ff61801b2aaf0aba033..18216b5965af292f5f7e3a74a77f79476547adf4 100644 --- a/docs/design/moe_kernel_features.md +++ b/docs/design/moe_kernel_features.md @@ -16,7 +16,7 @@ Async backends support the use of DBO (Dual Batch Overlap) and shared expert ove Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. Llama. For modular kernels, this feature is supported by the `FusedMoEPrepareAndFinalize` subclass. For non-modular kernels, it is up to the experts function to deal with this flag. -Unless otherwise specified, backends are controlled via `VLLM_ALL2ALL_BACKEND`. All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP. +Unless otherwise specified, backends are controlled via the `--all2all-backend` command-line argument (or the `all2all_backend` parameter in `ParallelConfig`). All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP.