Unverified Commit 0b98ba15 authored by Woosuk Kwon's avatar Woosuk Kwon Committed by GitHub
Browse files

Change the name to vLLM (#150)

parent e5464ee4
# Contributing to CacheFlow # Contributing to vLLM
Thank you for your interest in contributing to CacheFlow! Thank you for your interest in contributing to vLLM!
Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large. Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
There are several ways you can contribute to the project: There are several ways you can contribute to the project:
...@@ -11,9 +11,9 @@ There are several ways you can contribute to the project: ...@@ -11,9 +11,9 @@ There are several ways you can contribute to the project:
However, remember that contributions aren't just about code. However, remember that contributions aren't just about code.
We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions. We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.
Finally, one of the most impactful ways to support us is by raising awareness about CacheFlow. Finally, one of the most impactful ways to support us is by raising awareness about vLLM.
Talk about it in your blog posts, highlighting how it's driving your incredible projects. Talk about it in your blog posts, highlighting how it's driving your incredible projects.
Express your support on Twitter if CacheFlow aids you, or simply offer your appreciation by starring our repository. Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.
## Setup for development ## Setup for development
...@@ -70,5 +70,5 @@ If a comment isn't clear or you disagree with a suggestion, feel free to ask for ...@@ -70,5 +70,5 @@ If a comment isn't clear or you disagree with a suggestion, feel free to ask for
### Thank You ### Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to CacheFlow. Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
Your contributions make CacheFlow a great tool for everyone! Your contributions make vLLM a great tool for everyone!
# CacheFlow # vLLM
## Build from source ## Build from source
...@@ -28,7 +28,7 @@ python examples/simple_server.py --help ...@@ -28,7 +28,7 @@ python examples/simple_server.py --help
To start the server: To start the server:
```bash ```bash
ray start --head ray start --head
python -m cacheflow.entrypoints.fastapi_server # --model <your_model> python -m vllm.entrypoints.fastapi_server # --model <your_model>
``` ```
To test the server: To test the server:
...@@ -45,9 +45,9 @@ pip install gradio ...@@ -45,9 +45,9 @@ pip install gradio
Start the server: Start the server:
```bash ```bash
python -m cacheflow.http_frontend.fastapi_frontend python -m vllm.http_frontend.fastapi_frontend
# At another terminal # At another terminal
python -m cacheflow.http_frontend.gradio_webserver python -m vllm.http_frontend.gradio_webserver
``` ```
## Load LLaMA weights ## Load LLaMA weights
...@@ -62,5 +62,5 @@ Since LLaMA weight is not fully public, we cannot directly download the LLaMA we ...@@ -62,5 +62,5 @@ Since LLaMA weight is not fully public, we cannot directly download the LLaMA we
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example: 2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash ```bash
python simple_server.py --model /output/path/llama-7b python simple_server.py --model /output/path/llama-7b
python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
``` ```
# Benchmarking CacheFlow # Benchmarking vLLM
## Downloading the ShareGPT dataset ## Downloading the ShareGPT dataset
......
...@@ -11,7 +11,7 @@ def main(args: argparse.Namespace): ...@@ -11,7 +11,7 @@ def main(args: argparse.Namespace):
for i in range(args.n_threads)] for i in range(args.n_threads)]
api_url = f"http://{args.host}:{args.port}/generate" api_url = f"http://{args.host}:{args.port}/generate"
headers = {"User-Agent": "CacheFlow Benchmark Client"} headers = {"User-Agent": "vLLM Benchmark Client"}
ploads = [{ ploads = [{
"prompt": p, "prompt": p,
"max_tokens": args.max_tokens, "max_tokens": args.max_tokens,
......
...@@ -6,7 +6,7 @@ import numpy as np ...@@ -6,7 +6,7 @@ import numpy as np
import torch import torch
from tqdm import tqdm from tqdm import tqdm
from cacheflow import LLM, SamplingParams from vllm import LLM, SamplingParams
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
......
"""Benchmark online serving throughput. """Benchmark online serving throughput.
On the server side, run one of the following commands: On the server side, run one of the following commands:
(CacheFlow backend) (vLLM backend)
python -m cacheflow.entrypoints.api_server \ python -m vllm.entrypoints.api_server \
--disable-log-requests --model <your_model> --disable-log-requests --model <your_model>
(TGI backend) (TGI backend)
...@@ -114,7 +114,7 @@ async def send_request( ...@@ -114,7 +114,7 @@ async def send_request(
request_start_time = time.time() request_start_time = time.time()
headers = {"User-Agent": "Benchmark Client"} headers = {"User-Agent": "Benchmark Client"}
if backend == "cacheflow": if backend == "vllm":
pload = { pload = {
"prompt": prompt, "prompt": prompt,
"n": 1, "n": 1,
...@@ -213,8 +213,8 @@ def main(args: argparse.Namespace): ...@@ -213,8 +213,8 @@ def main(args: argparse.Namespace):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Benchmark the online serving throughput.") description="Benchmark the online serving throughput.")
parser.add_argument("--backend", type=str, default="cacheflow", parser.add_argument("--backend", type=str, default="vllm",
choices=["cacheflow", "tgi"]) choices=["vllm", "tgi"])
parser.add_argument("--host", type=str, default="localhost") parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8001) parser.add_argument("--port", type=int, default=8001)
parser.add_argument("--dataset", type=str, required=True, parser.add_argument("--dataset", type=str, required=True,
......
...@@ -5,12 +5,13 @@ import random ...@@ -5,12 +5,13 @@ import random
import time import time
from typing import List, Tuple from typing import List, Tuple
from cacheflow import LLM, SamplingParams
import torch import torch
from transformers import (AutoConfig, AutoTokenizer, AutoModelForCausalLM, from transformers import (AutoConfig, AutoTokenizer, AutoModelForCausalLM,
PreTrainedTokenizerBase) PreTrainedTokenizerBase)
from tqdm import tqdm from tqdm import tqdm
from vllm import LLM, SamplingParams
def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase: def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:
config = AutoConfig.from_pretrained(model_name) config = AutoConfig.from_pretrained(model_name)
...@@ -70,7 +71,7 @@ def sample_requests( ...@@ -70,7 +71,7 @@ def sample_requests(
return sampled_requests return sampled_requests
def run_cacheflow( def run_vllm(
requests: List[Tuple[str, int, int]], requests: List[Tuple[str, int, int]],
model: str, model: str,
tensor_parallel_size: int, tensor_parallel_size: int,
...@@ -172,8 +173,8 @@ def main(args: argparse.Namespace): ...@@ -172,8 +173,8 @@ def main(args: argparse.Namespace):
tokenizer = get_tokenizer(args.model) tokenizer = get_tokenizer(args.model)
requests = sample_requests(args.dataset, args.num_prompts, tokenizer) requests = sample_requests(args.dataset, args.num_prompts, tokenizer)
if args.backend == "cacheflow": if args.backend == "vllm":
elapsed_time = run_cacheflow( elapsed_time = run_vllm(
requests, args.model, args.tensor_parallel_size, args.seed, args.n, requests, args.model, args.tensor_parallel_size, args.seed, args.n,
args.use_beam_search) args.use_beam_search)
elif args.backend == "hf": elif args.backend == "hf":
...@@ -192,8 +193,8 @@ def main(args: argparse.Namespace): ...@@ -192,8 +193,8 @@ def main(args: argparse.Namespace):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark the throughput.") parser = argparse.ArgumentParser(description="Benchmark the throughput.")
parser.add_argument("--backend", type=str, choices=["cacheflow", "hf"], parser.add_argument("--backend", type=str, choices=["vllm", "hf"],
default="cacheflow") default="vllm")
parser.add_argument("--dataset", type=str, required=True, parser.add_argument("--dataset", type=str, required=True,
help="Path to the dataset.") help="Path to the dataset.")
parser.add_argument("--model", type=str, default="facebook/opt-125m") parser.add_argument("--model", type=str, default="facebook/opt-125m")
...@@ -207,7 +208,7 @@ if __name__ == "__main__": ...@@ -207,7 +208,7 @@ if __name__ == "__main__":
parser.add_argument("--hf-max-batch-size", type=int, default=None, parser.add_argument("--hf-max-batch-size", type=int, default=None,
help="Maximum batch size for HF backend.") help="Maximum batch size for HF backend.")
args = parser.parse_args() args = parser.parse_args()
if args.backend == "cacheflow": if args.backend == "vllm":
if args.hf_max_batch_size is not None: if args.hf_max_batch_size is not None:
raise ValueError("HF max batch size is only for HF backend.") raise ValueError("HF max batch size is only for HF backend.")
elif args.backend == "hf": elif args.backend == "hf":
......
from cacheflow.model_executor.input_metadata import InputMetadata
from cacheflow.model_executor.model_loader import get_model
from cacheflow.model_executor.utils import set_random_seed
__all__ = [
"InputMetadata",
"get_model",
"set_random_seed",
]
#include <torch/extension.h> #include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h> #include <ATen/cuda/CUDAContext.h>
namespace cacheflow { namespace vllm {
template<typename T> template<typename T>
__device__ __forceinline__ T silu(const T& x) { __device__ __forceinline__ T silu(const T& x) {
...@@ -22,7 +22,7 @@ __global__ void silu_and_mul_kernel( ...@@ -22,7 +22,7 @@ __global__ void silu_and_mul_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
void silu_and_mul( void silu_and_mul(
torch::Tensor& out, // [num_tokens, d] torch::Tensor& out, // [num_tokens, d]
...@@ -40,7 +40,7 @@ void silu_and_mul( ...@@ -40,7 +40,7 @@ void silu_and_mul(
input.scalar_type(), input.scalar_type(),
"silu_and_mul_kernel", "silu_and_mul_kernel",
[&] { [&] {
cacheflow::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
out.data_ptr<scalar_t>(), out.data_ptr<scalar_t>(),
input.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(),
d); d);
......
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -19,7 +19,7 @@ ...@@ -19,7 +19,7 @@
#include <stdint.h> #include <stdint.h>
namespace cacheflow { namespace vllm {
// A vector type to store Q, K, V elements. // A vector type to store Q, K, V elements.
template<typename T, int VEC_SIZE> template<typename T, int VEC_SIZE>
...@@ -61,4 +61,4 @@ inline __device__ void zero(T& dst) { ...@@ -61,4 +61,4 @@ inline __device__ void zero(T& dst) {
dst = tmp.raw; dst = tmp.raw;
} }
} // namespace cacheflow } // namespace vllm
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -27,7 +27,7 @@ ...@@ -27,7 +27,7 @@
#define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b))
namespace cacheflow { namespace vllm {
// Utility function for attention softmax. // Utility function for attention softmax.
template<int NUM_WARPS> template<int NUM_WARPS>
...@@ -315,10 +315,10 @@ __global__ void single_query_cached_kv_attention_kernel( ...@@ -315,10 +315,10 @@ __global__ void single_query_cached_kv_attention_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
#define LAUNCH_ATTENTION_KERNEL(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS) \ #define LAUNCH_ATTENTION_KERNEL(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS) \
cacheflow::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \ vllm::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \
<<<grid, block, shared_mem_size, stream>>>( \ <<<grid, block, shared_mem_size, stream>>>( \
out_ptr, \ out_ptr, \
query_ptr, \ query_ptr, \
......
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
#include <float.h> #include <float.h>
#include <type_traits> #include <type_traits>
namespace cacheflow { namespace vllm {
// Q*K^T operation. // Q*K^T operation.
template<int THREAD_GROUP_SIZE, typename Vec, int N> template<int THREAD_GROUP_SIZE, typename Vec, int N>
...@@ -52,4 +52,4 @@ struct Qk_dot { ...@@ -52,4 +52,4 @@ struct Qk_dot {
} }
}; };
} // namespace cacheflow } // namespace vllm
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -25,7 +25,7 @@ ...@@ -25,7 +25,7 @@
#include <cuda_fp16.h> #include <cuda_fp16.h>
#include <stdint.h> #include <stdint.h>
namespace cacheflow { namespace vllm {
// Define custom BF16 vector data types. // Define custom BF16 vector data types.
struct bf16_4_t { struct bf16_4_t {
...@@ -420,4 +420,4 @@ inline __device__ void from_float(bf16_8_t& dst, Float8_ src) { ...@@ -420,4 +420,4 @@ inline __device__ void from_float(bf16_8_t& dst, Float8_ src) {
#endif #endif
} }
} // namespace cacheflow } // namespace vllm
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -23,7 +23,7 @@ ...@@ -23,7 +23,7 @@
#include <stdint.h> #include <stdint.h>
namespace cacheflow { namespace vllm {
// FP16 vector types for Q, K, V. // FP16 vector types for Q, K, V.
template<> template<>
...@@ -441,4 +441,4 @@ inline __device__ Float8_ to_float(uint4 u) { ...@@ -441,4 +441,4 @@ inline __device__ Float8_ to_float(uint4 u) {
return tmp; return tmp;
} }
} // namespace cacheflow } // namespace vllm
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h * and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
#include <stdint.h> #include <stdint.h>
namespace cacheflow { namespace vllm {
// Define custom FP32 vector data types. // Define custom FP32 vector data types.
struct Float4_ { struct Float4_ {
...@@ -265,4 +265,4 @@ inline __device__ Float8_ to_float(Float8_ u) { ...@@ -265,4 +265,4 @@ inline __device__ Float8_ to_float(Float8_ u) {
return u; return u;
} }
} // namespace cacheflow } // namespace vllm
...@@ -46,7 +46,7 @@ void swap_blocks( ...@@ -46,7 +46,7 @@ void swap_blocks(
} }
} }
namespace cacheflow { namespace vllm {
// Grid: (num_layers, num_pairs) // Grid: (num_layers, num_pairs)
template<typename scalar_t> template<typename scalar_t>
...@@ -77,7 +77,7 @@ __global__ void copy_blocks_kernel( ...@@ -77,7 +77,7 @@ __global__ void copy_blocks_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
void copy_blocks( void copy_blocks(
std::vector<torch::Tensor>& key_caches, std::vector<torch::Tensor>& key_caches,
...@@ -129,7 +129,7 @@ void copy_blocks( ...@@ -129,7 +129,7 @@ void copy_blocks(
at::ScalarType::Half, at::ScalarType::Half,
at::ScalarType::BFloat16, at::ScalarType::BFloat16,
key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] { key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] {
cacheflow::copy_blocks_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::copy_blocks_kernel<scalar_t><<<grid, block, 0, stream>>>(
key_cache_ptrs_tensor.data_ptr<int64_t>(), key_cache_ptrs_tensor.data_ptr<int64_t>(),
value_cache_ptrs_tensor.data_ptr<int64_t>(), value_cache_ptrs_tensor.data_ptr<int64_t>(),
block_mapping_tensor.data_ptr<int>(), block_mapping_tensor.data_ptr<int>(),
...@@ -137,7 +137,7 @@ void copy_blocks( ...@@ -137,7 +137,7 @@ void copy_blocks(
})); }));
} }
namespace cacheflow { namespace vllm {
template<typename scalar_t> template<typename scalar_t>
__global__ void reshape_and_cache_kernel( __global__ void reshape_and_cache_kernel(
...@@ -181,7 +181,7 @@ __global__ void reshape_and_cache_kernel( ...@@ -181,7 +181,7 @@ __global__ void reshape_and_cache_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
void reshape_and_cache( void reshape_and_cache(
torch::Tensor& key, // [num_tokens, num_heads, head_size] torch::Tensor& key, // [num_tokens, num_heads, head_size]
...@@ -208,7 +208,7 @@ void reshape_and_cache( ...@@ -208,7 +208,7 @@ void reshape_and_cache(
key.scalar_type(), key.scalar_type(),
"reshape_and_cache_kernel", "reshape_and_cache_kernel",
[&] { [&] {
cacheflow::reshape_and_cache_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::reshape_and_cache_kernel<scalar_t><<<grid, block, 0, stream>>>(
key.data_ptr<scalar_t>(), key.data_ptr<scalar_t>(),
value.data_ptr<scalar_t>(), value.data_ptr<scalar_t>(),
key_cache.data_ptr<scalar_t>(), key_cache.data_ptr<scalar_t>(),
...@@ -223,7 +223,7 @@ void reshape_and_cache( ...@@ -223,7 +223,7 @@ void reshape_and_cache(
}); });
} }
namespace cacheflow { namespace vllm {
// Grid: (num_blocks, block_size). // Grid: (num_blocks, block_size).
template<typename scalar_t> template<typename scalar_t>
...@@ -343,7 +343,7 @@ __global__ void gather_cached_kv_kernel_optimized( ...@@ -343,7 +343,7 @@ __global__ void gather_cached_kv_kernel_optimized(
} }
} }
} // namespace cacheflow } // namespace vllm
void gather_cached_kv( void gather_cached_kv(
torch::Tensor& key, // [out] [num_tokens, num_heads, head_size] torch::Tensor& key, // [out] [num_tokens, num_heads, head_size]
...@@ -370,7 +370,7 @@ void gather_cached_kv( ...@@ -370,7 +370,7 @@ void gather_cached_kv(
key.scalar_type(), key.scalar_type(),
"gather_cached_kv_kernel_optimized", "gather_cached_kv_kernel_optimized",
[&] { [&] {
cacheflow::gather_cached_kv_kernel_optimized<scalar_t><<<grid, block, 0, stream>>>( vllm::gather_cached_kv_kernel_optimized<scalar_t><<<grid, block, 0, stream>>>(
key.data_ptr<scalar_t>(), key.data_ptr<scalar_t>(),
value.data_ptr<scalar_t>(), value.data_ptr<scalar_t>(),
key_cache.data_ptr<scalar_t>(), key_cache.data_ptr<scalar_t>(),
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
#include "reduction_utils.cuh" #include "reduction_utils.cuh"
namespace cacheflow { namespace vllm {
// TODO(woosuk): Further optimize this kernel. // TODO(woosuk): Further optimize this kernel.
template<typename scalar_t> template<typename scalar_t>
...@@ -33,7 +33,7 @@ __global__ void rms_norm_kernel( ...@@ -33,7 +33,7 @@ __global__ void rms_norm_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
void rms_norm( void rms_norm(
torch::Tensor& out, // [num_tokens, hidden_size] torch::Tensor& out, // [num_tokens, hidden_size]
...@@ -52,7 +52,7 @@ void rms_norm( ...@@ -52,7 +52,7 @@ void rms_norm(
input.scalar_type(), input.scalar_type(),
"rms_norm_kernel", "rms_norm_kernel",
[&] { [&] {
cacheflow::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
out.data_ptr<scalar_t>(), out.data_ptr<scalar_t>(),
input.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(),
weight.data_ptr<scalar_t>(), weight.data_ptr<scalar_t>(),
......
#include <torch/extension.h> #include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h> #include <ATen/cuda/CUDAContext.h>
namespace cacheflow { namespace vllm {
template<typename scalar_t> template<typename scalar_t>
__global__ void rotary_embedding_neox_kernel( __global__ void rotary_embedding_neox_kernel(
...@@ -46,7 +46,7 @@ __global__ void rotary_embedding_neox_kernel( ...@@ -46,7 +46,7 @@ __global__ void rotary_embedding_neox_kernel(
} }
} }
} // namespace cacheflow } // namespace vllm
void rotary_embedding_neox( void rotary_embedding_neox(
torch::Tensor& positions, // [num_tokens] torch::Tensor& positions, // [num_tokens]
...@@ -70,7 +70,7 @@ void rotary_embedding_neox( ...@@ -70,7 +70,7 @@ void rotary_embedding_neox(
query.scalar_type(), query.scalar_type(),
"rotary_embedding_neox", "rotary_embedding_neox",
[&] { [&] {
cacheflow::rotary_embedding_neox_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::rotary_embedding_neox_kernel<scalar_t><<<grid, block, 0, stream>>>(
positions.data_ptr<int64_t>(), positions.data_ptr<int64_t>(),
query.data_ptr<scalar_t>(), query.data_ptr<scalar_t>(),
key.data_ptr<scalar_t>(), key.data_ptr<scalar_t>(),
......
/* /*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/reduce_kernel_utils.cuh * Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/reduce_kernel_utils.cuh
* Copyright (c) 2023, The CacheFlow team. * Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
*/ */
#pragma once #pragma once
namespace cacheflow { namespace vllm {
template<typename T> template<typename T>
__inline__ __device__ T warpReduceSum(T val) { __inline__ __device__ T warpReduceSum(T val) {
...@@ -48,4 +48,4 @@ __inline__ __device__ T blockReduceSum(T val) { ...@@ -48,4 +48,4 @@ __inline__ __device__ T blockReduceSum(T val) {
return val; return val;
} }
} // namespace cacheflow } // namespace vllm
# CacheFlow documents # vLLM documents
## Build the docs ## Build the docs
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment