Commit 24eacbc0 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Welcome to vLLM!
================
.. figure:: ./assets/logos/vllm-logo-text-light.png
:width: 60%
:align: center
:alt: vLLM
:class: no-scaled-link
.. raw:: html
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Optimized CUDA kernels
vLLM is flexible and easy to use with:
* Seamless integration with popular HuggingFace models
* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
* Tensor parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
For more information, check out the following:
* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
Documentation
-------------
.. toctree::
:maxdepth: 1
:caption: Getting Started
getting_started/installation
getting_started/quickstart
.. toctree::
:maxdepth: 1
:caption: Serving
serving/distributed_serving
serving/run_on_sky
serving/deploying_with_triton
serving/deploying_with_docker
.. toctree::
:maxdepth: 1
:caption: Models
models/supported_models
models/adding_model
.. toctree::
:maxdepth: 1
:caption: Quantization
quantization/auto_awq
\ No newline at end of file
.. _adding_a_new_model:
Adding a New Model
==================
This document provides a high-level guide on integrating a `HuggingFace Transformers <https://github.com/huggingface/transformers>`_ model into vLLM.
.. note::
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
.. tip::
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
We will be happy to help you out!
0. Fork the vLLM repository
--------------------------------
Start by forking our `GitHub <https://github.com/vllm-project/vllm/>`_ repository and then :ref:`build it from source <build_from_source>`.
This gives you the ability to modify the codebase and test your model.
1. Bring your model code
------------------------
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the `vllm/model_executor/models <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models>`_ directory.
For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/opt.py>`_ was adpated from the HuggingFace's `modeling_opt.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py>`_ file.
.. warning::
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
2. Rewrite the :code:`forward` methods
--------------------------------------
Next, you need to rewrite the :code:`forward` methods of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters:
.. code-block:: diff
def forward(
self,
input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor,
+ kv_caches: List[KVCache],
+ input_metadata: InputMetadata,
+ cache_events: Optional[List[torch.cuda.Event]],
+) -> SamplerOutput:
3. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
4. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
.. note::
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
3. (Optional) Implement tensor parallelism and quantization support
-------------------------------------------------------------------
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
When it comes to the linear layers, we provide the following options to parallelize them:
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
4. Implement the weight loading logic
-------------------------------------
You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
5. Register your model
----------------------
Finally, include your :code:`*ForCausalLM` class in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_ and register it to the :code:`_MODEL_REGISTRY` in `vllm/model_executor/model_loader.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader.py>`_.
.. _supported_models:
Supported Models
================
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it.
.. list-table::
:widths: 25 25 50
:header-rows: 1
* - Architecture
- Models
- Example HuggingFace Models
* - :code:`AquilaForCausalLM`
- Aquila
- :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
* - :code:`BaiChuanForCausalLM`
- Baichuan
- :code:`baichuan-inc/Baichuan-7B`, :code:`baichuan-inc/Baichuan-13B-Chat`, etc.
* - :code:`ChatGLMModel`
- ChatGLM
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
* - :code:`BloomForCausalLM`
- BLOOM, BLOOMZ, BLOOMChat
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
* - :code:`FalconForCausalLM`
- Falcon
- :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc.
* - :code:`GPT2LMHeadModel`
- GPT-2
- :code:`gpt2`, :code:`gpt2-xl`, etc.
* - :code:`GPTBigCodeForCausalLM`
- StarCoder, SantaCoder, WizardCoder
- :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc.
* - :code:`GPTJForCausalLM`
- GPT-J
- :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc.
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
- :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc.
* - :code:`InternLMForCausalLM`
- InternLM
- :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc.
* - :code:`LlamaForCausalLM`
- LLaMA, LLaMA-2, Vicuna, Alpaca, Koala, Guanaco
- :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`young-geng/koala`, etc.
* - :code:`MistralForCausalLM`
- Mistral, Mistral-Instruct
- :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc.
* - :code:`MPTForCausalLM`
- MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
- :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc.
* - :code:`OPTForCausalLM`
- OPT, OPT-IML
- :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc.
* - :code:`PhiForCausalLM`
- Phi-1.5
- :code:`microsoft/phi-1_5`, etc.
* - :code:`QWenLMHeadModel`
- Qwen
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
* - :code:`YiForCausalLM`
- Yi
- :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` for instructions on how to implement support for your model.
Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ project.
.. tip::
The easiest way to check if your model is supported is to run the program below:
.. code-block:: python
from vllm import LLM
llm = LLM(model=...) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
To use model from www.modelscope.cn
.. code-block:: shell
$ export VLLM_USE_MODELSCOPE=True
.. code-block:: python
from vllm import LLM
llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
If vLLM successfully generates text, it indicates that your model is supported.
.. _auto_awq:
AutoAWQ
==================
To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_.
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the `400+ models on Huggingface <https://huggingface.co/models?sort=trending&search=awq>`_.
.. code-block:: console
$ pip install autoawq
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:
.. code-block:: python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
To run an AWQ model with vLLM, you can use `TheBloke/Llama-2-7b-Chat-AWQ <https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ>`_ with the following command:
.. code-block:: console
$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
AWQ models are also supported directly through the LLM entrypoint:
.. code-block:: python
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
.. _deploying_with_docker:
Deploying with Docker
============================
You can build and run vLLM from source via the provided dockerfile. To build vLLM:
.. code-block:: console
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8
To run vLLM:
.. code-block:: console
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
vllm <args...>
.. _deploying_with_triton:
Deploying with NVIDIA Triton
============================
The `Triton Inference Server <https://github.com/triton-inference-server>`_ hosts a tutorial demonstrating how to quickly deploy a simple `facebook/opt-125m <https://huggingface.co/facebook/opt-125m>`_ model using vLLM. Please see `Deploying a vLLM model in Triton <https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton>`_ for more details.
.. _distributed_serving:
Distributed Inference and Serving
=================================
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with `Ray <https://github.com/ray-project/ray>`_. To run distributed inference, install Ray with:
.. code-block:: console
$ pip install ray
To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
.. code-block:: python
from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")
To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
.. code-block:: console
$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4
To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
.. code-block:: console
$ # On head node
$ ray start --head
$ # On worker nodes
$ ray start --address=<ray-head-address>
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
\ No newline at end of file
.. _on_cloud:
Running on clouds with SkyPilot
===============================
.. raw:: html
<p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p>
vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud.
To install SkyPilot and setup your cloud credentials, run:
.. code-block:: console
$ pip install skypilot
$ sky check
See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
.. code-block:: yaml
resources:
accelerators: A100
envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer
setup: |
conda create -n vllm python=3.9 -y
conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py
Start the serving the LLaMA-13B model on an A100 GPU:
.. code-block:: console
$ sky launch serving.yaml
Check the output of the command. There will be a sharable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
.. code-block:: console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
**Optional**: Serve the 65B model instead of the default 13B and use more GPU:
.. code-block:: console
sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf
"""Example Python client for vllm.entrypoints.api_server"""
import argparse
import json
from typing import Iterable, List
import requests
def clear_line(n: int = 1) -> None:
LINE_UP = '\033[1A'
LINE_CLEAR = '\x1b[2K'
for _ in range(n):
print(LINE_UP, end=LINE_CLEAR, flush=True)
def post_http_request(prompt: str,
api_url: str,
n: int = 1,
stream: bool = False) -> requests.Response:
headers = {"User-Agent": "Test Client"}
pload = {
"prompt": prompt,
"n": n,
"use_beam_search": True,
"temperature": 0.0,
"max_tokens": 16,
"stream": stream,
}
response = requests.post(api_url, headers=headers, json=pload, stream=True)
return response
def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\0"):
if chunk:
data = json.loads(chunk.decode("utf-8"))
output = data["text"]
yield output
def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
output = data["text"]
return output
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--n", type=int, default=4)
parser.add_argument("--prompt", type=str, default="San Francisco is a")
parser.add_argument("--stream", action="store_true")
args = parser.parse_args()
prompt = args.prompt
api_url = f"http://{args.host}:{args.port}/generate"
n = args.n
stream = args.stream
print(f"Prompt: {prompt!r}\n", flush=True)
response = post_http_request(prompt, api_url, n, stream)
if stream:
num_printed_lines = 0
for h in get_streaming_response(response):
clear_line(num_printed_lines)
num_printed_lines = 0
for i, line in enumerate(h):
num_printed_lines += 1
print(f"Beam candidate {i}: {line!r}", flush=True)
else:
output = get_response(response)
for i, line in enumerate(output):
print(f"Beam candidate {i}: {line!r}", flush=True)
import argparse
import json
import gradio as gr
import requests
def http_bot(prompt):
headers = {"User-Agent": "vLLM Client"}
pload = {
"prompt": prompt,
"stream": True,
"max_tokens": 128,
}
response = requests.post(args.model_url,
headers=headers,
json=pload,
stream=True)
for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\0"):
if chunk:
data = json.loads(chunk.decode("utf-8"))
output = data["text"][0]
yield output
def build_demo():
with gr.Blocks() as demo:
gr.Markdown("# vLLM text completion demo\n")
inputbox = gr.Textbox(label="Input",
placeholder="Enter text and press ENTER")
outputbox = gr.Textbox(label="Output",
placeholder="Generated result from the model")
inputbox.submit(http_bot, [inputbox], [outputbox])
return demo
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default=None)
parser.add_argument("--port", type=int, default=8001)
parser.add_argument("--model-url",
type=str,
default="http://localhost:8000/generate")
args = parser.parse_args()
demo = build_demo()
demo.queue(concurrency_count=100).launch(server_name=args.host,
server_port=args.port,
share=True)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
prompts = [
# "input: 请告诉我把大象放进冰箱需要几步?output:",
# "input:请给我写一首描绘极寒的冬天的古诗 output:",
# "input:周杰伦都写了哪些歌呢?尽可能多地列举一些 output:",
# "input:铁做氧化还原反应的化学方程式是什么 output:",
# "input:linux里 watch -n1 file.txt命令是什么意思 output:",
# "input:帮我计算一下 12 + 13 - 2 * 5 = ? 请一步一步来 output:",
# "input:这段python代码 ```def add_number_minus1(x, y):\n return x + y - 1\n``` 用C语言应该怎么写?output:",
# "input:假唱是指什么?请剖析五月天假唱的正面影响和负面影响, 他们应该假唱吗? output:",
"Q:Which songs has Jay Chou written? Please list as many as possible. A:",
"Q:What is the chemical equation for the redox reaction of iron? A:",
"Q:Tell me a joke about a classmate who wanted to go to the bathroom during class. A:",
"Q:What does lip-syncing mean? Please analyze the positive and negative impacts of the band Mayday lip-syncing. Should they lip-sync? A:"
]
prompts = [
"<用户>Write five words that start with “en”, then write the result of “77+33”<AI>",
"<用户>Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.<AI>",
"<用户>A group of students are planning to go on a field trip to a museum. They need to decide how many buses to rent and how to divide the students among the buses. Each bus can hold up to 40 students, but the museum can only accommodate 120 students at a time. The group has a budget of $800 for the bus rental, and each bus costs $200 per day. How many buses should the group rent, and how many students should go on each bus? Explain your reasoning.<AI>",
"""<用户>Selena, Jennifer and Miley wear a blue dress, yellow dress, and green dress in an unknown order. It is known that:
1) If Selena wears blue, then Jennifer wears green.
2) If Selena wears yellow, then Miley wears green.
3) If Jennifer does not wear yellow, then Miley wears blue.
What is the color of the dress Selena is wearing?<AI>""",
"""<用户>Given the following premise:
It is given that Amy, Bernadette and Penny are good friends of Sheldon and Leonard.
Leslie is a good friend of Leonard.
We can deduce that Leslie is a friend of Sheldon as well.
Does this deduction follow the premise? Explain your reasoning.<AI>""",
"""<用户>A group of five friends are going to watch a movie together. They have to choose between three genres: comedy, horror, and action. Each friend has a different preference for the genre. Here are some clues to help you figure out their preferences: Alice likes comedy more than horror, but less than action. Bob likes horror more than comedy, but less than action. Carol likes action more than horror, but less than comedy. David likes comedy more than action, but less than horror. Eve likes horror more than action, but less than comedy. What is the order of preference for each friend from most to least liked genre? Write your answer using the following format: Friend: Genre > Genre > Genre <AI>""",
"<用户>If you were in a race and passed the person in second place, what place would you be in now?<AI>",
"<用户>Which one is more south? California or New York?<AI>",
"<用户>linux里 watch -n1 file.txt命令是什么意思<AI>",
"<用户>Translate this sentence into Russian: '如何用Python创建一个简单的网页爬虫?'.<AI>",
"""<用户>Translate this sentence into French: "I am a fresh man on Chinese, do you know how this sentence is translated: 如何用Python创建一个简单的网页爬虫?" <AI>""",
"<用户>Micro-expressions mean that people express their inner feelings to each other by making some expressions.Between different expressions made by people or in a certain expression, the face will \"leak\" other information.The shortest-lasting micro-expressions can last 1 / 25 seconds, although a subconscious expression may only last for a moment, it is easy to expose emotions.When the face is making an expression, these extremely short-term expressions suddenly flash by, and sometimes the opposite mood.\nAccording to the above definition, which of the following is a micro-expression?\nA.After Wang was frightened, his face continued to twitch\nB.The spy sends a message to associates in the blink of an eye\nC.The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer.\nD.Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding\nA:<AI>",
"<用户>A’s brother was half her age when she was 6 how old is her brother when she’s 42?<AI>",
]
for sent in prompts:
inputs = tokenizer(sent, return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print("-"*20)
print(text)
print("="*20)
\ No newline at end of file
import argparse
import json
import os
import shutil
from tqdm import tqdm
from collections import OrderedDict
import torch
def convert_model(config, ckpt):
# config
config_bmt = OrderedDict(
{
"_dtype": "bf16",
"activate_fn": "silu",
"architectures": [
"CPMDragonflyForCausalLM"
],
"model_type": "cpm_dragonfly",
"base": 10000,
"dim_ff": config['intermediate_size'],
"dim_head": config['hidden_size'] // config['num_attention_heads'],
"dim_model": config['hidden_size'],
"dim_model_base": 256,
"dropout_p": 0.0,
"eps": config['rms_norm_eps'],
"init_std": config['initializer_range'],
"num_heads": config['num_attention_heads'],
"num_kv_heads": config['num_key_value_heads'],
"num_layers": config['num_hidden_layers'],
"orig_max_length": 4096,
"pose_prob": 0.0,
"pose_scaling_factor": 1.0,
"qk_norm": False,
"rope_scaling_factor": 1,
"rope_scaling_type": "",
"scale": True,
"scale_depth": config['scale_depth'],
"scale_emb": config['scale_emb'],
"tie_lm_head": True,
"tp": 0,
"transformers_version": "4.35.0",
"vocab_size": config['vocab_size']
}
)
model_bmt = OrderedDict()
model_bmt["input_embedding.weight"] = ckpt['model.embed_tokens.weight'].contiguous()
model_bmt["encoder.output_layernorm.weight"] = ckpt['model.norm.weight'].contiguous()
for lnum in tqdm(range(config_bmt['num_layers'])):
hf_pfx = f"model.layers.{lnum}"
bmt_pfx = f"encoder.layers.{lnum}"
model_bmt[f"{bmt_pfx}.self_att.layernorm_before_attention.weight"] = ckpt[f"{hf_pfx}.input_layernorm.weight"].contiguous()
model_bmt[f"{bmt_pfx}.self_att.self_attention.project_q.weight"] = ckpt[f"{hf_pfx}.self_attn.q_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.self_att.self_attention.project_k.weight"] = ckpt[f"{hf_pfx}.self_attn.k_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.self_att.self_attention.project_v.weight"] = ckpt[f"{hf_pfx}.self_attn.v_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.self_att.self_attention.attention_out.weight"] = ckpt[f"{hf_pfx}.self_attn.o_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.ffn.layernorm_before_ffn.weight"] = ckpt[f"{hf_pfx}.post_attention_layernorm.weight"].contiguous()
model_bmt[f"{bmt_pfx}.ffn.ffn.w_in.w_0.weight"] = ckpt[f"{hf_pfx}.mlp.gate_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.ffn.ffn.w_in.w_1.weight"] = ckpt[f"{hf_pfx}.mlp.up_proj.weight"].contiguous()
model_bmt[f"{bmt_pfx}.ffn.ffn.w_out.weight"] = ckpt[f"{hf_pfx}.mlp.down_proj.weight"].contiguous()
return config_bmt, model_bmt
def load_model_ckpt(args):
with open(os.path.join(args.load, "config.json"), 'r') as fin:
config = json.load(fin)
ckpt = torch.load(os.path.join(args.load, "pytorch_model.bin"))
os.makedirs(f"{args.save}", exist_ok=True)
# model and config
hf_config, hf_ckpt = convert_model(config, ckpt)
with open(os.path.join(args.save, "config.json"), 'w') as fout:
json.dump(hf_config, fout, indent=4)
torch.save(hf_ckpt, f"{args.save}/pytorch_model.pt")
# tokenizer
shutil.copyfile(f"{args.load}/tokenizer.json", f"{args.save}/tokenizer.json")
shutil.copyfile(f"{args.load}/tokenizer.model", f"{args.save}/tokenizer.model")
shutil.copyfile(f"{args.load}/special_tokens_map.json", f"{args.save}/special_tokens_map.json")
shutil.copyfile(f"{args.load}/tokenizer_config.json", f"{args.save}/tokenizer_config.json")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--load", type=str, default="")
parser.add_argument("--save", type=str, default="")
args = parser.parse_args()
load_model_ckpt(args)
from vllm import LLM, SamplingParams
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default="")
parser.add_argument("--prompt_path", type=str, default="")
args = parser.parse_args()
with open(args.prompt_path, "r") as f:
prompts = f.readlines()
prompt_template = "<用户>{}<AI>"
prompts = [prompt_template.format(prompt.strip()) for prompt in prompts]
params_dict = {
"n": 1,
"best_of": 1,
"presence_penalty": 1.0,
"frequency_penalty": 0.0,
"temperature": 0.5,
"top_p": 0.8,
"top_k": -1,
"use_beam_search": False,
"length_penalty": 1,
"early_stopping": False,
"stop": None,
"stop_token_ids": None,
"ignore_eos": False,
"max_tokens": 1000,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
}
# Create a sampling params object.
sampling_params = SamplingParams(**params_dict)
# Create an LLM.
# llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16')
llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='float16')
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
for prompt in prompts:
outputs = llm.generate(prompt, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print("================")
# find the first <用户> and remove the text before it.
clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):]
print(f"""<用户>: {clean_prompt.replace("<AI>", "")}""")
print(f"<AI>:")
print(generated_text)
from vllm import LLM, SamplingParams
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default="")
parser.add_argument("--prompt_path", type=str, default="")
args = parser.parse_args()
with open(args.prompt_path, "r") as f:
prompts = f.readlines()
prompt_template = "<用户>{}<AI>"
prompts = [prompt_template.format(prompt.strip()) for prompt in prompts]
params_dict = {
"n": 1,
"best_of": None,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_p": 0.5,
"top_k": -1,
"use_beam_search": False,
"length_penalty": 1.5,
"early_stopping": False,
"stop": None,
"stop_token_ids": None,
"ignore_eos": False,
"max_tokens": 1000,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
}
# Create a sampling params object.
sampling_params = SamplingParams(**params_dict)
# Create an LLM.
llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16')
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
for prompt in prompts:
outputs = llm.generate(prompt, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print("================")
# find the first <用户> and remove the text before it.
clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):]
print(f"""<用户>: {clean_prompt.replace("<AI>", "")}""")
print(f"<AI>:")
print(generated_text)
from vllm import LLM, SamplingParams
import argparse
import json
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default="")
parser.add_argument("--prompt_path", type=str, default="")
parser.add_argument("--output_path", type=str, default="")
args = parser.parse_args()
with open(args.prompt_path, "r") as f:
data_list = json.load(f)
prompts = [data["prompt"] for data in data_list]
prompt_template = "{}"
prompts = [prompt_template.format(prompt.strip()) for prompt in prompts]
params_dict = {
"n": 1,
"best_of": None,
"presence_penalty": 1.0,
"frequency_penalty": 0.0,
"temperature": 0.3,
"top_p": 0.8,
"top_k": -1,
"use_beam_search": False,
"length_penalty": 1.0,
"early_stopping": False,
"stop": None,
"stop_token_ids": None,
"ignore_eos": False,
"max_tokens": 1000,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
}
# Create a sampling params object.
sampling_params = SamplingParams(**params_dict)
# Create an LLM.
llm = LLM(model=args.model_path, tensor_parallel_size=1, dtype='bfloat16')
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
for data in data_list:
prompt = data["prompt"]
outputs = llm.generate(prompt, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print("================")
# find the first <用户> and remove the text before it.
clean_prompt = prompt[prompt.find("<用户>")+len("<用户>"):]
print(f"""<用户>: {clean_prompt.replace("<AI>", "")}""")
print(f"<AI>:")
print(generated_text)
data['cpm_new'] = generated_text
with open(args.prompt_path+args.output_path, "a") as f:
f.write(json.dumps(data, ensure_ascii=False, indent=4))
f.write(",\n")
================
<用户>: 你是谁?
<AI>: ' 我是一个大型语言模型。'
================
<用户>: Who are you?
<AI>: ' I am an AI language model created by OpenAI. My purpose is to assist you in generating human-like text based on the prompts you provide. I do not have personal feelings or emotions, but I am designed to understand and respond to your questions and requests.'
================
<用户>: 给我写一份去东京旅游的游记,要求每句话都以“哈哈”开头。
<AI>: ' 哈哈,终于决定去东京旅游了!这次旅行让我充满了期待和兴奋。\n\n哈哈,一到达东京,我就被这座城市的繁华和活力所吸引。我立刻去了新宿站,这里人山人海,热闹非凡。\n\n哈哈,我去了东京塔,站在塔顶上,俯瞰整个东京市,感觉非常震撼。\n\n哈哈,我还去了浅草寺,这里有着悠久的历史和文化底蕴,让我感受到了日本传统文化的魅力。\n\n哈哈,在东京的街头巷尾,我品尝了各种美食,比如寿司、拉面、天妇罗等等,每一口都让我回味无穷。\n\n哈哈,我还去了迪士尼乐园,这里的游乐设施和表演都非常精彩,让我度过了一个难忘的假期。\n\n哈哈,最后一天,我在银座购物,买了一些纪念品和礼物,准备带回家和家人分享。\n\n哈哈,这次东京之行让我收获了很多美好的回忆和体验,我会一直珍藏它们。'
================
<用户>: 假唱是指什么?请剖析五月天假唱的负面影响, 他们应该假唱吗?
<AI>: ' 假唱是指歌手在演唱时使用录音或现场效果器代替真实的声音。五月天是台湾著名的摇滚乐队,他们曾多次被质疑假唱。\n\n假唱的负面影响主要有以下几点:\n\n1. 影响观众体验:假唱会使观众感受到音乐的真实性和情感的缺失,影响他们的观赏体验。\n2. 损害歌手形象:假唱会降低歌手的专业素养和艺术水准,损害其形象和声誉。\n3. 影响音乐产业:假唱会破坏音乐产业的生态平衡,导致原创音乐受到排挤,对音乐产业的发展造成负面影响。\n\n五月天是否应该假唱是一个有争议的问题。一方面,假唱可以保证演唱会的顺利进行,避免因歌手嗓子不好而影响演出效果。另一方面,假唱也会损害歌手的形象和声誉,降低观众对音乐的信任度。因此,五月天是否应该假唱需要权衡利弊,综合考虑各种因素。'
================
<用户>: 请告诉我把大象放进冰箱需要几步?
<AI>: ' 把大象放进冰箱是一个不可能完成的任务,因为大象的尺寸远远超过冰箱的容量。因此,这个问题实际上是不存在的。'
================
<用户>: 请给我写一首描绘极寒的冬天的七言绝句。
<AI>: ' 雪花纷飞白茫茫,\n寒风凛冽刺骨寒。\n枯枝落叶随风舞,\n冬日寒意入心间。'
================
<用户>: 周杰伦都写了哪些歌呢?尽可能多地列举一些
<AI>: ' 周杰伦是一位非常著名的华语流行歌手和音乐制作人,他的歌曲涵盖了多个风格和主题。以下是一些周杰伦的经典歌曲:\n\n1. 《青花瓷》\n2. 《千里之外》\n3. 《爱情转移》\n4. 《霍元甲》\n5. 《晴天》\n6. 《爱在西元前》\n7. 《夜曲》\n8. 《东风破》\n9. 《双截棍》\n10. 《倒带》\n11. 《不该》\n12. 《不能说的秘密》\n13. 《告白气球》\n14. 《发如雪》\n15. 《不该让你等》\n16. 《简单爱》\n17. 《等你下课》\n18. 《世界末日》\n19. 《听妈妈的话》\n20. 《稻香》\n\n以上只是周杰伦众多经典歌曲中的一部分,他的音乐作品非常丰富多样,涵盖了多种风格和主题,包括摇滚、流行、民谣、电子等。'
================
<用户>: 铁做氧化还原反应的化学方程式是什么
<AI>: ' 铁(Fe)与氧气(O2)反应生成氧化铁(Fe2O3)的化学方程式如下:\n\n4 Fe + 3 O2 → 2 Fe2O3\n\n这个方程式表示了铁和氧气在适当条件下发生氧化还原反应的过程。在这个反应中,铁原子失去电子被氧化为三价铁离子(Fe3+),而氧气分子接受这些电子被还原为氧离子(O2-)。'
================
<用户>: linux里 watch -n1 file.txt命令是什么意思
<AI>: ' `watch -n1 file.txt` 命令在 Linux 系统中用于监视文件的变化。其中,`watch` 命令是一个用于监视文件变化的工具,`-n` 表示每秒钟检查一次文件的变化,而 `1` 表示检查的时间间隔为 1 秒。\n因此,`watch -n1 file.txt` 命令的含义是每秒钟检查一次 `file.txt` 文件的变化。如果文件发生了任何变化,命令会立即输出变化的内容。'
================
<用户>: Translate this sentence into Russian: '如何用Python创建一个简单的网页爬虫?'.
<AI>: ' Как создать простый скрипт для сбора данных из веб-страниц, используя Python?'
================
<用户>: Translate this sentence into French: "I am a fresh man on Chinese, do you know how this sentence is translated: 如何用Python创建一个简单的网页爬虫?"
<AI>: ' "Je suis un nouveau-bien dans le chinois, savez-vous comment traduire cette phrase en Python : Comment créer un simple web scraper ?"'
================
<用户>: Micro-expressions mean that people express their inner feelings to each other by making some expressions.Between different expressions made by people or in a certain expression, the face will "leak" other information.The shortest-lasting micro-expressions can last 1 / 25 seconds, although a subconscious expression may only last for a moment, it is easy to expose emotions.When the face is making an expression, these extremely short-term expressions suddenly flash by, and sometimes the opposite mood.
According to the above definition, which of the following is a micro-expression?
A.After Wang was frightened, his face continued to twitch
B.The spy sends a message to associates in the blink of an eye
C.The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer.
D.Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding
A:
<AI>: " Micro-expressions, as defined in the question, are brief, involuntary facial expressions that reveal an individual's inner feelings to others. These expressions can last from 1/25 seconds to a moment but are often easily detectable due to their brevity and the fact that they can be triggered by subconscious emotions.\n\nLet's analyze each option:\n\nA. After Wang was frightened, his face continued to twitch - This is not a micro-expression because it describes a continuous action (twitching) rather than a brief expression.\n\nB. The spy sends a message to associates in the blink of an eye - This is not a micro-expression either. It refers to a quick action or communication, not a facial expression.\n\nC. The sales clerk may flash a contemptuous smirk when he smiles in front of a shabby customer - This is a micro-expression because it describes a brief, involuntary change in facial expression that reveals the sales clerk's underlying emotion (contempt).\n\nD. Walking against the biting cold wind, Xiao Li's upper and lower teeth kept shaking and colliding - This is not a micro-expression either. It describes a physical reaction to cold weather rather than a facial expression revealing an emotional state.\n\nTherefore, the correct answer is C, as it is the only option that clearly describes a brief, involuntary change in facial expression that reveals an underlying emotion..\n\nThe answer is C."
================
<用户>: How many legs did a three-legged dog have **before** it lost one leg?
<AI>: ' A three-legged dog had four legs before it lost one leg.'
================
<用户>: A’s brother was half her age when she was 6 how old is her brother when she’s 42?
<AI>: " Step 1: First, let's calculate the age difference between A and her brother. When A was 6, her brother was half her age, so her brother was 6 / 2 = 3 years old at that time.\nStep 2: Now, let's calculate the current age of A when she is 42. The age difference between A and her brother remains the same, so when A is 42, her brother would be 42 - 3 = 39 years old.\n\nAnswer:\n39"
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment