You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""# noqa: E501
classLlama2Adapter(BasicAdapterFast):
"""Adapter for llama2.
Llama2 use the following template and the first user prompt
should contain a system prompt.
User can specify the system prompt using a <<SYS>> tag otherwise
the default system prompt is prepended to user's input.
model_name (str): the name of the deployed model, deprecated and has no effect when version > 0.2.1
model_format (str): the layout of the deployed model. It can be one of the following values [hf, llama, awq], `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the quantized model by AWQ.
tp (int): the number of GPU cards used in tensor parallelism, default to 1
session_len (int): the max session length of a sequence, default to None
max_batch_size (int): the max batch size during inference, default to 128
cache_max_entry_count (float): the percentage of gpu memory occupied by the k/v cache.
For versions of lmdeploy between `v0.2.0` and `v0.2.1`, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache.
For lmdeploy versions greater than `v0.2.1`, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
quant_policy (int): , default to 0. When k/v is quantized into 8 bit, set it to 4
rope_scaling_factor (int): scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention
use_logn_attn (bool): whether or not to use log attn: default to False
download_dir (str): Directory to download and load the weights, default to the default cache directory of huggingface.
revision (str): The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
max_prefill_token_num(int): the number of tokens each iteration during prefill, default to 8192
"""# noqa: E501
model_name:Optional[str]=None
model_format:Optional[str]=None
tp:int=1
session_len:Optional[int]=None
max_batch_size:int=128
cache_max_entry_count:float=0.8
quant_policy:int=0
rope_scaling_factor:float=0.0
use_logn_attn:bool=False
download_dir:Optional[str]=None
revision:Optional[str]=None
max_prefill_token_num:int=8192
@dataclass
classPytorchEngineConfig:
"""PyTorch Engine Config.
Args:
model_name (str): name of the given model.
tp (int): Tensor Parallelism. default 1.
session_len (int): Max session length. Default None.
max_batch_size (int): Max batch size. Default 128.
cache_max_entry_count (float): the percentage of gpu memory occupied
by the k/v cache. For lmdeploy versions greater than `v0.2.1`,
it defaults to 0.8, signifying the percentage of FREE GPU memory
to be reserved for the k/v cache
eviction_type (str): What action to perform when kv cache
is full, ['recompute', 'copy'], Default 'recompute'.
prefill_interval (int): Interval to perform prefill,