使用 CodeLlama-7b-Instruct-hf 推理,生成文本为噪音
在 Kubernetes 集群环境中想要使用当前项目和 DCU 部署 CodeLlama-7b-Instruct-hf,但生成的 response 为噪音。
复现方法:
-
从 https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf 下载模型文件到 PVC
llm中。 -
创建如下 Pod,由于 DCU 只有 16GB 显存,使用 tensor-parallel-size=2。
apiVersion: v1
kind: Pod
metadata:
name: vllm
spec:
containers:
- command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/var/lib/t9k/model
- --served-model-name=codellama
- --trust-remote-code
- --tensor-parallel-size=2
- --enforce-eager
env:
- name: NVIDIA_VISIBLE_DEVICES
value: void
image: image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.3.3-dtk24.04-zk-centos7.6-py310-v2
imagePullPolicy: IfNotPresent
name: server
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
cpu: "4"
hygon.com/dcu: "2"
memory: 64Gi
requests:
cpu: "4"
hygon.com/dcu: "2"
memory: 64Gi
securityContext:
capabilities:
add:
- SYS_PTRACE
privileged: true
volumeMounts:
- mountPath: /opt/hyhal/
name: hyhal
readOnly: true
- mountPath: /dev/kfd
name: dev-kfd
readOnly: true
- mountPath: /dev/dri
name: dev-dri
readOnly: true
- mountPath: /dev/shm
name: dshm
- mountPath: /var/lib/t9k/model
name: model-volume
subPath: CodeLlama-7b-Instruct-hf
volumes:
- hostPath:
path: /opt/hyhal/
type: ""
name: hyhal
- hostPath:
path: /dev/kfd
type: ""
name: dev-kfd
- hostPath:
path: /dev/dri
type: ""
name: dev-dri
- emptyDir:
medium: Memory
sizeLimit: 80Gi
name: dshm
- name: model-volume
persistentVolumeClaim:
claimName: llm
- Pod 打印日志:
INFO 07-04 18:17:44 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='codellama', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/var/lib/t9k/model', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-04 18:17:44 config.py:420] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
2024-07-04 18:17:47,543 INFO worker.py:1724 -- Started a local Ray instance.
INFO 07-04 18:17:58 llm_engine.py:87] Initializing an LLM engine with config: model='/var/lib/t9k/model', tokenizer='/var/lib/t9k/model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0704 18:18:07.442831 1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309498768
(RayWorkerVllm pid=2084) WARNING: Logging before InitGoogleLogging() is written to STDERR
(RayWorkerVllm pid=2084) I0704 18:18:07.436853 2084 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310870032
I0704 18:18:08.710465 1 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
I0704 18:18:08.728509 1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310540816
I0704 18:18:08.729008 1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=294564160
(RayWorkerVllm pid=2084) I0704 18:18:08.725971 2084 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=313156592
(RayWorkerVllm pid=2084) I0704 18:18:08.726452 2084 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=311203584
I0704 18:19:02.314718 1 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
INFO 07-04 18:19:38 llm_engine.py:357] # GPU blocks: 1622, # CPU blocks: 1024
INFO 07-04 18:20:00 serving_chat.py:302] Using default chat template:
INFO 07-04 18:20:00 serving_chat.py:302] {% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content | trim + ' ' + eos_token }}{% endif %}{% endfor %}
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
- 获取 Pod 的 IP,向其发送聊天请求,得到的响应如下:
curl 10.233.64.14:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "codellama", "messages": [{"role": "user", "content": "hello"}], "temperature": 0.5, "max_tokens": 100}'
{"id":"cmpl-6b59d12db8a3432cb9d82bc40cbf0202","object":"chat.completion","created":719297,"model":"codellama","choices":[{"index":0,"message":{"role":"assistant","content":"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":10,"total_tokens":110,"completion_tokens":100}}