使用 CodeLlama-7b-Instruct-hf 推理，生成文本为噪音 (#5) · Issues · ModelZoo / LLaMA_vllm

使用 CodeLlama-7b-Instruct-hf 推理，生成文本为噪音

在 Kubernetes 集群环境中想要使用当前项目和 DCU 部署 CodeLlama-7b-Instruct-hf，但生成的 response 为噪音。

复现方法：

从 https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf 下载模型文件到 PVC llm 中。
创建如下 Pod，由于 DCU 只有 16GB 显存，使用 tensor-parallel-size=2。

apiVersion: v1
kind: Pod
metadata:
  name: vllm
spec:
  containers:
  - command:
    - python
    - -m
    - vllm.entrypoints.openai.api_server
    args:
    - --model=/var/lib/t9k/model
    - --served-model-name=codellama
    - --trust-remote-code
    - --tensor-parallel-size=2
    - --enforce-eager
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: void
    image: image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.3.3-dtk24.04-zk-centos7.6-py310-v2 
    imagePullPolicy: IfNotPresent
    name: server
    ports:
    - containerPort: 8000
      protocol: TCP
    resources:
      limits:
        cpu: "4"
        hygon.com/dcu: "2"
        memory: 64Gi
      requests:
        cpu: "4"
        hygon.com/dcu: "2"
        memory: 64Gi
    securityContext:
      capabilities:
        add:
        - SYS_PTRACE
      privileged: true
    volumeMounts:
    - mountPath: /opt/hyhal/
      name: hyhal
      readOnly: true
    - mountPath: /dev/kfd
      name: dev-kfd
      readOnly: true
    - mountPath: /dev/dri
      name: dev-dri
      readOnly: true
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/lib/t9k/model
      name: model-volume
      subPath: CodeLlama-7b-Instruct-hf
  volumes:
  - hostPath:
      path: /opt/hyhal/
      type: ""
    name: hyhal
  - hostPath:
      path: /dev/kfd
      type: ""
    name: dev-kfd
  - hostPath:
      path: /dev/dri
      type: ""
    name: dev-dri
  - emptyDir:
      medium: Memory
      sizeLimit: 80Gi
    name: dshm
  - name: model-volume
    persistentVolumeClaim:
      claimName: llm

Pod 打印日志：

INFO 07-04 18:17:44 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='codellama', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/var/lib/t9k/model', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-04 18:17:44 config.py:420] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
2024-07-04 18:17:47,543 INFO worker.py:1724 -- Started a local Ray instance.
INFO 07-04 18:17:58 llm_engine.py:87] Initializing an LLM engine with config: model='/var/lib/t9k/model', tokenizer='/var/lib/t9k/model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0704 18:18:07.442831     1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309498768
(RayWorkerVllm pid=2084) WARNING: Logging before InitGoogleLogging() is written to STDERR
(RayWorkerVllm pid=2084) I0704 18:18:07.436853  2084 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310870032
I0704 18:18:08.710465     1 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
I0704 18:18:08.728509     1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310540816
I0704 18:18:08.729008     1 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=294564160
(RayWorkerVllm pid=2084) I0704 18:18:08.725971  2084 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=313156592
(RayWorkerVllm pid=2084) I0704 18:18:08.726452  2084 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=311203584
I0704 18:19:02.314718     1 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
INFO 07-04 18:19:38 llm_engine.py:357] # GPU blocks: 1622, # CPU blocks: 1024
INFO 07-04 18:20:00 serving_chat.py:302] Using default chat template:
INFO 07-04 18:20:00 serving_chat.py:302] {% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content | trim + ' ' + eos_token }}{% endif %}{% endfor %}
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

获取 Pod 的 IP，向其发送聊天请求，得到的响应如下：

curl 10.233.64.14:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "codellama", "messages": [{"role": "user", "content": "hello"}], "temperature": 0.5, "max_tokens": 100}'
{"id":"cmpl-6b59d12db8a3432cb9d82bc40cbf0202","object":"chat.completion","created":719297,"model":"codellama","choices":[{"index":0,"message":{"role":"assistant","content":"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":10,"total_tokens":110,"completion_tokens":100}}