Unverified Commit f60f2931 authored by Jiří Suchomel's avatar Jiří Suchomel Committed by GitHub
Browse files

[k8s] Clarified the usage of shared memory. (#4341)

parent 17000d2b
......@@ -39,6 +39,8 @@ spec:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: hf-cache
mountPath: /root/.cache/huggingface
readOnly: true
......@@ -52,6 +54,10 @@ spec:
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 10Gi
- name: hf-cache
hostPath:
path: /root/.cache/huggingface
......
......@@ -21,6 +21,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
```
- See [hyperparameter tuning](hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See `--shm-size` for docker and `/dev/shm` size update for Kubernetes manifests.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```bash
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment