Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
211fe91a
Unverified
Commit
211fe91a
authored
Oct 30, 2024
by
Woosuk Kwon
Committed by
GitHub
Oct 30, 2024
Browse files
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
parent
6aa6020f
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
11 additions
and
10 deletions
+11
-10
Dockerfile.tpu
Dockerfile.tpu
+1
-1
docs/source/getting_started/tpu-installation.rst
docs/source/getting_started/tpu-installation.rst
+2
-2
vllm/worker/tpu_worker.py
vllm/worker/tpu_worker.py
+8
-7
No files found.
Dockerfile.tpu
View file @
211fe91a
ARG NIGHTLY_DATE="2024
0828
"
ARG NIGHTLY_DATE="2024
1017
"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"
FROM $BASE_IMAGE
...
...
docs/source/getting_started/tpu-installation.rst
View file @
211fe91a
...
...
@@ -56,8 +56,8 @@ First, install the dependencies:
$ pip uninstall torch torch-xla -y
$ # Install PyTorch and PyTorch XLA.
$ export DATE="2024
0828
"
$ export TORCH_VERSION="2.
5
.0"
$ export DATE="2024
1017
"
$ export TORCH_VERSION="2.
6
.0"
$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-${TORCH_VERSION}.dev${DATE}-cp310-cp310-linux_x86_64.whl
$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-${TORCH_VERSION}.dev${DATE}-cp310-cp310-linux_x86_64.whl
...
...
vllm/worker/tpu_worker.py
View file @
211fe91a
...
...
@@ -133,18 +133,19 @@ class TPUWorker(LoraNotSupportedWorkerBase, LocalOrDistributedWorkerBase):
# Synchronize before measuring the memory usage.
xm
.
wait_device_ops
()
dtype_btyes
=
get_dtype_size
(
self
.
cache_dtype
)
block_size
=
self
.
cache_config
.
block_size
block_size_bytes
=
(
dtype_btyes
*
block_size
*
num_layers
*
2
*
head_size
*
num_kv_heads
)
# Calculate the TPU KV cache size based on profiling.
# Get the maximum amount of memory used by the model weights and
# intermediate activations.
m
=
xm
.
get_memory_info
(
self
.
device
)
total_memory_size
=
m
[
"bytes_limit"
]
profiled
=
m
[
"peak_bytes_used"
]
# Weights + intermediate activations.
# Calculate the TPU KV cache size based on profiling.
usable_memory_size
=
int
(
total_memory_size
*
self
.
cache_config
.
gpu_memory_utilization
)
profiled
=
m
[
"bytes_used"
]
# Weights + intermediate activations.
tpu_kv_cache_bytes
=
max
(
usable_memory_size
-
profiled
,
0
)
dtype_btyes
=
get_dtype_size
(
self
.
cache_dtype
)
block_size_bytes
=
(
dtype_btyes
*
self
.
cache_config
.
block_size
*
num_layers
*
2
*
head_size
*
num_kv_heads
)
num_tpu_blocks
=
tpu_kv_cache_bytes
//
block_size_bytes
num_tpu_blocks
=
(
num_tpu_blocks
//
8
)
*
8
# Round down to 8.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment