Unverified Commit aba8d6ee authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Move examples into categories (#11840)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 2a0596bc
...@@ -20,12 +20,12 @@ Before incorporating the FP8 datatype for inference workloads, you must adhere t ...@@ -20,12 +20,12 @@ Before incorporating the FP8 datatype for inference workloads, you must adhere t
### 2. Convert HF model into a quantized HF model. ### 2. Convert HF model into a quantized HF model.
Note: The following steps are adapted from the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md). Note: The following steps are adapted from the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md).
`quantize.py` (examples/fp8/quantizer/quantize.py) uses the quantization toolkit (AMMO) to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). `quantize.py` (examples/other/fp8/quantizer/quantize.py) uses the quantization toolkit (AMMO) to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format).
The detailed quantization toolkit (AMMO) conversion guide for FP8 can be found at `examples/fp8/quantizer/README.md`. The detailed quantization toolkit (AMMO) conversion guide for FP8 can be found at `examples/other/fp8/quantizer/README.md`.
### 3. Extract KV Cache Scaling Factors from quantized HF model. ### 3. Extract KV Cache Scaling Factors from quantized HF model.
`extract_scales.py` (examples/fp8/extract_scales.py) can be utilized to extract the KV cache scaling factors from your quantized HF model, however at the moment, this tool exclusively supports Llama 2 models. It is also important to note the following: `extract_scales.py` (examples/other/fp8/extract_scales.py) can be utilized to extract the KV cache scaling factors from your quantized HF model, however at the moment, this tool exclusively supports Llama 2 models. It is also important to note the following:
1. **File Structure**: The utility operates under the assumption that all parameters, including KV cache scaling factors, corresponding to a particular Tensor Parallelism (TP) rank are stored in a single file. These files must adhere to a specific naming convention where the TP rank is immediately identified after a specific keyword (e.g., "rank") in the filename. 1. **File Structure**: The utility operates under the assumption that all parameters, including KV cache scaling factors, corresponding to a particular Tensor Parallelism (TP) rank are stored in a single file. These files must adhere to a specific naming convention where the TP rank is immediately identified after a specific keyword (e.g., "rank") in the filename.
2. **TP Decomposition**: The utility assumes consistency between the TP decomposition employed by the quantizer tool and that used by vLLM. 2. **TP Decomposition**: The utility assumes consistency between the TP decomposition employed by the quantizer tool and that used by vLLM.
...@@ -35,7 +35,7 @@ The detailed quantization toolkit (AMMO) conversion guide for FP8 can be found a ...@@ -35,7 +35,7 @@ The detailed quantization toolkit (AMMO) conversion guide for FP8 can be found a
```python ```python
# prerequisites: # prerequisites:
# - Quantized HF LLaMa 2 model # - Quantized HF LLaMa 2 model
python3 examples/fp8/extract_scales.py --help python3 examples/other/fp8/extract_scales.py --help
Usage: extract_scales.py [-h] --quantized_model QUANTIZED_MODEL [--load_format {auto,safetensors,npz,pt}] [--output_dir OUTPUT_DIR] [--output_name OUTPUT_NAME] [--tp_size TP_SIZE] Usage: extract_scales.py [-h] --quantized_model QUANTIZED_MODEL [--load_format {auto,safetensors,npz,pt}] [--output_dir OUTPUT_DIR] [--output_name OUTPUT_NAME] [--tp_size TP_SIZE]
KV Scale Extraction Example KV Scale Extraction Example
...@@ -52,7 +52,7 @@ Optional arguments: ...@@ -52,7 +52,7 @@ Optional arguments:
``` ```
```python ```python
Example: Example:
python3 examples/fp8/extract_scales.py --quantized_model <QUANTIZED_MODEL_DIR> --tp_size <TENSOR_PARALLEL_SIZE> --output_dir <PATH_TO_OUTPUT_DIR> python3 examples/other/fp8/extract_scales.py --quantized_model <QUANTIZED_MODEL_DIR> --tp_size <TENSOR_PARALLEL_SIZE> --output_dir <PATH_TO_OUTPUT_DIR>
``` ```
### 4. Load KV Cache Scaling Factors into VLLM. ### 4. Load KV Cache Scaling Factors into VLLM.
This script evaluates the inference throughput of language models using various backends such as vLLM. It measures the time taken to process a given number of prompts and generate sequences for each prompt. The recently generated KV cache scaling factors are now integrated into the benchmarking process and allow for KV cache scaling factors to be utilized for FP8. This script evaluates the inference throughput of language models using various backends such as vLLM. It measures the time taken to process a given number of prompts and generate sequences for each prompt. The recently generated KV cache scaling factors are now integrated into the benchmarking process and allow for KV cache scaling factors to be utilized for FP8.
......
...@@ -25,7 +25,7 @@ https://github.com/coreweave/tensorizer ...@@ -25,7 +25,7 @@ https://github.com/coreweave/tensorizer
To serialize a model, install vLLM from source, then run something To serialize a model, install vLLM from source, then run something
like this from the root level of this repository: like this from the root level of this repository:
python -m examples.tensorize_vllm_model \ python -m examples.offline_inference.tensorize_vllm_model \
--model facebook/opt-125m \ --model facebook/opt-125m \
serialize \ serialize \
--serialized-directory s3://my-bucket \ --serialized-directory s3://my-bucket \
...@@ -45,7 +45,7 @@ providing a `--keyfile` argument. ...@@ -45,7 +45,7 @@ providing a `--keyfile` argument.
To deserialize a model, you can run something like this from the root To deserialize a model, you can run something like this from the root
level of this repository: level of this repository:
python -m examples.tensorize_vllm_model \ python -m examples.offline_inference.tensorize_vllm_model \
--model EleutherAI/gpt-j-6B \ --model EleutherAI/gpt-j-6B \
--dtype float16 \ --dtype float16 \
deserialize \ deserialize \
...@@ -63,11 +63,11 @@ shard's rank. Sharded models serialized with this script will be named as ...@@ -63,11 +63,11 @@ shard's rank. Sharded models serialized with this script will be named as
model-rank-%03d.tensors model-rank-%03d.tensors
For more information on the available arguments for serializing, run For more information on the available arguments for serializing, run
`python -m examples.tensorize_vllm_model serialize --help`. `python -m examples.offline_inference.tensorize_vllm_model serialize --help`.
Or for deserializing: Or for deserializing:
`python -m examples.tensorize_vllm_model deserialize --help`. `python -m examples.offline_inference.tensorize_vllm_model deserialize --help`.
Once a model is serialized, tensorizer can be invoked with the `LLM` class Once a model is serialized, tensorizer can be invoked with the `LLM` class
directly to load models: directly to load models:
...@@ -88,7 +88,7 @@ TensorizerConfig arguments desired. ...@@ -88,7 +88,7 @@ TensorizerConfig arguments desired.
In order to see all of the available arguments usable to configure In order to see all of the available arguments usable to configure
loading with tensorizer that are given to `TensorizerConfig`, run: loading with tensorizer that are given to `TensorizerConfig`, run:
`python -m examples.tensorize_vllm_model deserialize --help` `python -m examples.offline_inference.tensorize_vllm_model deserialize --help`
under the `tensorizer options` section. These can also be used for under the `tensorizer options` section. These can also be used for
deserialization in this example script, although `--tensorizer-uri` and deserialization in this example script, although `--tensorizer-uri` and
......
...@@ -20,7 +20,7 @@ build-backend = "setuptools.build_meta" ...@@ -20,7 +20,7 @@ build-backend = "setuptools.build_meta"
line-length = 80 line-length = 80
exclude = [ exclude = [
# External file, leaving license intact # External file, leaving license intact
"examples/fp8/quantizer/quantize.py" "examples/other/fp8/quantizer/quantize.py"
] ]
[tool.ruff.lint.per-file-ignores] [tool.ruff.lint.per-file-ignores]
......
...@@ -5,7 +5,7 @@ def test_platform_plugins(): ...@@ -5,7 +5,7 @@ def test_platform_plugins():
import os import os
example_file = os.path.join( example_file = os.path.join(
os.path.dirname(os.path.dirname(os.path.dirname(current_file))), os.path.dirname(os.path.dirname(os.path.dirname(current_file))),
"examples", "offline_inference.py") "examples", "offline_inference/offline_inference.py")
runpy.run_path(example_file) runpy.run_path(example_file)
# check if the plugin is loaded correctly # check if the plugin is loaded correctly
......
...@@ -163,8 +163,8 @@ def test_deserialized_hf_model_has_same_outputs(hf_runner, vllm_runner, ...@@ -163,8 +163,8 @@ def test_deserialized_hf_model_has_same_outputs(hf_runner, vllm_runner,
def test_vllm_model_can_load_with_lora(vllm_runner, tmp_path): def test_vllm_model_can_load_with_lora(vllm_runner, tmp_path):
multilora_inference = import_from_path( multilora_inference = import_from_path(
"examples.multilora_inference", "examples.offline_inference.multilora_inference",
EXAMPLES_PATH / "multilora_inference.py", EXAMPLES_PATH / "offline_inference/multilora_inference.py",
) )
model_ref = "meta-llama/Llama-2-7b-hf" model_ref = "meta-llama/Llama-2-7b-hf"
......
...@@ -31,7 +31,7 @@ if __name__ == "__main__": ...@@ -31,7 +31,7 @@ if __name__ == "__main__":
type=str, type=str,
required=True, required=True,
help="json trace file output by " help="json trace file output by "
"examples/offline_profile.py") "examples/offline_inference/offline_profile.py")
parser.add_argument("--phase", parser.add_argument("--phase",
type=str, type=str,
required=True, required=True,
......
...@@ -534,11 +534,11 @@ def main( ...@@ -534,11 +534,11 @@ def main(
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument( parser.add_argument("--json-trace",
"--json-trace", type=str,
type=str, required=True,
required=True, help="json trace file output by \
help="json trace file output by examples/offline_profile.py") examples/offline_inference/offline_profile.py")
parser.add_argument("--output-directory", parser.add_argument("--output-directory",
type=str, type=str,
required=False, required=False,
......
...@@ -22,7 +22,7 @@ NOTE: If you want to not only transfer KV caches, but adjust the model execution ...@@ -22,7 +22,7 @@ NOTE: If you want to not only transfer KV caches, but adjust the model execution
## Disaggregated prefilling ## Disaggregated prefilling
The example usage is in [this file](../../../examples/disaggregated_prefill.sh). The example usage is in [this file](../../../examples/online_serving/disaggregated_prefill.sh).
Here is the diagram of how we run disaggretgated prefilling. Here is the diagram of how we run disaggretgated prefilling.
......
...@@ -452,9 +452,9 @@ class TensorizerLoader(BaseModelLoader): ...@@ -452,9 +452,9 @@ class TensorizerLoader(BaseModelLoader):
"""Load a serialized model with tensorizer to the CPU. """Load a serialized model with tensorizer to the CPU.
This is only necessary when the model isn't vLLM-tensorized (see This is only necessary when the model isn't vLLM-tensorized (see
examples/tensorize_vllm_model.py) This should still be faster than examples/other/tensorize_vllm_model.py) This should still
default HuggingFace loading, but will be slower than loading a be faster than default HuggingFace loading, but will be slower than
vLLM-tensorized model. loading a vLLM-tensorized model.
""" """
device_config = vllm_config.device_config device_config = vllm_config.device_config
model_config = vllm_config.model_config model_config = vllm_config.model_config
...@@ -472,7 +472,7 @@ class TensorizerLoader(BaseModelLoader): ...@@ -472,7 +472,7 @@ class TensorizerLoader(BaseModelLoader):
"""Load a serialized model with tensorizer. """Load a serialized model with tensorizer.
Expects a vLLM-tensorized model. See the Expects a vLLM-tensorized model. See the
examples/tensorize_vllm_model.py example script examples/other/tensorize_vllm_model.py example script
for serializing vLLM models.""" for serializing vLLM models."""
device_config = vllm_config.device_config device_config = vllm_config.device_config
...@@ -529,7 +529,8 @@ class ShardedStateLoader(BaseModelLoader): ...@@ -529,7 +529,8 @@ class ShardedStateLoader(BaseModelLoader):
Model loader that directly loads each worker's model state dict, which Model loader that directly loads each worker's model state dict, which
enables a fast load path for large tensor-parallel models where each worker enables a fast load path for large tensor-parallel models where each worker
only needs to read its own shard rather than the entire checkpoint. See only needs to read its own shard rather than the entire checkpoint. See
`examples/save_sharded_state.py` for creating a sharded checkpoint. `examples/offline_inference/save_sharded_state.py` for creating a sharded
checkpoint.
""" """
DEFAULT_PATTERN = "model-rank-{rank}-part-{part}.safetensors" DEFAULT_PATTERN = "model-rank-{rank}-part-{part}.safetensors"
......
...@@ -155,7 +155,7 @@ class TensorizerArgs: ...@@ -155,7 +155,7 @@ class TensorizerArgs:
encryption_keyfile: File path to a binary file containing a encryption_keyfile: File path to a binary file containing a
binary key to use for decryption. `None` (the default) means binary key to use for decryption. `None` (the default) means
no decryption. See the example script in no decryption. See the example script in
examples/tensorize_vllm_model.py. examples/other/tensorize_vllm_model.py.
s3_access_key_id: The access key for the S3 bucket. Can also be set via s3_access_key_id: The access key for the S3 bucket. Can also be set via
the S3_ACCESS_KEY_ID environment variable. the S3_ACCESS_KEY_ID environment variable.
s3_secret_access_key: The secret access key for the S3 bucket. Can also s3_secret_access_key: The secret access key for the S3 bucket. Can also
...@@ -363,12 +363,12 @@ class TensorizerAgent: ...@@ -363,12 +363,12 @@ class TensorizerAgent:
def tensorizer_weights_iterator( def tensorizer_weights_iterator(
tensorizer_args: "TensorizerArgs" tensorizer_args: "TensorizerArgs"
) -> Generator[Tuple[str, torch.Tensor], None, None]: ) -> Generator[Tuple[str, torch.Tensor], None, None]:
logger.warning( logger.warning("Deserializing HuggingFace models is not optimized for "
"Deserializing HuggingFace models is not optimized for " "loading on vLLM, as tensorizer is forced to load to CPU. "
"loading on vLLM, as tensorizer is forced to load to CPU. " "Consider deserializing a vLLM model instead for faster "
"Consider deserializing a vLLM model instead for faster " "load times. See the "
"load times. See the examples/tensorize_vllm_model.py example " "examples/other/tensorize_vllm_model.py example script "
"script for serializing vLLM models.") "for serializing vLLM models.")
deserializer_args = tensorizer_args.deserializer_params deserializer_args = tensorizer_args.deserializer_params
stream_params = tensorizer_args.stream_params stream_params = tensorizer_args.stream_params
......
...@@ -503,7 +503,8 @@ def kv_cache_scales_loader( ...@@ -503,7 +503,8 @@ def kv_cache_scales_loader(
KV cache scaling factors. The serialization should represent a dictionary KV cache scaling factors. The serialization should represent a dictionary
whose keys are the TP ranks and values are another dictionary mapping layers whose keys are the TP ranks and values are another dictionary mapping layers
to their KV cache scaling factors. to their KV cache scaling factors.
Keep this function in sync with the output of examples/fp8/extract_scales.py Keep this function in sync with the output of
examples/other/fp8/extract_scales.py
""" """
try: try:
with open(filename) as f: with open(filename) as f:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment