dist_checkpointing.rst 2.49 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
dist\_checkpointing package
===========================

A library for saving and loading the distributed checkpoints.
A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
can be loaded in a different parallel configuration.

Using the library requires defining sharded state_dict dictionaries with functions from  *mapping* and *optimizer* modules.
Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.

Safe Checkpoint Loading
-----------------------

Since **PyTorch 2.6**, the default behavior of `torch.load` is `weights_only=True`.
This ensures that only tensors and allow-listed classes are loaded, reducing the risk of arbitrary code execution.

If you encounter an error such as:

.. code-block:: bash

   WeightsUnpickler error: Unsupported global: GLOBAL argparse.Namespace was not an allowed global by default.

you can fix it by explicitly allow-listing the missing class in your script:

.. code-block:: python

   import torch, argparse

   torch.serialization.add_safe_globals([argparse.Namespace])


Subpackages
-----------

.. toctree::
   :maxdepth: 4

   dist_checkpointing.strategies

Submodules
----------

dist\_checkpointing.serialization module
----------------------------------------

.. automodule:: core.dist_checkpointing.serialization
   :members:
   :undoc-members:
   :show-inheritance:

dist\_checkpointing.mapping module
----------------------------------

.. automodule:: core.dist_checkpointing.mapping
   :members:
   :undoc-members:
   :show-inheritance:

dist\_checkpointing.optimizer module
------------------------------------

.. automodule:: core.dist_checkpointing.optimizer
   :members:
   :undoc-members:
   :show-inheritance:

dist\_checkpointing.core module
-------------------------------

.. automodule:: core.dist_checkpointing.core
   :members:
   :undoc-members:
   :show-inheritance:

dist\_checkpointing.dict\_utils module
--------------------------------------

.. automodule:: core.dist_checkpointing.dict_utils
   :members:
   :undoc-members:
   :show-inheritance:


dist\_checkpointing.utils module
--------------------------------

.. automodule:: core.dist_checkpointing.utils
   :members:
   :undoc-members:
   :show-inheritance:

Module contents
---------------

.. automodule:: core.dist_checkpointing
   :members:
   :undoc-members:
   :show-inheritance: