feat: upgrade to sdk v1 latest version

* 70b2701 on master Signed-off-by: huteng.ht <huteng.ht@bytedance.com>

feat: upgrade to sdk v1 latest version
* 70b2701 on master Signed-off-by: huteng.ht <huteng.ht@bytedance.com>
0ed05516 · huteng.ht · 61d052cb · 0ed05516 · 0ed05516 · 61d052cb
Commit 0ed05516 authored Dec 27, 2023 by huteng.ht
20 changed files
--- a/README.md
+++ b/README.md
 # veTurboIO

-火山引擎研发的一款用于高性能读写 PyTorch 模型文件的 Python 库。该库实现了主要基于 safetensors 文件格式，实现高效的存储与读取张量数据。

-## 安装
+[En](./README.md) | [中文](./README.zh.md)

-可以直接通过以下方式进行安装：
-```bash
-pip install veturboio -f https://veturbo-cn-beijing.tos-cn-beijing.volces.com/veturboio/index.html
-```

-Tips: 该指令会优先下载与当前 Python、PyTorch 版本匹配的 whl 文件，如果没有找到匹配的 whl 文件，会自动下载源码进行编译安装。
-当使用源码安装时，可增加 `--no-build-isolation` 来使用当前的运行环境进行编译并安装（否则会尝试创建虚拟环境）。
+A Python library for high-performance reading and writing of PyTorch model files 
+developed by Volcano Engine. This library mainly implements based on the safetensors 
+file format to achieve efficient storage and reading of tensor data.

+## Install

-如果已经安装失败，可以尝试通过下载源码进行安装：
+It can be installed directly through the following way:
 ```bash
 cd veturboio
 python setup.py get_libcfs
 python setup.py install
 ```

-## 快速开始
+Tips: This instruction will preferentially download the whl file that matches the 
+current Python and PyTorch versions. If no matching whl file is found, it will 
+automatically download the source code for compilation and installation.

-```python
-import torch
-import veturboio

-tensors = {
-   "weight1": torch.zeros((1024, 1024)),
-   "weight2": torch.zeros((1024, 1024))
-}
+If the installation fails, you can also try to install by downloading the source code, 
+and then compile and install it manually.

-veturboio.save_file(tensors, "model.safetensors")
+```bash
+# CUDA ops, default
+python setup.py install --cuda_ext

-reloaded_tensor = veturboio.load("model.safetensors", map_location="cpu")
+# NPU ops
+python setup.py install --npu_ext

-# check if the tensors are the same
-for k, v in tensors.items():
-    assert torch.allclose(v, reloaded_tensor[k])
+# CPU only
+python setup.py install --cpu_ext
 ```

-### 使用锁页内存加速连续加载数据到GPU
-```python
-import torch
-import veturboio

-tensors1 = {
-   "weight1": torch.zeros((1024, 1024)),
-   "weight2": torch.zeros((1024, 1024))
-}
-
-veturboio.save_file(tensors1, "model1.safetensors")
+## Quick Start

-tensors2 = {
-   "weight1": torch.zeros((1024, 1024)),
-   "weight2": torch.zeros((1024, 1024))
-}
+### Read and write model files

-veturboio.save_file(tensors2, "model2.safetensors")
-
-helper = veturboio.init_io_helper()
-reloaded_tensor1 = veturboio.load("model1.safetensors", map_location="cuda:0", use_pinmem=True, helper=helper)
-# the map_location may be different
-reloaded_tensor2 = veturboio.load("model2.safetensors", map_location="cuda:0", use_pinmem=True, helper=helper) 
-
-# check if the tensors are the same
-for k, v in tensors1.items():
-    assert torch.allclose(v.cuda(), reloaded_tensor1[k])
-for k, v in tensors2.items():
-    assert torch.allclose(v.cuda(), reloaded_tensor2[k])
-```

-### 读写模型时启用加解密
-该库底层通过两种接口读写：SFCS SDK 和 POSIX。如果文件路径前缀为 `sfcs://` 就视为使用 SFCS SDK，所需的鉴权信息可以从火山引擎可信服务的 unix domain socket 获取， 或者设置以下三个环境变量：
-
-| 环境变量名                     | 含义                              |
-| ------------------------------ | --------------------------------- |
-| SFCS_ACCESS_KEY                | SFCS 文件系统的 AK                  |
-| SFCS_SECRET_KEY                | SFCS 文件系统的 SK                  |
-| SFCS_NAMENODE_ENDPOINT_ADDRESS | SFCS 文件系统 name 节点地址          |
-
-加解密读写模型文件所需的 data key 和 iv，共有3种获取方式，优先级按照序号：
- [1] 加密的 data key 和 iv 存放在密文模型文件的 header 中，使用火山引擎 KMS 解密得到明文的 data key。
- [1.1] 访问 KMS 所需的 AK/SK/ST 从火山引擎可信服务的 unix domain socket 获取，需要额外挂载。
- [1.2] 访问 KMS 所需的 AK/SK/ST 从环境变量获取。
- [2] 访问火山引擎可信服务的 unix domain socket 直接获取 data key 和 iv，需要额外挂载。
- [3] 环境变量直接设置 data key 和 iv。
-
-不同方式需要设置的环境变量如下：
-
-| 环境变量名                     | 含义                                 |  
-| ------------------------------ | --------------------------------- |
-| VETURBOIO_KMS_HOST             |  [1] KMS 服务地址，默认值 open.volcengineapi.com|
-| VETURBOIO_KMS_REGION            | [1] KMS 服务所在区域，默认值 cn-beijing |
-| VETURBOIO_KMS_KEYRING_NAME      | [1] KMS 服务解密 data key 的钥匙环名 |
-| VETURBOIO_KMS_KEY_NAME          | [1] KMS 服务解密 data key 的主密钥名 |
-| DATAPIPE_SOCKET_PATH            | [1.1][2] 可信服务 uds 的路径        |
-| VETURBOIO_KMS_ACCESS_KEY        | [1.2] KMS 鉴权的 AK |
-| VETURBOIO_KMS_SECRET_KEY        | [1.2] KMS 鉴权的 SK |
-| VETURBOIO_KMS_SESSION_TOKEN     | [1.2] KMS 鉴权的临时令牌，非必需|
-| VETURBOIO_KEY                   | [3] 加解密的 128 位数据密钥的 base64 编码 |
-| VETURBOIO_IV                    | [3] 加解密的 128 位初始向量的 base64 编码 |
-
-
-按照上述三种方式设置好后，可以参考下面代码在读写模型文件时启用加解密：
 ```python
 import torch
 import veturboio
@@ -113,63 +51,61 @@ tensors = {
   "weight2": torch.zeros((1024, 1024))
 }

-# use cpu to encrypt
-veturboio.save_file(tensors, "sfcs://model.safetensors", use_cipher=True)
-
-# use cpu to decrypt if map_location is cpu
-reloaded_tensor1 = veturboio.load("sfcs://model.safetensors", map_location="cpu", use_cipher=True)
+veturboio.save_file(tensors, "model.safetensors")

-# use gpu to decrypt if map_location is cuda
-reloaded_tensor2 = veturboio.load("sfcs://model.safetensors", map_location="cuda:0", use_cipher=True)
+new_tensors = veturboio.load("model.safetensors")

 # check if the tensors are the same
 for k, v in tensors.items():
-    assert torch.allclose(v, reloaded_tensor1[k])
-for k, v in tensors.items():
-    assert torch.allclose(v, reloaded_tensor2[k])
+    assert torch.allclose(v, new_tensors[k])
 ```

-### 转换现有的 PyTorch 文件
+### Convert existing PyTorch files
+
 ```bash
 python -m veturboio.convert -i model.pt -o model.safetensors
 ```

-## 性能测试
-直接运行
+## Performance test
+
+Run directly:
 ```bash
 bash bench/io_bench.sh
 ```
-可以得到如下结果
+Then, you can get the following results:
 ```
-fs_name    tensor_size     veturboio load_time(s)             torch load_time(s)            
-shm        1073741824      0.08                               0.63                              
-shm        2147483648      0.19                               1.26                              
-shm        4294967296      0.36                               2.32    
+fs_name    tensor_size     veturboio load_time(s)             torch load_time(s)
+shm        1073741824      0.08                               0.63
+shm        2147483648      0.19                               1.26
+shm        4294967296      0.36                               2.32
 ```
-也可以进一步根据以下命令的参数说明调整使用参数
+
+Also, you can run the following command to get more options:
 ```bash
 python bench/io_bench.py -h
 ```

-## 特性
+## Advance Features
+
+### Using veMLP to accelerate reading and writing
+Volcano Engine Machine Learning Platform (veMLP) provides a distributed cache file system
+based on the physical disks of the GPU cluster. 
+
+<p align="center">
+    <img src="./docs/imgs/SFCS.png" style="zoom:15%;">
+</p>

- [x] 多线程高性能读取文件；
- [x] zero-copy 读取，不额外花费内存；
- [x] 支持直接加载到 CUDA；
- [x] bfloat16 数值 类型支持；
- [x] 支持固定 pin-memory 用于让 GPU 快速反复读取大文件；
- [x] 兼容 PyTorch 标准格式（无性能提升）；
- [x] 兼容 safetensors 格式；
- [x] 特殊加密格式存储；
+When a cluster-level task needs to read 
+a model file, the caching system can efficiently distribute the model file between GPU 
+machines via RDMA transfer, thus avoiding network transfer bottlenecks. When using this 
+system, veTurboIO can maximize its performance advantages.

-## 收益
+### Encrypt and decrypt model files
+veTurboIO supports encryption and decryption of model files. You can read the [tutorial](./docs/encrypt_model.md) 
+to learn how to keep your model files secure. When you use GPU as target device, veTurboIO can decrypt the model file on the fly.

-标准的 PyTorch 模型文件会经过 zip 与 pickle 两次操作，这两个操作极大的抑制了读取的速度，同时 unpickle 也会带来潜在的不安全性。我们使用一种自定义的模型格式来存储 tensor 数据，希望可以改善 PyTorch 标准格式所存在的这些问题。目前已经实现的优点有：

- 多线程读取：当前文件对象主要的存放点为云端存储，单一进程无法达到云存储的带宽上限，必须使用多线程读取才能达到最大的读取速度。PyTorch 标准格式的读取速度受限于 pickle 解析速度，远无法达到云存储的速度上限；
- 云端适配：基于火山引擎的云端存储（vePFS、SFCS）特性，最大化的利用了云端存储的带宽；
- 安全性：不再使用 pickle 对象，避免了 pickle 的安全性问题；
+## License

-## 更新记录
+[Apache License 2.0](./LICENSE)

-前往 [CHANGELOG](./CHANGELOG.md) 了解更多。
\ No newline at end of file
--- a/README.zh.md
+++ b/README.zh.md
+# veTurboIO
+
+
+[en](./README.md) | [中文](./README.zh.md)
+
+
+一个由 Volcano Engine 开发的用于高性能读写 PyTorch 模型文件的 Python 库。该库主要基于 safetensors 文件格式实现，以实现对张量数据的高效存储和读取。
+
+## 安装
+
+可以直接通过以下方式安装：
+```bash
+pip install veturboio -f https://veturbo-cn-beijing.tos-cn-beijing.volces.com/veturboio/index.html --no-build-isolation
+```
+
+提示：此指令会优先下载与当前 Python 和 PyTorch 版本匹配的 whl 文件，如果没有找到匹配的 whl 文件，会自动下载源码进行编译安装。
+
+如果安装失败，也可以尝试通过下载源码安装，然后手动编译安装。
+```bash
+# CUDA ops, default
+python setup.py install --cuda_ext
+
+# NPU ops
+python setup.py install --npu_ext
+
+# CPU only
+python setup.py install --cpu_ext
+```
+
+## 快速开始
+
+### 读写模型文件
+
+
+```python
+import torch
+import veturboio
+
+tensors = {
+   "weight1": torch.zeros((1024, 1024)),
+   "weight2": torch.zeros((1024, 1024))
+}
+
+veturboio.save_file(tensors, "model.safetensors")
+
+new_tensors = veturboio.load("model.safetensors")
+
+# check if the tensors are the same
+for k, v in tensors.items():
+    assert torch.allclose(v, new_tensors[k])
+```
+
+## 转换已有 PyTorch 文件
+
+```bash
+python -m veturboio.convert -i model.pt -o model.safetensors
+```
+
+## 性能测试
+
+直接运行：
+```bash
+bash bench/io_bench.sh
+```
+
+接下来，你可以获得如下的结果：
+```
+fs_name    tensor_size     veturboio load_time(s)             torch load_time(s)
+shm        1073741824      0.08                               0.63
+shm        2147483648      0.19                               1.26
+shm        4294967296      0.36                               2.32
+```
+
+## 进阶功能
+
+### 使用 veMLP 加速读写
+Volcano Engine Machine Learning Platform (veMLP) 提供了基于 GPU 集群的物理磁盘的分布式缓存文件系统。
+
+<p align="center">
+    <img src="./docs/imgs/SFCS.png" style="zoom:15%;">
+</p>
+
+当集群级任务需要读取模型文件时，缓存系统可以通过 RDMA 传输高效地在 GPU 机器之间分发模型文件，从而避免网络传输瓶颈。使用此系统时，veTurboIO 可以最大化其性能优势。
+
+
+### 加密和解密模型文件
+
+veTurboIO 支持模型文件的加密和解密。您可以阅读[教程]([tutorial](./docs/encrypt_model.md))以了解如何保护您的模型文件。当您使用 GPU 作为目标设备时，veTurboIO 可以实时解密模型文件。
+
+## 许可证
+
+[Apache License 2.0](./LICENSE)
--- a/docs/api.md
+++ b/docs/api.md
-
-
-
-# API
-
-
-::: veturboio.io
-
--- a/docs/dynamic_load.md
+++ b/docs/dynamic_load.md
-# 动态加载
-
--- a/docs/encrypt_model.md
+++ b/docs/encrypt_model.md
+# 加解密模型文件
+
+该库底层通过两种接口读写：SFCS SDK 和 POSIX。如果文件路径前缀为 `sfcs://` 就视为使用 SFCS SDK，所需的鉴权信息可以从火山引擎可信服务的 `unix domain socket` 获取或者设置以下三个环境变量：
+
+| 环境变量名                     | 含义                              |
+| ------------------------------ | --------------------------------- |
+| SFCS_ACCESS_KEY                | SFCS 文件系统的 AK                  |
+| SFCS_SECRET_KEY                | SFCS 文件系统的 SK                  |
+| SFCS_NAMENODE_ENDPOINT_ADDRESS | SFCS 文件系统 NameNode 地址          |
+
+
+加解密读写模型文件需要 data key 和 iv，有 3 种获取方式，读取优先级按照下列顺序：
+- [1] 加密的 data key 和 iv 存放在密文模型文件的 header 中，使用火山引擎 KMS 解密得到明文的 data key。
+- [1.1] 访问 KMS 所需的 AK/SK/ST 从火山引擎可信服务的 unix domain socket 获取，需要额外挂载。
+- [1.2] 访问 KMS 所需的 AK/SK/ST 从环境变量获取。
+- [2] 访问火山引擎可信服务的 unix domain socket 直接获取 data key 和 iv，需要额外挂载。
+- [3] 通过环境变量直接设置 data key 和 iv。
+
+不同方式需要设置的环境变量如下：
+
+| 环境变量名                     | 含义                                 |
+| ------------------------------ | --------------------------------- |
+| VETURBOIO_KMS_HOST             |  [1] KMS 服务地址，默认值 open.volcengineapi.com|
+| VETURBOIO_KMS_REGION            | [1] KMS 服务所在区域，默认值 cn-beijing |
+| VETURBOIO_KMS_KEYRING_NAME      | [1] KMS 服务解密 data key 的钥匙环名 |
+| VETURBOIO_KMS_KEY_NAME          | [1] KMS 服务解密 data key 的主密钥名 |
+| DATAPIPE_SOCKET_PATH            | [1.1][2] 可信服务 uds 的路径        |
+| VETURBOIO_KMS_ACCESS_KEY        | [1.2] KMS 鉴权的 AK |
+| VETURBOIO_KMS_SECRET_KEY        | [1.2] KMS 鉴权的 SK |
+| VETURBOIO_KMS_SESSION_TOKEN     | [1.2] KMS 鉴权的临时令牌，非必需|
+| VETURBOIO_KEY                   | [3] 加解密的 128 位数据密钥的 base64 编码 |
+| VETURBOIO_IV                    | [3] 加解密的 128 位初始向量的 base64 编码 |
+
+
+按照上述三种方式设置好后，可以参考下面代码在读写模型文件时启用加解密：
+```python
+import torch
+import veturboio
+
+tensors = {
+   "weight1": torch.zeros((1024, 1024)),
+   "weight2": torch.zeros((1024, 1024))
+}
+
+# use cpu to encrypt
+veturboio.save_file(tensors, "sfcs://model.safetensors", use_cipher=True)
+
+# use cpu to decrypt if map_location is cpu
+reloaded_tensor1 = veturboio.load("sfcs://model.safetensors", map_location="cpu", use_cipher=True)
+
+# use gpu to decrypt if map_location is cuda
+reloaded_tensor2 = veturboio.load("sfcs://model.safetensors", map_location="cuda:0", use_cipher=True)
+
+# check if the tensors are the same
+for k, v in tensors.items():
+    assert torch.allclose(v, reloaded_tensor1[k])
+for k, v in tensors.items():
+    assert torch.allclose(v, reloaded_tensor2[k])
+```
+
--- a/docs/imgs/SFCS.png
+++ b/docs/imgs/SFCS.png
--- a/docs/pin_mem.md
+++ b/docs/pin_mem.md
+### 使用锁页内存加速连续加载数据到GPU
+```python
+import torch
+import veturboio
+
+tensors1 = {
+   "weight1": torch.zeros((1024, 1024)),
+   "weight2": torch.zeros((1024, 1024))
+}
+
+veturboio.save_file(tensors1, "model1.safetensors")
+
+tensors2 = {
+   "weight1": torch.zeros((1024, 1024)),
+   "weight2": torch.zeros((1024, 1024))
+}
+
+veturboio.save_file(tensors2, "model2.safetensors")
+
+helper = veturboio.init_io_helper()
+reloaded_tensor1 = veturboio.load("model1.safetensors", map_location="cuda:0", use_pinmem=True, helper=helper)
+# the map_location may be different
+reloaded_tensor2 = veturboio.load("model2.safetensors", map_location="cuda:0", use_pinmem=True, helper=helper)
+
+# check if the tensors are the same
+for k, v in tensors1.items():
+    assert torch.allclose(v.cuda(), reloaded_tensor1[k])
+for k, v in tensors2.items():
+    assert torch.allclose(v.cuda(), reloaded_tensor2[k])
+```
+
--- a/docs/release.md
+++ b/docs/release.md
--- a/docs/sfcs_support.md
+++ b/docs/sfcs_support.md
-# SFCS 加载优化
-
--- a/setup.py
+++ b/setup.py
@@ -16,19 +16,35 @@ limitations under the License.

 import os
 import platform
+import sys

 import requests
 import setuptools
 import torch
 from pkg_resources import parse_version
-from setuptools import find_packages, setup
-from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDAExtension
+from setuptools import Extension, find_packages, setup
+from torch.utils.cpp_extension import BuildExtension, CppExtension, include_paths

 # initialize variables for compilation
 IS_LINUX = platform.system() == "Linux"
 IS_DARWIN = platform.system() == "Darwin"
 IS_WINDOWS = platform.system() == "Windows"

+this_dir = os.path.dirname(os.path.abspath(__file__))
+
+
+def get_option():
+    if os.getenv("NPU_EXTENSION_ENABLED", "0") == "1":
+        sys.argv.append("--npu_ext")
+    elif "--cuda_ext" not in sys.argv and "--npu_ext" not in sys.argv and "--cpu_ext" not in sys.argv:
+        print(
+            '''No known extension specified, default to use --cuda_ext. Currently supported:
+            --cuda_ext
+            --npu_ext
+            --cpu_ext'''
+        )
+        sys.argv.append("--cuda_ext")
+

 def get_version():
    import importlib.util
@@ -37,7 +53,12 @@ def get_version():
    m = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(m)

-    return m.__version__
+    if "--cpu_ext" in sys.argv:
+        return m.__version__ + "+cpu"
+    elif "--npu_ext" in sys.argv:
+        return m.__version__ + "+npu"
+    else:
+        return m.__version__


 def make_relative_rpath(path):
@@ -50,6 +71,7 @@ def make_relative_rpath(path):


 def get_veturboio_extension():
+    get_option()
    # prevent ninja from using too many resources
    try:
        import psutil
@@ -71,41 +93,108 @@ def get_veturboio_extension():
    # Since PyTorch1.8.0, it has a default value so users do not need
    # to pass an empty list anymore.
    # More details at https://github.com/pytorch/pytorch/pull/45956
-    extra_compile_args = {'cxx': [], 'nvcc': ['-O3']}
+    extra_compile_args = {'cxx': ['-fvisibility=hidden'], 'nvcc': ['-O3']}

    if parse_version(torch.__version__) <= parse_version('1.12.1'):
-        extra_compile_args['cxx'] = ['-std=c++14']
+        extra_compile_args['cxx'].append('-std=c++14')
    else:
-        extra_compile_args['cxx'] = ['-std=c++17']
+        extra_compile_args['cxx'].append('-std=c++17')
+
+    name = "veturboio_ext"
+
+    sources = [
+        "veturboio/ops/csrc/pybind.cpp",
+        "veturboio/ops/csrc/posix.cpp",
+        "veturboio/ops/csrc/sfcs.cpp",
+        "veturboio/ops/csrc/io_helper_cpu_common.cpp",
+        "veturboio/ops/csrc/cipher.cpp",
+    ]
+
+    include_dirs = include_paths()
+    include_dirs.append("veturboio/ops/csrc/include")
+
+    torch_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+    library_dirs = [torch_dir]
+    library_dirs.append("veturboio/ops/csrc/lib")
+
+    libraries = ["cloudfs", ":libfastcrypto_gpu.so.0.3"]

-    include_dirs = ["veturboio/ops/csrc/include"]
-    library_dirs = ["veturboio/ops/csrc/lib"]
-    libraries = ["cfs", ":libfastcrypto_gpu.so.0.3"]
    extra_link_args = [make_relative_rpath("veturboio/ops/csrc/lib")]

-    return CUDAExtension(
-        name="veturboio_ext",
-        sources=[
-            "veturboio/ops/csrc/pybind.cpp",
-            "veturboio/ops/csrc/load_utils.cpp",
-            "veturboio/ops/csrc/sfcs.cpp",
-            "veturboio/ops/csrc/io_helper.cu",
-            "veturboio/ops/csrc/cipher.cpp",
-        ],
-        define_macros=define_macros,
-        include_dirs=include_dirs,
-        library_dirs=library_dirs,
-        libraries=libraries,
-        extra_compile_args=extra_compile_args,
-        extra_link_args=extra_link_args,
-    )
+    # Refer to: https://github.com/pytorch/pytorch/blob/main/torch/utils/cpp_extension.py#L918
+    # In torch 2.0, this flag is False, and the *.so lib set this flag as False when building.
+    # In newer torch, this flag is True, to keep compatibility with *.so lib, we set it False
+    # to generate g++ flags '-D_GLIBCXX_USE_CXX11_ABI=0' when building veturboio_ext, otherwise
+    # some 'undefine symbol' error of std::string will be thrown.
+    torch._C._GLIBCXX_USE_CXX11_ABI = False
+
+    if "--cuda_ext" in sys.argv:
+        sys.argv.remove("--cuda_ext")
+
+        extra_compile_args['nvcc'].append('-O3')
+
+        sources.append("veturboio/ops/csrc/io_helper.cu")
+
+        define_macros.append(("USE_CUDA", "1"))
+
+        from torch.utils.cpp_extension import CUDAExtension
+
+        return CUDAExtension(
+            name=name,
+            sources=sources,
+            define_macros=define_macros,
+            include_dirs=include_dirs,
+            library_dirs=library_dirs,
+            libraries=libraries,
+            extra_compile_args=extra_compile_args,
+            extra_link_args=extra_link_args,
+        )
+    else:
+        extra_compile_args['cxx'].append('-O3')
+
+        libraries.append("torch_cpu")
+        libraries.append("torch_python")
+
+        extra_link_args.append(f"-Wl,--rpath={torch_dir},--enable-new-dtags")
+
+        if "--npu_ext" in sys.argv:
+            sys.argv.remove("--npu_ext")
+
+            sources.append("veturboio/ops/csrc/io_helper_npu.cpp")
+            define_macros.append(("USE_NPU", "1"))
+
+            return Extension(
+                name=name,
+                sources=sources,
+                define_macros=define_macros,
+                include_dirs=include_dirs,
+                library_dirs=library_dirs,
+                libraries=libraries,
+                extra_compile_args=extra_compile_args,
+                extra_link_args=extra_link_args,
+            )
+        elif "--cpu_ext" in sys.argv:
+            sys.argv.remove("--cpu_ext")
+
+            sources.append("veturboio/ops/csrc/io_helper_cpu.cpp")
+
+            return Extension(
+                name=name,
+                sources=sources,
+                define_macros=define_macros,
+                include_dirs=include_dirs,
+                library_dirs=library_dirs,
+                libraries=libraries,
+                extra_compile_args=extra_compile_args,
+                extra_link_args=extra_link_args,
+            )


 class GetLibCfsCommand(setuptools.Command):
    """get libcfs from url"""

    description = 'get libcfs from url'
-    user_options = [('src=', 's', 'source url of libcfs.so'), ('dst=', 'd', 'dest filepath of libcfs.so')]
+    user_options = [('src=', 's', 'source url of libcloudfs.so'), ('dst=', 'd', 'dest filepath of libcloudfs.so')]

    def initialize_options(self):
        from veturboio.utils.load_veturboio_ext import LIBCFS_DEFAULT_PATH, LIBCFS_DEFAULT_URL
@@ -117,7 +206,7 @@ class GetLibCfsCommand(setuptools.Command):
        pass

    def run(self):
-        print(f"download libcfs.so from {self.src}, save to {self.dst}")
+        print(f"download libcloudfs.so from {self.src}, save to {self.dst}")
        r = requests.get(self.src, timeout=60)
        with open(self.dst, 'wb') as f:
            f.write(r.content)
@@ -133,10 +222,12 @@ setup(
    install_requires=[
        "safetensors",
        "numpy",
+        "netifaces",
        "loguru",
        "requests-unixsocket",
        "requests",
    ],
    include_package_data=True,
    cmdclass={"get_libcfs": GetLibCfsCommand, "build_ext": BuildExtension},
+    dependency_links=['https://mirrors.ivolces.com/pypi/'],
 )
--- a/tests/test_fetch_cipher.py
+++ b/tests/test_fetch_cipher.py
@@ -168,7 +168,6 @@ class TestCipherInfo(TestCase):
        os.environ.pop(ENV_KMS_SK, None)
        DataPipeClient.DATAPIPE_SOCKET_PATH = self.server_address
        info = CipherInfo(True, header_bytes)
-        info = CipherInfo(True, header_bytes)
        self.assertTrue(info.use_cipher)
        self.assertTrue(info.use_header)
        self.assertTrue(np.array_equal(info.key, self.target_key_2))
@@ -176,7 +175,8 @@ class TestCipherInfo(TestCase):

    def test_fetch_from_datapipe(self):
        DataPipeClient.DATAPIPE_SOCKET_PATH = self.server_address
-        info = CipherInfo(True)
+        DataPipeClient.ENCRYPT_HEADER['X-Encrypt-Caller-Pod'] = 'test-pod-name'
+        info = CipherInfo(True, None, '/maas_model/test_path')
        self.assertTrue(info.use_cipher)
        self.assertTrue(np.array_equal(info.key, self.target_key))
        self.assertTrue(np.array_equal(info.iv, self.target_iv))
@@ -190,12 +190,12 @@ class TestCipherInfo(TestCase):
        self.assertTrue(np.array_equal(info.key, self.target_key))
        self.assertTrue(np.array_equal(info.iv, self.target_iv))

-    def test_fallback(self):
+    def test_raise_error(self):
        DataPipeClient.DATAPIPE_SOCKET_PATH = '/path/not/exist'
        os.environ['VETURBOIO_KEY'] = base64.b64encode(b'abcdefgh12').decode('ascii')
        os.environ['VETURBOIO_IV'] = base64.b64encode(b'1234567887').decode('ascii')
-        info = CipherInfo(True)
-        self.assertFalse(info.use_cipher)
+        with self.assertRaises(RuntimeError):
+            info = CipherInfo(True)

    @classmethod
    def tearDownClass(cls):
@@ -232,19 +232,9 @@ class TestCredentials(TestCase):
        self.assertEqual(cred['SessionToken'], 'ST' * 12)

    def test_sfcs_conf(self):
-        # case 1: a xml file already exists, do nothing
-        sfcs_conf = os.path.join(os.getcwd(), 'base_model.xml')
-        generate_sfcs_conf_xml(sfcs_conf, {'test': 'test'})
-        init_sfcs_conf('/base_model/tensor.pt')
-        self.assertEqual(os.environ['LIBCFS_CONF'], sfcs_conf)
-        self.assertEqual(len(credentials_helper.threads), 0)
-        self.assertEqual(len(credentials_helper.running), 0)
-        os.remove(sfcs_conf)
-
        for e in SFCS_REQ_ENV_LIST:
            os.environ[e] = 'test-value'
-
-        # case 2: env SFCS_ACCESS_KEY and SFCS_SECRET_KEY and SFCS_NAMENODE_ENDPOINT_ADDRESS exists
+        # case 1: env SFCS_ACCESS_KEY and SFCS_SECRET_KEY and SFCS_NAMENODE_ENDPOINT_ADDRESS exists
        os.environ['SFCS_ACCESS_KEY'] = 'A' * 12
        os.environ['SFCS_SECRET_KEY'] = 'S' * 12
        os.environ['SFCS_NAMENODE_ENDPOINT_ADDRESS'] = '100.67.19.231'
@@ -252,13 +242,13 @@ class TestCredentials(TestCase):
        if os.path.exists(sfcs_conf):
            os.remove(sfcs_conf)
        init_sfcs_conf('/base_model2/tensor.pt')
-        self.assertEqual(os.environ['LIBCFS_CONF'], sfcs_conf)
+        self.assertEqual(os.environ['LIBCLOUDFS_CONF'], sfcs_conf)
        self.assertEqual(len(credentials_helper.threads), 0)
        self.assertEqual(len(credentials_helper.running), 0)
        self.assertTrue(os.path.exists(sfcs_conf))
        os.remove(sfcs_conf)

-        # case 3: use datapipe socket to get and refresh ak, sk, st and namenode_ip
+        # case 2: use datapipe socket to get and refresh ak, sk, st and namenode_ip
        DataPipeClient.DATAPIPE_SOCKET_PATH = self.server_address
        os.environ.pop('SFCS_ACCESS_KEY', None)
        os.environ.pop('SFCS_SECRET_KEY', None)
@@ -277,12 +267,15 @@ class TestCredentials(TestCase):
        self.assertTrue(credentials_helper.running['base_model4'])
        self.assertTrue(os.path.exists(sfcs_conf3))
        self.assertTrue(os.path.exists(sfcs_conf4))
+        for i in range(5):
+            os.remove(sfcs_conf3)
+            os.remove(sfcs_conf4)
+            sleep(3)
+            self.assertTrue(os.path.exists(sfcs_conf3))
+            self.assertTrue(os.path.exists(sfcs_conf4))
+        print(credentials_helper.threads)
        os.remove(sfcs_conf3)
        os.remove(sfcs_conf4)
-        sleep(3)
-        self.assertTrue(os.path.exists(sfcs_conf3))
-        self.assertTrue(os.path.exists(sfcs_conf4))
-        print(credentials_helper.threads)

    def test_sfcs_conf_json(self):
        for e in SFCS_REQ_ENV_LIST:
@@ -308,17 +301,18 @@ class TestCredentials(TestCase):
        self.assertTrue(credentials_helper.running['base_model2'])
        self.assertTrue(os.path.exists(sfcs_conf1))
        self.assertTrue(os.path.exists(sfcs_conf2))
+        for i in range(5):
+            sleep(3)
+            self.assertTrue(os.path.exists(sfcs_conf1))
+            self.assertTrue(os.path.exists(sfcs_conf2))
+        print(credentials_helper.threads)
        os.remove(sfcs_conf1)
        os.remove(sfcs_conf2)
-        sleep(3)
-        self.assertTrue(os.path.exists(sfcs_conf1))
-        self.assertTrue(os.path.exists(sfcs_conf2))
-        print(credentials_helper.threads)

    @classmethod
    def tearDownClass(cls):
        credentials_helper.stop()
-        os.environ.pop('LIBCFS_CONF', None)
+        os.environ.pop('LIBCLOUDFS_CONF', None)
        for e in SFCS_REQ_ENV_LIST:
            os.environ.pop(e, None)
        for e in SFCS_OPT_ENV_LIST:

--- a/tests/test_load_op.py
+++ b/tests/test_load_op.py
@@ -46,19 +46,19 @@ class TestLoad(TestCase):

        cls.tensors_0 = {
            "weight1": torch.randn(2000, 10),
-            "weight2": torch.randn(2000, 10),
+            "weight2": torch.IntTensor(2000, 10),
        }

        cls.tensors_1 = {
            "weight1": torch.randn(2000, 10),
-            "weight2": torch.randn(2000, 10),
-            "weight3": torch.randn(2000, 10),
+            "weight2": torch.IntTensor(2000, 10),
+            "weight3": torch.BoolTensor(2000, 10),
        }

        cls.filepath_0 = os.path.join(cls.tempdir.name, "model_0.safetensors")
        cls.filepath_1 = os.path.join(cls.tempdir.name, "model_1.safetensors")
        veturboio.save_file(cls.tensors_0, cls.filepath_0)
-        veturboio.save_file(cls.tensors_1, cls.filepath_1)
+        veturboio.save_file(cls.tensors_1, cls.filepath_1, enable_fast_mode=True)

        cls.pt_filepath = os.path.join(cls.tempdir.name, "model.pt")
        torch.save(cls.tensors_0, cls.pt_filepath)
@@ -70,7 +70,7 @@ class TestLoad(TestCase):
        cls.filepath_0_enc = os.path.join(cls.tempdir.name, "model_0_enc.safetensors")
        cls.filepath_1_enc = os.path.join(cls.tempdir.name, "model_1_enc.safetensors")
        veturboio.save_file(cls.tensors_0, cls.filepath_0_enc, use_cipher=True)
-        veturboio.save_file(cls.tensors_1, cls.filepath_1_enc, use_cipher=True)
+        veturboio.save_file(cls.tensors_1, cls.filepath_1_enc, use_cipher=True, enable_fast_mode=True)

        cls.pt_filepath_enc = os.path.join(cls.tempdir.name, "model_enc.pt")
        veturboio.save_pt(cls.tensors_0, cls.pt_filepath_enc, use_cipher=True)
@@ -82,6 +82,7 @@ class TestLoad(TestCase):

        cls.pt_filepath_enc_h = os.path.join(cls.tempdir.name, "model_enc_h.pt")
        veturboio.save_pt(cls.tensors_0, cls.pt_filepath_enc_h, use_cipher=True)
+        del os.environ["VETURBOIO_CIPHER_HEADER"]

        if torch.cuda.is_available():
            cls.cuda_tensors_0 = deepcopy(cls.tensors_0)
@@ -94,12 +95,15 @@ class TestLoad(TestCase):

    @classmethod
    def tearDownClass(cls):
-        # cls.tempdir.cleanup()
-        pass
+        cls.tempdir.cleanup()

-    def _run_pipeline(self, tensors, filepath, map_location, use_cipher, enable_fast_mode=True):
+    def _run_pipeline(self, tensors, filepath, map_location, use_cipher, enable_fast_mode=True, state_dict=None):
        loaded_tensors = veturboio.load(
-            filepath, map_location=map_location, use_cipher=use_cipher, enable_fast_mode=enable_fast_mode
+            filepath,
+            map_location=map_location,
+            use_cipher=use_cipher,
+            enable_fast_mode=enable_fast_mode,
+            state_dict=state_dict,
        )
        for key in tensors.keys():
            self.assertTrue(torch.allclose(tensors[key], loaded_tensors[key]))
@@ -110,6 +114,30 @@ class TestLoad(TestCase):
        self._run_pipeline(self.tensors_0, self.filepath_0_enc, "cpu", use_cipher=True)
        self._run_pipeline(self.tensors_0, self.filepath_0, "cpu", use_cipher=False, enable_fast_mode=False)
        self._run_pipeline(self.tensors_0, self.filepath_0_enc, "cpu", use_cipher=True, enable_fast_mode=False)
+        pre_allocated_tensors = {
+            "weight1": torch.randn(2000, 10),
+            "weight2": torch.IntTensor(2000, 10),
+        }
+        self._run_pipeline(self.tensors_0, self.filepath_0, "cpu", use_cipher=False, state_dict=pre_allocated_tensors)
+        self._run_pipeline(
+            self.tensors_0, self.filepath_0_enc, "cpu", use_cipher=True, state_dict=pre_allocated_tensors
+        )
+        self._run_pipeline(
+            self.tensors_0,
+            self.filepath_0,
+            "cpu",
+            use_cipher=False,
+            enable_fast_mode=False,
+            state_dict=pre_allocated_tensors,
+        )
+        self._run_pipeline(
+            self.tensors_0,
+            self.filepath_0_enc,
+            "cpu",
+            use_cipher=True,
+            enable_fast_mode=False,
+            state_dict=pre_allocated_tensors,
+        )

    @unittest.skipIf(not torch.cuda.is_available(), "CUDA not available")
    def test_pipeline_cuda(self):
@@ -117,6 +145,32 @@ class TestLoad(TestCase):
        self._run_pipeline(self.cuda_tensors_0, self.filepath_0_enc, "cuda:0", use_cipher=True)
        self._run_pipeline(self.cuda_tensors_0, self.filepath_0, "cuda:0", use_cipher=False, enable_fast_mode=False)
        self._run_pipeline(self.cuda_tensors_0, self.filepath_0_enc, "cuda:0", use_cipher=True, enable_fast_mode=False)
+        pre_allocated_tensors = {
+            "weight1": torch.randn(2000, 10).cuda(),
+            "weight2": torch.IntTensor(2000, 10).cuda(),
+        }
+        self._run_pipeline(
+            self.cuda_tensors_0, self.filepath_0, "cuda:0", use_cipher=False, state_dict=pre_allocated_tensors
+        )
+        self._run_pipeline(
+            self.cuda_tensors_0, self.filepath_0_enc, "cuda:0", use_cipher=True, state_dict=pre_allocated_tensors
+        )
+        self._run_pipeline(
+            self.cuda_tensors_0,
+            self.filepath_0,
+            "cuda:0",
+            use_cipher=False,
+            enable_fast_mode=False,
+            state_dict=pre_allocated_tensors,
+        )
+        self._run_pipeline(
+            self.cuda_tensors_0,
+            self.filepath_0_enc,
+            "cuda:0",
+            use_cipher=True,
+            enable_fast_mode=False,
+            state_dict=pre_allocated_tensors,
+        )

    def test_read_multi_state_dict_cpu(self):
        load_tensor_0 = self._run_pipeline(self.tensors_0, self.filepath_0, "cpu", use_cipher=False)
@@ -165,16 +219,13 @@ class TestLoad(TestCase):
            self.assertTrue(torch.allclose(self.cuda_tensors_0[key], loaded_tensors_enc[key]))

    def test_load_cipher_header_cpu(self):
-        os.environ["VETURBOIO_CIPHER_HEADER"] = "1"
        self._run_pipeline(self.tensors_0, self.filepath_0_enc_h, "cpu", use_cipher=True)
        self._run_pipeline(self.tensors_0, self.pt_filepath_enc_h, "cpu", use_cipher=True)
        self._run_pipeline(self.tensors_0, self.filepath_0_enc_h, "cpu", use_cipher=True, enable_fast_mode=False)
        self._run_pipeline(self.tensors_0, self.pt_filepath_enc_h, "cpu", use_cipher=True, enable_fast_mode=False)
-        del os.environ["VETURBOIO_CIPHER_HEADER"]

    @unittest.skipIf(not torch.cuda.is_available(), "CUDA not available")
    def test_load_cipher_header_cuda(self):
-        os.environ["VETURBOIO_CIPHER_HEADER"] = "1"
        self._run_pipeline(self.cuda_tensors_0, self.filepath_0_enc_h, "cuda:0", use_cipher=True)
        self._run_pipeline(self.cuda_tensors_0, self.pt_filepath_enc_h, "cuda:0", use_cipher=True)
        self._run_pipeline(
@@ -183,12 +234,30 @@ class TestLoad(TestCase):
        self._run_pipeline(
            self.cuda_tensors_0, self.pt_filepath_enc_h, "cuda:0", use_cipher=True, enable_fast_mode=False
        )
-        del os.environ["VETURBOIO_CIPHER_HEADER"]

    def test_load_directIO_fall_back(self):
        with tempfile.NamedTemporaryFile(dir="/dev/shm") as tmpFile:
-            veturboio.save_file(self.tensors_0, tmpFile.file.name)
+            veturboio.save_file(self.tensors_0, tmpFile.name)
            tmpFile.flush()
            loaded_tensors = veturboio.load(tmpFile.name, map_location="cpu", use_direct_io=True)
            for key in self.tensors_0.keys():
                self.assertTrue(torch.allclose(self.tensors_0[key], loaded_tensors[key]))
+
+    def test_load_to_shmem(self):
+        shmem = veturboio.load_to_shmem(self.filepath_0, use_cipher=False)
+        loaded_tensors = veturboio.load(
+            os.path.join("/dev/shm/", shmem.name), map_location="cpu", enable_fast_mode=False, use_cipher=False
+        )
+        for key in self.tensors_0.keys():
+            self.assertTrue(torch.allclose(self.tensors_0[key], loaded_tensors[key]))
+        shmem.close()
+        shmem.unlink()
+
+        shmem = veturboio.load_to_shmem(self.filepath_0_enc, use_cipher=True)
+        loaded_tensors = veturboio.load(
+            os.path.join("/dev/shm/", shmem.name), map_location="cpu", enable_fast_mode=False, use_cipher=False
+        )
+        for key in self.tensors_0.keys():
+            self.assertTrue(torch.allclose(self.tensors_0[key], loaded_tensors[key]))
+        shmem.close()
+        shmem.unlink()
--- a/tests/test_save_op.py
+++ b/tests/test_save_op.py
@@ -31,7 +31,8 @@ class TestSave(TestCase):
    def setUpClass(cls):
        cls.tensors_0 = {
            "weight1": torch.randn(2000, 10),
-            "weight2": torch.randn(2000, 10),
+            "weight2": torch.IntTensor(2000, 10),
+            "weight3": torch.BoolTensor(2000, 10),
        }

        class MockModel(torch.nn.Module):
@@ -46,6 +47,7 @@ class TestSave(TestCase):
        cls.tempdir = tempfile.TemporaryDirectory()
        cls.filepath_0 = os.path.join(cls.tempdir.name, "model_0.safetensors")
        cls.filepath_1 = os.path.join(cls.tempdir.name, "model_0.pt")
+        cls.filepath_2 = os.path.join(cls.tempdir.name, "model_0_fast.safetensors")
        cls.filepath_3 = os.path.join(cls.tempdir.name, "model_1.safetensors")

    @classmethod
@@ -55,7 +57,14 @@ class TestSave(TestCase):
    def test_save_file(self):
        veturboio.save_file(self.tensors_0, self.filepath_0)
        with safe_open(self.filepath_0, framework="pt", device="cpu") as f:
-            assert len(f.keys()) == 2
+            assert len(f.keys()) == 3
+            for key in f.keys():
+                self.assertTrue(torch.allclose(self.tensors_0[key], f.get_tensor(key)))
+
+        # enable fast mode
+        veturboio.save_file(self.tensors_0, self.filepath_2, enable_fast_mode=True)
+        with safe_open(self.filepath_2, framework="pt", device="cpu") as f:
+            assert len(f.keys()) == 3
            for key in f.keys():
                self.assertTrue(torch.allclose(self.tensors_0[key], f.get_tensor(key)))


--- a/tests/test_sfcs_sdk_op.py
+++ b/tests/test_sfcs_sdk_op.py
--- a/veturboio/__init__.py
+++ b/veturboio/__init__.py
@@ -14,7 +14,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 '''

-from veturboio.io import load, save_file, save_model, save_pt
-from veturboio.ops.load_utils import init_io_helper
+from veturboio.io import load, load_to_shmem, save_file, save_model, save_pt
+from veturboio.ops.io_utils import init_io_helper

-__all__ = ["load", "save_file", "save_model", "init_io_helper", "save_pt"]
+__all__ = ["load", "load_to_shmem", "save_file", "save_model", "init_io_helper", "save_pt"]
--- a/veturboio/convert.py
+++ b/veturboio/convert.py
@@ -15,21 +15,228 @@ limitations under the License.
 '''

 import argparse
+import gc
+import logging
+import os
+import sys
+import traceback
+from datetime import datetime

 import torch
+from safetensors.torch import _find_shared_tensors, _is_complete

-from veturboio import save_file
+import veturboio

-parser = argparse.ArgumentParser()
-parser.add_argument("--input", "-i", type=str, required=True)
-parser.add_argument("--output", "-o", type=str, required=True)
+
+def to_valid_state_dict(state_dict: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+    invalid_key = [k for k, v in state_dict.items() if not isinstance(v, torch.Tensor)]
+    if len(invalid_key) > 0:
+        logger.warning(f"invalid keys to removed: {invalid_key}")
+        state_dict = {k: v for k, v in state_dict.items() if k not in invalid_key}
+
+    result = {}
+    shared_tensor_groups = _find_shared_tensors(state_dict)
+    for group in shared_tensor_groups:
+        # check if all share tensors have the same data ptr, same shape, and same size
+        shared_tensors = [state_dict[k] for k in group]
+        data_ptrs = [t.data_ptr() for t in shared_tensors]
+        shapes = [t.shape for t in shared_tensors]
+        if len(set(data_ptrs)) != 1 or len(set(shapes)) != 1:
+            raise Exception(f"shared tensors {group} are not equal")
+        # make sure these tensors are complete and identical
+        converted_tensor = shared_tensors[0]
+        if not _is_complete(converted_tensor):
+            converted_tensor = converted_tensor.clone()
+        for t in group:
+            result[t] = converted_tensor
+    for k, v in state_dict.items():
+        if k not in result:
+            result[k] = v
+    return result
+
+
+def add_handlers(logger: logging.Logger):
+    """
+    Add handlers to logger
+    """
+    handler = logging.StreamHandler(stream=sys.stdout)
+    formatter = logging.Formatter(fmt="[%(levelname)s %(asctime)s] %(filename)s: %(lineno)d  %(message)s")
+    handler.setFormatter(formatter)
+    logger.addHandler(handler)
+
+
+def validate_result(input_state_dict: dict[str, torch.Tensor], output_state_dict: dict[str, torch.Tensor]):
+    input_state_dict = {k: v for k, v in input_state_dict.items() if isinstance(v, torch.Tensor)}
+    output_state_dict = {k: v for k, v in output_state_dict.items() if isinstance(v, torch.Tensor)}
+
+    input_key_set = set(input_state_dict.keys())
+    output_key_set = set(output_state_dict.keys())
+
+    if input_key_set != output_key_set:
+        not_in_output_key_set = input_key_set - output_key_set
+        not_in_input_key_set = output_key_set - input_key_set
+        raise Exception(
+            f"key set not equal, not in output key set: {not_in_output_key_set}, not in input key set: {not_in_input_key_set}"
+        )
+
+    not_equal_tensor = []
+    for key in input_state_dict:
+        if not torch.allclose(input_state_dict[key], output_state_dict[key]):
+            not_equal_tensor.append(key)
+    if len(not_equal_tensor) > 0:
+        raise Exception(f"result is not valid, not equal tensors: {not_equal_tensor}")
+
+    logger.info(f"all {len(input_key_set)} keys in state dict are equal")
+
+
+def _get_available_cpu() -> int:
+    avail_cpu = os.cpu_count()
+    if os.path.isfile('/sys/fs/cgroup/cpu/cpu.cfs_quota_us'):
+        cpu_quota = int(open('/sys/fs/cgroup/cpu/cpu.cfs_quota_us').read().rstrip())
+        if cpu_quota != -1 and os.path.isfile('/sys/fs/cgroup/cpu/cpu.cfs_period_us'):
+            cpu_period = int(open('/sys/fs/cgroup/cpu/cpu.cfs_period_us').read().rstrip())
+            avail_cpu = int(cpu_quota / cpu_period)
+            logger.info(f"get veturboio thread {avail_cpu} from cgroup info")
+    return avail_cpu
+
+
+class Pt2SafeTensorConverter:
+    def __init__(
+        self,
+        input_path: str,
+        output_path: str,
+        dry_run: bool,
+        enable_to_valid_state_dict: bool,
+        overwrite: bool,
+        use_direct_io: bool,
+    ):
+        self.input_path = input_path
+        self.output_path = output_path
+        self.dry_run = dry_run
+        self.enable_to_valid_state_dict = enable_to_valid_state_dict
+        self.use_direct_io = use_direct_io
+        if self.input_path.startswith("sfcs://"):
+            try:
+                self.input_file_size = veturboio.ops.sfcs_utils.sfcs_get_file_size(self.input_path)
+            except BaseException as Exp:
+                raise FileNotFoundError("can't get size of sfcs file", Exp)
+        else:
+            if not os.path.exists(self.input_path):
+                raise Exception(f"file not exist: {self.input_path}")
+            # convert to abs path
+            if not os.path.isabs(self.input_path):
+                self.input_path = os.path.abspath(self.input_path)
+            self.input_file_size = os.path.getsize(self.input_path)
+        if not self.input_path.endswith(".pt"):
+            raise Exception("input file must end with .pt")
+
+        if self.output_path is None:
+            self.output_path = self.input_path.replace(".pt", ".safetensors")
+        elif not self.output_path.startswith("sfcs://") and not os.path.isabs(self.output_path):
+            self.output_path = os.path.abspath(self.output_path)
+        if not self.output_path.endswith(".safetensors"):
+            raise Exception("output file must end with .safetensors")
+
+        if overwrite:
+            if self.output_path.startswith("sfcs://"):
+                raise Exception("overwrite flag cannot be set when using sfcs")
+            if os.path.exists(self.output_path):
+                logger.info(f"overwrite output file {self.output_path}")
+                if not dry_run:
+                    os.remove(self.output_path)
+        elif not self.output_path.startswith("sfcs://") and os.path.exists(self.output_path):
+            raise Exception(f"output file {self.output_path} already exists")
+
+    def convert(self):
+        logger.info(f"converting {self.input_path} to {self.output_path}")
+        available_cpus = _get_available_cpu()
+        ext_name = self.output_path.split(".")[-1]
+        state_dict = {}
+        if ext_name != "safetensors":
+            raise ValueError("output file should be safetensors file")
+        logger.info(f"start loading the pt file, the pt file has size of {self.input_file_size // 1000 // 1000}MB")
+        start_time = datetime.now()
+        if self.dry_run:
+            logger.info("dry run finished for veturboio.load_pt_file")
+        else:
+            state_dict = veturboio.load(
+                self.input_path, num_thread=available_cpus, use_direct_io=self.use_direct_io, enable_fast_mode=True
+            )
+        end_time = datetime.now()
+        logger.info(f"finish loading the pt file with duration {end_time - start_time}")
+        logger.info("start saving the safetensors file")
+        start_time = datetime.now()
+        if self.dry_run:
+            logger.info("dry run finished for veturboio.save_safetensors_file")
+        else:
+            if self.enable_to_valid_state_dict:
+                state_dict = to_valid_state_dict(state_dict)
+            veturboio.save_file(state_dict, self.output_path, force_save_shared_tensor=True)
+        end_time = datetime.now()
+        logger.info(f"finish saving the safetensors file with duration {end_time - start_time}")
+
+        del state_dict
+        gc.collect()
+        logger.info(f"gc finished")
+
+    def validate(self):
+        available_cpus = _get_available_cpu()
+        logger.info(f"validating if {self.input_path} in equal to {self.output_path}")
+        input_state_dict = veturboio.load(
+            self.input_path, num_thread=available_cpus, use_direct_io=self.use_direct_io, enable_fast_mode=True
+        )
+        logger.info(f"{self.input_path} loaded")
+
+        output_state_dict = veturboio.load(
+            self.output_path, num_thread=available_cpus, use_direct_io=self.use_direct_io, enable_fast_mode=True
+        )
+        logger.info(f"{self.output_path} loaded")
+
+        validate_result(input_state_dict, output_state_dict)


 if __name__ == "__main__":
+    logger = logging.getLogger(__name__)
+    logger.setLevel(logging.INFO)
+    add_handlers(logger)
+
+    parser = argparse.ArgumentParser(description="converter used to convert .pt model to .safeTensor")
+    parser.add_argument(
+        "--input",
+        "-i",
+        type=str,
+        required=True,
+        help="indicate the path of .pt file, both posix path" "and sfcs prefix are supported",
+    )
+    parser.add_argument(
+        "--output",
+        "-o",
+        type=str,
+        required=False,
+        help="indicate the path of .safeTensor file, both "
+        "posix path and sfcs prefix are supported."
+        "will be placed into the same dir of the .pt "
+        "file if left empty",
+    )
+    parser.add_argument("--dry-run", "-d", action="store_true", help="just dry run, not really convert")
+    parser.add_argument("--overwrite", action="store_true", help="overwrite the output file if it exists")
+    parser.add_argument(
+        "--enable-to-valid-state-dict",
+        action="store_true",
+        help="execute to_valid_state_dict function before save to .safetensors",
+    )
+    parser.add_argument("--validate-result", action="store_true", help="validate result", default=False)
+    parser.add_argument("--use-direct-io", action="store_true", help="use direct io to load file", default=False)
    args = parser.parse_args()
-    print(f"convert {args.input} to {args.output}")
-    ext_name = args.output.split(".")[-1]
-    if ext_name != "safetensors":
-        raise ValueError("output file should be safetensors file")
-    state_dict = torch.load(args.input)
-    save_file(state_dict, args.output, force_save_shared_tensor=True)
+
+    instance = Pt2SafeTensorConverter(
+        args.input, args.output, args.dry_run, args.enable_to_valid_state_dict, args.overwrite, args.use_direct_io
+    )
+    try:
+        instance.convert()
+        if args.validate_result:
+            instance.validate()
+    except Exception as e:
+        logger.error(f"convert failed.")
+        traceback.print_exc()
+        exit(1)
--- a/veturboio/io.py
+++ b/veturboio/io.py
@@ -15,16 +15,15 @@ limitations under the License.
 '''

 import os
+from multiprocessing import shared_memory
 from typing import Dict, Optional

 import torch
 from loguru import logger
 from safetensors.torch import _remove_duplicate_names
-from safetensors.torch import save_file as safetenors_save_file
-from safetensors.torch import save_model as safetensors_save_model

 from veturboio.loader import FasterPosixLoader, PosixLoader, SfcsClientLoader
-from veturboio.ops.load_utils import IOHelper
+from veturboio.ops.io_utils import IOHelper
 from veturboio.safetensors import SafetensorsFile
 from veturboio.saver import PosixSaver, SfcsClientSaver
 from veturboio.types import FILE_PATH
@@ -33,6 +32,8 @@ from veturboio.types import FILE_PATH
 def is_sfcs_path(file: FILE_PATH):
    if len(file) > 7 and file[:7] == "sfcs://":
        return True, file[6:]
+    elif len(file) > 9 and file[:9] == "/dev/shm/":
+        return False, file
    elif os.environ.get("VETURBOIO_USE_SFCS_SDK", "0") == "1":
        return True, file
    else:
@@ -47,7 +48,8 @@ def load(
    helper: Optional[IOHelper] = None,
    use_pinmem: Optional[bool] = False,
    use_direct_io: Optional[bool] = False,
-    use_cipher: Optional[bool] = False,
+    use_cipher: Optional[bool] = None,
+    state_dict: Dict[str, torch.Tensor] = None,
 ) -> Dict:
    """Load state dict object from checkpoint file. The file can be both safetensors file and pytorch file.
    If the file is safetensors file, it will be loaded by veturboio and the loading speed will be accelerated.
@@ -56,10 +58,14 @@ def load(
        file (FILE_PATH): file path
        map_location (str, optional): map location. Defaults to "cpu".
        enable_fast_mode (bool, optional): enable fast mode. Defaults to True.
+        helper (IOHelper, optional): use IOHelper. Defaults to None.
        use_pinmem (bool, optional): use pin memory. Defaults to False.
        num_thread (int, optional): number of threads. Defaults to 32.
        use_direct_io (bool, optional): open file in direct io mode. Defaults to False.
-        use_cipher (bool, optional): decrypt file. Defaults to False.
+        use_cipher (bool, optional): decrypt file. Defaults to None. Note: cipher is
+            disabled by force when use_cipher set to False. Otherwise, when use_cipher
+            set to True or environ VETURBOIO_USE_CIPHER set to '1', cipher is enabled.
+        state_dict (Dict): pre allocated state dict. Defaults to None.

    Returns:
        state_dict (Dict): state dict
@@ -97,7 +103,56 @@ def load(
        )

    safetensors_file = SafetensorsFile(file, loader, use_cipher)
-    return safetensors_file.load(map_location=map_location)
+    return safetensors_file.load(map_location=map_location, state_dict=state_dict)
+
+
+def load_to_shmem(
+    file: FILE_PATH,
+    num_thread: Optional[int] = 32,
+    helper: Optional[IOHelper] = None,
+    use_direct_io: Optional[bool] = False,
+    use_cipher: Optional[bool] = None,
+) -> shared_memory.SharedMemory:
+    """Load checkpoint file to shmem.
+
+    Args:
+        file (FILE_PATH): file path
+        num_thread (int, optional): number of threads. Defaults to 32.
+        helper (IOHelper, optional): use IOHelper. Defaults to None.
+        use_cipher (bool, optional): decrypt file. Defaults to None. Note: cipher is
+            disabled by force when use_cipher set to False. Otherwise, when use_cipher
+            set to True or environ VETURBOIO_USE_CIPHER set to '1', cipher is enabled.
+
+    Returns:
+        shmem (shared_memory.SharedMemory): shared memory object.
+
+    Examples:
+        ```
+        import veturboio
+        shmem_file = veturboio.load_to_shmem("sfcs://model.safetensors")
+        ```
+    """
+
+    if helper is None:
+        helper = IOHelper()
+
+    use_sfcs_sdk, file = is_sfcs_path(file)
+    if use_sfcs_sdk:
+        loader = SfcsClientLoader(
+            helper=helper,
+            file=file,
+            num_thread=num_thread,
+        )
+    else:
+        loader = FasterPosixLoader(
+            file,
+            helper,
+            num_thread=num_thread,
+            use_direct_io=use_direct_io,
+        )
+
+    safetensors_file = SafetensorsFile(file, loader, use_cipher)
+    return safetensors_file.load_to_shmem()


 def save_file(
@@ -108,6 +163,8 @@ def save_file(
    force_clone_shared_tensor: bool = False,
    metadata: Dict[str, str] = None,
    use_cipher: Optional[bool] = False,
+    helper: Optional[IOHelper] = None,
+    enable_fast_mode: Optional[bool] = False,
 ) -> None:
    """Save state dict object to safetensors file.

@@ -120,6 +177,8 @@ def save_file(
            when force_save_shared_tensor is enabled. Defaults to False.
        metadata (Dict[str, str], optional): metadata. Defaults to None.
        use_cipher (bool, optional): decrypt file. Defaults to False.
+        helper (IOHelper, optional): use IOHelper. Defaults to None.
+        enable_fast_mode (bool, optional): enable fast mode. Defaults to False.

    Examples:
        ```
@@ -130,18 +189,21 @@ def save_file(
        veturboio.save_file(state_dict, "model.safetensors")
        ```
    """
+    if helper is None:
+        helper = IOHelper()
+
    use_sfcs_sdk, file = is_sfcs_path(file)
    if use_sfcs_sdk:
-        saver = SfcsClientSaver(file=file, use_cipher=use_cipher)
+        saver = SfcsClientSaver(file=file, use_cipher=use_cipher, helper=helper)
    else:
-        saver = PosixSaver(file=file, use_cipher=use_cipher)
+        saver = PosixSaver(file=file, use_cipher=use_cipher, helper=helper)

    # TODO: there are some bugs while state_dict is loaded from veturboio
    if not force_save_shared_tensor:
        if force_clone_shared_tensor:
            logger.warning("force_clone_shared_tensor won't take any effect while force_save_shared_tensor is False;")
        try:
-            saver.save_file(state_dict, metadata=metadata)
+            saver.save_file(state_dict, metadata=metadata, enable_fast_mode=enable_fast_mode)
        except ValueError as e:
            msg = str(e)
            raise ValueError(msg)
@@ -165,7 +227,7 @@ def save_file(
    if force_contiguous:
        state_dict = {k: v.contiguous() for k, v in state_dict.items()}

-    return saver.save_file(state_dict, metadata=metadata)
+    return saver.save_file(state_dict, metadata=metadata, enable_fast_mode=enable_fast_mode)


 def save_model(model: torch.nn.Module, file: FILE_PATH, use_cipher: Optional[bool] = False) -> None:

--- a/veturboio/loader/base_loader.py
+++ b/veturboio/loader/base_loader.py
@@ -37,7 +37,12 @@ class BaseLoader:
    def load_to_bytes(self, offset: int, count: int, cipher_info: CipherInfo = CipherInfo(False)) -> bytes:
        raise NotImplementedError

-    def load_safetensors(self, safetensors_file: Any, map_location: str = "cpu") -> Dict[str, torch.Tensor]:
+    def load_safetensors(
+        self,
+        safetensors_file: Any,
+        map_location: str = "cpu",
+        state_dict: Dict[str, torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
        raise NotImplementedError

    def init_aligned_tensor(self, device, device_id: int, file_size, base_offset: int) -> torch.Tensor:
@@ -74,20 +79,24 @@ class PosixLoader(BaseLoader):
            decrypt(cipher_info, arr, arr, offset - h_off)
        return arr.tobytes()

-    def load_safetensors(self, safetensors_file: Any, map_location: str = "cpu") -> Dict[str, torch.Tensor]:
-        state_dict = {}
+    def load_safetensors(
+        self,
+        safetensors_file: Any,
+        map_location: str = "cpu",
+        state_dict: Dict[str, torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        if not state_dict:
+            state_dict = {}

        base_offset = safetensors_file.tensor_offset
        device = torch.device(map_location)

        cipher_info = safetensors_file._cipher_info
-        mp_mode = "c" if cipher_info.use_cipher else "r"
-
        for tensor_meta in safetensors_file.meta.values():
            tensor_bytes = np.memmap(
                safetensors_file.file,
                dtype=np.uint8,
-                mode=mp_mode,
+                mode="c",
                offset=base_offset + tensor_meta.data_offsets[0],
                shape=tensor_meta.data_offsets[1] - tensor_meta.data_offsets[0],
            )

--- a/veturboio/loader/faster_posix_loader.py
+++ b/veturboio/loader/faster_posix_loader.py
@@ -16,13 +16,17 @@ limitations under the License.

 import io
 import os
+import random
+import string
+from multiprocessing import shared_memory
 from typing import Dict

 import numpy as np
 import torch

 from veturboio.ops.cipher import CipherInfo, decrypt
-from veturboio.ops.load_utils import IOHelper, load_file_to_tensor
+from veturboio.ops.io_utils import IOHelper, load_file_to_tensor
+from veturboio.ops.posix_utils import posix_read_file
 from veturboio.safetensors import SafetensorsFile
 from veturboio.types import FILE_PATH

@@ -45,7 +49,10 @@ class FasterPosixLoader(PosixLoader):
        self.use_direct_io = use_direct_io

    def load_safetensors(
-        self, safetensors_file: SafetensorsFile, map_location: str = "cpu"
+        self,
+        safetensors_file: SafetensorsFile,
+        map_location: str = "cpu",
+        state_dict: Dict[str, torch.Tensor] = None,
    ) -> Dict[str, torch.Tensor]:
        file_size = os.path.getsize(safetensors_file.file)
        base_offset = safetensors_file.tensor_offset
@@ -55,22 +62,70 @@ class FasterPosixLoader(PosixLoader):
        else:
            device_id = -1

-        total_tensor = self.init_aligned_tensor(device, device_id, file_size, base_offset)
-        load_file_to_tensor(
-            file_path=safetensors_file.file,
-            total_tensor=total_tensor,
-            sample_tensor=torch.ones([], dtype=torch.uint8),
-            offset=base_offset,
-            helper=self.helper,
-            device_id=device_id,
+        if state_dict:
+            for tensor_meta in safetensors_file._meta.values():
+                tensor = state_dict[tensor_meta.name]
+                if not tensor.is_contiguous():
+                    raise RuntimeError("allocated tensor not contiguous")
+                if not tensor.dtype == tensor_meta.dtype:
+                    raise RuntimeError("allocated tensor dtype not match")
+
+                offset = tensor_meta.data_offsets[0]
+                length = tensor_meta.data_offsets[1] - tensor_meta.data_offsets[0]
+                tensor_length = torch.numel(tensor) * tensor.element_size()
+                if tensor_length < length:
+                    raise RuntimeError("allocated tensor size not enough")
+
+                load_file_to_tensor(
+                    file_path=safetensors_file.file,
+                    total_tensor=tensor,
+                    length=length,
+                    offset=base_offset + offset,
+                    helper=self.helper,
+                    device_id=device_id,
+                    num_thread=self.num_thread,
+                    use_pinmem=self.use_pinmem,
+                    use_sfcs_sdk=False,
+                    use_direct_io=self.use_direct_io,
+                    cipher_info=safetensors_file._cipher_info,
+                )
+                tensor = tensor.resize_(tensor_meta.shape)
+                state_dict[tensor_meta.name] = tensor
+            return state_dict
+        else:
+            total_tensor = self.init_aligned_tensor(device, device_id, file_size, base_offset)
+            load_file_to_tensor(
+                file_path=safetensors_file.file,
+                total_tensor=total_tensor,
+                offset=base_offset,
+                helper=self.helper,
+                device_id=device_id,
+                num_thread=self.num_thread,
+                use_pinmem=self.use_pinmem,
+                use_sfcs_sdk=False,
+                use_direct_io=self.use_direct_io,
+                cipher_info=safetensors_file._cipher_info,
+            )
+
+            return SafetensorsFile.split_tensor_to_state_dict(total_tensor, safetensors_file)
+
+    def load_to_shmem(self, cipher_info: CipherInfo = CipherInfo(False)) -> shared_memory.SharedMemory:
+        file_size = os.path.getsize(self.file)
+        file_name = ''.join(random.sample(string.ascii_lowercase + string.ascii_uppercase, 10))
+        shm = shared_memory.SharedMemory(name=file_name, create=True, size=file_size)
+
+        h_off = CipherInfo.HEADER_SIZE if cipher_info.use_header else 0
+        candidate = np.frombuffer(shm.buf, dtype=np.byte)
+        posix_read_file(
+            self.file,
+            candidate,
+            length=file_size - h_off,
+            offset=h_off,
            num_thread=self.num_thread,
-            use_pinmem=self.use_pinmem,
-            use_sfcs_sdk=False,
+            cipher_info=cipher_info,
            use_direct_io=self.use_direct_io,
-            cipher_info=safetensors_file._cipher_info,
        )
-
-        return SafetensorsFile.split_tensor_to_state_dict(total_tensor, safetensors_file)
+        return shm

    def load_pt(
        self, map_location: str = "cpu", cipher_info: CipherInfo = CipherInfo(False)

--- a/veturboio/loader/sfcs_client_loader.py
+++ b/veturboio/loader/sfcs_client_loader.py
@@ -15,7 +15,10 @@ limitations under the License.
 '''

 import os
+import random
+import string
 from io import BytesIO
+from multiprocessing import shared_memory
 from typing import Dict

 import numpy as np
@@ -24,8 +27,14 @@ from numpy import ndarray

 from veturboio.loader.base_loader import BaseLoader
 from veturboio.ops.cipher import CipherInfo
-from veturboio.ops.load_utils import IOHelper, load_file_to_tensor
-from veturboio.ops.sfcs_utils import init_sfcs_conf, sfcs_get_file_size, sfcs_read_file
+from veturboio.ops.io_utils import IOHelper, load_file_to_tensor
+from veturboio.ops.sfcs_utils import (
+    init_sfcs_conf,
+    path_mapper,
+    sfcs_default_config,
+    sfcs_get_file_size,
+    sfcs_read_file,
+)
 from veturboio.safetensors import SafetensorsFile
 from veturboio.types import FILE_PATH

@@ -46,52 +55,110 @@ class SfcsClientLoader(BaseLoader):
        self.num_thread = num_thread
        self.use_pinmem = use_pinmem
        self.use_direct_io = use_direct_io
-
-        init_sfcs_conf(file)
+        self._mount_path = init_sfcs_conf(file)
+        self._sfcs_valid_path = path_mapper(self.file, self._mount_path)

    def load_to_bytes(self, offset: int, count: int, cipher_info: CipherInfo = CipherInfo(False)) -> bytes:
-        file_size = sfcs_get_file_size(self.file)
+        file_size = sfcs_get_file_size(self._sfcs_valid_path)
        if offset + count > file_size:
            count = file_size - offset

        file_bytes = bytes(count)
        candidate = np.frombuffer(file_bytes, dtype=np.byte)
        sfcs_read_file(
-            self.file, candidate, length=count, offset=offset, num_thread=self.num_thread, cipher_info=cipher_info
+            self._sfcs_valid_path,
+            candidate,
+            length=count,
+            offset=offset,
+            num_thread=self.num_thread,
+            cipher_info=cipher_info,
        )
        return file_bytes

+    def load_to_shmem(self, cipher_info: CipherInfo = CipherInfo(False)) -> shared_memory.SharedMemory:
+        file_size = sfcs_get_file_size(self._sfcs_valid_path)
+        file_name = ''.join(random.sample(string.ascii_lowercase + string.ascii_uppercase, 10))
+        shm = shared_memory.SharedMemory(name=file_name, create=True, size=file_size)
+
+        h_off = CipherInfo.HEADER_SIZE if cipher_info.use_header else 0
+        candidate = np.frombuffer(shm.buf, dtype=np.byte)
+        sfcs_read_file(
+            self._sfcs_valid_path,
+            candidate,
+            length=file_size - h_off,
+            offset=h_off,
+            num_thread=self.num_thread,
+            cipher_info=cipher_info,
+        )
+        return shm
+
    def load_safetensors(
-        self, safetensors_file: SafetensorsFile, map_location: str = "cpu"
+        self,
+        safetensors_file: SafetensorsFile,
+        map_location: str = "cpu",
+        state_dict: Dict[str, torch.Tensor] = None,
    ) -> Dict[str, torch.Tensor]:
-        file_size = sfcs_get_file_size(safetensors_file.file)
+        # TODO should be the same as self.loader
+        sfcs_valid_path = path_mapper(safetensors_file.file, self._mount_path)
+        file_size = sfcs_get_file_size(sfcs_valid_path)
        base_offset = safetensors_file.tensor_offset
        device = torch.device(map_location)
        if device.type == "cuda":
            device_id = device.index if device.index is not None else torch.cuda.current_device()
        else:
            device_id = -1
-        total_tensor = self.init_aligned_tensor(device, device_id, file_size, base_offset)
-        load_file_to_tensor(
-            file_path=safetensors_file.file,
-            total_tensor=total_tensor,
-            sample_tensor=torch.ones([], dtype=torch.uint8),
-            offset=base_offset,
-            helper=self.helper,
-            device_id=device_id,
-            num_thread=self.num_thread,
-            use_pinmem=self.use_pinmem,
-            use_sfcs_sdk=True,
-            use_direct_io=self.use_direct_io,
-            cipher_info=safetensors_file._cipher_info,
-        )
+
+        if state_dict:
+            for tensor_meta in safetensors_file._meta.values():
+                tensor = state_dict[tensor_meta.name]
+                if not tensor.is_contiguous():
+                    raise RuntimeError("allocated tensor not contiguous")
+                if not tensor.dtype == tensor_meta.dtype:
+                    raise RuntimeError("allocated tensor dtype not match")
+
+                offset = tensor_meta.data_offsets[0]
+                length = tensor_meta.data_offsets[1] - tensor_meta.data_offsets[0]
+                tensor_length = torch.numel(tensor) * tensor.element_size()
+                if tensor_length < length:
+                    raise RuntimeError("allocated tensor size not enough")
+
+                load_file_to_tensor(
+                    file_path=sfcs_valid_path,
+                    total_tensor=tensor,
+                    length=length,
+                    offset=base_offset + offset,
+                    helper=self.helper,
+                    device_id=device_id,
+                    num_thread=self.num_thread,
+                    use_pinmem=self.use_pinmem,
+                    use_sfcs_sdk=True,
+                    use_direct_io=self.use_direct_io,
+                    cipher_info=safetensors_file._cipher_info,
+                )
+                tensor = tensor.resize_(tensor_meta.shape)
+                state_dict[tensor_meta.name] = tensor
+            return state_dict
+        else:
+            total_tensor = self.init_aligned_tensor(device, device_id, file_size, base_offset)
+            load_file_to_tensor(
+                file_path=sfcs_valid_path,
+                total_tensor=total_tensor,
+                offset=base_offset,
+                helper=self.helper,
+                device_id=device_id,
+                num_thread=self.num_thread,
+                use_pinmem=self.use_pinmem,
+                use_sfcs_sdk=True,
+                use_direct_io=self.use_direct_io,
+                cipher_info=safetensors_file._cipher_info,
+            )

        return SafetensorsFile.split_tensor_to_state_dict(total_tensor, safetensors_file)

    def load_pt(
        self, map_location: str = "cpu", cipher_info: CipherInfo = CipherInfo(False)
    ) -> Dict[str, torch.Tensor]:
-        file_size = sfcs_get_file_size(self.file)
+        file_size = sfcs_get_file_size(self._sfcs_valid_path)
        h_off = CipherInfo.HEADER_SIZE if cipher_info.use_header else 0
        file_bytes = self.load_to_bytes(offset=h_off, count=file_size - h_off, cipher_info=cipher_info)
        return torch.load(BytesIO(file_bytes), map_location=map_location)