particle

524a1b6e · mashun · 524a1b6e · 524a1b6e · 524a1b6e · 524a1b6e
Commit 524a1b6e authored Nov 21, 2024 by mashun
20 changed files
--- a/.gitignore
+++ b/.gitignore
+.DS_Store
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+datasets/*
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk24.04.3-py3.10
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) 2022 Huilin Qu, Congqiao Li, Sitian Qian
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
+# particle_transformer
+## 论文
+`Particle Transformer for Jet Tagging`
+* https://arxiv.org/abs/2202.03772
+## 模型结构
+该模型是一种基于 Transformer 的架构，通过成对粒子交互功能进行了增强，这些功能作为softmax之前的偏差合并到多头注意力中。
+<img src="readme_imgs/arch.png" style="zoom:70%">
+## 算法原理
+该算法利用 Transformer 架构处理粒子云数据。Transformer 通过自注意力机制捕捉粒子之间的复杂关系，并学习粒子的全局特征。
+<img src="readme_imgs/alg.png" style="zoom:100%">
+## 环境配置
+### Docker（方法一）
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk24.04.3-py3.10
+    docker run --shm-size 50g --network=host --name=pt --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -r requirements.txt
+### Dockerfile（方法二）
+    docker build -t <IMAGE_NAME>:<TAG> .
+    docker run --shm-size 50g --network=host --name=pt --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -r requirements.txt
+### Anaconda (方法三)
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+https://developer.hpccube.com/tool/
+    DTK驱动：dtk24.04.3
+    python：python3.10
+    torch: 2.1.0
+    torchvision: 0.16.0
+Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
+2、其它非特殊库参照requirements.txt安装
+    pip install -r requirements.txt
+## 数据集
+可使用脚本下载数据集，也可通过SCNet高速通道进行下载，
+```bash
+./get_datasets.py [JetClass|QuarkGluon|TopLandscape] [-d DATA_DIR]
+```
+SCNet高速下载通道：
+[JetClass](http://113.200.138.88:18080/aidatasets/project-dependency/jetclass)
+| [QuarkGluon](http://113.200.138.88:18080/aidatasets/project-dependency/quarkgluon)
+| [TopLandscape](http://113.200.138.88:18080/aidatasets/project-dependency/toplandscape)
+```
+datasets/
+├── JetClass
+│   └── Pythia
+│       ├── test_20M
+│       ├── train_100M
+│       └── val_5M
+├── QuarkGluon
+│   ├── test_file_*.parquet
+│   └── train_file_*.parquet
+└── TopLandscape
+    ├── test_file.parquet
+    ├── train_file.parquet
+    └── val_file.parquet
+```
+## 训练
+```bash
+pip install 'weaver-core>=0.4'
+```
+```bash
+# 在JetClass数据集上训练
+./train_JetClass.sh [ParT|PN|PFN|PCNN] [kin|kinpid|full] ...
+# 在QuarkGluon数据集上训练
+./train_QuarkGluon.sh [ParT|ParT-FineTune|PN|PN-FineTune|PFN|PCNN] [kin|kinpid|kinpidplus] ...
+# 在TopLandscape数据集上训练
+./train_TopLandscape.sh [ParT|ParT-FineTune|PN|PN-FineTune|PFN|PCNN] [kin] ...
+```
+其中第一个参数表示使用的网络，分别是
+- ParT: Particle Transformer
+- PN: ParticleNet
+- PFN: Particle Flow Network
+- PCNN: P-CNN
+- xxx-FineTune: 使用预训练模型微调
+第二个参数表示输入的特征集合，分别是
+- kin: only kinematic inputs
+- kinpid: kinematic inputs + particle identification
+- full: kinematic inputs + particle identification + trajectory displacement
+### 多卡训练
+```bash
+# DP - pytorch
+./train_JetClass.sh ParT full --gpus 0,1,2,3 --batch-size [total_batch_size] ...
+# DDP - pytorch
+DDP_NGPUS=4 ./train_JetClass.sh ParT full --batch-size [batch_size_per_gpu] ...
+```
+## 推理
+```bash
+bash test_QuarkGluon_demo.sh
+```
+注意：该推理脚本仅供参考，具体见[weaver-core](https://github.com/hqucms/weaver-core)。
+## result
+![alt text](readme_imgs/result.png)
+### 精度
+所有模型均选择ParT，以下结果仅展示不同加速卡在相同配置下的训练精度差异，并不代表可得到的最高精度。
+仅记录AvgAcc.
+|加速卡|JetClass|QuarkGluon(kinpidplus)|TopLandscape(kin)|
+|:---:|:---:|:---:|:---:|
+|K100_AI|0.6219|0.8495|0.93975|
+|GPU|0.620|0.84921|0.93987|
+## 应用场景
+### 算法类别
+`ai for science`
+### 热点应用行业
+`高能物理,医疗,金融`
+## 源码仓库及问题反馈
+* https://developer.sourcefind.cn/codes/modelzoo/particle_transformer_pytorch
+## 参考资料
+* https://github.com/jet-universe/particle_transformer
+* https://github.com/hqucms/weaver-core
\ No newline at end of file
--- a/README_official.md
+++ b/README_official.md
+# Particle Transformer
+This repo is the official implementation of "[Particle Transformer for Jet Tagging](https://arxiv.org/abs/2202.03772)". It includes the code, pre-trained models, and the JetClass dataset.
+![jet-tagging](figures/jet-tagging.png)
+## Updates
+### 2023/07/06
+We added a [helper function](dataloader.py) to read the JetClass dataset into regular numpy arrays. To use it, simply download the file [dataloader.py](dataloader.py) and do:
+```python
+from dataloader import read_file
+x_particles, x_jet, y = read_file(filepath)
+```
+The return values are:
+- `x_particles`: a zero-padded numpy array of particle-level features in the shape `(num_jets, num_particle_features, max_num_particles)`.
+- `x_jets`: a numpy array of jet-level features in the shape `(num_jets, num_jet_features)`.
+- `y`: a one-hot encoded numpy array of the truth lables in the shape `(num_jets, num_classes)`.
+## Introduction
+### JetClass dataset
+**[JetClass](https://zenodo.org/record/6619768)** is a new large-scale jet tagging dataset proposed in "[Particle Transformer for Jet Tagging](https://arxiv.org/abs/2202.03772)". It consists of 100M jets for training, 5M for validation and 20M for testing. The dataset contains 10 classes of jets, simulated with [MadGraph](https://launchpad.net/mg5amcnlo) + [Pythia](https://pythia.org/) + [Delphes](https://cp3.irmp.ucl.ac.be/projects/delphes):
+![dataset](figures/dataset.png)
+### Particle Transformer (ParT)
+The **Particle Transformer (ParT)** architecture is described in "[Particle Transformer for Jet Tagging](https://arxiv.org/abs/2202.03772)", which can serve as a general-purpose backbone for jet tagging and similar tasks in particle physics. It is a Transformer-based architecture, enhanced with pairwise particle interaction features that are incorporated in the multi-head attention as a bias before softmax. The ParT architecture outperforms the previous state-of-the-art, ParticleNet, by a large margin on various jet tagging benchmarks.
+![arch](figures/arch.png)
+## Getting started
+### Download the datasets
+To download the JetClass/QuarkGluon/TopLandscape datasets:
+```
+./get_datasets.py [JetClass|QuarkGluon|TopLandscape] [-d DATA_DIR]
+```
+After download, the dataset paths will be updated in the `env.sh` file.
+### Training
+The ParT models are implemented in PyTorch and the training is based on the [weaver](https://github.com/hqucms/weaver-core) framework for dataset loading and transformation. To install `weaver`, run:
+```python
+pip install 'weaver-core>=0.4'
+```
+**To run the training on the JetClass dataset:**
+```
+./train_JetClass.sh [ParT|PN|PFN|PCNN] [kin|kinpid|full] ...
+```
+where the first argument is the model:
+- ParT: [Particle Transformer](https://arxiv.org/abs/2202.03772)
+- PN: [ParticleNet](https://arxiv.org/abs/1902.08570)
+- PFN: [Particle Flow Network](https://arxiv.org/abs/1810.05165)
+- PCNN: [P-CNN](https://arxiv.org/abs/1902.09914)
+and the second argument is the input feature sets:
+- [kin](data/JetClass/JetClass_kin.yaml): only kinematic inputs
+- [kinpid](data/JetClass/JetClass_kinpid.yaml): kinematic inputs + particle identification
+- [full](data/JetClass/JetClass_full.yaml) (_default_): kinematic inputs + particle identification + trajectory displacement
+Additional arguments will be passed directly to the `weaver` command, such as `--batch-size`, `--start-lr`, `--gpus`, etc., and will override existing arguments in `train_JetClass.sh`.
+**Multi-gpu support:**
+- using PyTorch's DataParallel multi-gpu training:
+```
+./train_JetClass.sh ParT full --gpus 0,1,2,3 --batch-size [total_batch_size] ...
+```
+- using PyTorch's DistributedDataParallel:
+```
+DDP_NGPUS=4 ./train_JetClass.sh ParT full --batch-size [batch_size_per_gpu] ...
+```
+**To run the training on the QuarkGluon dataset:**
+```
+./train_QuarkGluon.sh [ParT|ParT-FineTune|PN|PN-FineTune|PFN|PCNN] [kin|kinpid|kinpidplus] ...
+```
+**To run the training on the TopLandscape dataset:**
+```
+./train_TopLandscape.sh [ParT|ParT-FineTune|PN|PN-FineTune|PFN|PCNN] [kin] ...
+```
+The argument `ParT-FineTune` or `PN-FineTune` will run the fine-tuning using [models pre-trained on the JetClass dataset](models/).
+## Citations
+If you use the Particle Transformer code and/or the JetClass dataset, please cite:
+```
+@InProceedings{Qu:2022mxj,
+    author = "Qu, Huilin and Li, Congqiao and Qian, Sitian",
+    title = "{Particle Transformer} for Jet Tagging",
+    booktitle = "{Proceedings of the 39th International Conference on Machine Learning}",
+    pages = "18281--18292",
+    year = "2022",
+    eprint = "2202.03772",
+    archivePrefix = "arXiv",
+    primaryClass = "hep-ph"
+}
+@dataset{JetClass,
+  author       = "Qu, Huilin and Li, Congqiao and Qian, Sitian",
+  title        = "{JetClass}: A Large-Scale Dataset for Deep Learning in Jet Physics",
+  month        = "jun",
+  year         = "2022",
+  publisher    = "Zenodo",
+  version      = "1.0.0",
+  doi          = "10.5281/zenodo.6619768",
+  url          = "https://doi.org/10.5281/zenodo.6619768"
+}
+```
+Additionally, if you use the ParticleNet model, please cite:
+```
+@article{Qu:2019gqs,
+    author = "Qu, Huilin and Gouskos, Loukas",
+    title = "{ParticleNet: Jet Tagging via Particle Clouds}",
+    eprint = "1902.08570",
+    archivePrefix = "arXiv",
+    primaryClass = "hep-ph",
+    doi = "10.1103/PhysRevD.101.056019",
+    journal = "Phys. Rev. D",
+    volume = "101",
+    number = "5",
+    pages = "056019",
+    year = "2020"
+}
+```
+For the QuarkGluon dataset, please cite:
+```
+@article{Komiske:2018cqr,
+    author = "Komiske, Patrick T. and Metodiev, Eric M. and Thaler, Jesse",
+    title = "{Energy Flow Networks: Deep Sets for Particle Jets}",
+    eprint = "1810.05165",
+    archivePrefix = "arXiv",
+    primaryClass = "hep-ph",
+    reportNumber = "MIT-CTP 5064",
+    doi = "10.1007/JHEP01(2019)121",
+    journal = "JHEP",
+    volume = "01",
+    pages = "121",
+    year = "2019"
+}
+@dataset{komiske_patrick_2019_3164691,
+  author       = {Komiske, Patrick and
+                  Metodiev, Eric and
+                  Thaler, Jesse},
+  title        = {Pythia8 Quark and Gluon Jets for Energy Flow},
+  month        = may,
+  year         = 2019,
+  publisher    = {Zenodo},
+  version      = {v1},
+  doi          = {10.5281/zenodo.3164691},
+  url          = {https://doi.org/10.5281/zenodo.3164691}
+}
+```
+For the TopLandscape dataset, please cite:
+```
+@article{Kasieczka:2019dbj,
+    author = "Butter, Anja and others",
+    editor = "Kasieczka, Gregor and Plehn, Tilman",
+    title = "{The Machine Learning landscape of top taggers}",
+    eprint = "1902.09914",
+    archivePrefix = "arXiv",
+    primaryClass = "hep-ph",
+    doi = "10.21468/SciPostPhys.7.1.014",
+    journal = "SciPost Phys.",
+    volume = "7",
+    pages = "014",
+    year = "2019"
+}
+@dataset{kasieczka_gregor_2019_2603256,
+  author       = {Kasieczka, Gregor and
+                  Plehn, Tilman and
+                  Thompson, Jennifer and
+                  Russel, Michael},
+  title        = {Top Quark Tagging Reference Dataset},
+  month        = mar,
+  year         = 2019,
+  publisher    = {Zenodo},
+  version      = {v0 (2018\_03\_27)},
+  doi          = {10.5281/zenodo.2603256},
+  url          = {https://doi.org/10.5281/zenodo.2603256}
+}
+```
--- a/data/JetClass/JetClass_full.yaml
+++ b/data/JetClass/JetClass_full.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_energy)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+   part_d0: np.tanh(part_d0val)
+   part_dz: np.tanh(part_dzval)
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_charge, null]
+         - [part_isChargedHadron, null]
+         - [part_isNeutralHadron, null]
+         - [part_isPhoton, null]
+         - [part_isElectron, null]
+         - [part_isMuon, null]
+         - [part_d0, null]
+         - [part_d0err, 0, 1, 0, 1]
+         - [part_dz, null]
+         - [part_dzerr, 0, 1, 0, 1]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [label_QCD, label_Hbb, label_Hcc, label_Hgg, label_H4q, label_Hqql, label_Zqq, label_Wqq, label_Tbqq, label_Tbl]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+   - jet_phi
+   - jet_energy
+   - jet_nparticles
+   - jet_sdmass
+   - jet_tau1
+   - jet_tau2
+   - jet_tau3
+   - jet_tau4
+weights:
--- a/data/JetClass/JetClass_kin.yaml
+++ b/data/JetClass/JetClass_kin.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_energy)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [label_QCD, label_Hbb, label_Hcc, label_Hgg, label_H4q, label_Hqql, label_Zqq, label_Wqq, label_Tbqq, label_Tbl]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+   - jet_phi
+   - jet_energy
+   - jet_nparticles
+   - jet_sdmass
+   - jet_tau1
+   - jet_tau2
+   - jet_tau3
+   - jet_tau4
+weights:
--- a/data/JetClass/JetClass_kinpid.yaml
+++ b/data/JetClass/JetClass_kinpid.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_energy)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_charge, null]
+         - [part_isChargedHadron, null]
+         - [part_isNeutralHadron, null]
+         - [part_isPhoton, null]
+         - [part_isElectron, null]
+         - [part_isMuon, null]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [label_QCD, label_Hbb, label_Hcc, label_Hgg, label_H4q, label_Hqql, label_Zqq, label_Wqq, label_Tbqq, label_Tbl]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+   - jet_phi
+   - jet_energy
+   - jet_nparticles
+   - jet_sdmass
+   - jet_tau1
+   - jet_tau2
+   - jet_tau3
+   - jet_tau4
+weights:
--- a/data/QuarkGluon/qg_kin.yaml
+++ b/data/QuarkGluon/qg_kin.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_deta)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+   jet_isQ: label
+   jet_isG: 1-label
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [jet_isQ, jet_isG]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+weights:
--- a/data/QuarkGluon/qg_kinpid.yaml
+++ b/data/QuarkGluon/qg_kinpid.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_deta)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+   jet_isQ: label
+   jet_isG: 1-label
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_charge, null]
+         - [part_isChargedHadron, null]
+         - [part_isNeutralHadron, null]
+         - [part_isPhoton, null]
+         - [part_isElectron, null]
+         - [part_isMuon, null]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [jet_isQ, jet_isG]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+weights:
--- a/data/QuarkGluon/qg_kinpidplus.yaml
+++ b/data/QuarkGluon/qg_kinpidplus.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_deta)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+   part_isCHad: (np.abs(part_pid)==211) + (np.abs(part_pid)==321)*0.5 + (np.abs(part_pid)==2212)*0.2
+   part_isNHad: (np.abs(part_pid)==130) + (np.abs(part_pid)==2112)*0.2
+   jet_isQ: label
+   jet_isG: 1-label
+preprocess:
+  ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+  method: manual
+  ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+  data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars: 
+      ### [format 1]: var_name (no transformation)
+      ### [format 2]: [var_name, 
+      ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto), 
+      ###              multiply_by(optional, default=1), 
+      ###              clip_min(optional, default=-5), 
+      ###              clip_max(optional, default=5), 
+      ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_charge, null]
+         - [part_isCHad, null]
+         - [part_isNHad, null]
+         - [part_isPhoton, null]
+         - [part_isElectron, null]
+         - [part_isMuon, null]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars: 
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars: 
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [jet_isQ, jet_isG]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value: 
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+weights:
--- a/data/TopLandscape/top_kin.yaml
+++ b/data/TopLandscape/top_kin.yaml
+selection:
+   ### use `&`, `|`, `~` for logical operations on numpy arrays
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+new_variables:
+   ### [format] name: formula
+   ### can use functions from `math`, `np` (numpy), and `awkward` in the expression
+   part_mask: ak.ones_like(part_deta)
+   part_pt: np.hypot(part_px, part_py)
+   part_pt_log: np.log(part_pt)
+   part_e_log: np.log(part_energy)
+   part_logptrel: np.log(part_pt/jet_pt)
+   part_logerel: np.log(part_energy/jet_energy)
+   part_deltaR: np.hypot(part_deta, part_dphi)
+   jet_isTop: label
+   jet_isQCD: 1-label
+preprocess:
+   ### method: [manual, auto] - whether to use manually specified parameters for variable standardization
+   method: manual
+   ### data_fraction: fraction of events to use when calculating the mean/scale for the standardization
+   data_fraction: 0.5
+inputs:
+   pf_points:
+      length: 128
+      pad_mode: wrap
+      vars:
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_features:
+      length: 128
+      pad_mode: wrap
+      vars:
+         ### [format 1]: var_name (no transformation)
+         ### [format 2]: [var_name,
+         ###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto),
+         ###              multiply_by(optional, default=1),
+         ###              clip_min(optional, default=-5),
+         ###              clip_max(optional, default=5),
+         ###              pad_value(optional, default=0)]
+         - [part_pt_log, 1.7, 0.7]
+         - [part_e_log, 2.0, 0.7]
+         - [part_logptrel, -4.7, 0.7]
+         - [part_logerel, -4.7, 0.7]
+         - [part_deltaR, 0.2, 4.0]
+         - [part_deta, null]
+         - [part_dphi, null]
+   pf_vectors:
+      length: 128
+      pad_mode: wrap
+      vars:
+         - [part_px, null]
+         - [part_py, null]
+         - [part_pz, null]
+         - [part_energy, null]
+   pf_mask:
+      length: 128
+      pad_mode: constant
+      vars:
+         - [part_mask, null]
+labels:
+   ### type can be `simple`, `custom`
+   ### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels
+   type: simple
+   value: [jet_isTop, jet_isQCD]
+   ### [option 2] otherwise use `custom` to define the label, then `value` is a map
+   # type: custom
+   # value:
+   #    truth_label: label.argmax(1)
+observers:
+   - jet_pt
+   - jet_eta
+weights:
--- a/dataloader.py
+++ b/dataloader.py
+import numpy as np
+import awkward as ak
+import uproot
+import vector
+vector.register_awkward()
+def read_file(
+        filepath,
+        max_num_particles=128,
+        particle_features=['part_pt', 'part_eta', 'part_phi', 'part_energy'],
+        jet_features=['jet_pt', 'jet_eta', 'jet_phi', 'jet_energy'],
+        labels=['label_QCD', 'label_Hbb', 'label_Hcc', 'label_Hgg', 'label_H4q',
+                'label_Hqql', 'label_Zqq', 'label_Wqq', 'label_Tbqq', 'label_Tbl']):
+    """Loads a single file from the JetClass dataset.
+    **Arguments**
+    - **filepath** : _str_
+        - Path to the ROOT data file.
+    - **max_num_particles** : _int_
+        - The maximum number of particles to load for each jet. 
+        Jets with fewer particles will be zero-padded, 
+        and jets with more particles will be truncated.
+    - **particle_features** : _List[str]_
+        - A list of particle-level features to be loaded. 
+        The available particle-level features are:
+            - part_px
+            - part_py
+            - part_pz
+            - part_energy
+            - part_pt
+            - part_eta
+            - part_phi
+            - part_deta: np.where(jet_eta>0, part_eta-jet_p4, -(part_eta-jet_p4))
+            - part_dphi: delta_phi(part_phi, jet_phi)
+            - part_d0val
+            - part_d0err
+            - part_dzval
+            - part_dzerr
+            - part_charge
+            - part_isChargedHadron
+            - part_isNeutralHadron
+            - part_isPhoton
+            - part_isElectron
+            - part_isMuon
+    - **jet_features** : _List[str]_
+        - A list of jet-level features to be loaded. 
+        The available jet-level features are:
+            - jet_pt
+            - jet_eta
+            - jet_phi
+            - jet_energy
+            - jet_nparticles
+            - jet_sdmass
+            - jet_tau1
+            - jet_tau2
+            - jet_tau3
+            - jet_tau4
+    - **labels** : _List[str]_
+        - A list of truth labels to be loaded. 
+        The available label names are:
+            - label_QCD
+            - label_Hbb
+            - label_Hcc
+            - label_Hgg
+            - label_H4q
+            - label_Hqql
+            - label_Zqq
+            - label_Wqq
+            - label_Tbqq
+            - label_Tbl
+    **Returns**
+    - x_particles(_3-d numpy.ndarray_), x_jets(_2-d numpy.ndarray_), y(_2-d numpy.ndarray_)
+        - `x_particles`: a zero-padded numpy array of particle-level features 
+                         in the shape `(num_jets, num_particle_features, max_num_particles)`.
+        - `x_jets`: a numpy array of jet-level features
+                    in the shape `(num_jets, num_jet_features)`.
+        - `y`: a one-hot encoded numpy array of the truth lables
+               in the shape `(num_jets, num_classes)`.
+    """
+    def _pad(a, maxlen, value=0, dtype='float32'):
+        if isinstance(a, np.ndarray) and a.ndim >= 2 and a.shape[1] == maxlen:
+            return a
+        elif isinstance(a, ak.Array):
+            if a.ndim == 1:
+                a = ak.unflatten(a, 1)
+            a = ak.fill_none(ak.pad_none(a, maxlen, clip=True), value)
+            return ak.values_astype(a, dtype)
+        else:
+            x = (np.ones((len(a), maxlen)) * value).astype(dtype)
+            for idx, s in enumerate(a):
+                if not len(s):
+                    continue
+                trunc = s[:maxlen].astype(dtype)
+                x[idx, :len(trunc)] = trunc
+            return x
+    table = uproot.open(filepath)['tree'].arrays()
+    p4 = vector.zip({'px': table['part_px'],
+                     'py': table['part_py'],
+                     'pz': table['part_pz'],
+                     'energy': table['part_energy']})
+    table['part_pt'] = p4.pt
+    table['part_eta'] = p4.eta
+    table['part_phi'] = p4.phi
+    x_particles = np.stack([ak.to_numpy(_pad(table[n], maxlen=max_num_particles)) for n in particle_features], axis=1)
+    x_jets = np.stack([ak.to_numpy(table[n]).astype('float32') for n in jet_features], axis=1)
+    y = np.stack([ak.to_numpy(table[n]).astype('int') for n in labels], axis=1)
+    return x_particles, x_jets, y
--- a/env.sh
+++ b/env.sh
+#!/bin/bash
+export DATADIR_JetClass=
+export DATADIR_TopLandscape=
+export DATADIR_QuarkGluon=
--- a/figures/arch.png
+++ b/figures/arch.png
--- a/figures/dataset.png
+++ b/figures/dataset.png
--- a/figures/jet-tagging.png
+++ b/figures/jet-tagging.png
--- a/get_datasets.py
+++ b/get_datasets.py
+#!/usr/bin/env python3
+import argparse
+import os
+import shutil
+from utils.dataset_utils import get_file, extract_archive
+datasets = {
+    'JetClass': {
+        'Pythia/train_100M': [
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part0.tar', 'de4fd2dca2e68ab3c85d5cfd3bcc65c3'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part1.tar', '9722a359c5ef697bea0fbf79bf50f003'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part2.tar', '1e9f66cd1f915f9d10e90ae1d7761720'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part3.tar', '47348fc8985319fa4806da87500482fa'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part4.tar', '6b0ce16bd93b442a8d51914466990279'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part5.tar', '416e347512e716de51d392bee327b8e9'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part6.tar', 'e9b9c1557b1b39bf0a16e4ab631ae451'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part7.tar', '5bfc6cb285ccb7680cefa9ac82ad1a2e'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part8.tar', '540c1a0d66dfad78d2b363c5740ccf86'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_train_100M_part9.tar', '668f40b3275167ff7104c48317c0ae2a'),
+        ],
+        'Pythia/': [
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_val_5M.tar', '7235ccb577ed85023ea3ab4d5e6160cf'),
+            ('https://zenodo.org/record/6619768/files/JetClass_Pythia_test_20M.tar', '64e5156d26d101adeb43b8388207d767'),
+        ],
+    },
+    'TopLandscape': {
+        # converted from https://zenodo.org/record/2603256
+        '../': [
+            ('https://hqu.web.cern.ch/datasets/TopLandscape/TopLandscape.tar', '4fca2e47afbf321b0f201da6b804c404'),
+        ],
+    },
+    'QuarkGluon': {
+        # converted from https://zenodo.org/record/3164691
+        '../': [
+            ('https://hqu.web.cern.ch/datasets/QuarkGluon/QuarkGluon.tar', 'd8dd7f71a7aaaf9f1d2ee3cddef998f9'),
+        ],
+    },
+}
+def download_dataset(dataset, basedir, envfile, force_download):
+    info = datasets[dataset]
+    datadir = os.path.join(basedir, dataset)
+    if force_download:
+        if os.path.exists(datadir):
+            print(f'Removing existing dir {datadir}')
+            shutil.rmtree(datadir)
+    for subdir, flist in info.items():
+        for url, md5 in flist:
+            fpath, download = get_file(url, datadir=datadir, file_hash=md5, force_download=force_download)
+            if download:
+                extract_archive(fpath, path=os.path.join(datadir, subdir))
+    datapath = f'DATADIR_{dataset}={datadir}'
+    with open(envfile) as f:
+        lines = f.readlines()
+    with open(envfile, 'w') as f:
+        for l in lines:
+            if f'DATADIR_{dataset}' in l:
+                l = f'export {datapath}\n'
+            f.write(l)
+    print(f'Updated dataset path in {envfile} to "{datapath}".')
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('dataset', choices=datasets.keys(), help='datasets to download')
+    parser.add_argument('-d', '--basedir', default='datasets', help='base directory for the datasets')
+    parser.add_argument('-e', '--envfile', default='env.sh', help='env file with the dataset paths')
+    parser.add_argument('-f', '--force', action='store_true', help='force to re-download dataset')
+    args = parser.parse_args()
+    download_dataset(args.dataset, args.basedir, args.envfile, args.force)
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=1096
+# 模型名称
+modelName=particle_transformer_pytorch
+# 模型描述
+modelDescription=基于深度学习的喷注标签
+# 应用场景
+appScenario=训练,推理,ai for science,高能物理,医疗,金融
+# 框架类型
+frameType=pytorch