init

84fba50b · wangsen · bce4d8a1 · 84fba50b · 84fba50b · 84fba50b
Commit 84fba50b authored Sep 30, 2024 by wangsen
7 changed files
--- a/README.md
+++ b/README.md
-# 安装git-lfs 
-```
-sudo apt-get update
-sudo apt-get install git-lfs
-```

+# 论文
+Transfer learning enables predictions in network biology
+https://www.nature.com/articles/s41586-023-06139-9

-# 下载

-## 下载数据集
-```

-#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
-mkdir cell_type_train_data.dataset
-cd cell_type_train_data.dataset
-wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/dataset.arrow
-wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/dataset_info.json
-wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/state.json
+# 模型结构
+
+迁移学习通过利用在大规模通用数据集上预训练的深度学习模型，彻底改变了自然语言理解和计算机视觉等领域，然后可以对具有有限任务特定数据的大量下游任务进行微调。在这里，我们开发了一个基于上下文感知、注意力的深度学习模型Geneformer，该模型在大约3000万个单细胞转录组的大规模语料库上进行了预训练，以便在网络生物学数据有限的情况下进行特定于上下文的预测。在预训练过程中，Geneformer对网络动力学有了基本的了解，以完全自我监督的方式将网络层次编码在模型的注意力权重中。使用有限的任务特定数据对与染色质和网络动力学相关的下游任务进行微调，表明Geneformer始终提高了预测准确性。应用于有限患者数据的疾病建模，Geneformer确定了心肌病的候选治疗靶点。总体而言，Geneformer代表了一种预训练的深度学习模型，可以从中对广泛的下游应用进行微调，以加速发现关键的网络调节因子和候选治疗靶点。
+
+
+
+# 算法原理
+预训练的Geneformer架构。每个单细胞转录组被编码成排序值编码[秩编码]，然后通过6层transformer编码器单元进行编码，输入大小为2048（完全代表Geneformer-30M中排序值编码的93%），256个嵌入维度，每层四个注意力头，前馈大小为512。Geneformer在2048的输入大小上使用full dense 自注意力。可提取的输出包括上下文基因和细胞嵌入编码、上下文注意力权重和上下文预测
+![Alt text](image.png)

+
+# 环境配置
+Docker(方式一)
+推荐使用docker方式运行，提供拉取的docker镜像：
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
+docker exec -it geneformer /bin/bash
 ```

-## 模型下载
+安装docker中没有的依赖:

 ```
-git clone https://hf-mirror.com/ctheodoris/Geneformer   
-cd  Geneformer 
+pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```



-# 环境部署
+Dockerfile(方式二)


 ```
-conda create -n geneformer python=3.10
-conda activate geneformer 
-pip install torch      #dcu版本的torch 
-pip install -r requirements.txt   -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-
-## 部署后环境
-```
-accelerate                0.33.0
-accumulation_tree         0.6.2
-aiohappyeyeballs          2.3.6
-aiohttp                   3.10.3
-aiosignal                 1.3.1
-anndata                   0.10.8
-array_api_compat          1.8
-async-timeout             4.0.3
-attrs                     24.2.0
-certifi                   2024.7.4
-charset-normalizer        3.3.2
-click                     8.1.7
-cloudpickle               3.0.0
-contourpy                 1.2.1
-cycler                    0.12.1
-datasets                  2.21.0
-dill                      0.3.8
-exceptiongroup            1.2.2
-filelock                  3.15.4
-fonttools                 4.53.1
-frozenlist                1.4.1
-fsspec                    2024.6.1
-future                    1.0.0
-geneformer                0.1.0
-h5py                      3.11.0
-huggingface-hub           0.24.5
-hyperopt                  0.2.7
-idna                      3.7
-Jinja2                    3.1.4
-joblib                    1.4.2
-jsonschema                4.23.0
-jsonschema-specifications 2023.12.1
-kiwisolver                1.4.5
-legacy-api-wrap           1.4
-llvmlite                  0.43.0
-loompy                    3.0.7
-MarkupSafe                2.1.5
-matplotlib                3.9.2
-mpmath                    1.3.0
-msgpack                   1.0.8
-multidict                 6.0.5
-multiprocess              0.70.16
-natsort                   8.4.0
-networkx                  3.3
-numba                     0.60.0
-numpy                     1.26.4
-numpy-groupies            0.11.2
-packaging                 24.1
-pandas                    2.2.2
-patsy                     0.5.6
-pillow                    10.4.0
-pip                       24.2
-protobuf                  5.27.3
-psutil                    6.0.0
-py4j                      0.10.9.7
-pyarrow                   17.0.0
-pynndescent               0.5.13
-pyparsing                 3.1.2
-python-dateutil           2.9.0.post0
-pytz                      2024.1
-pyudorandom               1.0.0
-PyYAML                    6.0.2
-ray                       2.34.0
-referencing               0.35.1
-regex                     2024.7.24
-requests                  2.32.3
-rpds-py                   0.20.0
-safetensors               0.4.4
-scanpy                    1.10.2
-scikit-learn              1.5.1
-scipy                     1.14.0
-seaborn                   0.13.2
-session_info              1.0.0
-setuptools                72.1.0
-six                       1.16.0
-statsmodels               0.14.2
-stdlib-list               0.10.0
-sympy                     1.13.2
-tdigest                   0.5.2.2
-threadpoolctl             3.5.0
-tokenizers                0.19.1
-torch                     2.1.0+git540102b.abi0.dtk2404
-tqdm                      4.66.5
-transformers              4.44.0
-typing_extensions         4.12.2
-tzdata                    2024.1
-umap-learn                0.5.6
-urllib3                   2.2.2
-wheel                     0.43.0
-xxhash                    3.4.1
-yarl                      1.9.4
-```
+docker build -t geneformer:latest .
+docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro geneformer:latest /bin/bash
+docker exec -it geneformer /bin/bash

+```

-# 模型训练

-```
-#单卡运行
-python  train.py

+Conda(方式三)

-#详情请参考 Geneformer/examples/cell_classification.ipynb
+1.创建conda虚拟环境：

+```
+conda create -n geneformer python=3.10
+conda activate geneformer 
+```

-# 或者执行
+2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装。
+- [DTK 24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
+- [Pytorch 2.1](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.2/torch-2.1.0+das.opt1.dtk24042-cp310-cp310-manylinux_2_28_x86_64.whl)

-python test_cell_classifier.py    # 替换py文件中dataset的路径

-'''
+Tips：以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。

-# 模型推理

+3. 其它依赖库参照requirements.txt安装：
+```
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```
-python    geneformer/classifier.py --classifier="cell" --cell_state_dict = {"state_key": "disease", "states": "all"}  --forward_batch_size=200 --nproc=1   # 直接运行会出现报错  具体请参考Geneformer/examples/cell_classification.ipynb


-```

+# 下载
+## 安装git-lfs 
+```
+sudo apt-get update
+sudo apt-get install git-lfs
+```


-For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main/examples) for:
- tokenizing transcriptomes
- pretraining
- hyperparameter tuning
- fine-tuning
- extracting and plotting cell embeddings
- in silico perturbation
+## 下载数据集
+```
+#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
+mkdir -p /path/to/
+cd /path/to
+git clone  https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M
+```

-Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.

-Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).
+## 模型下载



+### geneformer模型下载

-datasets: ctheodoris/Genecorpus-30M
-license: apache-2.0
---
-# Geneformer
-Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
+模型下载以及安装geneformls

- See [our manuscript](https://rdcu.be/ddrx0) for details.
- See [geneformer.readthedocs.io](https://geneformer.readthedocs.io) for documentation.
+ 
+```
+cd /path/to
+git clone  -b pr146_branch   https://hf-mirror.com/ctheodoris/Geneformer
+cd Geneformer
+python install -e . 
+```

-# Model Description
-Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.

-The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.

-We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).

-During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.

-In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
+# 模型训练

-Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
+单卡运行 gene classification
+```

-# Application
-The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.

-Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
+cd geneformer/
+python  train.py

-*Fine-tuning*:
- transcription factor dosage sensitivity
- chromatin dynamics (bivalently marked promoters)
- transcription factor regulatory range
- gene network centrality
- transcription factor targets
- cell type annotation
- batch integration
- cell state classification across differentiation
- disease classification
- in silico perturbation to determine disease-driving genes
- in silico treatment to determine candidate therapeutic targets
+```
+详情可以参考 Geneformer/examples/cell_classification.ipynb

-*Zero-shot learning*:
- batch integration
- gene context specificity
- in silico reprogramming
- in silico differentiation
- in silico perturbation to determine impact on cell state
- in silico perturbation to determine transcription factor targets
- in silico perturbation to determine transcription factor cooperativity
+```
+python train_cell.py    # 替换py文件中dataset的路径

+```


 # 参考

--- a/__pycache__/train.cpython-310.pyc
+++ b/__pycache__/train.cpython-310.pyc
--- a/dockerfile
+++ b/dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+COPY requirements.txt requirements.txt
+RUN source /opt/dtk-24.04.1/env.sh
+RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone 
+ENV LANG C.UTF-8
+RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/image.png
+++ b/image.png
--- a/media/image.png
+++ b/media/image.png
--- a/train.py
+++ b/train.py
-import datetime
-import pickle
-from geneformer import Classifier
-import os 
-current_date = datetime.datetime.now()
-datestamp = f"{str(current_date.year)[-2:]}{current_date.month:02d}{current_date.day:02d}{current_date.hour:02d}{current_date.minute:02d}{current_date.second:02d}"
-datestamp_min = f"{str(current_date.year)[-2:]}{current_date.month:02d}{current_date.day:02d}"
-
-output_prefix = "tf_dosage_sens_test"
-output_dir = f"/path/to/output_dir/{datestamp}"
-os.makedirs(output_dir)
-
-with open("/path/to/Genecorpus-30M/dosage_sensitivity_TFs.pickle", "rb") as fp:
-    gene_class_dict = pickle.load(fp)
-
-cc = Classifier(classifier="gene",
-                gene_class_dict = gene_class_dict,
-                max_ncells = 10_000,
-                freeze_layers = 4,
-                num_crossval_splits = 5,
-                forward_batch_size=200,
-                nproc=16)
-
-
-cc.prepare_data(input_data_file="/path/to/Genecorpus-30M/dosage_sensitive_tfs",
-                output_directory=output_dir,
-                output_prefix=output_prefix)
-
-all_metrics = cc.validate(model_directory="/home/Geneformer",
-                          prepared_input_data_file=f"{output_dir}/{output_prefix}_labeled.dataset",
-                          id_class_dict_file=f"{output_dir}/{output_prefix}_id_class_dict.pkl",
-                          output_directory=output_dir,
-                          output_prefix=output_prefix)
-
-
-
-
-
--- a/test_cell_classifier.py
+++ b/test_cell_classifier.py
-
-
-
-#https://gitee.com/hf-models/Geneformer/blob/main/examples/cell_classification.ipynb   
-#具体可以参考
-
-
 import os
 GPU_NUMBER = [0]
 os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(s) for s in GPU_NUMBER])
 os.environ["NCCL_DEBUG"] = "INFO"

-
 # imports
 from collections import Counter
 import datetime
@@ -27,7 +19,11 @@ from geneformer import DataCollatorForCellClassification


 # load cell type dataset (includes all tissues)
-train_dataset=load_from_disk("/genecorpus_30M_2048.dataset")   ##更改数据集路径
+train_dataset=load_from_disk("/path/to/Genecorpus-30M/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/")
+
+
+
+
 dataset_list = []
 evalset_list = []
 organ_list = []
@@ -35,7 +31,7 @@ target_dict_list = []

 for organ in Counter(train_dataset["organ_major"]).keys():
    # collect list of tissues for fine-tuning (immune and bone marrow are included together)
-    if organ in ["bone_marrow"]:  
+    if organ in ["bone_marrow"]:
        continue
    elif organ=="immune":
        organ_ids = ["immune","bone_marrow"]
@@ -43,14 +39,15 @@ for organ in Counter(train_dataset["organ_major"]).keys():
    else:
        organ_ids = [organ]
        organ_list += [organ]
-    
+
    print(organ)
-    
+
    # filter datasets for given organ
    def if_organ(example):
        return example["organ_major"] in organ_ids
    trainset_organ = train_dataset.filter(if_organ, num_proc=16)
-    
+
+
    # per scDeepsort published method, drop cell types representing <0.5% of cells
    celltype_counter = Counter(trainset_organ["cell_type"])
    total_cells = sum(celltype_counter.values())
@@ -58,27 +55,28 @@ for organ in Counter(train_dataset["organ_major"]).keys():
    def if_not_rare_celltype(example):
        return example["cell_type"] in cells_to_keep
    trainset_organ_subset = trainset_organ.filter(if_not_rare_celltype, num_proc=16)
-      
+
    # shuffle datasets and rename columns
    trainset_organ_shuffled = trainset_organ_subset.shuffle(seed=42)
    trainset_organ_shuffled = trainset_organ_shuffled.rename_column("cell_type","label")
    trainset_organ_shuffled = trainset_organ_shuffled.remove_columns("organ_major")
-    
+
    # create dictionary of cell types : label ids
    target_names = list(Counter(trainset_organ_shuffled["label"]).keys())
    target_name_id_dict = dict(zip(target_names,[i for i in range(len(target_names))]))
    target_dict_list += [target_name_id_dict]
-    
+
    # change labels to numerical ids
    def classes_to_ids(example):
        example["label"] = target_name_id_dict[example["label"]]
        return example
    labeled_trainset = trainset_organ_shuffled.map(classes_to_ids, num_proc=16)
+        
    
    # create 80/20 train/eval splits
    labeled_train_split = labeled_trainset.select([i for i in range(0,round(len(labeled_trainset)*0.8))])
    labeled_eval_split = labeled_trainset.select([i for i in range(round(len(labeled_trainset)*0.8),len(labeled_trainset))])
-    
+
    # filter dataset for cell types in corresponding training set
    trained_labels = list(Counter(labeled_train_split["label"]).keys())
    def if_trained_label(example):
@@ -87,7 +85,7 @@ for organ in Counter(train_dataset["organ_major"]).keys():

    dataset_list += [labeled_train_split]
    evalset_list += [labeled_eval_split_subset]
-    
+
 trainset_dict = dict(zip(organ_list,dataset_list))
 traintargetdict_dict = dict(zip(organ_list,target_dict_list))

@@ -103,9 +101,11 @@ def compute_metrics(pred):
    return {
      'accuracy': acc,
      'macro_f1': macro_f1
-    }
+       }


+# set model parameters
+# max input size
 max_input_size = 2 ** 11  # 2048

 # set training hyperparameters
@@ -134,21 +134,20 @@ for organ in organ_list:
    organ_trainset = trainset_dict[organ]
    organ_evalset = evalset_dict[organ]
    organ_label_dict = traintargetdict_dict[organ]
-    
+
    # set logging steps
    logging_steps = round(len(organ_trainset)/geneformer_batch_size/10)
-    
-    # reload pretrained model      #  更改路径Geneformer 路径
-    model = BertForSequenceClassification.from_pretrained("/home/Geneformer", 
+
+    # reload pretrained model
+    model = BertForSequenceClassification.from_pretrained("//Geneformer",
                                                      num_labels=len(organ_label_dict.keys()),
                                                      output_attentions = False,
                                                      output_hidden_states = False).to("cuda")
-    
-    # define output directory path
+  # define output directory path
    current_date = datetime.datetime.now()
    datestamp = f"{str(current_date.year)[-2:]}{current_date.month:02d}{current_date.day:02d}"
-    output_dir = f"/path/to/models/{datestamp}_geneformer_CellClassifier_{organ}_L{max_input_size}_B{geneformer_batch_size}_LR{max_lr}_LS{lr_schedule_fn}_WU{warmup_steps}_E{epochs}_O{optimizer}_F{freeze_layers}/"     #
-    
+    output_dir = f"/path/to/models/{datestamp}_geneformer_CellClassifier_{organ}_L{max_input_size}_B{geneformer_batch_size}_LR{max_lr}_LS{lr_schedule_fn}_WU{warmup_steps}_E{epochs}_O{optimizer}_F{freeze_layers}/"
+
    # ensure not overwriting previously saved model
    saved_model_test = os.path.join(output_dir, f"pytorch_model.bin")
    if os.path.isfile(saved_model_test) == True:
@@ -156,7 +155,7 @@ for organ in organ_list:

    # make output directory
    subprocess.call(f'mkdir {output_dir}', shell=True)
-    
+
    # set training arguments
    training_args = {
        "learning_rate": max_lr,
@@ -177,7 +176,7 @@ for organ in organ_list:
        "load_best_model_at_end": True,
        "output_dir": output_dir,
    }
-    
+
    training_args_init = TrainingArguments(**training_args)

    # create the trainer
@@ -195,4 +194,4 @@ for organ in organ_list:
    with open(f"{output_dir}predictions.pickle", "wb") as fp:
        pickle.dump(predictions, fp)
    trainer.save_metrics("eval",predictions.metrics)
-    trainer.save_model(output_dir)
\ No newline at end of file
+    trainer.save_model(output_dir)