README.md 10.8 KB
Newer Older
wangsen's avatar
wangsen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
# 安装git-lfs 
```
sudo apt-get update
sudo apt-get install git-lfs
```


# 下载

## 下载数据集
```

#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
wangsen's avatar
wangsen committed
14
15
mkdir cell_type_train_data.dataset
cd cell_type_train_data.dataset
wangsen's avatar
wangsen committed
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/dataset.arrow
wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/dataset_info.json
wget https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M/resolve/main/example_input_files/cell_classification/cell_type_annotation/cell_type_train_data.dataset/state.json

```

## 模型下载

```
git clone https://hf-mirror.com/ctheodoris/Geneformer   
cd  Geneformer 
```



wangsen's avatar
wangsen committed
31
32
33
34
35
36
37
# 环境部署


```
conda create -n geneformer python=3.10
conda activate geneformer 
pip install torch      #dcu版本的torch 
wangsen's avatar
wangsen committed
38
pip install -r requirements.txt   -i https://pypi.tuna.tsinghua.edu.cn/simple
wangsen's avatar
wangsen committed
39
```
wangsen's avatar
wangsen committed
40

wangsen's avatar
wangsen committed
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
## 部署后环境
```
accelerate                0.33.0
accumulation_tree         0.6.2
aiohappyeyeballs          2.3.6
aiohttp                   3.10.3
aiosignal                 1.3.1
anndata                   0.10.8
array_api_compat          1.8
async-timeout             4.0.3
attrs                     24.2.0
certifi                   2024.7.4
charset-normalizer        3.3.2
click                     8.1.7
cloudpickle               3.0.0
contourpy                 1.2.1
cycler                    0.12.1
datasets                  2.21.0
dill                      0.3.8
exceptiongroup            1.2.2
filelock                  3.15.4
fonttools                 4.53.1
frozenlist                1.4.1
fsspec                    2024.6.1
future                    1.0.0
geneformer                0.1.0
h5py                      3.11.0
huggingface-hub           0.24.5
hyperopt                  0.2.7
idna                      3.7
Jinja2                    3.1.4
joblib                    1.4.2
jsonschema                4.23.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
legacy-api-wrap           1.4
llvmlite                  0.43.0
loompy                    3.0.7
MarkupSafe                2.1.5
matplotlib                3.9.2
mpmath                    1.3.0
msgpack                   1.0.8
multidict                 6.0.5
multiprocess              0.70.16
natsort                   8.4.0
networkx                  3.3
numba                     0.60.0
numpy                     1.26.4
numpy-groupies            0.11.2
packaging                 24.1
pandas                    2.2.2
patsy                     0.5.6
pillow                    10.4.0
pip                       24.2
protobuf                  5.27.3
psutil                    6.0.0
py4j                      0.10.9.7
pyarrow                   17.0.0
pynndescent               0.5.13
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
pytz                      2024.1
pyudorandom               1.0.0
PyYAML                    6.0.2
ray                       2.34.0
referencing               0.35.1
regex                     2024.7.24
requests                  2.32.3
rpds-py                   0.20.0
safetensors               0.4.4
scanpy                    1.10.2
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
session_info              1.0.0
setuptools                72.1.0
six                       1.16.0
statsmodels               0.14.2
stdlib-list               0.10.0
sympy                     1.13.2
tdigest                   0.5.2.2
threadpoolctl             3.5.0
tokenizers                0.19.1
torch                     2.1.0+git540102b.abi0.dtk2404
tqdm                      4.66.5
transformers              4.44.0
typing_extensions         4.12.2
tzdata                    2024.1
umap-learn                0.5.6
urllib3                   2.2.2
wheel                     0.43.0
xxhash                    3.4.1
yarl                      1.9.4
wangsen's avatar
wangsen committed
134
135
```

wangsen's avatar
wangsen committed
136

wangsen's avatar
wangsen committed
137
138
# 模型训练

wangsen's avatar
wangsen committed
139
140
```
#单卡运行
wangsen's avatar
wangsen committed
141
python  train.py
wangsen's avatar
wangsen committed
142
143


wangsen's avatar
wangsen committed
144
#详情请参考 Geneformer/examples/cell_classification.ipynb
wangsen's avatar
wangsen committed
145
146


wangsen's avatar
wangsen committed
147
148
149
150
151
152
# 或者执行

python test_cell_classifier.py    # 替换py文件中dataset的路径

'''

wangsen's avatar
wangsen committed
153
154
155
# 模型推理

```
wangsen's avatar
wangsen committed
156
157
python    geneformer/classifier.py --classifier="cell" --cell_state_dict = {"state_key": "disease", "states": "all"}  --forward_batch_size=200 --nproc=1   # 直接运行会出现报错  具体请参考Geneformer/examples/cell_classification.ipynb

wangsen's avatar
wangsen committed
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

```



For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main/examples) for:
- tokenizing transcriptomes
- pretraining
- hyperparameter tuning
- fine-tuning
- extracting and plotting cell embeddings
- in silico perturbation

Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.

Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).




datasets: ctheodoris/Genecorpus-30M
license: apache-2.0
---
# Geneformer
Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.

- See [our manuscript](https://rdcu.be/ddrx0) for details.
- See [geneformer.readthedocs.io](https://geneformer.readthedocs.io) for documentation.

# Model Description
Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.

The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.

We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).

During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.

In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.

Both the 6 and 12 layer Geneformer models were pretrained in June 2021.

# Application
The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.

Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:

*Fine-tuning*:
- transcription factor dosage sensitivity
- chromatin dynamics (bivalently marked promoters)
- transcription factor regulatory range
- gene network centrality
- transcription factor targets
- cell type annotation
- batch integration
- cell state classification across differentiation
- disease classification
- in silico perturbation to determine disease-driving genes
- in silico treatment to determine candidate therapeutic targets

*Zero-shot learning*:
- batch integration
- gene context specificity
- in silico reprogramming
- in silico differentiation
- in silico perturbation to determine impact on cell state
- in silico perturbation to determine transcription factor targets
- in silico perturbation to determine transcription factor cooperativity



# 参考
https://hf-mirror.com/ctheodoris/Geneformer