Commit b83756f8 authored by wangsen's avatar wangsen
Browse files

add readme.md

parent 076ea1f8
# CIRI-deep
- CIRI-deep is a deep-learning model used to predict differentially spliced circRNAs between two biological samples using totalRNA sequencing data.
- An adapted version of CIRI-deep, CIRI-deepA, was trained for poly(A) selected RNA-seq data.
# Installation # 1. 从GitHub下载安装CIRI-deep环境
The CIRI-deep model was constructed based on Keras. The `environment.yaml` was provided and the dependencies can be installed as the follow:
``` ```
git clone https://github.com/gyjames/CIRIdeep.git git clone https://github.com/gyjames/CIRIdeep.git
cd CIRIdeep
conda env create -n CIRIdeep -f ./environment.yaml
conda activate CIRIdeep
```
# Usage
The main program `CIRIdeep.py` can be used to predict differentially spliced circRNAs with CIRIdeep or CIRIdeep(A) or train your own model.
## Predict
**Prediction with CIRIdeep using total RNA-seq data**
CIRIdeep provides probability of given circRNAs being differentially spliced between any of two samples. When predict with CIRIdeep, expression value of 1499 RBPs (listed in `./demo/RBPmax_totalRNA.tsv`) and splicing amount (derived from SAM alignment files) in both samples are needed. The order of RBP expression of each sample should keep exactly the same with `RBP max value file`. We recommend to process raw total RNA-seq fastq files with `CIRIquant`, which provides junction ratio of each circRNA and expression value of each gene in a one-stop manual. SAM files generated with BWA is recommended when producing splicing amount values.
``` ```
python CIRIdeep.py predict -geneExp_absmax ./demo/RBPmax_totalRNA.tsv -seqFeature ./demo/cisfeature.tsv -splicing_max ./demo/splicingamount_max.tsv -predict_list ./demo/predict_list.txt -model_path ./models/CIRIdeep.h5 -outdir ./outdir -RBP_dir ./demo/RBPexp_total -splicing_dir ./demo/splicingamount # 安装环境
```
Several files are needed for prediction.
`-geneExp_absmax` This file contains maximum value of 1499 RBP expression value (TPM) across the training datasets used for normalization.
`-seqFeature` This file contains normalized cis features of circRNAs to be predicted. A table containing cis features of 71459 circRNAs has been constructed.
`-splicing_max` This file contains maximum value of splicing amount of each circRNA across the training datasets used for normalization.
`-predict_list` This file is comprised of two columns. The first column contains the name of sample pairs seperated by `_`. The second column contains the path to files containing circRNA to be predicted.
CircRNAs are given as coodination on `hg19` genome, like `chr10:102683732|102685776`.
`-model_path` We have provided fully trained CIRIdeep model for using.
`-outdir` Directory to output prediction result.
`-RBP_dir` Directory containing the RBP expression value in TPM of samples to be predicted.
`-splicing_dir` Directory containing the splicing amount of circRNAs to be predicted in each sample.
**Prediction with CIRIdeep(A) using poly(A) selected RNA-seq data**
CIRIdeep(A) gives three probabilities indicating the circRNA being unchanged, having higher junction ratio in sample A or having higher junction ratio in sample B, which sum to one.
Order of samples (A, B) is the same with sample pair name given in `predict list file`.
As in some cases, like in scRNA-seq or spatial transcriptomics data, only gene expression matrix is provided, splicing amount is not needed in CIRIdeep(A) any more.
```
python CIRIdeep.py predict -geneExp_absmax ./demo/RBPmax_polyA.tsv -seqFeature ./demo/cisfeature.tsv -predict_list ./demo/predict_list.txt -model_path ./models/CIRIdeepA.h5 -outdir ./outdir -RBP_dir ./demo/RBPexp_polyA --CIRIdeepA
``` ```
`--CIRIdeepA` When predict using CIRIdeepA, this parameter is needed. conda create -n CIRIdeep python=3.7
source activate CIRIdeep
Basically, the input files are similar to CIRIdeep, excluding splicing amount related files. **Notably**, the `RBP max value file` file is different from that used in CIRIdeep and all the expression values should be derived from poly(A) selected RNA-seq data. Still, when using CIRIdeep(A), the order of RBP expression of each sample should keep exactly the same with `RBP max value file`. pip install tensorflow-1.15.1+git06e2e8aa.dtk2404-cp37-cp37m-linux_x86_64.whl
pip install -r requirements.txt
**Generation of input files** pip install 'h5py<3.0.0'
pip install protobuf==3.20.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
Here we gave necessary instructions for generating the input files from different datasets.
**RBP expression of total RNA-seq data**
There are two columns in RBP expression level file, the first column identify gene symbols and the second column gives expression level of the RBP in TPM. The order of genes should keep exactly the same with `demo/RBPmax_totalRNA.tsv`.
| Gene Name | TPM |
|------|-------|
|A1CF|12.5|
|AAR2|23.9|
**Splicing amount**
Feature of splicing amount is used in CIRI-deep. We have provided a basic script `script_splicingamount.py` to produce splicing amount in samples.
**RBP expressin of poly(A) RNA-seq data**
The format is as same as the RBP expression file used in total RNA-seq data. The order of genes should keep exactly the same with `demo/RBPmax_polyA.tsv`.
**RBP expression of single-cell RNA-seq data**
When analyzing differentially spliced circRNA between cell clusters, the mean value of RBP expression level in CPM or TPM was used. The order of genes should keep exactly the same with `demo/RBPmax_polyA.tsv`
**RBP expression of spatial transcriptome data**
We recommend to perform imputation step before extracting expression level of RBPs. Tangram, gimVI and SpaGE were greate choices. After imputation, the gene expression value should be normalized as: $$Exp^i = Exp_{imputed}^i / \Sigma Exp_{imputed}^i*scalefactor$$
We used 300,000 as scale factor here. The order of genes should keep exactly the same with `demo/RBPmax_polyA.tsv`
```
## Train
**CIRIdeep training** # 测试
## 用CIRI-deep进行预测
``` ```
python CIRIdeep.py train -geneExp_absmax /path/to/file -seqFeature /path/to/file -splicing_max /path/to/file -outdir /out/path -RBP_dir /RBP/path -splicing_dir /splicing/path python CIRIdeep.py predict -geneExp_absmax ./demo/RBPmax_totalRNA.tsv -seqFeature ./demo/cisfeature.tsv -splicing_max ./demo/splicingamount_max.tsv -predict_list ./demo/predict_list.txt -model_path ./models/CIRIdeep.h5 -outdir ./outdir -RBP_dir ./demo/RBPexp_total -splicing_dir ./demo/splicingamount
```
Hyperparameters are given in `config.py`. `config.py` must be under the same directory with `CIRIdeep.py`. Resources are waiting to be loaded...
**CIRIdeep(A) training** ```
## 用CIRI-deepA进行预测
``` ```
python CIRIdeep.py train -geneExp_absmax /path/to/file -seqFeature /path/to/file -outdir /out/path -RBP_dir /RBP/path --CIRIdeepA python CIRIdeep.py predict -geneExp_absmax ./demo/RBPmax_polyA.tsv -seqFeature ./demo/cisfeature.tsv -predict_list ./demo/predict_list.txt -model_path ./models/CIRIdeepA.h5 -outdir ./outdir -RBP_dir ./demo/RBPexp_polyA --CIRIdeepA
``` ```
\ No newline at end of file
## Contact
Zihan Zhou. zhouzihan2018m@big.ac.cn
Please open an issue if you find bugs.
GSM4117496_GSM4117278-chr10:102683732|102685776 0.93381387 0.04291529 0.023270903
GSM4117496_GSM4117278-chr10:103427643|103436193 0.6981813 0.0053234557 0.29649526
GSM4117496_GSM4117278-chr10:103916776|103917971 0.999851 2.1411643e-05 0.00012754099
GSM4117496_GSM4117278-chr10:112356156|112358048 0.9994598 5.9756778e-05 0.0004803477
GSM4117496_GSM4117278-chr10:112723883|112745523 0.17680527 0.0019056727 0.82128906
GSM4117496_GSM4117278-chr10:12039671|12056183 0.83356875 0.0030888005 0.16334245
GSM4117496_GSM4117278-chr10:12123471|12162266 0.9978904 0.0010932159 0.0010164027
GSM4117496_GSM4117278-chr10:123661906|123683844 0.9997274 3.153922e-05 0.0002409183
GSM4117496_GSM4117278-chr10:126370176|126370948 0.5997549 0.048025265 0.35221982
GSM4117496_GSM4117278-chr10:126631026|126631876 0.039053295 0.0046387627 0.95630795
GSM4117496_GSM4117278-chr10:126727566|126799662 0.99925905 3.6211062e-05 0.0007047521
GSM4117496_GSM4117278-chr10:126799559|126811437 0.77496356 0.21577464 0.009261748
GSM4117496_GSM4117278-chr10:128768966|128788867 0.9612441 0.03768977 0.0010661611
GSM4117496_GSM4117914-chr10:102683732|102685776 0.96108353 0.022447577 0.016468925
GSM4117496_GSM4117914-chr10:103427643|103436193 0.93654233 0.05269837 0.010759323
GSM4117496_GSM4117914-chr10:103916776|103917971 0.99998367 3.7056334e-06 1.2626175e-05
GSM4117496_GSM4117914-chr10:105197772|105198565 0.9984523 0.0013729727 0.00017471335
GSM4117496_GSM4117914-chr10:112356156|112358048 0.9990736 0.00012950468 0.00079690054
GSM4117496_GSM4117914-chr10:112723883|112745523 0.9335758 0.008081945 0.05834227
GSM4117496_GSM4117914-chr10:126370176|126370948 0.6822433 0.28132218 0.036434624
GSM4117496_GSM4117914-chr10:126631026|126631876 0.72082007 0.27297902 0.0062008724
GSM4117496_GSM4117914-chr10:126727566|126799662 0.99745744 2.642908e-05 0.0025161307
GSM4117496_GSM4117914-chr10:126799559|126811437 0.8097568 0.16435811 0.025885087
GSM4117496_GSM4117914-chr10:128768966|128788867 0.9850642 0.0073393574 0.007596481
...@@ -3,4 +3,4 @@ matplotlib==3.3.4 ...@@ -3,4 +3,4 @@ matplotlib==3.3.4
numpy==1.19.2 numpy==1.19.2
pandas==1.1.5 pandas==1.1.5
scikit-learn==0.24.2 scikit-learn==0.24.2
python==3.6.3 #python
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment