GitHub - shaojunyu/DNA-probe-efficiency: Model building and efficiency prediction based on DNA sequences

DNA Probe Targeting Efficiency

A deep learning tool for predicting DNA probe on-target efficiency in targeting sequencing

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
License
Contact

About The Project

A deep learning tool to build models for predicting the On-target efficiency of DNA probes based on their sequence. We provide some sample data and pre-trained models for testing and evaluation. Use this tool with your dataset to train customized models or just predict new datasets with pre-trained models. It can also be easily modified and applied to other sequence regression problems.

Getting Started

This tool is developed in Python, so you need to set up a proper Python environment to run the code.

Prerequisites

The recommended way to run this code is to create a conda environment and install all the dependencies inside that environment.

$ conda create -n probe python=3.7
$ conda activate probe

Install the dependencies:

(probe) $ conda install pandas scipy numpy tqdm scikit-learn -c conda-forge
(probe) $ conda install pytorch -c pytorch

If you want to use GPU to accelerate this tool, please make sure you have installed the proper GPU CUDA driver. Follow the instructions from PyTorch.

Installation

Download the latest release of this tool from the release page, unzip it then you can use the tool.

(probe) $ python3 DNA_Probe.py -h

(back to top)

Usage

Here are some simple and useful examples of how to use this tool. For more options, please refer to the supported arguments. The main program contains two subcommands: train and predict. In the Train mode, you can train new models with the new dataset, and in the Predict mode, you can predict the efficiency of pre-trained models.

- Train

Train a new model with the input data and save the model to output. The model will only use the sequence features.

python3 DNA_Probe.py train \  
-input data/pig_probe_effiency_150bp_train.tsv.gz \  
-output models/pig_150bp_model.h5

Train a new model that includes the structure information and set the learning rate as 2e-5. The model will use sequence features as well as the corresponding structure information of the sequence.
```
python DNA_Probe.py train \  
-input data/human_probe_effiency_120bp_with_struc_train.tsv.gz \ 
-use_struct \  
-output models/human_120bp_struct.h5 \  
-lr 2e-5
```

Use GPU to accelerate the training process.

python DNA_Probe.py train \  
-input data/human_probe_effiency_120bp_with_struc_train.tsv.gz \  
-use_struct \  
-output models/human_120bp_struct.h5 \  
-lr 2e-5 \  
-gpu 0

Supported arguments of train:

input : str, required
The file path of the input data.
output : str, required
The file path of the output model.
gpu : int, optional
The GPU device ID that used to accelerate the process. Leave it empty to use CPU if GPU is not available. Default: None.
kmer : int, optional
The kmer length of DNA seq. The default value is 1, which is the one-hot encoding of DNA. Any value larger than 1 will encode the DNA sequence based on the kmer first. Please note that the kmer encoding is not working if you set the use_struct option Ture.
onehot : bool, optional
If [default: True], use one-hot encoding for DNA sequences and structure sequences. Please note that this argument will overide the setting in kmer.
use_struct : bool, optional
If true, incorporate the structure information in the model and only one-hot encoding is available. Default: False.
embed_dim : int, optional
Set the embedding dimension [default: 32] for input sequences.
epochs : int, optional
Set the epochs [default: 60] for model training.
batch_size : int, optional
Set the batch size [default: 64] for model training.
lr : float, optional
Set the learning rate [defalt: 1e-4] for model training.

- Predict

Predict efficiency on new data and save the result to a file.

python DNA_Probe.py predict \  
-input data/human_probe_effiency_120bp_with_struc_test.tsv.gz \  
-model models/human_120bp_struct.h5_bk \  
-output prediction.txt

Use GPU to accelerate the prediction.

python DNA_Probe.py predict \  
-input data/human_probe_effiency_120bp_with_struc_test.tsv.gz \  
-model models/human_120bp_struct.h5_bk \  
-output prediction.txt
-gpu 0

Supported arguments of predict:

input : str, required
The file path of the input data.
output : str, required
The file path of the prediction output.
model : str, required
The file path of the pre-trained model.
gpu : int, optional
The GPU device ID that used to accelerate the process. Leave it empty to use CPU if GPU is not available. Default: None.
batch_size : int, optional
Set the batch size [default: 128] for prediction.

- Data Format

Example datasets are in the data folder. Checking out these datasets helps to prepare your own datasets.

Input data for training:

Header-less TSV (tab-separated value) file
At least 2 or 3 columns
The 1st column is the DNA sequence. All the sequences should be the same length
The 2nd column is the efficiency value. If you want to use structure information in the model, the 2nd column is the structure seq in the Dot-Bracket Notation format and the 3rd column is the efficiency value

Example data:

AGCTTAACGAAGGGCCAGGAGAAGGTTTCTCTGTAGCCTCAGTCTGCCGGACGAACACATCCTTAGGCGACTTGGGACCGTTTCTTTTATCTTATCAAAGTCTACTACACATCGAAGAAT	26.779413773688
AGGGGTAGGACCAGAGGGCGGAGGAAGAGTATGGACAGACTCCTACTTCGACCAGCTTCACCACGACGGTAGCCTAGAAAAGTTGGACGAGGAGGCCCAACACCACGGAGCCCGGTGGAC	16.6173844090768

Example data with structure

AGCAGGTTTCGAGACAGGTGAAACTGACGAGTGTAATGTCATCAAGAAAACAAGAAACCTGGTACACAGAAATAAATACGGACCGGTAAGGGGTAGTTCAGTAATCTATTTAAGGAACGA	(.((((((((((((((......(((....)))....)))).))..........)))))))).)..................(((.......))).((((..(((.....)))..))))..	17.7492339387625
CCGTGTAAGAACCCGAGTATTACCAGTCTATCACCTCCCCGAATGTATCCCGGTGTATAGACAGTTTCCGGTACCGATACTCGTCGTGGTAGAGGTGTGGTGGTTGCTGCACCTACTTCT	..(((((..((((........((((......(((((((((((..((((((((((((....)))....))))....)))))...))).))..))))))))))))))..)))))........	57.5186324677292

Input data for predicting:
- Same as the input data for training without the efficiency value, just the DNA seq and optional structure seq.
- 1 or 2 columns.

(back to top)

Roadmap

Predict on-target efficiency based on probe seq
Figure out the seq features that lead to high efficiency
Design highly efficient probes
- Sequence modification (adaptors, primers)
- Verification by experiments

License

Distributed under the BSD License. See LICENSE for more information.

(back to top)

Contact

Shaojun Yu - [email protected]
Zhuqing Zheng - [email protected]
Project Link: https://github.com/shaojunyu/DNA-probe-efficiency

(back to top)

References

https://www.idtdna.com/pages/technology/next-generation-sequencing/dna-sequencing/targeted-sequencing
https://github.com/genetic-medicine/PaddleHelix_RNA_UPP
Ma, X. et al. (2019) ‘Analysis of error profiles in deep next-generation sequencing data’, Genome Biology, 20(1), p. 50. doi:10.1186/s13059-019-1659-6.
Kim, H.K. et al. (2018) ‘Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity’, Nature Biotechnology, 36(3), pp. 239–241. doi:10.1038/nbt.4061.
Huang, L. et al. (2019) ‘LinearFold: linear-time approximate RNA folding by 5’-to-3’ dynamic programming and beam search’, Bioinformatics, 35(14), pp. i295–i304. doi:10.1093/bioinformatics/btz375.
Sato, K., Akiyama, M. and Sakakibara, Y. (2021) ‘RNA secondary structure prediction using deep learning with thermodynamic integration’, Nature Communications, 12(1), p. 941. doi:10.1038/s41467-021-21194-4.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
images		images
models		models
.gitignore		.gitignore
DNA_Probe.py		DNA_Probe.py
LICENSE		LICENSE
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA Probe Targeting Efficiency

About The Project

Getting Started

Prerequisites

Installation

Usage

- Train

- Predict

- Data Format

Roadmap

License

Contact

References

About

Releases 1

Languages

License

shaojunyu/DNA-probe-efficiency

Folders and files

Latest commit

History

Repository files navigation

DNA Probe Targeting Efficiency

About The Project

Getting Started

Prerequisites

Installation

Usage

- Train

- Predict

- Data Format

Roadmap

License

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages