Skip to content

themudo/genomic_sequence_downloader.py

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 

Repository files navigation

genomic_sequence_downloader.py

Introduction


genomic_sequence_downloader.py is a Python script that allows for downloading the genomic sequence portion underlying a given target gene annotation across multiple species with annotated genomes available at NCBI (FASTA format). This script uses, on its basis, NCBI Entrez APIs, making use of the latest annotation version of each target species genome.

The script can be run in batch mode using batch_gsd.py. For that you need to specify two text input files, one with a list of the scientific names of the species you are interested in, one per line; and one with the list of genes of interest: each line should have (tab-separated) name of gene followed by the name of three upstream genes, and three downstream genes.

Dependencies


Usage

genomic_sequence_downloader.py requires a set of 10 arguments:

  • the name of the gene of interest (-target_gene_name);
  • the names of three downstream and upstream target gene flanking genes, so that in the absence of the target gene annotation in a given species genome, the script automatically downloads the most likely genomic sequence region for the target gene to be physically located (according to the principle of synteny conservation across evolution);
  • the input path to a .txt file containing the list of the species of interest (scientific name, separated by lines) (-target_species_list_file_path);
  • the output path to a .fasta file that will contain each species corresponding downloaded sequence (-sequences_content_output_file_path);
  • the output path to a .csv file that will contain metadata regarding each downloaded sequence, including, among others, from left to right, the scientific and common name of the corresponding species, the corresponding genomic sequence ID, the coordinates of the corresponding genomic sequence ID that define the genomic portion that was extracted, the ID of the corresponding genome assembly, and the used method for defining the extracted sequence (either the Annotated Gene-Based Method or the Synteny Conservation-Based Method).

usage: python3 genomic_downloader.py
                         -target_gene_name
                         -1st_downstream_flanking_gene_name
                         -2nd_downstream_flanking_gene_name
                         -3rd_downstream_flanking_gene_name
                         -1st_upstream_flanking_gene_name
                         -2nd_upstream_flanking_gene_name
                         -3rd_upstream_flanking_gene_name
                         -target_species_list_file_path
                         -sequences_content_output_file_path
                         -sequences_data_output_file_path

Example

Download the script available at script/genomic_sequence_downloader.py. Try the following example that is targeted to the RAG1 protein-coding gene and a set of 160 mammalian species:

python3 genomic_downloader.py RAG1 TRAF6 PRRR5L COMMD9 IFTAP LRRC4C API5 input_species.txt sequences_output.fasta sequences_data.csv

Using as input the above-mentioned arguments, as well as the input_species.txt file found in the example folder, the script should generate the output files found within the same folder (sequences_output.fasta and sequences_data.csv).


Enjoy it and please let me know if you have any specific questions!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%