Esprit 2 is a software package to detect split genes in a proteome using a statistical test based on reconstructed gene trees with related reference genomes. The approach is described in detail in `http://xxx`_.
Clone the git repository from https://github.com/dessimozlab/esprit2
git clone https://github.com/dessimozlab/esprit2
Adjust your file load_env according to your environment, i.e. python environment, paths to required software,...
You need to provide a folder with fasta files, one per gene family and an orthoxml file. The simplest way to obtain these files is by running OMA Standalone on your dataset. The produced files /HierarchicalGroups.orthoxml/ and the folder /HOGFasta/ are the two input files you need for Esprit 2. Please put them to your esprit2 directory.
As of now, Esprit 2 needs a SunGridEngine (SGE) scheduler. This will likely change in the future and will be extended to other HPC schedulers.
Run Esprit by
./pipeline.sh <ID_prefix> <path_to_family_folder> <path_to_orthoxml>
ID_prefix
is a unique species- or chromosome-specific substring of a gene ID. For example, in case of wheat where all gene IDs follow the format Traes_chromosomeArm_string
(e.g., Traes_1BL_6EC2AB17D
) or TRAES3Bstring
(e.g., TRAES3BF036000300CFD
) for 3B reference assembly, an ID_prefix
could be:
Traes_chromosomeArm
, e.g.,Traes_1BL
- for detecting split genes within a chromosome arm, e.g., long arm of chromosome 1BTraes_chromosome
, e.g.,Traes_1B
- for detecting split genes within a chromosome, e.g., chromosome 1B. This will probably yield candidate pairs where one fragment has been assigned to 1BL (long arm) and the other to 1BS (short arm).Traes
- for detecting split genes within the whole wheat genome. Please be aware that the set of candidate pairs will contain fragments coming from different chromosomes.TRAES3B
- for detecting split genes within 3B reference assembly
By calling the pipeline.sh script with -h
, it will output additional parameters
you can specify together with their default values.
collapsing_results.txt
- columns: gene1, gene2, sister taxa before collapsing (True/False), sister taxa after collapsing (True/False)
lrt_summaries.tar.gz, lrt_summary.txt, missing_lk.txt
- tar.gz contains a summary per case tested, lrt_summary.txt provides test statistics and p-values for all cases, missing_lk.txt indicates cases where the tree likelihood wasn't computed (please have a look at these computations and investigate what went wrong)
predictions_ambiguous.txt, predictions_unambiguous.txt
- contain gene IDs for predictions
alignment_positions.txt
TSV file with the following columns:
- HOG ID
- gene1
- gene2
- start position of gene1 in the MSA
- end position of gene1 in the MSA
- start position of gene2 in the MSA
- end position of gene2 in the MSA
- overlap start position (or -1 if no overlap)
- overlap end position (or -1 if no overlap)
- %overlap of aligned gene1
- %overlap of aligned gene2
cuts.txt
- columns: HOG ID, gene1, gene2, their cut/middle position in the alignment
mapping.txt
- mapping between OMA IDs and IWGSC IDs
sequence_lengths.txt
TSV file with following columns: HOG ID, gene1, length of gene1, gene2, length of gene2
Contains also pairs with short sequence(s) which didn't pass the min sequence length criteria
aln_c.tar.gz, aln.tar.gz, phy_c.tar.gz, phy.tar.gz
- contain aligned families in FASTA format (aln_c, aln) and phylip (phy_c, phy). aln_c and phy_c contain families with n-1 sequences whereas aln and phy contain n sequences
hog_aln.tar.gz
- alignments of HOGs which contain at least 2 wheat genes from the chromosome of interest
bootstrap_aln.tar.gz, bootstrap_s_aln.tar.gz, bootstrap_phy.tar.gz, bootstrap_s_phy.tar.gz
- similar as above but for bootstrap samples. bootstrap_aln.tar.gz and bootstrap_phy.tar.gz contain samples with n-1 sequences whereas bootstrap_s_aln.tar.gz and bootstrap_s_phy.tar.gz contain samples with n sequences
collapsed.tar.gz
- contains trees after collapsing
n_1_res.tar.gz, n_notop_res.tar.gz, n_top_res.tar.gz, n_1_b_res.tar.gz, n_b_notop_res.tar.gz, n_b_top_res.tar.gz
- contain stats output from FastTree
n_1_trees.tar.gz, n_trees_notop.tar.gz, n_1_b_trees.tar.gz
- contain the infered FastTree trees
n_1_trees_s.tar.gz, n_1_b_trees_s.tar.gz
- contain input topologies for tree reconstructions with input topology