This repository accompanies Rice, Novembre, and Desai (2018). It includes:
jupyter
notebooks, along with processed data and simulation output, for reproducing the figures in the manuscript.snakemake
pipelines for reproducing simulations and data analysis from scratch.python
source code for computing two-site frequency spectra (2-SFS) and frequency pointwise mutual information (fPMI) in Kingman coalescent models of population growth and beta coalescent models.
First, clone this repository using:
git clone https://github.com/dp-rice/multiplemergers.git
The repository contains a conda
environment file for reproducing the computing environment where the project was originally carried out. Conda will make sure that your python
version is the same one we used and that you have all of the necessary packages, including jupyter
and snakemake
. To use this environment:
-
Install
conda
, following the instructions here. -
From inside the project directory, create a new
conda
environment using:
conda env create -f config/environment.yml
- Activate the
conda
environment:
source activate mmc
See the conda
documentation for more information about using environments.
The coalescent simulations were performed using a modified version of msprime
, which allows for multiple mergers (lambda) coalescence.
Most of the under-the-hood work was done by Joe Zhu, but we made some modifications to allow access from the python API.
We would intend to contribute this feature to the main msprime
codebase.
For now, if you would like to reproduce our multiple mergers coalescent simulations, you will need to install our version of msprime
:
- Clone the repository wherever you would like on your machine:
git clone https://github.com/dp-rice/msprime-lambda.git
- With the
conda
environment activated, installmsprime
locally usingpip
:
cd [BASE/DIRECTORY/OF/MSPRIME-LAMBDA]; pip install .
You will now be able to use import msprime
within python
code to use the msprime python API whenever the mmc
conda environment is activated.
In order to reproduce the selective sweep simulations, you will need to install SLiM
.
Follow the instructions at: https://messerlab.org/slim/
In order to reproduce our coalescent model fits to Drosophila melanogaster diversity data, you will need to install fastNeutrino
.
Follow the instructions at: https://sourceforge.net/projects/fastneutrino/
If you would like to reproduce the figures in the manuscript without re-running the simulations or data pre-processing, we have provided the output of these steps.
Our simulation output is in a tar
archive. To unpack the archive, use the command: tar -xzf simulations.tgz
.
The processed Drosophila data is in the directory data/DPGP3/minor_allele_counts
. Each file contains the number of genotyped samples (out of 100) and minor allele count at every site in one chromosome arm.
The commands for reproducing our figures are in three jupyter
notebooks:
-
notebooks/figures.ipynb
contains computes two-site frequency spectra from theoretical calculations and simulation output and generates most of the figures in the paper. -
notebooks/figure5.ipynb
produces Figure 5 (hiloPMI vs. genetic distance) from simulation output. -
notebooks/DPGP3_figures.ipynb
computes the 2-SFS from Drosophila data and compares it to coalescent simulations of population growth.
If you would like to reproduce the analysis from scratch or do something similar, we have provided two snakemake
pipelines:
-
snakefiles/simulations.snake
contains the pipeline for running all of the simulations. -
snakefiles/dpgp3.snake
contains the pipeline for processing the Drosophila data. You will need to download the raw data from the Drosophila Genome Nexus. You can also find a description of the data there. To download the data, use:
wget http://pooldata.genetics.wisc.edu/dpgp3_sequences.tar.bz2
To run snakemake locally, use commands like snakemake msprime_all
, which will run all the msprime
simulations. See the snakefiles for all rules. There is also a bash script run_snakemake.sh
, which will submit jobs to a parallel computing cluster. It is currently configured for SLURM
job scheduling.
The src
directory contains python
code for computing the various statistics described in the manuscript.
src/coalescentmoments.py
contains functions for calculating the 2-SFS for the beta coalescent.src/zivkovic.py
contains functions for calculating the 2-SFS for the Kingman coalescent with population growth.src/simulate_joint_sfs.py
is a command-line interface for running msprime simulations and calculating the 2-SFS (sometimes called "joint SFS in the code").
For examples of how to use these functions, see the notebooks and snakemake
files.