Prism is an implementation of the statistical phasing algorithm used in
Kuleshov, Xie, Chen, et al., Cost-effective whole-genome haplotyping using a combination of dilution and statistical methods.
Whole-genome haplotyping (also known as genome phasing) is the problem of determining the differences between the paternal and maternal copies of a chromosome. Current sequencing methods are presently unable to adequately resolve these differences; in practice, one uses either statistical methods, which can approximately infer the most likely genomic phase by looking at patterns in the genome of many individuals, or molecular approaches, which are often expensive and time consuming.
Prism uses a new phasing algorithm that augments traditional molecular haplotyping approaches with statistical methods. It determines haplotypes statistically where typical molecular phasing methods fail. By using Prism, one can substantially reduce the amount of molecular phasing that needs to be performed, and thus substantially reduce haplotyping costs. The paper by Kuleshov, Xie, Chen, et al. demonstrates how Prism can be combined with an inexpensive dilution haplotyping method to produce haplotypes that match the quality of existing methods at a fraction of the cost. The resulting technology, statistically-aided dilution haplotyping (SDH) represents the first commercially available haplotyping product and is offered by Illumina Inc.
This document contains instructions for running Prism, as well as a step-by-step tutorial that goes over running Prism on a 1 Mbp region of chromosome 22.
The Prism phaser was written by Volodymyr Kuleshov. The algorithm and source code are the property of Illumina Inc.
The source code of Prism is released under the Illumina Open Source License, available at https://github.com/sequencing/licenses.
A package containing Prism is available at http://www.stanford.edu/~kuleshov/prism.tar.gz. To install the package, extract the contents, and run the setup.py script:
wget http://www.stanford.edu/~kuleshov/prism.tar.gz
tar -zxvf prism.tar.gz
cd prism
python setup.py install
Prism is written in Python and C, and requires numpy (>=1.5.1) and scipy (>=0.8.0) as well as a relatively recent version of gcc. It makes use of the SciPy Weave package to compile inline C code into Python. It has been tested on a stock Ubuntu 12.04 Linux machine.
Prism takes as input a set of haplotype blocks assembled through a form of dilution haplotyping (which we call "local" blocks) and outputs for each block its most likely phase relative to its predecessor, as well as a confidence score. Given a desired accuracy threshold, one can combine the local blocks into global statistically-assembled superblocks. When used with the SDH technology described in Kuleshov, Xie, Chen, et al, these global blocks attain N50 lengths of 500 Kbp or more and contain on average fewer than one long switching error per Mbp.
Prism phases locally-assembled blocks based on a phased reference panel. It was designed to be used with the Thousand Genomes project pre-phased reference panel that is distributed as part of the IMPUTE2 software package.
Statistical phasing is performed as a two-stage process, each associated with one of two commands: prism
and prism-interleaving
. The prism
command computes the phase for a subset of the local blocks that do not overlap among each other. It produces also a path file, which contains a "path" through the phased reference panel, i.e. which describes the pair of reference panel haplotypes that most closely matches the subject at each genomic position. The prism-interleaving
, then determines the phase of the remaining blocks based on the path file.
More precisely, the prism
command works as follows:
prism
--blocks <local-blocks>
--positions <reference panel at position that need to be phased>
--start <genomic coordinate>
--end <genomic coordinate>
--phase <output file containing the phase of each block>
--path <output file describing the path through the reference panel>
--K <number of reference haplotypes to use in current window>
--N <effective population size parameter>
Each input block must be tagged with an ID. The phase output file has the following format:
ID PHASE SCORE
where PHASE
equals 0 or 1. If it equals 1, the block identified with ID should be flipped. The path output file uses the format
POS R1 R1 A1 A2
where POS
is a genomic position from the positions file, R1
and R2
are the indices of the reference panel that best describe the sample at that position, and A1
, A2
are the allele of the reference panel haplotypes at that position.
The prism-interleaving
command requires the following input
prism-interleaving
--unphased-blocks <set of interleaving blocks to be phased>
--path <output file describing the path through the reference panel>
--vcf <vcf file of the subject>
--chr <chromosome to consider>
--reference-panel <the reference panel for the chromosome; this requires a .hap.gz file in IMPUTE2 format>
--reference-legend <the reference panel for the chromosome; this requires a .legend.gz file in IMPUTE2 format>
--reference-sample <the reference panel for the chromosome; this requires a .sample file in IMPUTE2 format>
--phased-blocks <output file containing output blocks>
The phased blocks returned by prism-interleaving
are in the same format as the input blocks, but some of them have been flipped to match the phase that was deemed the most likely. The algorithm used behind prism-interleaving
is very simple; it simply assigns to each block the phase that is closest to the haplotypes defined in the "path" input file.
A set of locally-phased blocks is transformed into global blocks in two stages. Initially, the blocks must be split into two subsets, in a way that the first subset contains no overlapping blocks.
The following is a step-by-step tutorial on how to use Prism over a small ~1Mbp region of chromosome 22 of HapMap sample NA12878. We suggest that the reader run these steps on a stock Ubuntu 12.04 Linux machine, for example on an new m1.large Amazon EC2 instance. Depending on the chosen parameter settings, Prism can use several gigabytes of RAM over the course of its execution.
We recommend manually setting up the packages required for running Prism:
sudo apt-get update
sudo apt-get install gcc
sudo apt-get install python-setuptools
sudo apt-get install python-numpy
sudo apt-get install python-scipy
On a stock Ubuntu 12.04 machine, extract the contents of the prism.tar.gz package and run the installation script:
tar -zxvf prism.tar.gz
cd prism
sudo python setup.py install
This will make availble two new commands: prism
and prism-interleaving
.
Next, download the data package with various input files for NA12878 chromosome 22 (for simplicity, we recommend downloading directly to the prism folder):
wget http://www.stanford.edu/~kuleshov/prism-data.tar.gz
tar -zxvf prism-data.tar.gz
The prism-data
package contains the following files:
na12878.chr22.local.blocks
: Set of non-interleaving blocks obtained from the local phasing stage of SDH.na12878.chr22.local.interleaving.blocks
: Set of interleaving blocks from the local phasing stage of SDH. Together with the above file, these represent all the local blocks for chromosome 22. Note that there are many more interleaving blocks; however, they are all very short. They mostly represent single SNPs that fell within a larger local block, but could not be connected to it via a long fragment.
na12878.chr22.positions
: File representing the set of all positions to be considered for global phasing. Each position also contains its genetic distance from the very first position in the file, as well as the haplotypes of the reference panel.na12878.chr22.vcf
: The VCF of the subject.na12878.chr22.true-phase
: The true phase of the subject, determined by trio phasing.impute2-panel/
: Data for chromosome 22 from the IMPUTE2 phasing panel.The first step in a typical Prism workflow would be to phase the non-interleaving blocks in na12878.chr22.local.blocks
. These are long and accurate blocks obtained by local phasing. We may assess their accuracy using the evaluate_blocks.py
helper script in the scripts/
subfolder:
python scripts/evaluate_blocks.py
--blocks prism-data/na12878.chr22.local.blocks
--true-phase prism-data/na12878.chr22.true-phase
The long switch accuracy for these blocks should be 99.95%. Unfortunately, these blocks are relatively short (N50 length of about 60 Kbp); let's now use the prism
phaser to compute the most likely phase of each block relative to the other within a small genomic region: chr22:18890344-19737410
.
prism
--blocks prism-data/na12878.chr22.local.blocks
--positions prism-data/na12878.chr22.positions
--start 18890344
--end 19737410
--K 75
--path region.path
--phase region.phase
We manually specified a --K
value of 75, indicating that prism
should perform inference with the 75 haplotypes that are the closest to the subject within that particular region. Although larger values of K
may yield better performance, the running time and memory usage of the algorithm increases superlinearly with K
. A value of 75 should be suitable for a demonstration, but if the phaser takes too much time, we suggest moving to a value of 50 or 40.
Over the course of the execution, prism
will automatically compile its C functions using SciPy Weave.
On a m1.large Amazon EC2 instance, the above command takes about 5 minutes to finish with --K 75
.
After prism
finishes successfully, it will have created two files: region.phase
and region.path
. For now, we may disregard the latter; the phase inferred for every block (identified by its ID) is found in the first file. To apply this phase to our local blocks, we use the phase_local_blocks.py
helper script, which will flip the blocks the way the phaser specified:
python scripts/phase_local_blocks.py
--phase region.phase
--unphased-local-blocks prism-data/na12878.chr22.local.blocks
--phased-local-blocks phased.local.blocks
We may assess the accuracy of the resulting statistical phasing using the evaluate_statistical_accuracy.py
helper script:
python scripts/evaluate_statistical_accuracy.py
--true-phase prism-data/na12878.chr22.true-phase
--blocks phased.local.blocks
--threshold 0
We should see a switch accuracy of 94.8%. This is slightly higher than we would normally expect for non-overlapping blocks. Typically, such blocks will be relatively far apart from each other, and will exhibit less linkage disequilibrium (LD) than, say, interleaving blocks. However, the region we chose seems to have somewhat higher LD than usual.
In addition to reporting the phase of each block, prism
also provided us with confidence scores for each block. We may use these scores to disregard the suggested phase of some blocks. For instance, we may look only at the phase between blocks that have a confidence score above 0.8:
python scripts/evaluate_statistical_accuracy.py
--true-phase prism-data/na12878.chr22.true-phase
--blocks phased.local.blocks
--threshold 0.8
This should give us a much higher confidence score: 97.4%. Typically, across the entire genome, the statistical phasing accuracy at this stage should be slightly above 96%.
Next, we need to phase the remaining local blocks: ones in na12878.chr22.local.interleaving.blocks
. For that we will need the region.path
file generated at the previous stage, as well as the full Thousand Genomes reference panel for chromosome 22. The region.path
specifies which pair of haplotypes from the reference panel best describe the subject at each genomic position that was considered by the prism
command (i.e. the ones in the file na12878.chr22.positions
). To determine the phase of each interleaving block, we will compare it to the haplotypes specified by these two files, and pick the phase that best matches them:
prism-interleaving
--unphased-blocks prism-data/na12878.chr22.local.interleaving.blocks
--path region.path
--vcf prism-data/na12878.chr22.vcf
--chr chr22
--reference-panel prism-data/impute2-panel/ALL_1000G_phase1integrated_v3_chr22_impute.hap.gz
--reference-legend prism-data/impute2-panel/ALL_1000G_phase1integrated_v3_chr22_impute.legend.gz
--reference-sample prism-data/impute2-panel/ALL_1000G_phase1integrated_v3.sample
--phased-blocks phased.interleaving.blocks
This command will take about a minute to complete. It will produce a file in which the interleaving blocks whose phase could have been determined; they will be already flipped accordingly.
Let's now combine them with the blocks obtained at the previous stage:
cat phased.local.blocks phased.interleaving.blocks | sort -n -k 3 > phased.blocks
We may now estimate again the statistical accuracy of our phasing:
python scripts/evaluate_statistical_accuracy.py
--true-phase prism-data/na12878.chr22.true-phase
--blocks phased.blocks
--threshold 0.8
The statistical phasing accuracy should have now improved to 98%. This is typically what we would expect at this stage, using a threshold of 0.8. Interleaving blocks have a high statistical phasing accuracy because they are physically closer to neighboring blocks.
We now have assigned a phase and a confidence score to every block obtained from the local stage. We may use this information to combine these haplotypes into global super-blocks. We will do so by going from left to right across the genome and adding the next local block to the current global block if its confidence score is above our minimum threshold. If it is not, we stop the current block, and start a new one. We recommend using a threshold of 0.8 for a good tradeoff between block size and quality.
python scripts/get_global_blocks.py
--phased-local-blocks phased.blocks
--global-blocks global.blocks
--threshold 0.8
We can now evaluate the accuracy of these blocks:
python scripts/evaluate_accuracy.py
--true-phase prism-data/na12878.chr22.true-phase
--blocks global.blocks
This should give a long switch accuracy of 99.8%, which is low by our standards, suggesting that an error slipped in somewhere in this region. A look at global.blocks shows we have two blocks of 250 and 600 Kbp:
cut -f 2,3 global.blocks | awk '{print $2 - $1 + 1}'
A more closer inspection would show that the second block has one long switch error in it, which accounts for the 0.2% loss in accuracy:
1111111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
00000000000000000000000000000000000000000000000000000000000000011111111111111111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
Strangely, this error was not detected through the confidence scores. A closer look at the local blocks can explain why:
49 0000000000000000000000000000000
50 000000000
51 00000001111111111
52 1111111111
53 111
54 011111111
55 111111111111111111111111
56 111111111111111111111111111111111111
57 11
58 11111111111111111
It turns out that a switch error occurred in the middle of local block #51. Thus, the statistical algorithm actually made a sensible decision, as it aligned blocks 52-58 to continue the phase that was started in block 51.
Fortunately, local blocks have a high accuracy (recall that we measured it to be 99.95%) and errors as in the above case are quite rare. Most genomic windows, such as chr22:24700000-25500000
or chr22:20300000-21000000
exhibit no global errors. Across the chromsome 22, the long switch accuracy of the global blocks equals 99.9%.
The above tutorial showed how phase a 1Mbp window of chromosome 22. Extending this approach to the whole genome is conceptually simple, and requires partitioning the genome into many such overlapping windows. These windows can then all be phased in parallel on a compute cluster, and merged together to obtain an entire phased genome. For greater accuracy, one may use a larger panel size, although we did not see any improvements in accuracy beyond -K 200
. Finally, it may be interesting to play with the --N
parameter (effective population size in the Li & Stephens (2002) model), which may lead to a better accuracy on some samples, especially one that are very heterozygous.