Typhon Manual

Back

Overview

Typhon is a tool for indexing a multiple alignment. It supports the core functionality of Wu-BLAST, and can be used as such on a sequence. However, its chief purpose is to build indexes directly from multiple alignments. It also supports multiple alignments as queries, although the alignment will merely be transformed into its consensus sequence for use as a query. If you have many queries to run, put them in .mfa format, and make sure to use the -sepq option. Not specifying this option will treat the queries as a multiple alignment, which will have undetermined effects.

Typhon is run most simply as typhon [database] [query]. The sequences must be in .mfa format; that is, the sequences should be concatenated, separated by lines of the form of '>[sequence name]'. Note: the sequence names will be determined exactly from those in the .mfa file.

Different Algorithms Supported
Scoring Options
Output Options
Filtering Options
Performance Hints

Different Algorithms Supported

To run the basic BLAST algorithm (consecutive kmers, seed-ungapped extend-gapped extend use the -runblast option. This will compress the database into its consensus sequence and run the standard BLAST algorithm. To query each species in the database (effectively running BLAST simultaneously for each species in the alignment), tack on the -runallpairs option.

If you wish to run BLAST but using a user specified pattern, add the -pattern [bitstring] option. The pattern specified should be in the form of a bit string, with 0s representing don't cares. The WABA pattern would therefore be specified by -pattern 11011011011011. To run multiple patterns (effectively PatternHunter or Mandala), use the -runpattern option. You can specify the patterns you want to use with -patternfile [string], where the string is the name of the file containing the patterns you want to use. The file should have one pattern per line, with nothing else, in bit string format. Note: only the first -P patterns will be used, where -P [uint] is specified on the command line. The default behavior will not read in every pattern in the file, so make sure to specify this. Also, each pattern will be truncated or expanded to fit the values of the weight and span specified by -w [uint] and -s [uint]. Again, the default behavior will not be to set the weight and span to be that of the patterns in the file, so be careful. The value for -B [float] will be honored for this algorithm, but will effectively override -P.

If neither -runblast or -runpattern is specified, then Typhon will be run. Although honored for multiple pattern behavior, -B [float] is Typhon specific and specifies the average number of patterns indexed per position in the database; it therefore controls the index size. -P [uint] determines the total number of candidate patterns used in the index, which can be specified using -patternfile [string].

It is strongly recommended that you specify a phylogenetic tree; this can be done using -tree [string]. The argument value should be the name of a file containing the tree in Phylip format. Branch lengths are also recommended and should be in units of substitutions per site. The tree should include all species in the alignment database. Note: The names in the tree must match those in the .mfa file exactly. Any species in the database not found in the tree will be inserted under the root. If your tree is not binary it will be made so. You should also include a hypothetical query in the tree; this can be done using -queryname [string], which should again match the entry in the tree exactly. If no query is specified the root will be used as the position of the query.

The parameter -numregionclasses [uint] is described in the Typhon paper. It has (apparently) little effect on the performance of the algorithm, but feel free to change it from its default value.

For any of the above algorithms, the -hitdist [uint] parameter will apply extensions only to pairs of seeds that lie within a window on the diagonal. This window is the argument to -hitdist.

Scoring Options

Typhon incorporates Karlin-Altshul e-values as a means of assessing the significance of alignment scores. There are three e-values used. -E [double] gives the final e-value, above which alignments are not kept. -E1 [double] should probably not be changed, but will be discussed below. -gapE [double] determines which alignments are extended using full Smith-Waterman. Only alignments more significant than this threshold will be extended with gaps. -gapE has nothing to do with whether alignments are kept. Note: this will be ignored unless -gap is specified on the command line. If you like, -S [uint], -S1 [uint], and -gapS [uint] do exactly what -E, -E1 and -gapE do (and override whatever you set for those values), but are specified in units or normalized bit scores.

The significance of -E1 (-S1) is only apparent when considering the ungapped extension phase of alignment. Typhon uses two rounds of extension. The first stops when the score falls -X1 [uint] below the maximum score seen. If alignments score higher than -S1, they are extended further. This second round stops when they fall -X2 [uint] below the best score seen; -X2 is typically larger than -X1.

Alignments are kept if they score higher than -S, typically higher than -S1. All alignments making it past the second round will be printed, but they will be extended with gaps if they score higher than -gapS and -gap is specified. This phase ignores cells in the alignment matrix that have score more than -gapX [uint] below the best score seen.

The scoring matrix used is by default the HOXD scoring matrix, with a gap open penalty of 400 and gap extension penalty of 25. To specify your own, use -matrix [filename]. The format is shown below using the default parameters as a template:

#ACGT-
SubstitutionMatrix
91-114-31-123-25
-114100-125-31-25
-31-125100-114-25
-123-31-11491-25
-25-25-25-250
GapOpen -400

Output Options

By default output is written to standard out. To write to a file use -o [filename]. There are three output levels; shown below is a sample hit in each format:

Several output options may be useful if you are using Typhon as part of a global aligner. The parameters -startdb, -enddb, -startq, -endq [uint] use only subsets of the database and/or query while searching for alignments. The positions are base 1. For instance, to query only positions 100-200 of the database with positions 50-60 of the query, use -startdb 100 -enddb 200 -startq 50 -endq 60.

If you want ungapped alignments to be shorter than a threshold length, use the -mhl [uint] parameter to specify the maximum length of an ungapped alignment (or portion of a gapped alignment). Alignments longer than this value will be broken into shorter pieces satisfying the length restriction.

Filtering Options

Typhon has two basic options for filtering repeats. First, very frequently occuring words in the index are discarded. Those with the number of occurrences exceeding a certain number of standard deviations above the mean are tossed; the number of deviations is speicifed with -devs [float]. Note: For very small databases this may end up ignoring everything. Use caution. To turn off filtering of this type, use -devs 0. Filtering of low-complexity regions in the query is handled using DUST. By default, the level is 15. To turn off DUST, use -nodust. To change the level of masking, use -dustl [uint].

If you like, you can hard repeat mask your input files; any base with an 'N' will be ignored.

Performance Hints

You can limit memory use (somewhat) using -memopt.

To improve running times, try one or more of the following:


Back