Compare Prospector

Compare Prospector Inputs

User Information:

We store the submissions on the server and process them in order. Please tell us your full email address, so we can email the result back to you. Most of the time, you should get the result from us within a day. Currently we only allow a maximum of 20 submissions to be stored, the 21st submission will overwrite the 1st. To ensure that all the users get the correct and timely results, please don't submit more than one submission at a time. Make sure you have received the answer for your submissions before submitting another job. We thank you for your cooperation!!

Input Sequences:

You can specify a file containing your sequences. Since CompareProspector uses a probability matrix to represent a motif, it is advantageous for input to have between 20 to 400 sequences. With too few sequences, it is hard to characterize the motif with a probability matrix; with too many sequences (e.g. all 6000 or so yeast genes), it is hard for the motif to converge. Of course, these would not be a problem if the motif is very strong (very conserved consensus) or if the motif occurs very frequently. Currently, CompareProspector recognizes input in FASTA format:

1) FASTA: every sequence is written as (see example)

>sequence name [return] the ">" is very important here"

sequence as one / more lines [return/s]

CompareProspector recognizes only DNA sequences, either in ATGC or agtc alphabet. Bases such as N or U are randomly assigned A/T/G/C. Each sequence must be less than 32766 bases long and the total input size must be less than 200KB.

Conservation of input sequences: please specify either an alignment file in mfa format or a window percent identity file:

For each sequence in the input sequence file, please identify its ortholog/homolog from other species. Commonly used species pairs are: human-mouse, C. elegans-C. briggsae, various yeast species, etc. Then align the orthlogous/homologous sequences. You can use pairwise alignment methods (if there are two sequences) or multiple alignment methods (if there are multiple sequences). Many such alignment methods are available, such as LAGAN and BLASTZ. Please see instruction below on how to handle sequences without identifiable orthologs/homologs in the input.

Alignment file in mfa format:

If you use LAGAN, you can save the alignments in mfa format (which can be generated by specifying '-mfa' when using lagan.pl or converted from binary alignment by using bin2mfa in the utils directory) and use it as input.You will also need to specify the window size (default 20). We will calculate the window percent identity values for you.

Format of the alignment file: the genes in the mfa alignment file should be in the same order as genes in the input sequence file. If any of the genes in the input sequences does not have an ortholog, please put a line starting with '#' as the alignment for that sequence.

Example, if the sequences in your input sequence file are seq1, seq2, seq3, and seq4(in this order), and seq2 does not have an ortholog from the other species, you should specify the allignment file as:

alignment of seq1 and its ortholog in mfa format (please list seq1 rather than its ortholog(s) first)
#
alignment of seq3 and its orthologs in mfa format (please list seq3 rather than its ortholog(s) first
alignment of seq4 and its orthologs in mfa format (please list seq3 rather than its ortholog(s) first)

Here is an example of the alignment file.

Input Window Percent Identity File:

If you prefer to use other alignment programs, you can also generate the window percent identify file on your own and use it as input. Please specify a file containing the percent identity values for each input sequence. Each line in the file corresponds to the percent identity value of one sequence, with each number (between 0 and 1) indicating the percent identity value of each nucleotide in the sequence. The numbers are tab-delimited. The sequences should be in the same order as those specified as the input sequence file. If a sequence has no percent identity values (e.g., in the case when no orthologs can be identified), please use a blank line in the input percent identity file.

see example

The following is one way to calculate window percent identities:

First convert the alignment into a score for each nucleotide: for pairwise alignment, the score for a nucleotide is assigned 1 if the nucleotide is the same in the pairwise alignment and 0 otherwise. For multiple alignment, the score is 1 if the nucleotide is completely conserved in all sequences and 0 otherwise. Then the percent identity value of a nucleotide is calculated as the average score over a certain window size centered at that nucleotide. The window size we use is 20 bp.

Window Percent Identity Thresholds:

Please specify a high and a low window percent identity thresholds. The thresholds should be > 0 and <= 1. In our paper, we used (0.8, 0.5) as (high, low) thresholds for human-mouse comparisons, and (0.5, 0.3) as (high, low) thresholds for C. elegans-C. briggsae comparisons.

Motif Model:

CompareProspector searches for one-block motifs. You should specify the width of the motif. Motif width are usually 8 - 15 bp.

Please also specify whether the motif occurs in each of the input sequences, and whether we need to search both forward and reverse compliment strands.

Background Model:

A good background model greatly improves the specificity of CompareProspector. Only uses input sequences as the background model if the motif signal is very strong, or if you have no idea/data about the correct background.

If you have a dataset with sequences which you think represent the background (non-motif) very well, then please specify a background sequence file. The program automatically detects sequences in FASTA format (example) (check input sequence format). The background Markov dependency order is determined by the size of your background sequences, so the larger your background sequence set, the better you characterize the background. However, considering the server hard drive space and time to transfer big files over the Internet, we limit the maximum background file size to be less than 400 KB.

We have precomputed the background for several genomes. If you have another big genome sequence file larger than 400KB that you want to use as background, send this file to iliu@smi.stanford.edu by email attachment so we could include this genome to our selection of pre-computed backgrounds.

Motif Search Strategy:

CompareProspector tries to find the motif a number of times and report the motifs with highest scores. You can specify the number of top motifs you want to get. Sometimes the top motifs you get are similar. This is an indication that the motif is pretty strong (conserved and abundant in the data). However, sometimes user wants to see what other motifs are there besides the strong one. We are in the process of implementing this feature. Before this is ready, you could manually delete the strong motif segment and perform motif search again.

Enjoy the search!