Input Format

BioProspector Input

User Information:

We store the submissions on the server and process them in order of job ID. Please tell us your full email address, so we can email the result back to you. You should get the result from us within a day. To ensure that all the users get the correct and timely results, the server accepts a job submission only if: 1) you have received an email from your previous submission; or 2) you have waited a day since your last submission. We thank you for your cooperation!!

Run Information:

You can give your run a name to remind you of the sequences and parameters you used for the run. This information will be included in the result email to you.

Input Sequences

You can specify a file containing your sequences, or paste your sequences into the text field if the data is relatively small. Since BioProspector uses a probability matrix to represent a motif, it is advantageous for input to have between 20 to 400 sequences. With too few sequences (e.g 3-5), it is hard to characterize the motif with a probability matrix; with too many sequences (e.g. all 6000 or so yeast genes), it is hard for the motif to converge. Of course, these would not be a problem if the motif is very strong (very conserved consensus) or if the motif occurs very frequently.

Currently, BioProspector accepts FASTA format (example): BioProspector recognizes only DNA sequences, either in ACGT or acgt alphabet. Bases such as N or U are randomly assigned ACGT. Each sequence must be less than 32766 bases long. The web server accepts a maximum of 200 sequences with total size 100KB, and a minimum of 5 sequences with total size of 200 bp. For other input sizes, please download a copy to run the program locally.

Motif Model:

BioProspector has 3 motif models:

1) One-block motifs require one motif width
2) Two-block motifs require first motif block width, second motif block width, the estimated minimum gap between the two blocks (from the end of the first block to the beginning of the second block), and the estimated maximum gap between the two blocks.
3) Palindrome motifs (a special case of two-block motifs) require only one motif width (since the second motif block should have the same width as the first motif block), the estimated min and max gap between the two blocks. An example of a palindrome motif is:
Forward strand ATGACA ...gap... TGTCAT --->
Reverse compliment strand <--- TACTGT ...gap... ACAGTA
If your motif is like ATCAGCTGAT, then specify motif block width to be 5 and both max and min gap to be 0.

Since BioProspector samples two-block motif alignments (aka, starting positions of the two motif blocks in sequences) from joint distribution, it is advantageous to specify a relatively smaller gap range (= max gap - min gap). It is OK to specify max gap to be 40, as long as min gap is also pretty big. But a run with max gap = 40 and min gap = 0 would dilute the information too much, and won't result in a very good motif. Currently the server only allows a submission of two-block (including palindrome) motif search if (max gap - min gap <= 15).

Please also specify whether the motif occurs in each of the input sequences, and whether we need to search both forward and reverse compliment strands.

Background Model:

A good background model greatly improves the specificity of BioProspector. Only uses input sequences as the background model if the motif signal is very strong, or if you have no idea/data about the correct background.

If you have a dataset with sequences which you think represent the background (non-motif) very well, then either paste them in the text field or specify a background sequence file. The server recognizes sequences in FASTA (example) format. The background Markov dependency order is determined by the size of your background sequences, so the larger your background sequence set, the better you characterize the background. However, considering the server hard drive space and time to transfer big files over the Internet, we limit the maximum background file size to be less than 200 KB.

We have precomputed the background for several genomes. If your input sequences are from yeast, using these precomputed background could speed up the computation and give you better results. If you have another big genome sequence file larger than 200KB that you want to use as background, send this file to xsliu@jimmy.harvard.edu by email attachment so we could include this genome to our selection of pre-computed backgrounds.

Motif Search Strategy:

BioProspector tries to find the motif a number of times and report the motifs with highest scores. You can specify the number of top motifs you want to get. Sometimes the top motifs you get are similar. This is an indication that the motif is pretty strong (conserved and abundant in the data). However, sometimes user wants to see what other motifs are there besides the strong one. We are in the process of implementing this feature. Before this is ready, you could manually delete the strong motif segment and perform motif search again.

Enjoy the search!