Input Format

MDscan Input

User Information:

We store the submissions on the server and process them in order of job ID. Please tell us your full email address, so we can email the result back to you. You should get the result from us within a day. To ensure that all the users get the correct and timely results, the server accepts a job submission only if: 1) you have received an email from your previous submission; or 2) you have waited a day since your last submission. We thank you for your cooperation!!

Run Information:

You can give your run a name to remind you of the sequences and parameters you used for the run. This information will be included in the result email to you.

Input Sequences

You can specify a file containing your sequences, or paste your sequences into the text field if the data is relatively small. Since MDscan uses a probability matrix to represent a motif, it is advantageous for input to have between 20 to 400 sequences. With too few sequences (e.g 3-5), you won't be able to specify the top sequences and it is hard to characterize the motif with a probability matrix; with too many sequences (e.g. all 6000 or so yeast genes), it is hard for the motif to converge.

Currently, BioProspector accepts FASTA format (example): BioProspector recognizes only DNA sequences, either in ACGT or acgt alphabet. Bases such as N or U are ignored. Each sequence must be less than 32766 bases long. The web server accepts a maximum of 200 sequences with total size 100KB, and a minimum of 5 sequences with total size of 200 bp. For other input sizes, please download a copy to run the program locally.

MDscan algorithm fails when there are very non-functional simple repeats in the input sequence. E.g. AAAAAAAAAAAAA or CACACACACACACACACAC. We suggest you run a Repeat Masker program to remove these simple repeats before looking for motifs by MDscan.

Background Model:

A good background model greatly improves the specificity of MDscan. Only uses input sequences as the background model if the motif signal is very strong, or if you have no idea/data about the correct background.

If you have a dataset with sequences which you think represent the background (non-motif) very well, then either paste them in the text field or specify a background sequence file. The server recognizes sequences in FASTA (example) format. The background Markov dependency order is determined by the size of your background sequences, so the larger your background sequence set, the better you characterize the background. However, considering the server hard drive space and time to transfer big files over the Internet, we limit the maximum background file size to be less than 200 KB.

We have precomputed the background for several genomes. If your input sequences are from yeast, using these precomputed background could speed up the computation and give you better results. If you have another big genome sequence file larger than 200KB that you want to use as background, send this file to xsliu@jimmy.harvard.edu by email attachment so we could include this genome to our selection of pre-computed backgrounds.

Motif Model:

The width of the motif needs to be estimated.

MDscan first searches for similar words in the top sequences where you have more confidence that they contain the motif segments more abundantly. Only a number of good candidate motifs are kept to be updated and refined with the whole input sequences. The final several best motifs are reported to the user.

Please specify how many top sequences you are confident to contain the motif more abundantly, how many good candidate motifs to keep for the refinement step, and how many final top motifs you want to see.

Enjoy the search!