Research

My research area is computational genomics. The broad goal of my research is to develop efficient and accurate methodologies for the analysis of genomic data.

In recent years, genomics has rapidly become one of the main ways in which to study biology. Key milestones along this development were the draft release of the human genome in 2000, the subsequent sequencing of tens of vertebrate genomes including chimpanzee, mouse, rat, dog, and chicken, with which the human genome can be compared, sequencing of tens of fruitfly, yeast, plant, and hundreds of bacterial species, and the availability of global datasets on the expression and interactions of genes in human and key laboratory organisms. To enable the analysis of such genomic data, drawing from string algorithms, combinatorial optimization, machine learning, and efficient software development, we build methods and systems for a range of bioinformatics problems: sequence assembly, whole-genome multiple alignment, protein sequence alignment, RNA structure prediction, gene finding, motif finding, protein association network  alignment, and population ancestry inference on genotype data. All our tools are publicly available as source code and executables.

Genomic Sequence Assembly

Whole-Genome Assembly of Pyrosequencing Reads.  Today there is a significant push by both academia and industry to develop new technologies for genomic sequencing. Such technologies ideally should be cheaper than Sanger sequencing by orders of magnitude. That would enable determination of the DNA of each individual for medical applications, or to sequence a very large number of animals for tracing the evolutionary path of every functional element in our genomes, or to comprehensively sequence environmental samples so as to see how microorganisms in environments such as soil or within our bodies, respond to changes in conditions and evolve over time. Our group collaborates with Mostafa Ronaghi at the Stanford Genome Technology Center; Ronaghi's group develops a pyrosequencing-based technology for genomic sequencing that promises orders-of-magnitude cheaper sequencing, and we develop sequencing protocols and algorithms for de novo fragment assembly of data that come from pyrosequencing. We developed SHRAP [2], a method for assembling complex genomes using short DNA fragments, and demonstrated with simulations that assembling a whole mammalian genome with such fragments is feasible.

Gene Finding

Once a genome is available, one of the first analysis tasks is to annotate protein-coding genes. We developed CONTRAST, the first system for vertebrate gene prediction that successfully uses multiple alignments to achieve greater accuracy than methods based on two-species alignments [16]. CONTRAST predicts 65% more exact genes than the previous state-of-the-art method, misses 46% fewer exons, and displays comparable gains in specificity. Methodologically CONTRAST employs SVMs for scoring various coding region boundaries, a conditional random field (CRF) for integrating all features of gene structure, and a novel objective function for training, based on maximizing the accuracy of detected coding region boundaries. We applied CONTRAST in the context of the Drosophila Comparative Sequencing and Analysis Consortium to contribute whole-genome gene predictions for 12 Drosophila genomes [17]. In related methodological work we developed the maximum parse accuracy procedure for training CRFs [18] as an alternative to max-margin methods and are planning to employ it for gene finding.

Comparative Genomics

Genomic Alignment Methods. We have been developing algorithms and systems for aligning DNA sequences. LAGAN is a global pairwise aligner; MLAGAN is a multiple aligner [3], and SLAGAN [13] is a tool that simultaneously aligns and finds microrearrangements. Our systems are available at http://lagan.stanford.edu.  See also TreeRefiner, a tool for refining multiple sequence alignments.

We have applied our tools to test the power of comparative sequence analyses [4, 5] prior to the sequencing of mammalian genomes that is now undergoing, in order to help establish a set of most informative mammals to sequence. We aligned the whole human, mouse, and rat genomes as part of the International Rat Sequencing Consortium [6, 7], three fungal genomes [8] as part of the Aspergillus consortium, and 1% of the human genome with 23 mammals as part of the ENCODE consortium [9, 10, 11].

Whole Genome Alignments. In collaboration with Inna Dubchak's group, and with Arend Sidow's group, we aligned the entire human, mouse, and rat genomes, and are studying their evolution.  These alignments are accessible through the VISTA Browser gateway. Inna Dubchak also provides visualizations of our ENCODE alignments. Currently we are developing a new tool, Revolver, a new tool for automated whole-genome multiple mammalian alignments, and Evolver, a simulator of mammalian genome evolution.

Multiple Local Alignment. More and more, genomic databases will consist of multiple alignments of related species (mammals, flies, yeasts) rather than individual genomes. Therefore, one may wish to search a query sequence against a database of alignments. As it turns out, BLAST-style local alignment can be done with better speed and sensitivity in this setting. Typhon [14] is a new tool that achieves that.

Protein Alignments. Protein multiple alignment is challenging, especially at the "twilight" zone of low sequence identity. We developed ProbCons [19, 20], a tool that exhibits high accuracy and practical running times in benchmark datasets. ProbCons is based on a hidden Markov model, and on a probabilistic method we call probabilistic consistency that leverages comparisons to all sequences of the set, during progressive alignment of two (multi-)sequences. ProbCons also uses maximum expected accuracy, rather than Viterbi, when computing alignments. ProbCons is available to run or download at http://probcons.stanford.edu.  The work was presented as an abstract to ISMB/ECCB 2004, and received the Best Paper Award together with Muscle, which is a really scalable and accurate protein aligner.

Subsequently, we developed an improved protein aligner, CONTRAlign [21], which combined the ProbCons algorithm with the first discriminative training framework for protein alignment, to provide further accuracy gains without loss of efficiency. CONTRAlign can use a diverse set of features such as hydrophobicity, solvent accessibility, and secondary structure information; users can specify feature sets and train their own alignment models easily and rigorously using CONTRAlign.

Protein Alignment with Repeats and Rearrangements.  We developed ProDA [22], a multiple alignment program for proteins that have different domain architectures and therefore their homology may exhibit repeats and rearrangements.

RNA Structure Prediction

Together with proteins, RNA structures are the key biomolecules that perform regulatory, catalytic, structural and other biological functions in a cell. RNA structure is determined primarily by the base pairing of nucleotides in the RNA sequence, and computational methods have been among the main ways to predict RNA structure. For several decades, free-energy minimization methods have been the dominant computational strategy. More recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic method, enabling fully automated statistical learning algorithms to derive model parameters. However, energy minimization methods such as Mfold have clearly dominated probabilistic methods in predicting structures accurately. We developed CONTRAfold [23], a secondary structure prediction method based on conditional log-linear models (CLLMs), which is the first method enabling automated parameter learning to outperform energy minimization methods, and in particular Mfold which had been the best method for almost two decades. Our result demonstrates that statistical learning procedures provide an effective alternative to laborious experimental measurements of thermodynamic parameters for RNA secondary structure prediction.

In earlier work, we defined a multiple RNA structure alignment model of theoretical interest, proved that the computational problem of structural alignment is APX-hard, and provided an O(log2N) approximation algorithm [24]. Returning to multiple RNA structural alignment, we recently developed CONTRAfold-multi, which is a practical method of the highest accuracy to-date for aligning multiple RNA structures. This method is based on our CLLM algorithm for RNA folding [23] in combination with our probabilistic consistency algorithm [21] for sequence alignment.

Motif Finding

Regulatory motifs are short DNA sequences that reside in the vicinity of genes, bind transcription factors, and control the binding of RNA polymerase and consequent transcription of a gene; therefore motifs control the amount and timing of gene expression. The computational problems of modeling motifs, finding over-represented motifs in genomic regions, and scanning a genome for specific motifs, have been studied extensively. The most popular motif models are based on the position-specific scoring matrix (PSSM) model and its variations: a motif is a probabilistic word m = m1…mk, where each letter mi is a probability distribution over {A, C, G, T}. Pairwise letter dependencies and more complex models have also been tried. Learning richer models, however, requires large numbers of motif instances. The most widely used motif finding methods search for a PSSM with either Expectation Maximization or Gibbs Sampling.

We developed CompareProspector [25, 26], the first Gibbs sampler for motif finding that successfully leveraged cross-species comparisons during the search procedure. Next, we departed from the PSSM framework and modeled a motif as simply a bag, or multiset, consisting of all known occurrences of the motif. This motif model grows with additional known instances, and sidesteps the difficult task of parameterizing complex motif substructure. We incorporated this model in MotifScan [27], a method for searching a genome for a known motif, and showed that our model outperforms PSSMs. Motivated by this new model, we developed MotifCut [28], a method that translates motif finding into finding maximum-density subgraphs in a graph that represents the input sequences. This is a convex optimization problem; with heuristic optimizations we solve it in practice in sub-quadratic time. MotifCut is practical and significantly outperforms previous state-of-the-art methods. Moreover, because our methodology is fundamentally different than Gibbs samplers and EM-based methods, MotifCut often finds motifs where all other methods fail, and complements the existing motif finding toolbox.

Protein Association Networks

Today we have the complete genomes of almost 500 microbes and tens of mammals and other organisms. The first steps in learning biology from these genomes are finding genes and regulatory elements, and comparing genomes and proteins across species. An important next step is to discover which groups of proteins participate in the same functional processes. In protein association networks, proteins are nodes and edges connect pairs of proteins that are involved in the same functional process, such as a metabolic pathway or multiprotein complex. A new and rapidly growing area of biocomputation involves integrating such networks from diverse systems-biology and sequence data (network integration), and comparing networks across species to reveal common functional processes (network alignment). We developed a novel network integration algorithm, SRINI [29], which we recently applied to all sequenced microbes to obtain networks available online at http://networks.stanford.edu [30]. To compare networks across multiple species, we developed a network alignment algorithm, Graemlin [31], which is the first algorithm capable of aligning more than 3 networks, and provides orders-of-magnitude speedup and significantly better accuracy than previous approaches for network alignment.

Population Genetics

Today we are experiencing a revolution in our ability to inexpensively obtain personalized genomic information, in the form of genotyping chips or whole-genome resequencing. Challenging computational problems are emerging, with enormous potential impact in medicine and in understanding our genetic heritage. We recently addressed the problem of ancestry reconstruction. The genome of an individual is a mosaic of chromosomal blocks, each following the statistical properties of variations seen in the populations to which the individual’s ancestors belong. We developed HAPAA [32], a novel method for inferring the population ancestry of each chromosomal block, given the genotype of an individual.  HAPAA employs a non-parametric representation of allelic variation in the source populations, which in contrast to previous methods, captures the signal of linkage disequilibrium. As a result, HAPAA is substantially more accurate, and finds the source populations of chromosomal blocks contributed to an individual’s genome by ancestors that existed 20 or more generations ago.

Microarrays

We explored the application of Independent Component Analysis (ICA) to microarrays. ICA provides a model of underlying biological processes within a cell. Based on the ICA model, genes can be clustered according to their loads in the derived components. This clustering technique demonstrated remarkable performance in a variety of datasets. Our software and results can be found at http://www.stanford.edu/~silee/ICA/. We are also working on methods that combine comparative genomics and microarrays to computationally find motifs that are putative sites of gene cis-regulation.

 

References

1.       Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES.  ARACHNE: A whole genome shotgun assembler. Genome Research 12:177-189, 2002.

2.       Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S. Whole-genome sequencing and assembly with high-throughput short-read technologies.  PLOS One 2(5): e484, 2007.

3.       Brudno M, et al.  LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 13: 721-731, 2003. http://lagan.stanford.edu.

4.       Cooper GM, et al.  Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Research 13:813-820, 2003.

5.       Cooper GM, Brudno M, Stone ES, Dubchak I, Batzoglou S, Sidow A.  Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Research 14:539–548, 2004.

6.       Rat Genome Sequencing Project Consortium (RGSPC).  Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature 428:493–521, 2004.

7.       Brudno M, et al.  Automated whole-genome multiple alignment of Rat, Mouse, and Human. Genome Research 14:685–692, 2004.

8.       Galagan JE, et al.  Sequencing of Aspergillus nidulans and comparative analysis with A. fumigatus and A. oryzaeNature, 438: 1105–1115, 2005.

9.       The ENCODE Project Consortium. The ENCODE Project. Science 306:636-640, 2004.

10.    The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816, 2007.

11.    Margulies et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Research, 17(6): 760-774, 2007.

12.    Brudno M, Malde S, Poliakov A, Do C, Couronne O, Dubchak I, Batzoglou S.  Glocal alignment: finding rearrangements during alignment. Proceedings of the ISMB 2003, Bioinformatics 19: 54i-62i, 2003.

13.    Sundararajan M, Brudno M, Small K, Sidow A, Batzoglou S.  Chaining algorithms for alignment of draft sequence.  Proceedings of the 4th Workshop on Algorithms in Bioinformatics (WABI2004).

14.    Flannick J, Batzoglou S.  Using multiple alignments to improve seeded local alignment algorithms.  Nucleic Acids Research, 33(14): 4563–4577, 2005. http://typhon.stanford.edu.

15.    Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES.  Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research 10:950-958, 2000.

16.    Gross SS, Do CB, Sirota M, Batzoglou S. CONTRAST: A discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology, in press. http://contra.stanford.edu/contrast.

17.    Drosophila Comparative Genome Sequencing and Analysis Consortium. Evolution of genes and genomes in the context of the Drosophila phylogeny. Nature, in press.

18.    Gross SS, Russakovsky O, Do CB, Batzoglou S.  Training conditional random fields for maximum parse accuracy.  NIPS 2006.

19.    Do CB, Brudno M, Batzoglou S.  ProbCons: Probabilistic consistency-based multiple alignment of amino acid sequences. Best Paper Award, ISMB/ECCB 2004. http://probcons.stanford.edu.

20.    Do CB, Mahabhashyam MS, Brudno M, Batzoglou S.  ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Research 15:330–340, 2005.

21.    Do CB, Gross SS, Batzoglou S. CONTRAlign: Discriminative Training for Protein Sequence Alignment.  Proceedings of the Tenth Annual International Conference on Computational Molecular Biology, (RECOMB 2006), pp. 160–164. http://contra.stanford.edu/contralign.

22.    Phuong TM, Do CB, Edgar RC, Batzoglou S.  Multiple alignment of protein sequences with repeats and rearrangements.  Nucleic Acids Research, 34(20): 5932-5942, 2006.

23.    Do CB, Woods DA, Batzoglou S.  CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models. Best Paper Award, ISMB 2006. http://contra.stanford.edu/contrafold.

24.    Davydov E, Batzoglou S.  A computational model for RNA multiple structural alignment. Combinatorial Pattern Matching 2004.

25.    Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S.  Eukaryotic regulatory element conservation and identification using comparative genomics. Genome Research 14:451-458, 2004. http://compareprospector.stanford.edu.

26.    Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS.  A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Research 32:W204 - W207, 2004.

27.    Naughton B, Fratkin E, Batzoglou S, Brutlag DL.  A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites.  Nucleic Acids Reesearch, 34(20): 5730-5739, 2006. http://motifscan.stanford.edu.

28.    Fratkin E, Naughton B, Brutlag DL, Batzoglou S.  MotifCut: Finding Regulatory Motifs with Maximum Density Subgraphs.  Proceedings ISMB2006, Bioinformatics 22: e150-e157, 2006. http://motifcut.stanford.edu.

29.    Srinivasan B, Novak A, Flannick J, Batzoglou S, McAdams H. Integrated Protein Interaction Networks for 11 Microbes.  Proceedings of RECOMB 2006, pp. 1–14.

30.    http://networks.stanford.edu.

31.    Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S.  Graemlin: general and robust alignment of multiple large interaction networks.  Genome Research, 16:1169-1181, 2006. http://graemlin.stanford.edu.

32.    Sundquist A, Fratkin E, Do CB, Batzoglou S. HAPAA: HMM-based analysis of polymorphisms in admixed ancestries. Submitted. http://hapaa.stanford.edu.

 

Posters

Short Read Assembly.  A poster by Andreas Sundquist describing his sequencing protocol and assembly algorithm for reconstructing a whole mammalian genome from short unpaired reads, such as those that Pyrosequencing or some other high-throughput sequencing technology may be able to produce in the near future.

Typhon: a local multiple aligner. A poster by Jason Flannick describing his method for indexing a multiple alignment for seeded homology search, and his Typhon tool that implements the method. This was presented in the Genome Informatics meeting of CSHL in October 2005.

An extended-model DNA alignment system.  A poster by Chuong Do describing an extension that he implemented of the LAGAN pairwise aligner, incorporating a more elaborate alignment model that includes: (1) a double-affine gap model for alignment; (2) a non-conserved state to absorb unalignable sequence; (3) a coding model for translated alignment of protein-coding regions.

Draft alignment. Some of our unpublished work on aligning two genomes that are given in draft assembly form. Eugene Davydov recently proved that this problem is NP-complete.

Application of ICA to microarrays. A poster by Suin Lee describing our preliminary results of applying independent component analysis to microarrays. It was presented in the May 2003 Genome Informatics CSHL meeting.

LAGAN GUI. Marina Sirota started developing a gui for LAGAN and MLAGAN during Summer 2003. This is her CURIS student poster.