CS 374 - Algorithms in Biology
Fall 2005

 

 

 

Course Description

This course will cover algorithms and computational models applied to molecular biology. Current, exciting algorithms from a variety of biological areas will be covered. The topics should be of interest to computer scientists and biologists alike. In Fall 2004 we will cover topics from genomics and evolution of DNA, such as sequence comparison methods, annotating DNA with genes and evolutionary important elements, genomic rearrangements, microarray analysis, and new sequencing technologies. We will also cover topics from protein structure, protein surface and interactions modeling, multiple alignment of proteins, phylogenetic trees, and DNA-based computation. The course will consist primarily of student presentations of topics in the syllabus, which will be prepared with the help of the instructor. Students will help forming the syllabus, by choosing the topics they would like to present.

 

Class Schedule

          Lecture: TTh 3:15-4:30, Clark Center room S361

Staff and Office hours

        Instructor: Serafim Batzoglou
        Office: S266 Clark Center 
        Phone: (650) 723-3334 
        E m a il: serafim (at the address of) cs period stanford period edu (so as to avoid spam)
        Office hours: Tuesday 1:15-3:15PM.

TA: Relly Brandman

 

Prerequisites
These are recommended but will not be strictly enforced.

 Course Requirements and Grading

1.     Lecture.  The main course requirement is to select a topic and prepare a presentation based on 2 papers on the topic. The instructor and TA will meet with each student to help with the preparation, and ensure that the resulting presentation will be interesting and accessible to students in the class who are not experts in the given topic. Most of the topics have a strong algorithmic flavor, but some topics are more geared towards biology. Please sign up for topics to present, on a first-come first-serve basis (see Topics below).

2.     Scribing.  The second requirement is scribing a lecture.  Lecture notes should provide students who are taking the class a useful resource for remembering the material presented. Ideally, lecture notes should be written up in a way so that they are readable by students of next year who did not necessarily read the papers that were presented. For formatting, here is a sample of how lecture notes should look like in terms of format and organization. We suggest that you use this as a template to prepare your lecture notes in Word.

Please sign up for scribing, on a first-come first-serve basis. To do so, please email both instructor and TA with subject "CS374, signing up for scribing". Lecture notes are due 1 week after the presentation.

3.    Summaries.  As a third requirement, you should select one of the first 10 lectures, and one of the rest. For this lecture, you should find one paper in addition to the 2 presented, which is related to the topic. It is preferable to find recent papers (2001-2005). Then, you should write a 1-page summary of what the paper presents and how it relates to the other two. The deadline for that summary is 1 week from the time of your selected lecture, and it will be made available online 2 weeks from that lecture, after we edit it together. 

Here is a sample structure of this short summary:

- Paper reference

- Abstract: in your own words (preferably simple description), what does the paper present

- Discussion: how do these results relate to the topic? Is it an advance over what was described, a different approach, and what are the main advantages/disadvantages?

4.    As this is a seminar-style class, attendance is mandatory, and each student can miss up to 2 classes without affecting his/her grade.

Taking the class for 2 units: If you take the class for 2 units, you can drop (2) or (3) above; or, in case enrollment is too high we will consider dropping (1) if you prefer.

Communication

Questions should be sent to the instructor and TA directly with email, or communicated to course staff in person after lecture or during office hours.
 

 

Topics

Students will select topics from the following list. Also, they will sign up for a date of presentation. All this will be done on a first-come first-serve basis. Please email both instructor and TA with subject "CS374, signing up for presentation". Each lecture will cover 2, or occasionally 3 papers. Underlined topics have been assigned.

In selecting topics, note that some of them have several papers, which are always grouped. Please select just one group of papers.

Color code: The topics below are color coded to roughly correspond to the subject area. Sky blue broadly denotes DNA sequence & genomics papers, red is systems and modular biology, green is protein-related papers, purple is biological computation, orange is non-CS biology, and gray are miscellaneous topics. Ordering of the topics is random.

  Topic Papers
1 Genomic rearrangements

 

     Genome Rearrangements in Mammalian Evolution: Lessons from Human and Mouse Genomes

     Transforming Men into Mice: the Nadeau-Taylor Chromosomal Breakage Model Revisited

 

2 Repetitive DNA detection and classification

 

     Piler: identification and classification of genomic repeats

     De novo identification of repeat families in large genomes

 

3 Networks of Protein Interactions

A  Integration

     A probabilistic functional network of yeast genes

      A Bayesian framework for combining heterogeneous data sources for gene function prediction

 

B  Network Alignment

      Conserved patterns of protein interaction in multiple species

      Pairwise local alignment of protein interaction networks guided by models of evolution

 

C  Mathematical Properties

     Network biology: understanding the cell's functional organization

      Evidence for dynamically organized modularity in the yeast protein-protein interaction network

      Subnets of scale-free networks are not scale free: sampling properties of networks

 

D  Systems Biology

      Stochasticity in gene expression: from theories to phenotypes

      Robustness in bacterial chemotaxis

      Directed evolution of a genetic circuit

     

E  Misc. graph algorithms

      Efficient algorithms for detecting signaling pathways in protein interaction networks

      Mining coherent dense subgraphs across massive biological networks for functional discovery

 

F  Signal Transduction Networks

     Network motifs: simple building blocks of complex networks

      Interlinked fast and slow positive feedback loops drive reliable cell decisions

 

4 Indexing large databases for string similarity search

A  Seeded database search

     BLAT--The BLAST-like alignment tool

      Designing Seeds for Similarity Search in Genomic DNA

 

B  Multiple seeds and multiple alignments

     Designing Multiple Simultaneous Seeds for DNA Search

      Using multiple alignments to improve seeded local alignment algorithms

 

5 Regulatory motif finding

 

      Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals

      Ab initio prediction of transcription factor targets using structural knowledge

 

6 Protein structure and prediction

A  Finding the Beta Helix motif

      Predicting the beta-helix fold from protein sequence data

      Segmentation Conditional Random Fields (SCRFs): a new approach for protein fold recognition

 

B  Computational musings on protein domains

      A novel method for multiple alignment of sequences with repeated and shuffled elements

      Graph theoretical insights into evolution of multidomain proteins

 

C  Molecular Dynamics Simulation of Drug-Target Proteins

      HIV-1 protease molecular dynamics: Possible contributions to drug resistance and a potential new target site for drugs

      Dimerization of the p53 oligomerization domain: identification of a folding nucleus by molecular dynamics simulations

 

D  Graphical Models for Protein Kinetics

      Stochastic roadmap simulation: an efficient representation and algorithm for analyzing molecular motion

      Using path sampling to build better Markovian state models. Predicting the folding rate and mechanism of a tryptophan zipper beta hairpin

 

7 Protein classification

A  Kernel-based methods

     Support vector machine applications in computational biology

      Semi-supervised protein classification using cluster kernels

 

B  Graph flow-based methods

     Protein ranking: from local to global structure in the protein similarity network

      Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps

 

8 Phylogenetic trees

 

      A Structural EM Algorithm for Phylogenetic Inference

      A hybrid micro-macroevolutionary approach to gene tree reconstruction

 

9 Haplotype reconstruction

 

      Minimum-Recombinant Haplotyping in Pedigees

      An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming

 

10 Finding elements in DNA that are conserved by evolution

A  Methods for finding conserved elements

     Distribution and intensity of constraint in mammalian genomic sequence

      Identification and Characterization of Multi-Species Conserved Sequences

 

B  Statistical power of detecting conserved elements

     Subtree power analysis and species selection for comparative genomics

      A model of the statistical power of comparative genome sequence analysis

 

11 Protein multiple alignment

 

      MUSCLE: a Multiple Sequence Alignment Method with Reduced Time and Space Complexity

      ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment

 

12 Modeling the origin and migration of human populations

 

      The application of molecular genetic approaches to the study of human evolution

      Recovering the geographic origin of early modern humans by realistic and spatially explicit simulations

 

13 Finding genes based on comparative genomics

 

      Multiple organism gene finding by collapsed gibbs sampling

      Using multiple alignments to improve gene prediction

 

14 Mining the medical literature

 

     Using Text Analysis to Identify Functionally Coherent Gene Groups

      Extracting Synonymous Gene and Protein Terms from Biological Literature

 

15 Modeling regulatory networks

A  Probabilistic Modeling

      Module Networks: Identifying Regulatory Modules and their Condition-Specific Regulators from Gene Expression Data

      Probabilistic Discovery of Overlapping Cellular Processes and their Regulation

 

B  Role of noise in gene expression

     The effect of transcription and translation initiation frequencies on the stochastic fluctuations in procaryotic gene expression

      Noise propagation in gene networks

 

16 Classic Papers

This presentation, if selected by a student, will be different from usual. We will cover a historical perspective based on three classic papers on Chromosomes (1903), Genes (1933), and the Central Dogma of molecular biology (1970)

      The Chromosomes in Heredity

      What is a Gene?

      Central Dogma of Molecular Biology

17 DNA-based computation and self-assembly

 

      Complexity of Self-Assembled Shapes

      Self-healing tile sets

 

Additional References: 1, 2

 

18 Transforming cells into automata

 

      Genetic Circuit Building Blocks for Cellular Computation

      Optimizing genetic circuits by global sensitivity analysis

 

                                                                                                        

Schedule

The schedule will be filled-in as students sign up for topics. Click on the scribe's name for lecture notes. 

  Topic Date Presenter Short Paper Summaries Scribe

1

Introduction 9-27 Serafim Batzoglou   Abhishek Rathod
2 Comparative Genomics 9-29 Serafim Batzoglou   Vignesh Ganapathy
3 Finding Genes Based on Comparative Genomics 10-4 Sam Gross   Ross Bayer
4 Classic Papers in Genetics 10-6 Chihiro Fukami    
5 Networks of Protein Interactions -- A.   Introduction and Integration 10-11 Balaji Srinivasan    

6

Networks of Protein Interactions -- B.   Network Alignment 10-13 Tony Novak    
7 Indexing Large Databases for String Similarity -- A. Seeded Database Search 10-18 Ross Bayer Indexing a MSA

Spaced Seeds

 
8 Protein Multiple Alignment 10-20 Konstantin Davydov SPEM aligner  
9 Networks of Protein Interactions -- C.   Mathematical Properties 10-25 Abhishek J Rathod Functional Topology Chihiro Fukami

10

Signal Transduction Networks -- no slides, Onn used chalk :-) 10-27 Onn Brandman ModularAnalysis  
11 Graphical Models for Understanding Protein Kinetics 11-1 Nina Singhal    
12 Regulatory Motif Finding 11-3 Wenxiu Ma Phylogenetic Motif Finder 1 , 2

GEMODA

Marcin Mejran
13 DNA-based Computation and Self-Assembly 11-8 Ho-Lin Chen    
14 Protein Classification 11-10 Serafim Batzoglou  Coherent Subgraphs  
15 Protein Structure and Prediction -- A. Finding the Beta Helix Motif 11-15 Marcin Mejran Ab initio prediction  
16 Modeling the Origin and Migration of Human Populations 11-17 Michael Palmer   Melroy Saldanha

17

Mining the Medical Literature 11-29 Vignesh Ganapathy   Konstantin Davydov
18 Networks of Protein Interactions -- D. Systems Biology 12-1 Ophelia Venturelli    
19 Regulatory Motif Finding, Part II 12-6 Balaji Srinivasan   Wenxiu Ma
20 Phylogenetic Trees 12-8 Melroy Saldanha   Ophelia Venturelli