I529:  Bioinformatics in Molecular Biology and Genetics: Practical Applications (4CR)

Spring Semester 2007
Lecture : M/W 4-5:15pm, I107

Office Hour: TBA
Eigenmall 1008
Lab: Fri, 4-5:30pm I109
Instructor: Haixu Tang
AI: Huijun Wang


Description: We aim to introduce a broad range of, from fundamantal and advanced, applications of bioinformatics methods and tools to solving problems in genomics and molecular biology. Prior to this class, the students should have learned basic methods and theories in bioinformatics, e.g. by taking I519. In this class, we will focus on how to apply them to solving biological problems in real life. Some advanced computational techniques that are widely applied in bioinformatics, e.g. Hidden Markov model (HMM), Bayesian Network (BN), will be discussed in details in the class. The important themes that will be covered by this course include

- Sequence modeling and classification
- Genome annotation
- Motif finding
- Genome comparison
- Protein families
- Non-coding RNAs
- MicroRNAs and their targets
- Functional prediction
- Phylogenetics
- Mass spectrometry and proteomics

This class will have a separate lab section, in which the students will be taught in how to solve biological problems in a step-by-step fashion. The programs that will be covered in the lab of this class include

- Sequence modeling using Markov chains: seq++;
- Pair HMM: SLAM, TwinScan, QRNA;
- HMM: Genscan;
- Profile HMM: Hmmer, Pfam;
- Stochastic Context Free Grammer (SCFG): COVE;
- Non-coding RNA search: Rsearch;
- Phylogenetics: PHYLIP, PAML;

Students will be instructed to write scripts (Perl and PHP preferrable) and/or programs that make use of the current implementation of sophisticated algorithms, such as HMM, BN, SVM, etc., to solve biological problems.

This course is designed for the advanced level bioinformatics graduate students after they take I519. Graduate students with either biology or phisical/computer science backgrounds who is interested in bioinformatics applications in molecular biology are also welcome to take this course.


Textbook: : Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , Cambridge University Press, 1999, (BSA)
Some of the topics from the course can not be found in this book. We will distribute complementary lecture notes and reading materials along the course for these topics. We also recommend the students to read the book, Nello Cristianini and Matthew W. Hahn,
Introduction to Computational Genomics: A Case Studies Approach ,Cambridge University Press, 2006
Assignments: We will have 5 take-home assignments and 1 class project.

Grading: Combined assignments (30%), One mid-term exam (20%), Final exam (25%), Class Project (20%), Attendence (5%).

Office hour: Haixu Tang: TBA, Eigenmann 1008, or upon appointment
Office hour: Huijun Wang: TBA, Ph.D student office, or upon appointment

  I519 or equivalent knowledge in bioinformatics required.

Group Assignment: The class will be divided into several small groups for mini projects. The group assignment is going to be determined in the first class.

Final projects (please email me if you have any questions regarding these projects)

  • Project 1: Construction of Asthma-realted protein-protein interaction from literature and its application to analysis of gene expression data. Contact: Dr. Sun Kim.
  • Project 2: Analysis of the effect of drug induced hypermethylation on gene transcription. (Note: this project require prior knowledge on DNA hypermethylation). Contact: Dr. Sun Kim.
  • Project 3: Probabilistic modeling of oligosaccharides. Unlike nucleic acids or proteins, oligosaccharides (suger molecules) have tree-like structures instead of linear sequences, which makes their probabilistic modeling more complicated. In this project, we will build a hidden tree Markov model (HTMM), an extension of the classical HMM, to discover and mine the structural motif of oligosaccharides. Reference: Ueda, N, et. al., A probabilistic model for mining labeled ordered trees: capturing patterns in carbohydrate sugar chains. IEEE Trans. in knowledge and data engineering, 17, 1051-1064, 2005. Contact: Haixu.
  • Project 4: Hidden Markov models for the identification of LTR retrotransposons. LTR retrotransposon is a major type of transposable elements. The current method of identifying LTR retrotransposons is mainly based on their sequence features, in particular the pairs of LTRs. In this project, we want to build integrative hidden Markov models to recognize the remote homolog elements. Reference: Andrieu O, et. al. Detection of transposable elements by their compositional bias. BMC Bioinformatics 5:94, 2004. Contact: Haixu.
  • Project 5: Probabilistic modeling of time series data. Time series data, which can be viewed as vectors of data points obtained from a sequence of measurements, is commonly observed in biomedical sciences, ranging from the molecule profiling experiment using microarry, mass spectrometry (MS) or capillary electrophoresis (CE) to the clinical measurement for the disease progression. Biomarker discovery aims at to identify patterns in the time series data that are associated to biological functions or disease prognasis. In this project we want to build probabilistic models to cluster a large amount of time series data and discover the common patterns among them. Reference: A. Schliep, et. al. Robust inference of groups in gene expression time-courses using mixtures of HMM. ISMB2004. Bioinformatics, 20:I283-I289, 2004. Contact: Haixu.

  • Preliminary syllabus [This may change!]:


    Lecture notes
    1/8 Mon.
    Introduction to the class

    BSA 1.1 - 1.2
    The primer of Perl
    Hypertext Preprocessor PHP
    -- we will use it for the web site design in this class.

    1/10 Wed.
    Probabilistic modeling
    BSA 1.4, Chapter 11

    1/12 Fri.
    Lab1: Web site design using PHP and mySQL
    Homework 1)

    1/15 Mon.
    No class (Martin Luther King Jr. Day)

    1/17 Wed.
    Probabilistic sequence modeling I: frequency and profiles

    1/19 Fri.
    Lab2: Alignment algorithms: Smith-Waterman, FASTA and Blast

    1/22/ Mon.
    Probabilistic sequence modeling I: frequency and profiles

    1/24 Wed.
    Probabilistic sequencing modeling II: Markov chain BSA Chapter 4

    1/26 Fri.

    Lab3: Modeling biological sequences using seq++ ; blocks and related tools; Sequence weblogo
    (Homework 1 due)

    1/29 Mon.
    Probabilistic sequencing modeling II: Markov chain
    Homework 2)

    1/31 Wed.
    Hidden Markov Model I: Model structure
    BSA Chapter 3

    2/2 Fri.

    Group Discussion

    2/5 Mon.
    Hidden Markov Model I: Model structure

    2/7 Wed.
    Hidden Markov Model II: GHMM

    2/9 Fri.

    Lab4: GeneMark.HMM & Genscan
    2/12 Mon.

    HMM III: Parameter estimation
    Homework 3)

    BSA Chapter 3

    2/14 Wed.
    HMM III: parameter estimation

    2/16 Fri.

    Group discussion
    (Homework 2 due)

    2/19 Mon.
    EM algorithm

    2/21 Wed.
    EM algorithm

    2/23 Fri.

    Lab5: SLAM,TwinScan,QRNA
    2/26 Mon.
    Pair HMM I BSA Chapter 4

    2/28 Wed.
    Pair HMM II

    3/5 Mon.
    Profile HMM I
    (Homework 4)
    BSA Chapter 5

    3/7 Wed.
    Spring access`
    11 3/19 Mon.
    Profile HMM II

    BSA Chapter 5

    3/21 Wed.
    Profile HMM III

    3/23 Fri.

    Lab5: Pfam & Hmmer
    (Homework 3 due)

    3/26 Mon.
    Gibbs Sampling

    3/28 Wed.
    Advance probabilistic graphic models

    3/30 Fri.

    Group Discussion
    4/2 Mon.
    Phylogenetics: distances and evolutionary models
    Homework 5
    BSA Chapter 7

    4/4 Wed.
    Phylogenetics: Neighbor joining (NJ) tree
    BSA Chapter 7

    4/6 Fri.

    Lab6: ClustalW, Phylip, Treeview/ATV
    (Homework 4 due)

    4/9 Mon.
    Phylogenetics: parsomony
    BSA Chapter 7

    4/11 Wed.
    Phylogenetics: bootstrap
    BSA Chapter 7

    4/13 Fri.
    Lab7: Phylip: more examples
    4/16 Mon.
    Phylogenetics: phylogeny and alignment BSA 7

    4/18 Wed.
    Phylogenetics: probabilistic models of evolution
    BSA 9

    4/20 Fri.

    Lab8: PAML
    (Homework 5 due)

    4/23 Mon.
    Phylogenetics: Maximal likelihood (ML) method
    BSA 9

    4/25 Wed.
    Project presentation (continue on 4/27, Friday)
    17 4/30 Mon Final Exam

    5/4 Fri. Final project report due

    Last updated : 12/25/2006