Training - Hongyu, Zhang

Download Report

Transcript Training - Hongyu, Zhang

Bioinformatics
Methods and Applications
Dr. Hongyu Zhang
Ceres Inc.
Goals of the talk
• The major battle fields in Bioinformatics
research
• The most popular weapons used in the
battle
History
• Human genome project
• Overlapping with other branches
– Computational Biology
– Biocomputing
– Biostatistics
– Cheminfomatics
The Central Dogma of
Molecular Biology
Transcription
DNA
Translation
RNA
Protein
Major battle fields in bioinformatics
• DNA
– Genome sequencing
– Gene discovery
• mRNA
– Micro-array analysis
– Sequencing
• Protein
– Structure modeling and prediction
– Proteomics
• …
Major weapons
•
Computational algorithm
–
–
–
–
•
Probability and Statistical theory and methods
–
–
–
•
Functions to describe the physical chemistry interactions in bio-molecules
Molecular mechanics, Molecular dynamics algorithm
Data storage and access
–
–
•
Bayesian theorem, Markov chain (HMM), Principle component
Monte Carlo simulation
Neural Network
Physical chemistry
–
–
•
Hash method
Dynamic algorithm
String and Tree (binary, suffix)
Clustering
Database: Oracle, MySQL etc.
Web interface
Large-scale computing platform
–
–
Hardware
Software
Genome sequencing: Celera shotgun assembly
Venter et al. 2001
Gene discovery
based on sequence comparison
• Finding new genes based on their sequence
similarity and evolution relationship with known
genes
• Methods
– Hash-based database search method, like BLAST
(PSI-BLAST), FASTA, BLAT etc.
– Sequence alignment using Dynamic Programming
algorithm
BLAST database search
(http://www.ncbi.nih.gov/BLAST/)
Query sequence

Database sequences
Query
database
Sequence alignment
• Example
• Programs
• CLUSTALW
• DIALIGN
BLAST
||| |
BLA-T
Dynamics algorithm
Sequence A = (A1, A2, …, Ai, ..., Am)
Sequence B = (B1, B2, …, Bj, …, An)
H i, j
 H i 1, j 1  S Ai, Bj

 max  H i , j 1  S  , Bi
H

S
i

1
,
j
Ai , 

Ab initio gene prediction methods
• Statistics based gene prediction
– Nucleotides distribution frequencies in the
coding regions
– Exon/Intron boundary signal
• Examples
– GenScan, Burge and Karlin 1997
– Fgenesh, Solovyev and Salamov 1994
Hybrid gene prediction method
• Example: Celera Otto program
– BLAST against Refseq database
– BLAST against EST database, other genomic
sequences etc.
– Genscan, Fgenesh
Problems in Gene discovery
• Example:
Given a cDNA sequence, find its true
location in the genome map among lots of alternatives
1
2
3
Query transcript/protein
Genomic component
1’
2’
3’
Two-step solution
1.
BLAST search of the cDNA sequence against the
whole genome map
2.
Using an LIS algorithm to find the correct genomic
component hit

l0  {hsp0 }
l  {max l , hsp }, if 0  s  e  Cutoff
i
j
i
i
j

0

j

i

LIS  max li
0i  n
Phylogenetic analysis
• Goal: study the function and evolution relationship
among a group of genes
– Divide homologous genes into function families
– Find the evolution relationship between the ortholog
genes belonging to different species (e.g., the theory
of Out of Africa)
• Methods
– Hierarchical Clustering
– Neighbore-joining etc.
• PHYLIP program, Univ. of Washington
Micro-array analysis
• Expression-genomics
• Primary goals
– Look for the genes with different expression
levels between experiments, which are candidates
of functional genes
– Look for the group of genes that have correlated
gene expression levels, which could suggest that
they are in the same biological pathway
• Methods
– General probability and statistics methods
– Dimension reduction
• Principle components
• Lowess
– Clustering
• Tools
– S-plus, R
Example
• Herbicide
– Plants was treated with herbicide to observe
the gene expression profiles in a series of
time steps.
– The genes that appeared right before plant
dies (12 hours) are the possible “death” genes
– If we knock down the “death” genes in the
normal plants, they could last longer time than
the herbs.
Protein structure prediction
• Why is protein structure important?
– The functions of a gene depend on its
translated protein structure
• Protein binding with its ligands
• Protein-protein interactions
– A protein molecule usually keeps one stable
structure under normal physiological
conditions (Anfinson, 1960es)
– Drug design
• Docking and high throughput drug screening.
Sequence
Bioinformatics
Protein structure
Function
Protein structure prediction methods
Homology modeling procedure
Protein sequence
Database search
Select template structure
Sequence alignment
Build conserved regions first
Loop modeling
Build side-chains
Optimizing
Homology modeling programs
• Academic software
– MODELER, Sali A.
– COMPOSER, Blundell T.
– SWISS-MODEL
– Rasmol (graphics)
• Commercial software
– QUANTA, MSI inc.
– SYBYL, TRIPOS inc.
Threading
• Find the best fold candidates among a limited number of
choices
• Add 3D information to the score function of dynamic
programming
Ab initio protein structure principle
• Threading programs
– Topits, Eisenberg D.
– Threader, Jones D.
– ProSup, Sipple M
– 123D, Alexandra N.
• Ab initio programs
– Rosetta, David Baker
Current status in the protein
structure prediction field
• Moult J., CASP (Critical Assessment of
Techniques for Protein Structure Prediction).
• Homology modeling is very mature already
• Threading and Ab initio method have been used
in industry
• Structure genomics
Large scale computing platform
• Hardware
– Super-computers
• Cray/SGI
• DEC/Compaq
• Intel
– Linux clusters
– Blade
• Software
– Parallel computing (MPP,
PVM etc.)
– Linux
– Grid computing: the Globus
Project
Linux clusters
Data storage and access
• Bioinformatics is producing huge amount of
data each day
– How to organize and store data
– How to access data
• Database software
– Commercial
• Oracle, DB2, Sybase
– Freeware
• MySQL, PostgreSQL
Data store and access
• Bioinformatics is producing huge amount of data each day
– How to organize and store data
– How to access data
• Database software
– Commercial
• Oracle, DB2, Sybase
– Freeware
• MySQL, PostgreSQL
• Current popular database
– DNA, protein sequence, like Genbank, SwisProt, PIR etc.
– Protein structure, like PDB, Scop
– DNA, mRNA, protein function, like GO, PFAM
Database example:
Gene Ontology (GO)
Molecular
function
Cellular
component
Biological
process
Data access
• Web interface
– Protocol
• CGI, JSP, ASP
– Computer languages
• Perl, Java, C/C++, Visual Basic, Visual C++
Forth looking
• Where are the markets
– Develop new programs
– Assemble current programs to build more efficient data mining
pipelines
– Data storage and access
– Integrate the current database to use them more effectively
– Computing platform, including hardware, software support,
consulting etc.
• What we can offer
– Multi-talents
– Team work
– Networking
http://www.hongyu.org/paper/bioinformatics.ppt