Bioinformatics and Industrial Engineering

Download Report

Transcript Bioinformatics and Industrial Engineering

Opportunities of Systems
Engineering/Operations Research
in Bioinformatics
Hyoungtae Kim
(Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)
2
Outline
• Introduction on Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research
for Bioinformatics
• About Funding Opportunities
• Conclusions
What is Bioinformatics/Computational
Molecular Biology?
• An application of mathematical, statistical, and computational
tools in the analysis of the huge size biological data
• Most of the cases, it involves analyzing information stored in
large databases
• Multi-disciplinary:
-Biology
-Physics
-Engineering
-Mathematics
-Chemistry
-Statistics
-Computer Science
It has not yet found its own natural home department
3
4
Why Bioinformatics?
• Current data analysis tools are far from being efficient for
analyzing vast amount of biological data
• The pace of biological understanding is much slower than the
pace of the technology advance that have powered
experimental discovery and data collection
• Benefits:
Advances in detection and treatment of disease and the
production of genetically engineered foods
Profound impact on health and medicine
Three Elements of Bioinformatics
Research
Significant Biological problems
-Gene, motif, signal recognition
-Protein structure prediction
-Metabolic pathway deduction
-Etc.
Theory & Methods
-Algorithms
-Statistical Methods
-Ontologies
-Etc.
Data
-Microarrays
-Mass Spectroscopy
-Etc.
5
6
Prerequisites of Bioinformatics
Scientific Mind
+
• Basic knowledge in Molecular Biology
– Prokaryotic and Eukaryotic cells
– Genes, Codons, DNA, RNA, Central dogma of biology
– Etc.
• Computing Skills
– Program Languages: Python, Perl, Java, etc.
– Knowledge in Relational Databases, etc.
• Other Skills
– General Statistical Knowledge
– Optimization Tools: Math Programming, Network
Optimization, etc.
7
Various Problems in Bioinformatics
Standard Problems
• DNA and Protein Sequence Analysis
• Gene Finding and Prediction
• Etc.
Emerging Problems
•
•
•
•
Microarray Experiment and Data Analysis
Protein Structure Prediction
Deduction of Metabolic Pathways
And more…
8
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research
for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
9
Paradigm Shift in Biology
• The Human Genome Project (HGP)
– Working Draft of the human genome (2001)
– Goal of the HGP = sequencing of the human genome
– Hypothesis driven reductionism discovery science
approach
– Drive-forced the development of high throughput
technologies and computer applications to transmit,
analyze, and model very large size data sets
10
Paradigm Shift in Biology
• High-throughput Technologies
– Microarrays – allow the expression of
thousands of genes to be surveyed at
one time
– Protein Arrays – can examine all
proteins in a cell and check if they are
interacting under designed conditions
– Mass Spectrometry – The basic
modality is protein mass fingerprinting
11
Paradigm Shift in Biology
• Microarray Chip Technology
– Allows data collection in high-throughput manner
– Can put all genes in a microbe on a chip
genes
genes
samples
genes
“Raw Data”
Expression
Level
Similarity
Matrix
Or
Distance
Matrix
• Interpretation of the data is very challenging
253x15154 Microarray Gene
Expression Data: 162 cancer vs
91 normal patients
patients with or without
cancer
genes
12
13
Paradigm Shift in Biology
Genes and proteins
Protein-protein
interaction data
Gene activity data
Protein structure
data
Metabolite data
Black
box
Proteomic data
Regulatory
elements
Gigantic amount of biological information is hidden in
these data and their inter-data relationship!
14
Paradigm Shift in Biology
• Concept of Systems Biology
– The Reductionist paradigm has been
phenomenally successful in biology
since 1950’s
– Genomics era  exhaustive lists of
biological parts (i.e. genes and
proteins) together with their functional
characteristics
– A System-level perspective is required
to make sense of how all of these
individual parts emerge and act
collectively to perform a biological
function
15
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research tools
for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
16
Systems Engineering/Operations Research tools
• Three Categories
Network & Optimization
Statistics
Stochastics
–MLE
–Hidden Markov Models –Combinatorial
–Regression
–MCMC
–Integer Programming
–Sampling
–Simulation Models
–Dynamic Programming
–Etc.
–Network Optimization
–Linear Model
–Cross Validation
–Minimum Spanning Tree
–Statistical Estimation and Test
–Etc.
–Multivariate Analysis (or ANOVA)
–Wavelet Transformation
–Bayesian Networks, Etc.
17
Systems Engineering tools for Bioinformatics
• Some Examples
•
•
•
•
Hidden Markov Model for Gene Finding
Dynamic Programming for Sequence Alignment
Integer Programming for Protein Folding
Minimum Spanning Tree approach to Clustering
for Motif Identification (Xu et al. (2001)
• And many more …
18
A Significant Biological Problem
• Identification of Transcription Factor Binding
Sites(Motifs)
•
•
A gene’s transcriptional level is regulated by proteins
(transcription factors), which bind to specific sites in the gene’s
promoter region, called binding sites
The binding-site identification problem is to find short
“conserved” fragments, from a set of genomic sequences
 Features of transcription factor binding site
1. These short DNA fragments in the upstream regions of genes
are generally very similar to each other
2. Relatively high frequencies compared to other sequence
fragments
19
Data Collection
• Data Set (D)= Set of All Short DNA fragments in
the upstream regions of genes
• Microarray gene expression technologies allow simultaneous
view of the transcription levels of many thousands of genes
under various cellular conditions
Upstream regions of
genes
GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG
CCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGT
Experiment
CCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAAT
Find group
of genes
having
correlated
expression
profiles
GAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGC
GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGAT
AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC
ACAGCTACTGCAAACAACCATGATGATGATACATCTCTTG
GGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGG
GGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGG
ACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCC
CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG
CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC
CACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCA
CAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCA
GCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGG
20
• Some testing data sets are available on the
internet or in the literature
For example 
• CRP binding sites: 18 sequences with 105 BPs
• Yeast binding sites: 8 sequences with 1000 BPs
• Human binding sites: 113 sequences with 30 BPs
21
CRP binding sites: 18 sequences with 105 BPs
A
C
T
G
22
Theory & Methods
• Traditional approaches
•
•
•
•
Various sampling techniques including Gibbs sampling
EM Algorithm
Greedy Algorithm
Multi-Order Markov Chain Algorithm
All these are heuristic algorithms so this problem remains as a
challenging and unsolved problem
23
Brief Review: Minimum Spanning Tree
• Input = A graph, G = (V,E), with weighted edges
• Output = the cheapest subset of edges that keeps the
graph in one connected component
• Two Popular Algorithms
• Kruskal’s Algorithm
• Prim’s Algorithm
24
Theory & Methods
• Minimum Spanning Tree approach
• Step1: Define a distance measure () on the data set
(D), and compute distances b/w each pair of data
points (i.e., (A,B) for all A, B in D)
• Higher the sequence similarity b/w two fragments, smaller
the distance is b/w their mapped positions
25
Theory & Methods
• Minimum Spanning Tree approach
• Step2: Find the MST ,T, representing D with its edge
weight defined by  and treat it as a data clustering
problem
c1
c4
T
e1
e2
c2
e3
Remove three edges
e1,e2,e3
c3
4 Clusters, c1~c4, are identified
26
Evaluation of the MST Method
• Comparison with Other Methods
• MST is based on a combinatorial approach
 can identify all clusters of possible binding sites
• While existing heuristic methods are likely to miss
some clusters
• Implemented result is at least as good as results
by other methods
• While Simple structure of a tree facilitates
efficient implementations of rigorous algorithm
27
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research tools
for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
Funding Overviews by Funding
Institutions(Top)/Field of Research(Bot)
US DA
3%
A ll other
a gencies
7%
28
Total of $54.1 billion in FY2004
NS F
7%
NA S A 2
10%
DOE
10%
5%
HHS
52%
DoD
11%
2%
2%
3%
Environmental7%
science
10%
Physical science
Life science
Engineering
$9.1 billion
54%
$29.3 billion
17%
Percentage of Total Federal Funding: Preliminary 2004 Statistics
Source: National Science Foundation/Division of Science Resources Statistics, Survey of Federal Funds for Research
29
How to Search for Funding Opportunities?
• NIH Computer Retrieval of Information on Scientific
Projects (CRISP)
• http://crisp.cit.nih.gov
• NIH Office of Extramural Research (OER)
• http://grants1.nih.gov
• Other Websites
• http://www.grants.gov
• http://fedgrants.gov
• http://www.nsf.gov/pubsys/ods/index.html
30
Growing Opportunities in Bioinformatics
From CRISP Search Data
N e w Gra nts for Bioinform a tic s
350
300
# N ew Gr ants
300
250
200
170
191
150
100
50
0
2002
2003
Fis ca l Y e a r
2004
31
NIH Funded Projects in 2004
From CRISP Search Data
• Searched all Related Institutes, Centers, and States for
the 2004 Fiscal Year
# NIH Grants in Bioinformatics, 826
Microarray, 214 grants
Cancer,63 grants
Systems Biology, 80 grants
32
NIH Funding Opportunities for 2004 ~
From http://grants1.nih.gov
• 2004 Program Announcement (PA)
•
•
•
•
Total 171 PAs
Larger variety of topics
Cancer most prevalent topic
Many wish to have “multidisciplinary” outlook on topics
• 2005 Requests For Application (RFA)
•
•
•
•
Total 68 RFAs
Although listed for 2005, some application deadlines have passed
2 directly related to bioinformatics
Cancer still most prevalent topic
33
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research
for Bioinformatics
• About Funding Opportunities
• Conclusions
34
Developing Potential Research Plans
• Two Takeaways
1. Systems Engineers/Operations
Research Society already have tools to
solve various bioinformatics problems
2. Moneys are there to support your
research
Then, what do we need to start?
Biological Problems to solve
35
Concluding Remarks!!
• The main driving force of
bioinformatics/computational biology is the highthroughput data production
• I.E. tools together with computing power can
play an important role in this process
• Funding opportunities in this area are very
rich
36
Thank you!
Any Questions?
37
Level of Organization and Related Field of
Study
38
39
Central Dogma of Biology
DNA
RNA
Transcription
Protein
Translation
example
TTG CTG CGG
Transcription
Translation
UUG CUG CGG
Leu Leu Arg
40
Transcription and Translation
41
Gene
• A gene is a region of DNA that controls a hereditary characteristic,
usually corresponding to a single mRNA carrying the information for
constructing a protein.
• The human genome contains about 30,000 genes. (February 2001)
42
Introns and Exons
43
44
Pair-wise Sequence Alignment
VLSPADKTNVKAAWAKVGAHAAGHG
||| |
|
|||| |
||||
VLSEAEWQLVLHVWAKVEADVAGHG
45
Sequence Alignment
 Purposes:
 Learn about evolutionary relationships
 Finding genes, domains, signals …
 Classify protein families (function, structure).
 Identify common domains (function, structure).
46
Multiple Sequence Alignment
47
Scoring Systems for Alignment
Simple case
actaccagttcatttgatacttctcaaa
Sequence 1
taccattaccgtgttaactgaaaggacttaaagact
Sequence 2
Scoring
matrix
A
G
C
T
A
1
0
0
0
G
0
1
0
0
C
0
0
1
0
T
0
0
0
1
Match: 1
Mismatch: 0
Score = 5
DNA
48
Scoring Systems for Alignment
Complex case
Sequence 1
PTHPLASKTQILPEDLASEDLTI
Sequence 2
PTHPLAGERAIGLARLAEEDFGM
Scoring
matrix
C
C
S
T
P
A
G
N
9
S -1
4
T
-1
1
5
P -3
-1
-1
7
A
0
1
0
-1
4
G -3
0
-2
-2
0
6
N -3
1
0
-2
-2
0
5
D -3
0
-1
-1
-2
-1
1
.
D
6
.
.
T:G
= -2
T:T
= 5
Score = 48
Protein
49
Protein Structure
Alpha-helices
Quaternary
Beta-sheets
Tertiary
Diitrogenase as an example
50
Public Databases
• Big 3 Centers
National Center for Biotechnology
Information
EBI
DNA Database Bank of Japan
51
The Human Genome
• 23 pairs of chromosomes comprise the human genome.
• The human genome contains 3,164.7 million (or 3 Billion) nucleotide
base.
• The average gene consists of 3,000 bases, but sizes vary greatly,
with the largest known human gene being dystrophin at 2.4 million
bases.
• The total number of genes is estimated at 30,000 to 40,000
• The total number of protein variant is estimated as 1 Million.
52
Various Fields in Biology
DNA
Genomics
RNA
Transcriptomics
Proteins
Proteomics
Metabolites
Metabolomics
53
Trends in Molecular Biology
Reverse Genetics
Functional Genomics
Function (Mutation)
Gene
Gene
Genome Project
High Throughput Tech
Function
Genome
Genomics
Structural
Genomics
Functional
Genomics
54
DNA & Bases
A (Adenine), G (Guanine), C (Cytosine), T(Thymine)