Transcript Slides

Bioinformatics
Predrag Radivojac
INDIANA UNIVERSITY
Basics of Molecular Biology
Eukaryotic cell
Can we understand how cells function?
Bioinformatics is multidisciplinary!
• What is Bioinformatics?
– Integrates: computer science,
statistics, chemistry, physics,
and molecular biology
– Goal: organize and store huge
amounts of biological data and
extract knowledge from it
• Major areas of research
– Genomics
– Proteomics
– Databases
• Practical discipline
Some major applications
· Drug design
· Evolutionary studies
· Genome characterization
Interesting Problems
Sequence
Alignment
Interesting Problems
Interesting Problems
• Sequence assembly
Goal:
solve the puzzle,
i.e. connect the
pieces into one
genomic
sequence
Interesting Problems
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
• Proteomics
687.3
90
85
588.1
80
75
70
Relative Abundance
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
Mass spectrometry
400
600
800
1000
m/z
1200
1400
1600
1800
2000
Interesting Problems
• Microarray data
Interesting Problems
• Functional Genomics
• Gene Regulation
Diseases are interconnected…
Goh et al. PNAS, 104: 8685 (2007).
Disease
•
Development of tools that can be used
to understand and treat human disease
•
Prediction of disease-associated genes
•
Important from
• biological standpoint
• medical standpoint
• computational standpoint
•
Background
• human genome
• low-throughput data
• high-throughput data
• ontologies for protein function at
multiple levels
The Time is Right!
www.cancer.gov
Alzheimer’s disease
Top PhenoPred hits:
1) CDK5
2) NTN1
AUC = 77.5%
Loss/Gain of function and disease
E6V
4hhb
2hbs
Sickle Cell Disease:
Autosomal recessive disorder
E6V in HBB causes interaction w/ F85 and L88
Formation of amyloid fibrils
Abnormally shaped red blood cells, leads to sickle cell
anemia
Manifestation of disease vastly different over patients
Pauling et al. Science 110: 543 (1949). Chui & Dover. Curr Opin Pediatr, 13: 22 (2001).
http://gingi.uchicago.edu/hbs2.html
Lipitor (ATORVASTATIN)
E6V
Proteins = chains of amino acids
• biomolecule, macromolecule
– more than 50% of the dry
weight of cells is proteins
• polymer of amino acids
connected into linear chains
• strings of symbols
• machinery of life
– play central role in the
structure and function of
cells
– regulate and execute many
biological functions
a) amino acid b) amino acid chain
15
Introduction to Protein Structure by Branden and Tooze
Protein structure
• peptide bonds are planar and strong
• by rotating at each amino acid, proteins
adopt structure
16
Introduction to Protein Structure by Branden and Tooze
Protein function
• Multi-level phenomenon
– biochemical function
– biological function
– phenotypical function
• Example: kinase
– biochemical function –
transferase
– biological function – cell
cycle regulation
– phenotypical function –
disease
• Function is everything that
happens to or through a
protein (Rost et al. 2003)
17
Protein contact graph
C- C< 6A
Myoglobin 1.4A X-ray PDB: 2jho 153 residues
Protein contact graph
Protein contact graph
Residue neighborhood
Notation:
S113 of isocitrate dehydrogenase
G = (V, E)
f: V  A
g: V  {1, +1}
A = {A, C, D, … W, Y}
21
S
Graphlets are small non-isomorphic connected graphs.
Different positions of the pivot vertex with respect to the graphlet correspond to graph-theoretical
concept of automorphism orbits, or orbits.
22
Przulj et al. Bioinformatics 20: 3508 (2004).
Results
Key insight:
Efficient combinatorial
enumeration
of graphlets / orbits
over 7 disjoint cases
breadth-first search
2-graphlets:
3-graphlets:
4-graphlets:
01
011, 012
0111, 0112
0122, 0123
02
01
A
C
D
E
F
G
H
I
K
…
Y AA AC AD …
01
o2
o5, o6, o11
|A|
|A|2
|A|3
o3, o4
?
A = {0, 1}
A = {0, 1, 2}
00, 01 = 10, 11
(3)
00, 11, 22, 01 = 10,
02 = 20, 12 = 21
(6)
binomial (multinomial) coefficients
|A |= 20, dimensionality = 1,062,420
Graphlet kernel
Inner product between vectors of counts of labeled orbits
where
A
C
D
E
F
G
H
I
K
…
Y AA AC AD …
A
C
D
E
F
G
H
I
K
…
Y AA AC AD …
K is a kernel because matrices of inner products are symmetric and positive
definite (proof due to David Haussler).
i(x) is the number of times labeled orbit i occurs in the graph