CAP5510 - Bioinformatics - Department of Computer and

Download Report

Transcript CAP5510 - Bioinformatics - Department of Computer and

CIS 4930/6930 – Recent
Advances in Bioinformatics
Spring 2014
Tamer Kahveci
CISE Department
University of Florida
1
Vital Information
•
•
•
•
•
Instructor: Tamer Kahveci
Office: E566
Time: Mon/Wed/Fri 9:35 - 10:25 AM
Office hours: Mon/Thu 2:00-2:50 PM
Course page:
– http://www.cise.ufl.edu/~tamer/teaching/spring2014
2
Goals
• This course will discuss the cutting edge
developments in bioinformatics and
computational biology. We will discuss in
depth the recent publications on
computational biology and bioinformatics
with emphasis on computer science
challenges and contributions particularly
on biological networks.
3
Bioinformatics & Systems Biology
• Bioinformatics is the
science where
computational and
information science is
used to understand
biological data.
• Systems biology studies
the interactions between
the components of
biological systems, and
how these interactions
give rise to the function
and behavior of that
system.
4
This Course will
• Give you exposure to research topics in
bioinformatics.
• Strongly encourage you to explore research
problems and make contribution.
5
This Course will not
• Teach you biology or fundamentals of
bioinformatics.
• Teach you programming
• Teach you how to be an expert user of offthe-shelf molecular biology computer
packages.
6
Course Outline
•
•
•
•
•
•
Introduction to terminology
Biological networks
Comparison of biological networks
Network motifs
Essentiality in networks
Network reconstruction
7
Grading
How can I get an A ?
Paper
presentations
Project
HW &
Quizzes
90+ = A- & above
80+ = B & above
70+ = C & above
Bonus
• 2.5% attendance
• 2.5% project contribution
8
Expectations
• Require
– Data structures and algorithms.
– Coding (C, Java)
• Encourage
– actively participate in discussions in the classroom
– read bioinformatics literature in general
– attend colloquiums on campus
• Academic honesty
9
Text Book
• Not required, but recommended.
• Class notes + papers.
10
Where to Look ?
• Journals
–
–
–
–
–
Bioinformatics
Genome Research
PLOS Computational Biology
Journal of Computational Biology
IEEE Transaction on Computational Biology and Bioinformatics
• Conferences
–
–
–
–
–
RECOMB
ISMB
ECCB
PSB
BCB
11
A Gentle Introduction to
Molecular Biology
12
Goals
• Understand major components of
biological data
– DNA, protein sequences, expression arrays,
protein structures
• Get familiar with basic terminology
• Learn commonly used data formats
13
Genetic Material: DNA
• Deoxyribonucleic
Acid, 1950s
– Basis of inheritance
– Eye color, hair color,
…
• 4 nucleotides
– A, C, G, T
14
Chemical Structure of Nucleotides
Pyrmidines
Purines
15
Making of Long Chains
5’ -> 3’
16
DNA structure
• Double stranded,
helix (Watson &
Crick)
• Complementary
– A-T
– G-C
• Antiparallel
– 3’ -> 5’ (downstream)
– 5’ -> 3’ (upstream)
• Animation (ch3.1)
17
Base Pairs
18
Question
•
•
•
•
5’ - GTTACA – 3’
5’ – XXXXXX – 3’ ?
5’ – TGTAAC – 3’
Reverse complements.
19
Repetitive DNA
• Tandem repeats: highly repetitive
–
–
–
–
Satellites (100 k – 1 Gbp) / (a few hundred bp)
Mini satellites (1 k – 20 kbp) / (9 – 80 bp)
Micro satellites (< 150 bp) / (1 – 6 bp)
DNA fingerprinting
• Interspersed repeats: moderately repetitive
– LINE
– SINE
• Proteins contain repetitive patterns too
20
Genetic Material: an Analogy
•
•
•
•
Nucleotide => letter
Gene => sentence
Contig => chapter
Chromosome => book
–
–
–
–
Traits: Gender, hair/eye color, …
Disorders: down syndrome, turner syndrome, …
Chromosome number varies for species
We have 46 (23 + 23) chromosomes
• Complete genome => volumes of encyclopedia
• Hershey & Chase experiment show that DNA is the
genetic material. (ch14)
21
Functions of Genes 1/2
• Signal transduction: sensing a physical signal
and turning into a chemical signal
• Enzymatic catalysis: accelerating chemical
transformations otherwise too slow.
• Transport: getting things into and out of
separated compartments
– Animation (ch 5.2)
22
Functions of Genes 2/2
• Movement: contracting in order to pull
things together or push things apart.
• Transcription control: deciding when
other genes should be turned ON/OFF
– Animation (ch7)
• Structural support: creating the shape
and pliability of a cell or set of cells
23
Central Dogma
24
Introns and Exons 1/2
25
Introns and Exons 2/2
• Humans have about 25,000 genes =
40,000,000 DNA bases < 3% of total DNA
in genome.
• Remaining 2,960,000,000 bases for
control information. (e.g. when, where,
how long, etc...)
26
DNA
(Genotype)
Protein
Gene expression
Phenotype
27
Gene Expression
• Building proteins from DNA
– Promoter sequence: start of a gene
–  13 nucleotides.
• Positive regulation: proteins that bind to DNA
near promoter sequences increases
transcription.
• Negative regulation
28
Microarray
Animation on creating microarrays
29
Amino Acids
• 20 different amino acids
– ACDEFGHIKLMNPQRSTVWY
but not BJOUXZ
• ~300 amino acids in an average protein,
hundreds of thousands known protein
sequences
• How many nucleotides can encode one amino
acid ?
–
–
–
–
42 < 20 < 43
E.g., Q (glutamine) = CAG
degeneracy
Triplet code (codon)
30
Triplet Code
31
Molecular Structure of Amino Acid
Side Chain
C
•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)
•Polar, Hydrophilic (S, T, C, Y, N, Q)
•Electrically charged (D, E, K, R, H)
32
Peptide Bonds
33
Direction of Protein Sequence
Animation on protein synthesis (ch15)
34
Data Format
•
•
•
•
•
•
GenBank
EMBL (European Mol. Biol. Lab.)
SwissProt
FASTA
NBRF (Nat. Biomedical Res. Foundation)
Others
– IG, GCG, Codata, ASN, GDE, Plain ASCII
35
Primary Structure of Proteins
>2IC8:A|PDBID|CHAIN|SEQUENCE
ERAGPVTWVMMIACVVVFIAMQILG
DQEVMLWLAWPFDPTLKFEFWRYFT
HALMHFSLMHILFNLLWWWYLGGA
VEKRLGSGKLIVITLISALLSGYVQQK
FSGPWFGGLSGVVYALMGYVWLRGER
DPQSGIYLQRGLIIFALIWIVAGWFD
LFGMSMANGAHIAGLAVGLAMAFVD
SLNA
36
Secondary Structure: Alpha Helix
•
•
•
•
1.5 A translation
100 degree rotation
Phi = -60
Psi = -60
37
Secondary Structure: Beta sheet
anti-parallel
Phi = -135
Psi = 135
parallel
38
Tertiary Structure
phi2
phi1
psi1
2N angles
39
Tertiary Structure
• 3-d structure of a polypeptide sequence
– interactions between non-local atoms
tertiary structure of
myoglobin
40
Ramachandran Plot
Sample pdb entry ( http://www.rcsb.org/pdb/ )
41
Quaternary Structure
• Arrangement of protein subunits
quaternary structure
of Cro
human hemoglobin
tetramer
42
Structure Summary
• 3-d structure determined by protein
sequence
• Prediction remains a challenge
• Diseases caused by misfolded proteins
– Mad cow disease
• Classification of protein structure
43
Systems biology
• A biological system is made up of components (e.g., proteins,
genes, compounds) that interact with each other to affect one
another. As a result they serve a set of functions of that system.
• Internal factors can alter the networks.
– E.g., gene expression and regulation.
• External factors can alter the network.
– E.g., drugs, radiation, food, temperature, bacteria and virus.
• We develop quantitative mathematical models that can explain the
how the interactions take place.
– E.g., Boolean, stochastic, ordinary differential equations, probabilistic,
etc.
• We develop algorithmic methods to analyze the networks under
these models.
44
Signal Transduction Networks
• Vertices are proteins.
• A directed edge from
vertex X to vertex Y if
X changes the activity
level of Y under
certain conditions
45
Transcription regulation networks
• Two types of vertices:
proteins (transcription
factors, or TF’s) and
genes
• Edges are directed
from TF’s to genes.
• An edge from TF X to
gene Y if X
transcribes Y
46
Post-transcription regulation
• Two types of vertices
– RNA binding proteins
– RNA
• Directed edge from
proteins to RNA
RNA binding protein
47
Metabolic networks 1/2
• Various
representations
– Vertices are
compounds and
directed edges
are biochemical
reactions
– Two types of
vertices, one for
compounds one
for reactions.
Directed edges
from one type
to the other.
48
Metabolic networks 2/2
• Reactions
– Catabolism: breaking
down large molecules,
for example to harvest
energy in cellular
respiration
– Anabolism: using
energy to construct
components of cells,
such as proteins and
nucleic acids
49
Protein-protein interaction (PPI)
network
• Vertices are proteins.
• An edge between two
vertices if the two
proteins interact (i.e.,
form a protein
complex).
• Undirected edges.
50
Gene expression network
• Vertices are genes.
• An edge between two
vertices imply that the
genes corresponding
to those two vertices
have similar
expression patterns
• Edges are undirected
51
Phylogenetic networks
• Two types of vertices
– Leaf nodes:
taxanomical units
(e.g., genes, proteins,
organisms)
– Internal nodes:
inferred ancestors
• Directed, acyclic
(often rooted tree)
• Edges from X to Y if X
can evolve into Y.
52
Ecological networks
Zombie
Human
53
Some interaction network datasets
KEGG
http://www.genome.jp/kegg/
BioCyc
http://biocyc.org/
MIPS
http://mips.helmholtz-muenchen.de/proj/ppi/
DIP
http://dip.doe-mbi.ucla.edu/
GRID
http://biodata.mshri.on.ca/grid/servlet/Index
BIND
http://bind.ca/
String
http://www.bork.embl-heidelberg.de/STRING/
InterAct
http://www.ebi.ac.uk/intact/index.html
MINT
http://cbm.bio.uniroma2.it/mint/
54