Transcript ppt
BIONF/BENG 203:
Functional Genomics
Sources of Functional Data
Lectures 1 and 2
Lecture TI 1
Trey Ideker
UCSD Department of Bioengineering
Grading
40%
Problem Sets (best 4 of 5)
30% Midterm
30% Final Project
Outline of the course
Biological
data
sources
(2)
Data preprocessing
(6)
Total of 17
lectures
Unsupervised:
Project
Presentations
(2)
Clustering
Inference
Supervised:
Classification
Population
Genetics and
Linkage
Single
Source
(3)
(3)
(1)
MultiSource
(2)
FINAL
PROJECT
FINAL
PROJECT
Functional Genomics Data
–
Expression
mRNA, protein
–
Molecular interactions
Protein, mRNA, small molecules
–
Knockout phenotypes
1st, 2nd, higher orders
–
–
SNP sequence (polymorphism) data
Imaging data
Sub-cellular localization
Cell morphology
–
Gene ontology
Dividing the data into two classes of information:
Biological Networks and Network States
1)
Directly observe the network
“wires” themselves
Protein-protein interactions:
Two-hybrid system, coIP, protein
antibody arrays
BIND, DIP
Other types not yet possible:
e.g., protein-small molecule
DNA/RNA Gene expression:
DNA microarrays, SAGE
Protein levels, locations, and
modifications:
Mass spectrometry, fluorescence
microscopy, protein arrays
Protein-DNA interactions:
Chromatin IP
BIND, Transfac, SCPD
2)
Observe molecular states that
result from the interaction wiring
Gross phenotypes:
e.g., growth rates of single and double
deletion strains
High-throughput methods for
measuring cellular states
Gene expression levels: RT-PCR, arrays
Protein levels, modifications: mass spec
Protein locations: fluorescent tagging
Metabolite levels: NMR and mass spec
Systematic phenotyping
The transcriptome and proteome
The transcriptome is the full complement of RNA molecules
produced by a genome
The proteome is the full complement of proteins enabled by the
transcriptome
DNA RNA protein
Genome transcriptome proteome
30,000 genes ??? RNAs ??? proteins?
For example, the drosophila gene Dscam can generate 40,000
distinct transcripts through alternative splicing.
What is the minimum number of exons that would be required?
Expression: High-throughput approaches
RNA
DNA Microarrays
cDNA / EST sequencing
RT-PCR
Differential display
SAGE
Massively parallel signature sequencing (MPSS)
Proteins
2D PAGE
Mass spectrometry
Gene expression arrays
They are really, really, really, really, really, really,
really, really, really, really, really, really, really
important
Microarrays
Monitors the level of each gene:
Is it turned on or off in a
particular biological
condition?
Is this on/off state different
between two biological
conditions?
Microarray is a rectangular grid
of spots printed on a glass
microscope slide, where each
spot contains DNA for a
different gene
Two-color DNA
microarray design
Reverse
Transcription
cDNA-chip of brain glioblastoma
Types of microarrays
Spotted (cDNA)
–
–
Robotic transfer of cDNA clones or PCR products
Spotting on nylon membranes or glass slides coated with poly-lysine
Synthetic (oligo)
–
–
Direct oligo synthesis on solid microarray substrate
Uses photolithography (Affymetrix) or ink-jet printing (Agilent)
All configurations assume the DNA on the array is in excess of the
hybridized sample—thus the kinetics are linear and the spot intensity
reflects that amount of hybridized sample.
Labeling can be radioactive, fluorescent (one-color), or two-color
Microarray Spotter
Affymetrix High Density Arrays
Microarrays (continued)
Imaging
–
–
Radioactive 32P labeling: Autoradiography or
phosphorimager
Fluorescent labeling: Confocal microscope (invented
by Marvin Minsky!!)
Feature density
–
–
–
Nylon membrane macroarrays 100-1000 features
Glass slide spotted array 5,000 features / cm2
Synthesized arrays 50,000 features / cm2
Microarray
confocal scanner
Collects sharply defined optical
sections from which 3D renderings can
be created
The key is spatial filtering to eliminate
out-of-focus light or glare in specimens
whose thickness exceeds the
immediate plane of focus.
Two lasers for excitation
Two color scan in less than 10 minutes
High resolution, 10 micron pixel size
cDNA / EST sequencing projects
cDNA = complementary or copy DNA
EST = Expressed Sequence Tag
The microarray could be described as a “closed system”
because information about RNAs is limited by the targets
available for hybridization. RNAs not represented on the
array are not interrogated.
Direct sequencing of cDNAs (yielding ESTs) overcomes
this problem by large-scale random sampling of sequences
from a whole-cell RNA extract
Statistical counting of distinct sequences provides an
estimate of expression level
Conversely, cDNA library can be normalized to capture rare
messages
Requires large scale sequencing to get statistical
significance
cDNA / EST Sequencing:
Preparation of a cDNA
library in phage l vector
SAGE Technology
Serial
Analysis of
Gene
Expression
Takes idea of sequence sampling to the extreme
Generates short ESTs (9-14nt) which are joined into long
concatamers and then sequenced
49 is 262,144, ~5-fold the number of human genes
The count of each type of tag estimates RNA copy number
>50X more efficient than cDNA sequencing because many
RNAs are represented in a single sequencing run
Steps to SAGE
Copy mRNA ds cDNA using biotinylated (dT)
Cleave with anchoring enzyme (AE) which cleaves
within ~250bp of poly-A tail at 3’ end.
Capture this segment on streptavidin beads
Ligate to linkers containing a type IIs restriction site,
which cleave DNA 14 bp away from this site.
Ligate sequences to each other and PCR amplify
Cleave with AE to remove linkers
Concatenate, clone, and sequence
Velculescu et al.
Science (1995)
WHY DI-TAGS?
Ditags are used to
detect bias in the PCR
amplification step.
B
B
B
A
A
A
PrimerA
PrimerA
The probability of any
two tags being
coupled in the same
ditag is small.
PrimerB
PrimerB
Biased amplification
can be detected as
many ditags always
having the same 2
tags present.
SAGE (continued)
Example of a concatemer:
CATGACCCACGAGCAGGGTACGATGATACATGGAAACCTATGCACCTTGGGTAGCACATG
TAG1
TAG2
TAG3
TAG4
Tag Sequence
Tag Sequence
Counting the tags:
Count
Count
GCGATATTGT
66
ATCTGAGTTC
1075
TACGTTTCCA
66
GCGCAGACTT
125
TCCCGTACAT
66
TCCCCGTACA
112
TCCCTATTAA
66
TAGGACGAGG
92
GGATCACAAT
55
GCGATGGCGG
91
AAGGTTCTGG
54
TAGCCCAGAT
83
CAGAACCGCG
50
GCCTTGTTTA
80
GGACCGCCCC
48
Proteomics
SDS PAGE
2D PAGE
MS/MS
An example
SDS-PAGE
How many
proteins are in
a band?
Protein stains:
Silver
Copper
Coomassie Blue
2D-PAGE
Dimension 2: size
Dimension 1:
Isoelectric
focusing gel
2D gel from macrophage phagosomes
Mass spectrometry
Mass spectrometers consist of three essential
parts
–
–
–
Ionization source: Converts peptides into gas-phase ions
(MALDI + ESI)
Mass analyzer:
Separates ions by mass to charge (m/z) ratio
(Ion trap, time of flight, quadrupole)
Ion detector: Current over time indicates amount of signal at
each m/z value
MS/MS Overview
MS/MS Overview
A raw fragmentation spectrum
By calculating the molecular weight difference between ions of the same
type the sequence can be determined.
SEQUEST uses the fragmentation pattern to search through a complete
protein database to identify the sequence which best fits the pattern.
Tandem Mass Spec (MS/MS)
Typical nanoelectrospray source
Isotope Coded Affinity Tags (ICAT)
Mass spec based method for measuring relative protein abundances
between two samples
ICAT Reagents: Heavy reagent: d8-ICAT (X=deuterium)
Normal reagent: d0-ICAT (X=hydrogen)
O
N
N
O
XX
N
S
Biotin
tag
XX
O
O
O
XX
O
XX
Linker (d0 or d8)
N
I
Thiol specific
reactive group
Protein Quantification & Identification
via ICAT Strategy
100
Mixture 1
Light
0
550
560
Heavy
570
580
m/z
ICATlabeled
cysteines
Quantitation
100
NH2-EACDPLR-COOH
Mixture 2
Combine and
proteolyze
(trypsin)
Affinity
separation
(avidin)
0
200
400
600
800
m/z
ICAT Flash animation:
http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html
Protein identification
ICAT continued
The heavy (blue) and light (gray) peptides are separated and
quantified to produce a ratio for each peptide – here, a single
peptide ratio is shown
Each peptide is subjected to CID fragmentation in the second MS
stage in order to identify it
Metabolomic measurements
2D NMR or mass spectrometry
Currently not global and in less widespread use
than microarrays, but have tremendous potential
Gene knockout and RNAi libraries for model species
Example from yeast:
Replacement of yeast ORFS with kanMX gene flanked by unique oligo
barcodes– Yeast Deletion Project Consortium
YFP tagging for protein localization
YPF is green, transmitted light is red
NIC96 Nuclear Pore
TUB1 Tubulin
cytoskeleton
HHF2 Histone
Nucleus
BNI4 Bud neck
Images courtesy T. Davis lab
See also recent work by
Weissman and O’Shea labs at UCSF
Systematic phenotyping
Barcode
CTAACTC
(UPTAG):
Deletion
Strain:
TCGCGCA
TCATAAT
yfg2D
yfg3D
yfg1D
Rich media
…
Growth 6hrs
in minimal media
(how many doublings?)
Harvest and label genomic DNA
Systematic phenotyping with a
barcode array
Ron Davis and friends…
These oligo barcodes are also spotted
on a DNA microarray
Growth time in minimal media:
–
–
Red: 0 hours
Green: 6 hours
Molecular Interactions
Among proteins,
mRNA, small
molecules, and so on…
Protein→DNA
interactions
▲ Chromatin IP
▼ DNA microarray
Gene levels
(on/off)
Protein—protein
interactions
▲ Protein coIP
▼ Mass spectrometry
Protein levels
(present/absent)
Biochemical
reactions
▲Not yet!!!
Metabolic flux
▼ measurements
Biochemical
levels
Also like sequence, protein interaction data
are exponentially growing…
EMBL Database Growth
DIP Database Growth
total nucleotides (gigabases)
total interactions
10
90,000
80,000
70,000
60,000
50,000
5
40,000
30,000
20,000
10,000
0
0
1980
1990
2000
(As are the false positives!!!)
2000
2001
2002
2003
2004
2005
High-throughput methods for
measuring interaction networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
Yeast two-hybrid method
Fields and Song
Detection of protein interactions with
antibody arrays
McBeath and Schreiber
Kinase-target interactions
Mike Snyder and colleagues
High-throughput methods for
measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
Protein interactions by protein immunoprecipitation
followed by mass spectrometry
TEV = Tobacco Etch Virus proteolytic site
CBP = Calmodulin binding peptide
Protein A = IgG binding from Staphylococcus
Gavin / Cellzome
TAP purification
Image courtesy of
Bertrand Seraphin
High-throughput methods for
measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
ChIP-chip measurement of protein→DNA interactions
From Figure 1 of Simon et al. Cell 2001
High-throughput methods for
measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
Genetic interactions: synthetic lethals and suppressors
Genetic Interactions:
Widespread method used by
geneticists to discover
pathways in yeast, fly, and
worm
Implications for drug targeting
and drug development for
human disease
Thousands are now reported in
literature and systematic
studies
As with other types, the number
of known genetic interactions
is exponentially increasing…
Adapted from Tong et al., Science 2001
Most recorded genetic interactions are
synthetic lethal relationships
A
B
A
DB
DA
B
DA
DB
Adapted from Hartman, Garvik, and Hartwell, Science 2001
Synthetic-lethal protein interaction
A
A
DB
DA
DB
B
DA
DA
DB
X
B
Suppressor protein interaction
A
A
B
B
B
DA
X
DA
DA
DB
DB
Interpretation of genetic interactions (Guarente T.I.G. 1990)
Parallel Effects
(Redundant or Additive)
Sequential Effects
(Additive)
a
a
GOAL: Identify
downstream
B
physical
pathways
A
w
Single A or B mutations typically
abolish their biochemical activities
A
B
w
Single A or B mutations typically
reduce their biochemical activities