Transcript Slide 1
Chapter 18
Functional Genomics
•High throughput analysis of gene functional at a genome level
•Transcriptomics
•Reveal how genes work together to form metabolic, regulatory
and signaling pathways and networks
•Reveals co-expressed and co-regulated genes
•Insight into biological function of the whole genome
Expressed sequence tags (ESTs)
The protocol
•Libraries of cDNA closes are prepared by reverse transcription of
oligo-d(T) primers
•Clones are randomly selected for sequencing of 200-400nt from 3’ or
5’ direction
•Thus, if the mRNA of a gene was present, there is a likelihood that its
corresponding EST should be found
•The number of each EST gives an indication of the level of mRNA
(~expression level)
Drawbacks
•High error rates
•Vector sequence contamination
•Double inserts: vector_left-5’-gene_1-3’-5’-gene_2-3’-vector_right
•Highly expressed genes predominate the EST library
EST Index Construction
•Remove vector sequences (VecScreen)
•Clustering to associate ESTs with single, unique genes
•Derive consensus sequence to produce EST contig
•Use HMM to delineate coding region
•Annotate by database similarity search of translates sequence
UniGene
•http://www.ncbi.nlm.nih.gov/unigene/
•Database of overlapping, processed EST sequences
•Sequences imported from dbEST and GenBank
•Only EST sequences with 3’ ends are used (avoid double inserts)
•Vector sequences removed
•ESTs searched against database of known genes
•EST sequence corrected to that of known gene
•Partitioned into clusters and assembled into contigs
TIGR Gene Indices
•http://compbio.dfci.harvard.edu/tgi/
Serial Analysis of Gene Expression (SAGE)
SAGE
•The number of times that is sequence tag is present in the sequenced
pool is proportional to the level of the complementary mRNA
•Tags typically 15bp in length
•Sequencing most costly step
•10,000 clones representing ~500,000 tags are typically sequenced
•Tag ID very sensitive to sequencing error due to short length
•One tag can map to more than one gene
SAGEmap
•http://www.ncbi.nlm.nih.gov/SAGE/
•Use cDNA sequence to find corresponding SAGE tag and find
expression level
SAGExProfiler
•Allows subtraction of one SAGE library from another to view differences
•Allows identification of genes differentially expressed in control versus
experimental samples
Microarray analysis
•Primers (25-70nt) or cDNA spotted onto poly-lysine coated
microscope slide
•The spots are called probes that hybridize to fluorescently labeled
DNA samples
•Probes must be specific enough to minimize cross-hybridization
•Use BLAST to find unique regions in genes
•Use RepeatMasker to remove low-complexity regions (nonspecific)
•No stable internal secondary structures (use Mfold)
•All probes should have similar Tm and GC% ~55%
OligoWiz
•http://www.cbs.dtu.dk/services/OligoWiz/
•Java client for oligfo design
Microarray analysis
Image processing
•Extract total RNA or mRNA from tissues or cells
•Incorporate Cy3 (absorption: 550nm, emission 570nm) or Cy5
(absorption 650nm, emission 670nm) fluorescent dye into two cDNA
samples (control and experiment) during RT step
•Mix two samples and hybridize simultaneously to microarray slide
•Visualize hybridization of one cDNA set at 570nm and other at 670nm
•Subtract background
•Express as ration experiment : control
•Log2 transform: 2:1 log2(2)=1; 1:2 log2(0.5)=-1
•Initial “raw analysis” performed in GenePix
Data transformation and normalization
•Background-corrected ratios (experiment : control) are further
normalised to correct for technical errors and biases
•Non-linear relation in intensity-ratio plot (panel C) can be corrected by
Lowess normalisation (klocalised, weighted linear regression)
ArrayPlot
•http://transcriptome.ens.fr/arrayplot/
•Windows program that allows visualization, filtering and normalization
of microarray data
Statistical analysis to identify differentially expressed genes
Arbitrary 2-fold cut-off
Replicates
t-Test and ANOVA
Calculate 95% confidence level
Microarray Data classification
Identify groups of genes with similar expression profiles
Partition data into grousp on grounds of simmilarity
Distance measure
•Euclidean distance between genes X and Y under conditions i=1, 2, …, n
x y
n
2
i
i
i 1
Pearson correlation
1 n xi xav yi yav
n i 1 sdi sdi
•If profiles are identical, corr = +1, anti-correlated -1, no correlation 0
Supervised and unsupervised classification
•Genes can be grouped by the distances between their expression
profiles
•Supervised – classification into a set of predefined categories
•Unsupervised – no predefines categories; define categories by data
similarities
•Gene3s within categories have more similarity than genes in other
categories
•Co-regulated genes often reflect similar functions
•Functions of unknown genes may thus be assigned
•Clustering algorithms of two types
Agglomerative
•Define most similar two data points, and repeat the process of
merging data points until no points remain
Divisive
•Lump all data together and successively remove most distant data
points until all data points are resolved
Hierarchical clustering
•Uses agglomerative approach to construct relationship tree
•Different linkage types
•Single: minimum distance between members of two clusters
•Complete: maximum distance
•Average: mean of the distances
•Number of clusters depends on arbitrary user set threshold
k-Means clustering
•Does not produce a dendrogram
•Classifies data through single-step partition
•Average of group is calculated and distance of each point from this average
•Randomly reassign all data points to new cluster
•Recompute distances
•If distance to group average is smallest, retain point in group
•If not, reassign in next round
Self organizing maps (SOMs)
•Similar to k-means
•Define number of nodes, and assign data points randomly to nodes
•Calculate distances to node averages
•‘Redistribute data points, and recalculate
•Repeat until no decrease in distance is obtained
•Nodes are not isolated groups, but are connected
Clustering programs
Cluster
http://rana.lbl.gov/EisenSoftware.htm
Windows program, capable of hierarchical clustering, SOM and kmeans clustering
Treeview
http://rana.lbl.gov/EisenSoftware.htm
Visualize data from Cluster program