Expression Profile Clustering

Download Report

Transcript Expression Profile Clustering

Expression Profile Clustering
• Expression Profile
= the pattern of signal values for one gene over several chips.
signal value
• Expression
4 Profile Clustering
= the clustering of “similar” profiles
3
• Why?
– Similar 2
expression profiles suggest
• regulation (by shared factor or cluster member)
• related function
1
– ALSO: opposite expression profiles suggest
• regulation (e.g. inhibition)
0
• Software:
– EpClust
– J-Express
– Etc….
1
2
3
Chips
4
5
Guided Tour of EpClust
Data file format
• Can enter in a variety of formats.
– See EpClust’s data upload page
• One simple way:
– Download tab-delimited data from NASC
– Open with (or paste into) excel
– Delete all but one “name” column and Signal value columns.
– Do not include any hyphens!
– (I try to avoid all punctuation)
– Save as tab delimited text
Gene_ID
Chip1
Chip2
Chip3
• AT1G01010
Annotation can be 40
added as either: 48
65
– A second file with2 the same first column
of names 1
AT1G01030
4
– A single second column
in the signal
file (if specified)35
AT1G01040
79
88
AT1G01050
713
744
671
AT1G01060
56
80
111
AT1G01070
75
109
94
AT1G01080
39
33
33
To Upload Data
To select uploaded data
Select Specific Experiments within
Input File
To Remove Unreliable Data
To Choose Algorithm Type
Hierarchical
K-means
Measures distance between each profile
(i.e. gene) with each other profile
Initially choose a set number (K) of clusters
So time increases exponentially with each
gene
2) Clusters each remaining profile with one
of the K
Then clusters closest genes together,
followed by increasingly distant ones
So, much less time than Hierarchical
(if many genes)
1) Chooses the K most different profiles
Into a tree of clusters within clusters
can be visualised
and finally split at chosen distance
with knowledge of tree
Cluster size can vary
so can return too many or few genes
with no way to select.
Then must repeat with different K values.
So Which is “Better”?
• K-Means is better able to manage large data-sets.
• Hierarchical seams a more objective aproach
– In that you don’t need to decide cluster number at start
But which is more biologically informative?
•
•
•
•
Opinion divided.
Both artificial.
Don’t prove, just suggest.
K-means gives more consistent results.
• ….and then there’s SOTA too!
Alternative Distance Measurements
Pearson-Based Distance Measurement
The most commonly used
How similar the SHAPES of the two profiles are
Based on average of values and the standard deviation
Rates from identical (1) to completely uncorrelated (0) to perfect opposites (-1)
Centred Test?
If profiles have identical shape, but offset from each other by a fixed value
(or magnitude)
Centred: Identical (1)
Uncentred: Not (<1)
7
signal value
signal
value
signal value
77
AbsoluteTest?
6
6
If profiles have perfect opposite expression 6patterns
5
55
Absolute: Identical (1)
4
Gene1
44
Non-Absolute: perfect opposites (-1)
3
3
3Gene2
2
Gene1
Gene1
Gene2
Gene2
2
2
Parametric Test?
1
11
Parametric: assumes normal distribution. More
rigorous where there are no outliers.
0
0
0
Non-parametric
1
2(=spearman
3
4rank): 5More rigorous
1 where
2 there
3 are 4outliers.
5
1
chips
2
3
4
Chips
chips
5
Euclidean-Based Distance Measurement
between gene expression levels directly
based on magnitude of changes
8
More about signal VALUES.
Less about profile shapes.
8
7
7
chip2 signal
6
5
chip2
4
signal3
data must be suitably normalized
e.g. use log-ratios of signals
Gene4
6
Gene3
5
4
2
Gene1
1 3
0
21
2
3
S7
4
5
6
chip1 signal
1
Euclidean:
shortest path between points
Manhattan:
the sum of distances along each dimension
chip3
signal
S4
Gene2
7
S1
8
0
0
1
2
3
4
5
chip1 signal
6
7
8
To Choose Algorithm Type
RESULTS
RESULTS
_MyGene
_MyGene
Extra Options
Extra
Options
Format
Export
Phylip
Statistics
Results
Lists
Format
asPicture
Text
Can
get from TAIR
Search for Promoter
Motifs
using a list of gene names
Can get from querying a database such as PlantCare
using the upstream sequence of our particular gene of most interest
Export clusters as:
text
Visualise
promoter
region next to each profile
(lists of gene
names)
tree cut at chosen height.
Highlight motifs
Or
Phylip
Youformatted
must tell for
it the
sequences of motifs though.
(tree drawing software)
Must prepare file of upstream sequences
Currently must arrange with EpClust staff to upload
PlantCARE
Cis-Acting Regulatory Elements
http://intra.psb.ugent.be:8080/PlantCARE/
• Database of CAREs
• Tools
– Search for CARE
• Enter upstream sequence of a gene
• Lists and then highlights known motifs from database
– Motif Sampler
• Enter upstream sequences of your cluster
• Highlights 8mers conserved more in cluster members than in genome
– Other tools
• Clustering
• Query for info on motifs
Gene Ontology
A hierarchical structure to describe gene function.
As PlantCARE compares:
Expression Profile Clusters
to Promoter Motif Conservation
There are also tools to compare:
Expression Profile Clusters
to Gene Ontology