Genome-wide Dissections of DNA Damage Induced

Download Report

Transcript Genome-wide Dissections of DNA Damage Induced

APO-SYS workshop on data
analysis and pathway charting
Igor Ulitsky
Ron Shamir’s Computational
Genomics Group
Part I: Presentations
 EXPANDER
 AMADEUS
 SPIKE
 MATISSE
Part II: Hands-on Session
 EXPANDER
 MATISSE
 SPIKE
EXPression ANalyzer and
DisplayER
Adi Maron-Katz
Chaim Linhart
Amos Tanay
Rani Elkon
Israel Steinfeld
Seagull Shavit
Igor Ulitsky
Roded Sharan
Yossi Shiloh
Ron Shamir
http://acgt.cs.tau.ac.il/expander
EXPANDER
– Low level analysis:
•
•
•
•
Missing data estimation (KNN or manual)
Normalization: quantile, loess
Filtering: fold change, variation, t-test
Standardization: mean 0 std 1, take log, fixed norm
– High level gene partition analysis:
• Clustering
• Biclustering
– Ascribing biological meaning to patterns:
• Enriched functional categories (Gene Ontology)
• Identify transcriptional regulators – promoter analysis
• Built-in support for 9 organisms:
– human, mouse, rat, chicken, zebrafish, fly, worm, arabidopsis, yeast
Normalization/
Filtering
Clustering
(CLICK, SOM,
K-means, Hierarchical)
Biclustering
(SAMBA)
Functional enrichment
Promoter signals
(TANGO)
(PRIMA)
Visualization utilities
Links to public annotation databases
Input data
EXPANDER - Preprocessing
• Input data:
- Expression matrix (probe-row; condition-column)
• One-channel data (e.g., Affymetrix)
• Dual-channel data (cDNA microarrays, data are (log)
ratios between the Red and Green channels)
• ‘.cel’ files
- ID conversion file: map probes to genes
- Gene sets data

Data definitions:
Defining condition subsets
- Data type & scale (log)
-
EXPANDER – Preprocessing (II)
 Data Adjustments:
- Missing value estimation (KNN or arbitrary)
- Merging conditions
Normalization: removal of systematic biases
from the analyzed chips
 Implemented methods: quantile, lowess
 Visualization: box plots, scatter plots (simple,
M vs. A)
EXPANDER – Preprocessing (III)
 Filtering: Focus downstream analysis on the set
of “responding genes”
 Fold-Change
 Variation
 Statistical tests (T-test)
 Standardization : Create a common scale
 For each probe Mean=0, STD=1
 Log data (base 2)
 Fixed Norm (divide by norm of probe vector)
Normalization/
Filtering
Clustering
(CLICK, SOM,
K-means, Hierarchical)
Biclustering
(SAMBA)
Functional enrichment
Promoter signals
(TANGO)
(PRIMA)
Visualization utilities
Links to public annotation databases
Input data
Cluster Analysis
• Partition the responding genes into distinct sets,
each with a particular expression pattern
 Identify major patterns in the data: reduce the
dimensionality of the problem
 co-expression → co-function
 co-expression → co-regulation
• Partition the genes to achieve:
 Homogeneity: genes inside a cluster show
highly similar expression pattern.
 Separation: genes from different clusters have
different expression patterns.
Cluster Analysis (II)
• Implemented algorithms:
–
CLICK, K-means, SOM, Hierarchical
• Visualization:
–
–
Mean expression patterns
Heat-maps
Example study: responses to ionizing radiation
Ionizing Radiation
Double Strand
Breaks
Sensors
ATM
Effectors (p53, BRCA1, CHK2)
DNA
repair
Cell cycle
Stress
arrest
responses
Apoptosis
Example study: experimental design
• Genotypes: Atm-/- and control w.t.
mice
• Tissue: Lymph node
• Treatment: Ionizing radiation
• Time points: 0, 30 min, 120 min
• Microarrays: Affymetrix U74Av2
(12k probesets)
Test case - Data Analysis
•
Dataset: six conditions (2 genotypes, 3
time points)
Normalization
Filtering step – define the ‘responding
genes’ set
•
•
•
•
•
genes whose expression level is changed by
at least 1.75 fold
Over 700 genes met this criterion
The set contains genes with various
response patterns – we applied CLICK to
this set of genes
MajorAtm-dependent
Gene Clusters –early
Irradiated
Lymph
node
responding
genes
Major
Gene Clusters2–ndIrradiated
Lymph node
Atm-dependent
wave of responding
genes
Normalization/
Filtering
Clustering
(CLICK, SOM,
K-means, Hierarchical)
Biclustering
(SAMBA)
Functional enrichment
Promoter signals
(TANGO)
(PRIMA)
Visualization utilities
Links to public annotation databases
Input data
Ascribe Functional Meaning to
the Clusters
• Gene Ontology (GO) annotations for
human, mouse, rat, chicken, fly,
worm, Arabidopsis, Zebrafish and
yeast.
• TANGO: Apply statistical tests that
seek over-represented GO
functional categories in the clusters.
Functional Enrichment - Visualization
Functional Categories
cell cycle control (p<1x10-6 )
Functional Categories
Cell cycle control (p<5x10-6)
Apoptosis (p=0.001)
Normalization/
Filtering
Clustering
(CLICK, SOM,
K-means, Hierarchical)
Biclustering
(SAMBA)
Functional enrichment
Promoter signals
(TANGO)
(PRIMA)
Visualization utilities
Links to public annotation databases
Input data
Identify Transcriptional
Regulators
Clues are
in the
promoters
ATM
Hidden
layer
NEW
?
TF-B
?
TF-C
?
?
TF-A
p53
?
Observed
layer
g13
g12
g11
g10
g9
g8
g7
g6
g5
g4
g3
g2
g1
‘Reverse engineering’ of transcriptional
networks
• Infers regulatory mechanisms from gene
expression data
– Assumption:
co-expression → transcriptional co-regulation →
common cis-regulatory promoter elements
• Step 1: Identification of co-expressed genes
using microarray technology (clustering algs)
• Step 2: Computational identification of cisregulatory elements that are over-represented in
promoters of the co-expressed gene
PRIMA – general description
• Input:
– Target set (e.g., co-expressed genes)
– Background set (e.g., all genes on the chip)
• Analysis:
– Identify transcription factors whose binding
site signatures are enriched in the ‘Target set’
with respect to the ‘Background set’.
• TF binding site models – TRANSFAC DB
• Default: From -1000 bp to 200 bp relative
the TSS
Promoter Analysis - Visualization
PRIMA - Results
PRIMA – Results
Transcription
factor
Enrichment
factor
P-value
CREB
2.6
Transcription
factor
Enrichment
factor
P-value
NF-B
5.1
3.8x10-8
p53
4.2
9.6x10-7
STAT-1
3.2
5.4x10-6
Sp-1
1.7
6.5x10-4
6.0x10-5
Normalization/
Filtering
Clustering
(CLICK, SOM,
K-means, Hierarchical)
Biclustering
(SAMBA)
Functional enrichment
Promoter signals
(TANGO)
(PRIMA)
Visualization utilities
Links to public annotation databases
Input data
Biclustering
 Clustering becomes too
restrictive on large datasets:
• Seeks global partition of genes
according to similarity in their
expression across ALL
conditions
 Relevant knowledge can be
revealed by identifying
genes with common pattern
across a subset of the
conditions
• Biclustering algorithmic
approach
A. Tanay, R. Sharan, R. Shamir RECOMB 02
Biclustering: SAMBA
Statistical Algorithmic Method for Bicluster Analysis
* Bicluster (=module) : subset of genes with similar
behavior in a subset of conditions
* Computationally challenging: has to consider
many combinations of sub-conditions
Biclustering Visualization