Transcript robust fit

About genewise regulation analysis
with regression
Janne Nikkilä
Contents of the talk
•
•
•
•
•
•
Biological background of gene expression regulation
Previous work in statistical modeling of gene
regulation
Our motivation
Our analysis approach
Some results
Discussion
Biological background for gene
expression regulation
•
Gene expression is regulated at several stages:
–
–
–
–
–
–
•
DNA unpacking (demethylation, histone acetylation)
Transcription
Alternative RNA splicing
mRNA degradation
Translation initiation
Protein processing and degradation
Transcription is believed to be the most important
one
Biological background for gene
transcription regulation
•
•
•
•
In the transcription, mRNA corresponding the coding
DNA sequence is formed
Transcription initiation is mainly controlled by
binding of specific protein complexes, transcription
factors (tf), to gene promoter region
Tfs may enhance, suppress, or do both
As tfs are composed of proteins, which are coded by
genes, tf activities can be analyzed by studying the
expressions of the genes that code tfs
Analysis methods used in the literature
•
Modelling of gene interactions by
–
–
–
–
–
Boolean networks
Differential equations
Linear regression
Clustering
Probabilistic models (e.g. Bayesian networks)
Our motivation
•
•
➔
None of the previous methods seem to work
adequately
This may be due to methods, due to the quality of the
data, or perhaps due to the cumulative effect of these
two factors
We wish to find some evidence that gene regulation
mechanisms can be inferred from gene expression
data
A simple approach
•
Study one gene expression at time and try to explain
it with the sum of the transcription factor component
activities
–
•
Regression as model
–
•
Intuitive interpretation of the set up and the results
Easy to interpret, computationally feasible
Evaluate the results statistically
–
Somewhat quantitative interpretation of the results
Data
•
Expression data
–
–
•
300 different knockout mutations of the yeast (300
arrays)
over 6000 yeast genes on each cDNA-array
Binding data
–
–
–
Binding activity of 147 transcription factors to all yeast
genes (147 arrays)
About same genes on array as above
Used to choose a set of candidate tfs for each gene
Preprocessing of the data
•
Normal quantity provided by cDNA-arrays is the
log-ratio of the sample and the control intensities
from each spot
–
•
Plain log-intensities separately?
•
➔
May hinder the discovery of normal regulation
mechanisms
Not possible because of spotwise variation
Only the arraywise and genewise averages were
removed and the normal log-ratios were used in the
analysis
The regression model
•
•
•
•
The expression of a gene, y, is modelled as a
weighted sum of x, the expressions of a set of
transcription factor genes
The error e is assumed to be normally distributed
As a result each transcription factor gene is assigned
a coefficient
, which denotes its role in gene
regulation
Fitted with robust fit-method
Statistical analysis of the results
•
•
A subset of nine genes: some confirmed transcription
factors, the binding activities of the 50 tfs and
significances of the same 50 tfs in regression model
Tests:
–
–
Test whether binding activity and regression model
produce same kind of information about the roles of the
tfs for each gene -> no statistical significance
Test whether the confirmed tfs are found among the most
significant ones in either binding or regression -> no stat
signif
Discussion
•
•
•
Clearly, there is no linear association between the
regulator genes and the regulated genes in this data
set
The biggest problem is perhaps the type of the data:
cDNA-data without time dimension -> the change of
data to Affymetrix and/or timeseries data might help
Another problem may be oversimplified model, but
with this kind of data statistical models for gene
interactions seem to be fruitless