Bioinformatica A.A. 2006/2007
Download
Report
Transcript Bioinformatica A.A. 2006/2007
Bioinformatica
Corso di Laurea Specialistica in Informatica
Microarray e Biomarcatori
06/05/2011
Classification of microarray samples
•
We are given a set (called Learning set) of Microarrays
expressions data coming from several classes of samples
(patients)
• To simplify the problem we consider only two classes:
Case/Control. So we have a set of pairs case/control .
• For example cancer/normal metastatic/non metastatic
etc..,
• Build a classifier able to decide to which class a new
unclassified sample belongs .
Expression profiling data analysis
• A supervised approach to classification:
• Identify genes (or microRNAs) that are
differentially expressed in the two classes of
samples.
• Discretize the set of discriminant genes
• Use these genes to build a classifier able to
classify new (unknown) samples
Two classes/1
• Rank Product
– Rank Product is a non-parametric statistical method
based on ranks of fold changes. Given n genes and k
replicates, let eg,i be the fold change(ratio
case/control) and rg,i the rank of gene g in the i-th
replicate.
– The rank product is computed through the geometric
mean:
– Simple permutation-based estimation is used to
determine how likely a given RP has been obtained by
chance.
R. Breitling, P. Armengaud, A. Amtmann, P. Herzyk. Rank products: a simple, yet powerful, new
method to detect differentially regulated genes in replicated microarray experiments FEBS Letters,
Volume 573, Issue 1, Pages 83-92.
Two classes/2
• Identification of differentially expressed genes between two classes. The
identification consists of two parts the identification of up-regulated and
down-regulated genes in the class a compared to class b, respectively.
• These results have been obtained using the Rank Product package (v. 2.16.0)
of the BioConductor Library under the R System.
More than two classes
• Many statistical tests are available
–
–
–
–
Kruskal-Wallis
ANOVA (for Gaussian only)
SAM (?)
Linear model (R limma package)
Discretization
• Discretization algorithms play an important role in data
mining and knowledge discovery.
• They not only produce a concise summarization of
continuous attributes to help the experts understand the
data more easily, but also make learning more accurate
and faster.
• Discretization algorithms can be classified into five diffrent
groups:
–
–
–
–
–
supervised versus unsupervised;
static versus dynamic;
global versus local;
top-down (splitting) versus bottom-up (merging);
direct versus incremental;
Class-Attribute Contingency Coefficient
• Given the quanta matrix, usually contingency coefficient is used to measure the
strength of dependence between the variables
–
–
–
–
qir (i = 1,2,...,S,r = 1,2,...,n) denotes the total number of examples belonging to the i-th class that are
within interval (dr-1,dr];
Mi+ is the total number of examples belonging to the i-th class;
M+r is the total number of examples that are within the interval (dr-1,dr];
n is the number of intervals;
C.J. Tsai, C.-I. Lee, W.-P. Yang. A discretization algorithm based on Class-Attribute
Contingency Coefficient. Information Sciences 178:3 (2008) 714-731.
CACC Pseudo-code
Associative classification
• Associative classification mining is a successful
approach that uses association rule discovery
techniques to build classification systems.
Maximal Frequent Itemset (i.e. MAFIA
algorithm)
– Given the set of discretized discriminant genes. Consider all the
pairs [gene,interval] as the Items of our data mining analysis . We
compute , for each class k, a set of maximal frequent itemsets
(MFI). Where a frequent itemset for a class k is a set of items
which appear together in a number of elements of the class greater
than a given percentage threshold t. It is maximal if no proper
superset of it is frequent.
– For each class k=0,…,K−1, the set of all MFI,
MFI(k)={mfi1(k),...,mfihk(k)} is computed.
Then assign to k the set of rules
&mfi1(k)- class k
.
.
&mfihk(k) class k
Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T: MAFIA: A Maximal Frequent Itemset
Algorithm. IEEE Transactions on Knowledge and Data Engineering 2005, 17:1490–1504.
Evaluation
– Unknown phenotypes are properly discretized and then assigned to
a class k with a score, by using association rules. The assignment
which yields the highest score establishes the class.
– Let x = {I1,...,Im} be an unknown discretized phenotype, we
evaluate how many rules are satisfied, even partially, in each Rk.
The sample is assigned to the class whose satisfied rules are
maximal. Fixed a class k, we evaluate x under a generic rule rvk = {Ii
, ..., Ij } assigning a score in the following way:
General schema
Profiling
data
Filtering
(i.e. discriminant genes)
Discretization
Binary strategy
Model validation
(KFCV)
Superset of robust
biomarkers
Genes patterns
(data mining: max
freq itemsets)
Filtering based on
permutation test
Bayesian Networks
Construction (reverse engineering)
Pathway Perturbation
microRNAs analysis
Bayesian networks
• Two components:
– G directed aciclic graph in which nodes are random variables X1,…..,Xn
– For each variable the conditional probability distribution is given by its
precursor.
• These two components represent a unic distribution on X1,…..,Xn.
• Markov assumption
• Each joint distribution satisfies the assumption
that each variable Xi is influenced by the values of
the state that preceds it .
Where:
parents(Xi) = set of precursors of Xi in G
Tools for Bayesian networks construction
• Banjo
Hartemink, A., Gifford, D., Jaakkola, T., & Young, R. (2001) “Using Graphical Models and Genomic
Expression Data to Statistically Validate Models of Genetic Regulatory Networks.” In Pacific
Symposium on Biocomputing 2001 (PSB01), Altman, R., Dunker, A.K., Hunter, L., Lauderdale, K., &
Klein, T., eds. World Scientific: New Jersey. pp. 422–433.
• Biolearn
– Dana Pe’er Lab
• http://www.c2b2.columbia.edu/danapeerlab/html/biolearn.ht
ml
Build a Bayesian network
MFI(K)
set
PKC
PKA
Raf
Jnk
P38
Mek
Erk
Akt
Pathway Perturbation
• Our goal is to apply an analysis model using both
– statistically significant number of differentially
expressed genes (or miRNAs)
– biologically meaningful changes on a given pathway. A
set of pathways describing sub‐systems of the given
organism involving the given variables (genes).
S. Draghici, P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero. A
systems biology approach for pathway level analysis. Genome Research, 17:1537-1545, 2007.
• Output
– Rank the sub‐systems in the decreasing order of the
amount of disruption suffered
– If possible, identify those sub‐systems for which the
disruption is “significant”
• Gene perturbation factor
–
–
–
–
–
PF(g) = perturbation factor of g:
α = a priori type of impact expected from that gene
ΔΕ(g) = change in expression level for g(fold change)
USg = Set of genes directly upstream of g in the pathway
Nds(u) = number of genes directly downstream of u in
pathways
– βug = efficiency of the connection between u and g
• Pathway perturbation factor
– Nde (Pi) = number of Differentially Expressed genes on
the given pathway Pi
– PF(g) =perturbation of the gene g
– mean fold change of differentially expressed genes.
• In this model, the impact factor IF of a set of
genes (for example those of a MFI belonging to Pi)
on a pathway Pi can be estimated (p-value) by
replacing that set by a random set of genes in Pi of
the same cardinality .
• The perturbation factor of Pi and this p-value give
the measure of the relevance of the MFI on that
Pathway.