gene-associate

Download Report

Transcript gene-associate

Mining Association Rules from
Microarray Gene Expression Data
Association Rules
• Form: LHS => RHS, where LHS and
RHS are disjoint itemsets.
• Frequent itemset: supportT(I) ≥ α,
where supportT(I) is the number of
transactions in T that contains all the
items in I and α is the minimum
support.
2
Objective Measurements of Rule
Interestingness
• Support of the rule LHS => RHS:
Support(LHS U RHS), frequency that LHS
and RHS occur together in a transaction.
• Confidence of the rule LHS => RHS:
Support (LHS U RHS)/Support (LHS),
frequency that RHS occurs when LHS
occurs.
3
•
•
•
•
•
Association Rule Discovery (ARD)
algorithms
Originally used in market basket analysis
Unsupervised and domain independent
Example: Apriori algorithm (Agrawal,
1993)
Advantage: Finds all associations
Disadvantage: Very large number of
associations
4
Microarray technology
•
•
•
•
Measurement of gene expression levels in cells
Simultaneous measurement of thousands of genes
Facilitates the study of gene interactions
Expression profile=set of values for different conditions
–
–
•
One condition = one slide
condition: tissue, treatment or time point
Data matrices
5
Rules Applied to Gene Expression
• Each gene expression experiment is a single
transaction and each gene is an item.
– Gene value may be numerical, may need to be binned
as being up (expressed), down (repressed), or neither.
• Items in a gene expression transaction can also
include relevant facts describing the cellular
environment.
• Find frequent itemsets: apply Apriori algorithm.
• Generate association rules from the frequent
itemsets.
6
Example - association rules
• Database
T1:
T2:
T3:
T4:
T5:
A
A
A
A
A
B
B
B
B
B
C
C
C
C
C
D
D
D
D
D
E
E
E
E
E
• Rules
R1:
R2:
R3:
A  B  C  E (sup=0.40, conf=0.67)
A  B  D (sup=0.40, conf=1.00)
A  B (sup=0.60, conf=1.00)
7
Rule induction algorithms
G1 G2 G3 G4 Class
S1     Cancer
S2     NonCancer
. . . . . .
. . . . . .
. . . . . .
Sn     NonCancer
G1
1, is Gene 1, S1 is sample

andmeans overexpressed ,
 means underexpressed.
Association rule discovery
algorithm
*
Rule set
R1 : G1  G3  Cancer
R2 : G1  G4  Non-Cancer
R3: G1  G3   G2 
Cancer
8
Forming Association rules
• Any frequent itemset of size greater than
one can be divided into two itemsets, LHS
and RHS.
• Using objective measures: If the
confidence of a candidate rule exceeds a
speficied minimum confidence criterion,
the rule may be included.
• A very large set of rules is usually
generated.
9
Example
• 28 treatments were recorded on 28
microarray chips– 28 transactions.
• Each chip contains expression levels of
approximately 6200 genes (items).
• After Apriori algorithm, 70,000,000 rules
were generated.
10
Subjective Measures for Selecting
Rules
• Select genes before finding frequent items –
limit the number of items.
• Limit the size of LHS or RHS
E.g., LHS contains only one item.
• Domain specific measures -- Use knowledge
about the domain under study to specify
patterns.
11
Rule Filtering and Group
(Tuzhilin and Adomavicius 2002)
• Rule templates – specify restrictions on the
combinations of genes and their expression levels that
can appear in the body and head of the rule.
RulePart HAS Quantifier OF C1, C2, ..., CN [ONLY]
– RulePart: BODY, HEAD, or RULE.
– C1, C2, ..., CN is a comparison set, representing a list of genes
against which the discovered rules will be compared:
•
•
•
•
•
A gene, e.g., G17
A gene with a particular expression level, e.g., G17↑
A group (category) of genes, e.g., [DNA_Repair]
A group of genes with an expression level, e.g., [DNA_Repair]
A group of genes with a list of allowable or unallowable expression
levels, e.g., [DNA_Repair] = {↑, #}
12
Rule Filtering con’t
• Quantifier: a keyword or an expression
specifying how many genes specified by C1,
C2, ..., CN List have to be contained in RulePart.
– ALL, ANY, NONE, specifying the number of
genes from C1, C2, …, CN the RulePart must have
– A numeric value; e.g., 2, specifying a rule must
have exactly 2 genes from the comparison set
– A range of numeric values; e.g., 1,3,5-7
• ONLY is used to indicate RulePart can have
only the genes in the C1, C2, …, CN list.
13
Rule Filtering con’t
• All rules that contain at least one of the
following genes: G1, G5, G7:
– RULE HAS (ANY) of G1, G5, G7,
– Matching rules: G1↑ => G3↑.
• "When genes involved in the DNA repair are upregulated,
what other gene categories are also up- or
downregulated?"
– BODY HAS (ANY) OF [DNA_Repair] AND HEAD HAS
(ANY) OF [All_Genes]={, }
14
Macro Templates
(Tuzhilin and Adomavicius 2002)
• Detect unexpected rules :
CONTRADICT (GeneExprSet, G, ExpLevel)=
BODY HAS (ALL) of GeneExprSet AND
HEAD HAS (ALL) OF G ≠ {ExpLevel}
– CONTRADICT({G1, G2},G4,}
– Unexpected rule: G1G2G3 G4
15
Rule Grouping
• Group similar rules together into classes to
be analyzed
• Gene hierarchy: group genes based on
their functions.
ALL
F1
F2
G1 G2 G3 G4
G5
16
Rule Grouping con’t
• Aggregated rules
– Groups: F1={G1,G2,G3}, F2={G4,G5}
– Rules: R1={G1G4}, R2={G1G5},
R3={G1G3G5}
– Aggregated rule: R=F1F2, R'=F1F2
– {R1,R2,R3}R, {R1,R2}R', R3R'
17
Rule Derivation Procedure
Database of transactions
Some features may be irrelevant or a
reduction may be required for
efficiency reasons
Feature selection
Reduced database
Rule induction
A rule induction algorithm is applied to
the database (e.g. association rule
algorithm or decision tree algorithm)
Initial rule set
Rule selection
Rules are assessed and ranked using
different measures of interestingness
Relevant rule set
Rule validation
The rules need to be validated to be
accepted as knowledge. Can be done by
more detailed biological experiments in
the context of gene expression data
Rules representing knowledge
18
Mining Association Rules from
Clinical Data
Association Rules in Medical Data
• Medical record data
– Millions of “claims” for medical procedures
– Each patient may have several claims
– Each claim may have several line items or records,
one for each procedure performed.
– A diagnosis was reported with each procedure.
• Data: patient code, procedure code, diagnosis
code
• Goal: discover relationships between procedure
performed on a patient and the reported
diagnosis.
20
Con’t
• Data items: the set of all procedure and diagnosis
codes (7,365 + 9,383=16,748)
• Transaction: the set of procedure and diagnosis
codes for each patient (1,257,645 patients)
– May cause unexpected rule: {cast} => {heart disease}
• Each itemset consists of one or more
procedure/diagnosis codes.
• The support of an itemset I is the number of
patients whose set of items include all the items in
I (> 1%).
21
Formulating Rules
• Applying Aprior algorithm to generate all
frequent itemsets.
• Restricting to those frequent itemsets which
contain both procedures and diagnoses
• For each selected frequent itemset, one rule is
formulated with all procedure codes on the left
and all diagnosis codes on the right.
• Computing confidence and eliminating rules
with confidence less than 65%.
22