Learning disjunctions in Geronimo`s regression trees Felix Sanchez

Download Report

Transcript Learning disjunctions in Geronimo`s regression trees Felix Sanchez

Learning disjunctions in
Geronimo’s regression trees
Felix Sanchez Garcia
supervised by Prof. Dana Pe’er
Motivation
•
•
•
•
Gliobastoma: most common primary brain tumour in adults.
Newly diagnosed patients have an average survival of 1 year.
Need for better models of the network.
Data used to create models: microarrays
 # genes  8000
 # candidate regulators  800
 # samples 120
Module networks
• Bayesian model that benefits from high correlation of groups of
variables [2]
• Algorithm similar to EM (but hard decisions). Loop:
– Module assignment step: assign variables to modules
– Structure search step: calculate CPD for each module
Module 3
Module 4
Module 1
Module 2
Regression trees as CPD
•
•
•
•
Regression trees are used for each module’s CPD
Internal nodes: condition on a single variable
Leaf nodes: parameters for normal distribution
Bayesian score
prior on structure (complexity+biological penalties)
x<0.3
y>-0.2
pdf of normal-gamma
• Exhaustively calculates score for each split for each regulator
……
target gene’s values sorted by regulator
Incorporating pathway
information
• Biological pathways: contain sets of genes and represent
chains of biochemical reactions that perform some function
• Aberrations in gliobastoma tend to occure as disjunctions
within pathways: derregulating 1 component is usually
enough to alter the function of the whole pathway [4]
• Idea: use pathway information to obtain a better model
• Methodology: extend node conditions to disjunctions of
conditions on pathway elements
• We will use 15 sets of regulators (20-30 genes per set)
– 5 sets of regulators of pathways known to be related to
cancer.
– 5 sets of regulators of other pathways
– 5 sets of regulators chosed at random
Problem setting
• Concept class: disjunction of threshold functions on a single
variable
• Loss functions: -Bayesian score (biological penalty?)
• Potential number of hypotheses: 2^{m}
• Related classification problem tackled by Marchand and Shah
(2005) and Kestler et al. (2006).
Bibliography
1.
2.
3.
4.
5.
6.
Pe'er, D., Bayesian Network Analysis of Signaling Networks: A Primer. Sci. STKE, 2005.
2005(281): p. pl4-.
Segal, E., et al., Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet, 2003. 34(2): p. 166-176.
Lee, S.-I., et al., Identifying regulatory mechanisms using individual variation reveals key
role for chromatin modification. Proceedings of the National Academy of Sciences, 2006.
103(38): p. 14062-14067.
Comprehensive genomic characterization defines human glioblastoma genes and core
pathways. Nature, 2008. 455(7216): p. 1061-1068.
Kestler, H., W. Lindner, and A. Müller, Learning and Feature Selection Using the Set
Covering Machine with Data-Dependent Rays on Gene Expression Profiles, in Artificial
Neural Networks in Pattern Recognition. 2006. p. 286-297.
Marchand, M. and M. Shah, PAC-Bayes Learning of Conjunctions and Classification of
Gene-Expression Data, in Advances in Neural Information Processing Systems 17.
2005, MIT Press: Cambridge, MA. p. 881-888.