COT6930 Course Project
Download
Report
Transcript COT6930 Course Project
COT6930 Course Project
Outline
• Gene Selection
• Sequence Alignment
Why Gene Selection
• Identify marker genes that characterize
different tumor status.
• Many genes are redundant and will
introduce noise that lower performance.
• Can eventually lead to a diagnosis chip.
(“breast cancer chip”, “liver cancer chip”)
Why Gene Selection
Gene Selection
• Methods fall into three categories:
– Filter methods
– Wrapper methods
– Embedded methods
Filter methods are simplest and most frequently
used in the literature
Wrapper methods are likely the most accurate
ones
Filter Method
• Features (genes) are scored according to the evidence
of predictive power and then are ranked.
• Top s genes with high score are selected and used by
the classifier.
– Scores: t-statistics, F-statistics, signal-noise ratio, …
– The # of features selected, s, is then determined by cross
validation.
• Advantage: Fast and easy to interpret.
Good versus bad features
Filter Method: Problem
• Genes are considered independently.
– Redundant genes may be included.
– Some genes jointly with strong discriminant
power but individually are weak will be
ignored.
• Good single features do not necessarily form a
good feature set
• The filtering procedure is independent to
the classifying method
– Features selected can be applied to all types
of classifying methods
Wrapper Method
• Iterative search: many “feature subsets” are
scored base on classification performance and
the best is used.
– Select a good subset of features
• Subset selection: Forward selection, backward
selection, their combinations.
– Exhaustive searching is impossible.
– Greedy algorithm are used instead.
Wrapper Method: Problem
• Computationally expensive
– For each feature subset considered, the
classifier is built and evaluated.
• Exhaustive searching is impossible
– Greedy search only.
• Easy to overfit.
Embedded Method
• Attempt to jointly or simultaneously train
both a classifier and a feature subset.
• Often optimize an objective function that
jointly rewards accuracy of classification
and penalizes use of more features.
• Intuitively appealing
Relief-F
• Relief-F a filter approach for feature selection
– Relief
Relief-F
• Original Relief is only able to handle binary classification problem.
Extension was made to handle multiple-class problem
Relief-F
• Categorical attributes
• Numerical attributes
Relief-F Problem
• Time Complexity
– m×(m×a+c×m×a+a)=O(cm2a)
– Assume m=100, c=3, a=10,000
– Time complexity 300×106
• Only considers one single attribute, cannot
select a subset of “good” genes
Solution: Parallel Relief-F
• Version 1:
– Clusters runs ReliefF in parallel, and updated
weighted weight values are collected at the
master.
– Theoretical time complexity O(cm2a/p)
• P is the # of clusters
Parallel Relief-F
• Version 2:
– Clusters runs ReliefF in parallel, and each
cluster directly update the global weight
values.
– Each cluster also considers the current weight
values to select nearest neighbour instances
– Theoretical time complexity O(cm2a/p)
• p is the # of clusters
Parallel Relief-F
• Version 3
– Consider selecting a subset of important
features
– Comparing the difference between
including/excluding a specific feature, and
understand the importance of a gene with
respect to an existing subset of features
– Discussion in private!
Outline
• Gene Selection
• Sequence Alignment
– Given a dataset D with N=1000 sequences
(e.g., 1000 each)
– Given an input x,
– Do pair-wise global sequence alignment
between x and all sequences D
• Dispatch jobs to clusters
• And aggregate the results