Some slides from Week 7

Download Report

Transcript Some slides from Week 7

CSC 556– DBMS II, Spring 2013,
Week 7
Bayesian Inference
Paul Graham’s Plan for Spam, +
A Patent Application for Learning Mobile
Preferences, + some text related matters
Bayes and Incremental (Lazy) Learning
• Bayes’ formula
• P(selected | K) = (P(K | selected) x P(selected)) / P(K)
• My collection of terms on the right hand side.
– A priori probability of events
• P(selected) = size of SelectedSet / size of InspectedSet
• P(K) = count of K in InspectedSet / size of InspectedSet
– A posteriori probability (conditional probability)
• P(K|selected) = count of K in SelectedSet / size of SelectedSet
• InspectedSet are available alternatives that the mobile user
inspected. They correspond to all emails see in Gragam.
• SelectedSet are available alternative contacted by mobile user.
They correspond to spam vs. non-spam choices in Graham.
Graham’s Formula (Hackers and Painters)
• rg = min(1, 2 * (good(w) / G))
• rb = min(1, bad(w) / B)
• Pspam|w = max(.01, min(.99, rb / (rg + rb)))
• w is token from email, good(w) and bad(w) are
occurrence counts of w in good & bad emails, G and B
are good & bad emails.
• 2 * biases against false positives (don’t’ over-discard)
• False positives are non-linearly undesirable.
• min & max related to Laplace Estimator (textbook)
Graham continued
• Graham uses 15 most interesting tokens.
• They occur > 5 times with max distance from
.5.
• Pspam = (∏i=1..15 Pspam|wi)
/ (∏i=1..15 Pspam|wi + ∏i=1..15 1 - Pspam|wi)
• Assign .4 spam probability to unknown token.
• Treat email as spam if Pspam > .9
Spam of the Future (is now)
“Assuming they could solve the problem of the headers, the
spam of the future will probably look something like this:
Hey there. Thought you should check out the following:
http://www.27meg.com/foo
because that is about as much sales pitch as content-based
filtering will leave the spammer room to make. (Indeed, it will be
hard even to get this past filters, because if everything else in the
email is neutral, the spam probability will hinge on the url, and it
will take some effort to make that look neutral.)
See my example from March 12.
Better Bayesian Filtering (1/2003)
• Train on lots and lots of data.
• Don’t ignore headers, don’t stem tokens.
• Related to textbook section 3.5 n-grams for small n~=3.
• Use only the most significant entropy-reducing
tokens (Graham’s 15).
• Bias against false positives.
• Graham spends a lot of time talking about
tokenization. Data cleaning, data to include, and data
format are all key & work-intensive.
Parson, Glossner & Jinturkar
• Mobile users have direct associations to their
“friends and relatives” (and establishments, etc.).
• These directional associations have strengths 0.0
through 1.0 borrowed from spreading activation.
• http://en.wikipedia.org/wiki/Spreading_activation
• http://en.wikipedia.org/wiki/Semantic_network
• When arriving / departing, weighted associations
scale * transitive strength * proximity.
• The InspectedSet consists of the number of
alternatives inspected by the user on the handset.
Centile Ranking
• For learning Associations, Preferences, and for
Adjusting their Weights
•
•
•
•
Raw Score: NumberOfTimesSelected / NumberofTimesInspected.
Sort by Raw Score. Form Equivalence Classes.
Position in Sort gives Percentile Rank (CentileWeight).
NewStrength = ((OldStrength * DragFactor) + CentileWeight) /
(DragFactor + 1)
• New Preference Learning ranks Keywords according to Association
weights that reach those keywords, applies Centile Ranking to top
“selected” keywords, then forms single Keyword query formula for
top ranked Keywords.
Bayesian Preference Rules
•
•
•
•
•
•
•
P(selected | K) = (P(K | selected) x P(selected)) / P(K)
P(selected) = size of SelectedSet / size of InspectedSet
P(K) = count of K in InspectedSet / size of InspectedSet
P(K|selected) = count of K in SelectedSet / size of SelectedSet
InspectedSet are available, inspected alternatives.
SelectedSet are available alternative contacted by mobile user.
NewStrength = ((OldStrength x DragFactor) + P(selected | K) ) / (DragFactor
+ 1)
• Use only Keywords reached via Associations, i.e., contacts that were
actually made. Top percent are treated as “selected.” Avoids storing data
for unselected keywords. Multiple keywords connected by and, or, minus
possible; and is implemented.
• Garbage collect weak Associations & Preferences (via Threshold).
Deficiencies of Weka’s relational approach
• Some attributes are naturally sets of values
from a domain. Some are counted multisets.
• Some are naturally maps K  V.
• Multi-instance classifiers in Weka are only a
small subset.
• They do not addressed multi-value attributes.
• Text mining is also outside Weka’s scope.
Data Mining Scrabble
• It requires text processing.
• Text is not conventional prose.
• It poses problems for data representation /
normalization.
• It will support multi-instance classification.
• Each “play” in a game is an instance of attributes.
• Plays across games have relationships.
• It may support analysis of social interactions.
What to collect?
• Each instance contains gameID, moveNumber,
playerID, row or column of play, pointGain,
pointSum, remainderInTileBag, isWinner/rank.
• These are conventional scalar attribute values.
• Multi-value attributes are harder.
• There are zero or more PLAYED, LEAVED, CROSSED, BORROWED
(from sides) and SWAPPED letters per play.
• Bindings for blanks, 3word, 2word, 3letter, 2letter bonus locations
vary in number available and what their bindings (word vs. letter).
• How do we deal with strings, fragments, and letter
blends? Contiguous consonants, also vowels?