A Unified Model of Spam Filtration

Download Report

Transcript A Unified Model of Spam Filtration

A Unified Model of Spam Filtration
William Yerazunis1, Shalendra Chhabra2, Christian
Siefkes3, Fidelis Assis4, Dimitrios Gunopulos2
1Mitsubishi
Electric Research Labs, Cambridge, MA
2University of California Riverside, CA
3GKVI/ FU Berlin, Germany
4Embractel, Rio de Janeiro, Brazil
“Spam is a Nuisance”
Spam Statistics
The Problem of Spam Filtering
Given a large set of email readers (who desire to
reciprocally communicate without prearrangement
with each other) and another set of email
spammers (who wish to communicate with the
readers who do not wish to communicate with the
spammers), the set of filtering actions are the
steps readers take to “maximize their desired
communication and minimize their undesired
communications”.
Motivation
All current spam filters have very similar
performance – why?
Can we learn anything about current filters
by modeling them?
Does the model point out any obvious flaws
in our current designs?
What do you MEAN?
Similar Performance?
Gordon Cormack's “Mr. X” test suite – all
filters (heuristic, statistical) get about 99%
accuracy.
Large variance of reported real-user results
for all filters – everything from “nearly
perfect” to “unusable” on the same
software.
Conclusion: This problem is evil.
Method
Examine a large number of spam filters
Think Real Hard (Feynmann method of
deduction)
Come up with a unified model for spam
filters
Examine the model for weaknesses and
scalability
Consider remedies for the weaknesses
The Model
Models information flow
Considers stateful v. stateless computations
and computational complexity
Simplification: training is usually done
“offline”, so we will assume a stationary (no
learning during a filtering) filter configuration
for the computational model.
Empirical Stages in Filtering Pipeline
Preprocessing – text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding
Empirical Stages in Filtering Pipeline
Preprocessing – text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding
Everybody does it this way!
Empirical Stages in Filtering Pipeline
Preprocessing – text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding
Everybody does it this way!
Including SpamAssassin
The Obligatory Flowchart
Note that this is
now a pipeline –
anyone for a
scalable solution?
Step 1: Pre Processing: Arbitrary
Text to Text Transformation




Character Set Folding / Case Folding
Stopword Removal
MIME Normalization / Base64 Decoding
HTML Decommenting
Hypertextus Interruptus

Heuristic Tagging
“FORGED_OUTLOOK_TAGS”

Identifying Lookalike Transformations
‘@’ instead of ‘a’, $ instead of S
Step 2: Tokenization


Converting incoming text (and heuristic
tags) into features
Two step process:
Use regular expression (regex) to segment the
incoming text into tokens.
We will then convert this token stream T in
features.
This conversion is not necessarily 1:1
Step 3: Converting Tokens
into Features
A “Token” is a piece of text that meets the
requirements of the tokenizing regex
A “Feature” is a mostly-unique identifier that the
filter's trained database can convert into a
mathematically manipulable value
Converting Tokens into Features
first- convert Text to ArbInteger
For each unknown text token, convert that text into a
(preferably unique) arbitrary integer.
This conversion can be done by dictionary lookup (i.e.
“foo” is the 82,934th word in the dictionary) or by
hashing.
Output of this stage is a stream of unique integer IDs.
Call this row vector T.
Second part:
Matrix-based Feature Generation
Use a rectangular feature generation profile matrix P; each
column of P is a vector containing small primes (or zero)
The dot product of each column of P against a segment of
the token stream T is a unique encoding of a segment of the
feature stream at that position.
The number of features out per element of T is equal to the
column rank of P
Zero elements in a P column indicate that token is
disregarded in this particular feature
Matrix-based Feature Generation
Example – Unigram + Bigram features
A P matrix for unigram plus bigram features is:
1 2
0 3
If the incoming token stream T is 10, 20, 30... then
successive dot products of T•P yield:
10 * 1 + 20 * 0 = 10 <--- first position, first column
10 * 2 + 20 * 3 = 80 <-- first position, second column
20 * 1 + 30 * 0 = 30 <--- second position, first column
20 * 2 + 30 * 3 = 130 <-- second position, second col
Matrix-based Feature Generation
Nice Side Effects
Zeroes in a P matrix indicate “disregard this
location in the token stream”
Identical primes in a column of a P matrix allow
order-invariant feature outputs, eg:
1 2
0 2
generates the unigram features, AND the set of
bigrams in an order-invariant style (ie. “foo bar”
== “bar foo”)
Matrix-based Feature Generation
Nice Side Effects II
Identical primes in different columns of a P matrix
allow order-sensitive distance-invariant feature
outputs, eg:
1 2 2
0 3 0
0 0 3
generates the unigram features and the set of
bigrams in an order-sensitive distance-invariant
style
(ie. “foo bar” == “foo <skip> bar”)
Step 4:
Feature Weighting by Lookup
Feature Lookup tables are pre-built by learning
side of filter EG:
Strict Probability:
Local Weight =
TimesSeenInClass
TimesSeenOverAllClasses
Buffered probability:
Local Weight =__________TimesSeenInClass
(TimesSeenOverallClasses + constant)
Document Count Certainty:
Weight= TimesSeenInThisClass * DocumentsTrainedIntoThisClass
(TimesSeenOverallClasses+Constant)*TotalDocumentsActuallyTrained
In this model, Weight Generators
Do Not Have To Be Statistical
SpamAssassin: Weights generated by genetic
optimization algorithm
SVM: weights generated by linear algebra
Winnow Algorithm:

uses additive weights stored in the database

Each features weight starts at 1.0000

Weights are multiplied by a promotion / demotion factor if the
results are below /above a predefined threshold during
learning
Unique Features -->
Nonuniform Weights
Because all features are unique, we can have
nonuniform weighting schemes
➢
➢
Can have nonuniform weights compiled into
the weighting lookup tables
Can pass a row vector of weights from the
feature generation column vectors
Step 5: Weight Combination
1st Stateful Step
•
Winnow Combining
•
Bayesian Combining
•
Chi squared Combining
•
Sorted Weight Approach - “Only use the most
extreme N weights found in the document”
•
Uniqueness – use only 1st occurrence of any feature
Output: a state vector – may be of length 1, or
longer, and may be nonuniform.
Final Threshold
Comparison to either a fixed threshold
or
Comparison of one part of the output state
to another part of the output state (useful
when the underlying mathematical model
includes a null hypothesis)
Emulation of Other Filtering Methods In
the Unified Filtration Model

Emulating Whitelists and Blacklists
•
Voting-style whitelists
•
Prioritized-rule Blacklists
--> Compile entries into lookup tables with
superincreasing weights for each whitelist or
blacklisted term
Emulation of Other Filtering Methods In
the Unified Filtration Model
Emulating Heuristic Filters:

Use text-to-text translation to insert text tag
strings corresponding to each heuristic satisfied

Compile entries in the lookup table corresponding
to the positive or negative weight of each heuristic
text tag string; ignore all other text.

Sum outputs and threshold
This is the SpamAssassin model
Emulation of Other Filtering Methods In
the Unified Filtration Model
Emulating Bayesian Filters:

Text-to-text translation as desired

Use a unigram (SpamBayes) or digram (Dspam)
P matrix to generate the features

Preload the lookup tables with local probabilities

Use Bayes Rule or Chi-Squared to combine
probabilities, and threshold at some value
This is just about every statistical filter...
Conclusions
Everybody's filter is more or less the same!
But- filtration on only the text limits our
information horizon.
We need to broaden the information input
horizon
... but from where? how?
Conclusions
Better information input - examples:
CAMRAM – use outgoing email as auto whitelist
Smart Squid – look at your web site surfing to
infer what might be legitimate email (say,
receipts from online merchants)
Honey pots – sources of new spam, and newly
compromised zombie spambots (a.k.a realtime
blacklists, inoculation, peer-to-peer filter
systems). Large ISPs have a big statistical
advantage here over single users.
Thank You!
questions or comments?