rajabiun-MultilayerFilteringMITSpamConf08

Download Report

Transcript rajabiun-MultilayerFilteringMITSpamConf08

Multilayer Filtering
or
The Dangerous Economics of Spam Control
2008 MIT Spam Conference
By
Alena Kimakova and Reza Rajabiun
York University and COMDOM Software
Toronto, Canada
I.1 Spam as an empirical problem
Two historical observations (2002-2008)
A)
Spam ratio in 2002 = 20% -30% of all email messages
Spam ratio in 2008 = 70% - 90% of all email messages
Increased sophistication of spam (pdf, image, search engine,
etc)
B) Increased sophistication and accuracy of statistical
content filters: 98% accuracy, 0.1% false positives
(Cormack and Lynam, 2007)
Empirical puzzle:
Why more spam after the adoption of technical and regulatory
countermeasures?
2
I.2 Methodology: Positive analysis
How can we explain the growth and sophistication of spam?
Hypothesis: A technological trade-off between speed and accuracy
facing network owners and operators.
Approach combines:
A) Game theoretical models: Large volumes of spam because
of asymmetries in the distribution of filter quality across
the Internet.
B) Evolution of the technological possibilities frontier facing
ISPs and other operators from the early 2000s.
Problem with existing studies in economics and computer science:
Do not account for incentives of spammers and ISPs.
General point: Importance of interdisciplinary cooperation between
economists and computer scientists in designing spam filtering
bundles and regulatory countermeasures.
3
II.1 Technological Choice
Advances in content filtering accuracy  Constrained sensory threat
However:
High noise/signal ratio  Network costs of spam rise
Of particular concern in developing countries with relatively lower:
a) Bandwidth
b) Processing capacity
c) Administrative capacity
Spam and the Digital Divide (Rajabiun, 2007)
The literature in computer science and economics almost
exclusively focuses on false negative/positive problem
4
II.2 End user and network costs
More realistic assumption: End user (E) and network costs of spam
(N) are likely to be closely linked.
General problem facing an ISP (Server level problem)
Costs of Spam = C ( E (E1, E2 ), N ( E1, E2, S ) )
E1 – Expected false negative rate
E2 – Expected false positive rate
S – Number of servers
Theory: Little known about relationships, but not static.
Practice: Can be estimated for individual ISPs based on:
a) Accounting information
b) Features of antispam systems available at a point in time
5
II.3 Antispam technology
Basic filtering methods available since the late 1990s
Server level:
Adoption of (fuzzy) fingerprinting (2001-2005) and reputation based
systems (2004-2006) upstream (fast, but not accurate)
End user level:
Statistical (Bayesian) content filters (accurate, but not fast)
Other technical + public policy measures: Aiming to increase the
costs of sending spam (Hashcash, civil/criminal law, do not call
registries)
Optimal choice of filter depends on identity of end user/ISP
Upstream ISPs more sensitive to speed, downstream to
accuracy
Divergence between (socially) optimal and actual technological
choice
6
III.1 The long tail
Distribution of taste for spam for each sub-network: not normal
Khong (2004): Mechanisms that connect spammers and those with
a taste for spam  first best solution (open channel argument)
Blocking and filters second best
Empirically: More spam after wide-spread adoption of open
channels rather than less.
Loder et al. (2004): Attention Bond Mechanism (ABM) first best
because it allows for price negotiations between senders and
receivers.
Basic economic assumption: The subjective theory of value
Ex: Search for affordable drugs for the uninsured in the U.S.
The long tail in natural sciences: Phase transition/multiple equilibria
Game theory: Strategic complementarities
7
III.2 Sender side countermeasures
In Microeconomic theory:
Long tailed distributions associated with markets where markups are
invariant to the number of sellers (e.g.: mutual funds)
Margin for spammers, or expected response rates, are invariant to
the number of spammers at play
Implications: Legal sanctions and IP reputation systems increase
costs of spamming, drive some spammers out of the market, but do
not thin out the market.
Intuition: As in wars against prostitution and drugs, “hang them all”
strategies ineffective + increase social costs (Becker-Friedman).
8
IV.1 Strategic conflicts
Trivial model of spam: Tragedy of the commons
 Generic solution is to increase costs on spammers, but results in
escalating spam wars.
Empirically: Increasing sender costs since early 2000s, but more
spam.
Escalation  Development and adoption of new spamming
techniques
Androutsopoulos et al. (2005)  2 player game between senders
and receivers has a single Nash equilibrium
 Settles in infinitely repeated games, unless changes to
underlying technologies or taste for spam
9
IV.2 Spam growth and filter quality
Reshef and Solan (2006): Blame filters for growth of spam due to
differences in filter quality.
When costs of sending messages not too high
 Effect of improved filter quality on total volume of spam
ambiguous
Eaton et al. (2008): Complementarities between filters and sender
side countermeasures. Improving filtering alone results in more
spam. Given ineffective sender side countermeasures, they suggest
receiver side payments (as in SMS systems).
Kearns (2005): Spam as a source of both costs and revenues for
ISPs  economic incentive to adopt inefficient filters
10
V.1 Speed versus accuracy
Existing literature:
Even if they could read end user preferences accurately,
upstream backbone providers do not have sufficient financial
incentives to adopt the right technological countermeasures.
Argument here: Not necessarily because of financial factors alone.
Hypothesis: ISPs faced technological trade-offs in terms of speed v.
accuracy
Coordination failure not between senders and receivers as in the
tragedy of commons or Khong (2004), but between upstream and
downstream entities/servers.
Downstream better off with less incoming spam, but cannot force
upstream to do the optimal filtering for them.
11
V.2 Bundles and layers
Bundles of countermeasures facing spammers:
a) Ad hoc feature selection rules (late 1990s): centralized
b) Fingerprinting/checksum filters (2000-2005): centralized
c) IP reputation/authentication mechanism (2004-2006):
centralized
d) Statistical content filters (since late 1990s): distributed
Asymmetric filter quality (2000-2006):
(b and c) fast relative to 1st generation statistical content filters
(5x), but less accurate (-5% and -30% respectively).
Response by income smoothing spammers: higher noise/signal
ratio, more variants, one shot BGP spectrum agility
12
VI.1 The response
A) Coordination by operators to strengthen authentication protocols
(SPF, DKIM)
Problem: A wide range of techniques available to bypass, and
even use the protocols as an instrument of sending more
spam!
B) Closing the gap between fast and accurate filters: Further
optimization of the methods for distributed content scanning,
learning, and classification
1st versus 2nd generation Bayesian filters (CRM114, Bogofilter,
COMDOM)
13
VI.2 Evolution of Bayesian content filters
14
VI.3 Findings
Technological trade-off between speed and accuracy now closed
with distributed 2nd generation Bayesian filters (at least 30x
differential in throughput relative to 1st generation):
Note: Fingerprinting was 5x faster than 1st generation Bayesian
filters in terms of throughput
Fixed versus variable costs of message processing
 Substantial reductions in variable costs of scanning, minor
improvements in fixed costs of classification
15
VII. Summary
More spam is an instrument for:
a) Evading filters
b) Searching for people with a taste for spam
Normative question for policy makers: Should spamming be illegal?
Legal sanctions may induce moral hazard problem and potentially
exacerbate the problem at the aggregate level by adopting more
costly strategies/technologies (especially important for developing
countries).
For designers of antispam systems/bundles: Should we retain layers
that aim to increase the costs of spamming through ad hoc
centralized control (e.g. IP reputation, fingerprinting)?
16