Secure 2008 Corporate Template

Download Report

Transcript Secure 2008 Corporate Template

Adaptive Language Parsing
Teaching Parsers to Program Themselves
J. Zdziarski
[email protected]
7/18/2015
©2008 Secure Computing Corporation. All Rights Reserved.
1
The Problem
Adaptive spam filters have proven to work
Heuristics have proven to decompose
So how is this a problem?
2
The Problem
It’s a problem because:
Adaptive spam filters still use heuristics!
Well, most do anyway
3
Rules-Based Parsing
The Problem with Rules-Based Parsers
• They make assumptions about language syntax
• Many languages have their own set of rules
• Requires foreknowledge of the languages being used
• Spammers can’t obey RFC, let alone proper English
• A machine can learn how to read better than you can
• Some languages don’t support whitespace…
THEREDWELLSAMISSLATE
ENDANGERSSPARSEAMANSWORDS
THE RED WELL, SAM IS SLATE
ENDANGER! SPAR, SEAMAN SWORDS!
END ANGERS; PARSE A MAN’S WORDS
THERE DWELLS A MISS, LATE
(D. Higgs)
(G. Sinnamon)
Parse-o-Grams Courtesy of Robert Craigen, Univ. of Manitoba
4
Adaptive Language Parsing
3 Steps to Adaptive Parsing
1. Build a statistical hypothesis space for all parsing options
This can be all ASCII chars, wchars, biGram separators, legacy heuristic rules
2. Calculate the probability that each parsing rule yields
interesting data
For each potential delimiter or rule, how often was it found in an uninteresting
token (LOW) vs. how often was it found in an interesting token (HI).
3. Use this data to reprogram the parser
Take the most uninteresting N possible delimiters and use them to parse the
document differently; wash, rinse, and repeat.
PDELIM= (LOW
LOWDELIM/ LOWTOTAL
DELIM/
LOWTOTAL) + (HIDELIM/ HITOTAL)
* Counters are per-token, not per-message
5
Adaptive Language Parsing
Some Examples
Final Delimiter Set for a SpamAssassin Corpus run:
Header Delimiters:
378z049Y;mF:w"O@!^N\%$(>
Body Delimiters:
T?N,I?OS.pEmroaicthldesn
Interesting Data Generated
[0.990000] ,+click (8s, 0i)
Click more interesting when with comma
[0.940828] igh (105s, 2i)
Foreign Character Sets
[0.990000] !+ESC(B (50s, 0i)
[0.990000] ESC$B(-ESC(B (19s, 0i)
[0.990000] ESC$B!!!!!!!!!!ESC(B (29s, 0i)
|-|igh, High, H-IGH Interest Mortgage
[0.990000] $888 (15s, 0i)
Yup, we knew about this one. So does
I have no idea what this means, but the
machine says it’s Japanese spam.
Hal.
[0.990000] ional_Inc.+Now (6s, 0i)
[0.990000] s0r+C|ubs (12s, 0i)
Junk can be very useful to the machine
6
Adaptive Language Parsing
Some Tests
SpamAssassin Corpus
TP
TN
FP
FN
Precision
Recall
FScore
1640
4143
7
256
0.9957
0.8649
0.925
Static Defaults 1654
4145
5
242
0.9969
0.8723
0.930
Adaptive,W24 1760
4138
12
136
0.9932
0.9282
0.959
Adaptive,W32 1723
4140
10
173
0.9942
0.9087
0.949
Pure,W28
1756
4137
13
140
0.9926
0.9261
0.958
Pure,W36
1666
4142
8
230
0.9952
0.8786
0.933
TP
TN
FP
FN
Precision
Recall
Fscore
21085
9034
24
3699
0.9988
0.8507
0.918
Static Defaults 22787
8782
66
2109
0.9971
0.9152
0.954
Kakasi*
18745
8904
108
1438
0.9942
0.9287
0.960
Adaptive,W24 23952
9224
40
1104
0.9983
0.9559
0.976
Adaptive,W32 24068
4222
42
988
0.9982
0.9596
0.978
Whitespace
Chinese ISP Corpus
Whitespace
* Kakasi was not designed for Chinese, but it works pretty well anyway
7
Counter-Example
So can we break it too?
SpamAssassin Corpus
TP
TN
FP
FN
Precision
Recall
FScore
1640
4143
7
256
0.9957
0.8649
0.925
Static Defaults 1654
4145
5
242
0.9969
0.8723
0.930
Adaptive,W24 1760
4138
12
136
0.9932
0.9282
0.959
Adaptive,W32 1723
4140
10
173
0.9942
0.9087
0.949
Pure,W28
1756
4137
13
140
0.9926
0.9261
0.958
Pure,W36
1666
4142
8
230
0.9952
0.8786
0.933
Counter-
1578
4146
4
318
0.9974
0.8322
0.9073
Whitespace
Example
More junk tokens = lower efficiency, and of course less ability to catch anything (good or bad)
Answer: To some degree
8
Future Work
What Else Could We Do With This?
• Extend support to statistically stem words and parse inflection
Extend the hypothesis space to include biGram and triGram delimiters, and position of split
(before, after, or as delimiter).
• Character Set Detection
Certain parsing models will no doubt adhere to specific character sets
• Fuzzy Data Mining
Improve text retrieval by parsing documents to be more machine-coherent
• Apply to binary parsing challenges
Parse executable files, forensic recovery of hard drives, pixel border detection, etc.
9
Questions
Questions?
Jonathan Zdziarski ([email protected])
http://www.securecomputing.com
10