Transcript features?

Automatic Authorship
Identification (Part II)
Diana Michalek, Ross T. Sowell, Paul Kantor,
Alex Genkin, David Madigan, Fred Roberts,
and David D. Lewis
Acknowledgements
• Support
– U.S. National Science Foundation
• DIMACS REU 2004
• Knowledge Discovery and Dissemination Program
• Disclaimer
– The views expressed in this talk are those of the
authors, and not of any other individuals or
organizations.
Outline
I.
II.
III.
IV.
Recap
New Federalist Paper Results
New E-mail Data Results
Conclusions and Future Work
The Authorship Problem
• Given:
– A piece of text with unknown author
– A list of possible authors
– A sample of their writing
• Problem:
– Can we automatically determine which person
wrote the text?
The Authorship Problem
• Given:
– A piece of text
– A list of possible authors
– A sample of their writing
• Problem:
– Can we automatically determine which person wrote
the text?
• Approach:
– Use style markers to identify the author
The Federalist Papers
• 85 Total
• 12 Disputed
Previous Work: Mosteller and
Wallace (1964)
• Function Words
Upon
Also
An
By
Of
On
There
This
To
Although
Both
Enough
While
Whilst
Always
Though
Commonly
Consequently
Considerable(ly)
According
Apt
Direction
Innovation(s)
Language
Vigor(ous)
Kind
Matter(s)
Particularly
Probability
Work(s)
Our Previous Work: Trials with
the Federalist Papers
• Wrote scripts in Perl and Python to
compute
– Sentence length frequencies
– Word length frequencies
– Ratios of 3-letter words to 2-letter words
• Analyzed our data with graphing and
statistics software.
Previous Conclusions
• Not too helpful…but there is hope!
– Try more features
– Try different features
-
Feature Selection
• Which features work best?
• One way to rank features:
– Make a contingency table for each feature F
– Compute abs ( log ( ad / bc ) )
F Not F
– Rank the log values
Madison
a
b
Hamilton
c
d
49 Ranked Features
Linear Discriminant Analysis
• A technique for classifying data
• Available in the R statistics package
• Input:
– Table of training data
– Table of test data
• Output:
– Classification of test data
Linear Discriminant Analysis:
example
Input training
data:
upon
2-letter
3-letter
M
0.000 206.943 194.927
M
0.000 212.915 194.665
M
0.369 202.583 190.775
M
0.000 201.891 213.712
M
0.000 236.943 206.221
H
3.015 235.176 187.940
H
2.458 226.647 201.082
H
4.955 232.432 192.793
H
2.377 232.937 186.078
H
3.788 224.116 196.338
Input test data:
upon 2-letter
3-letter
0.000 226.277 203.163
0.908 205.268 181.653
0.000 225.536 182.627
0.000 217.273 183.053
1.003 232.581 184.962
Ouput:
mmmmh
Some more LDA results
• 12 to Madison:
– upon, 1-letter, 2-letter
– upon, enough, there
– upon, there
• 11 to Madison:
– upon, 2-letter, 3-letter
• < 6 to Madison
– 2-letter, 3-letter
– there, 1-letter, 2-letter
Some more LDA results
Class
Output of lda
Features tested
12 M
mmmmmmmmmmmm
upon apt 9 2
12 M
mmmmmmmmmmmm
to upon 2 3
11 M
mmmmmmhmmmmm
on there 2 13
11 M
hmmmmmmmmmmm
an by 5 10
10 M
mmmmmmhmmmhm
particularly probability 3 9
8M
mmmmmmhhhmhm
also of 1 4
8M
mmmhmmhhmmhm
always of 1 3
7M
hmmhmhhmhmmm
of work 5 2
6M
mmhmmmhhmhhh
there language 1 8
5M
mhmhhmhhhmmh
consequently direction 5 11
Feature Selection Part II
• Which combinations of features are best
for LDA?
• Are the features independent?
• We did some random sampling:
–
–
–
–
Choose features a, b, c, d
Compute x = log a + log b + log x + log d
Compute y = log (a+b+c+d)
Plot x versus y
Selecting more features
• What happens when more than 4 features
are used for the lda?
• Greedy approach
– Add features one at a time from two lists
– Perform lda on all features chosen so far
• Is overfitting a problem?
First few greedy iterations
6M6H
hmhhmhmmhmhm
2-letter words
12 M 0 H
mmmmmmmmmmmm
upon
12 M 0 H
mmmmmmmmmmmm
1-letter words
12 M 0 H
mmmmmmmmmmmm
5-letter words
11 M 1 H
mmmmmhmmmmmm
4-letter words
12 M 0 H
mmmmmmmmmmmm
there
12 M 0 H
mmmmmmmmmmmm
enough
11 M 1 H
mmmmmmhmmmmm
whilst
12 M 0 H
mmmmmmmmmmmm
3-letter words
11 M 1 H
mmmmmmhmmmmm
15-letter words
Listserv Data
• 70 Listerv archives
• Over 1 million e-mail messages
• Data was gathered by Andrei Anghelescu
– http://mms-02.rutgers.edu/ListServ/
Our Data
• One Listserv, “CINEMA-L”
• 992 authors, 41263 messages
• We look at 3 authors
– sstone
– thea70
– jmiles_2
1077 messages
1253
1481
Frustration
Feature Selection
• How do we find “good” features?
More Frustration
A Measure of Variance
Summary of LDA Results
• Ran LDA using “I”, “is”, and “think”
• Trained on 80%, tested on 20%
• Correctly classified 122/186 documents
Future Work
• Finish our 3 author experiment
• Use more and different features
– Structural
– E-mail specific features
• Analyzing the relationship among features
• Other authorship id problems
– Many authors
– Odd-man-out
Thanks!!!
[email protected]
[email protected]