TextAsData_ICPSR

Download Report

Transcript TextAsData_ICPSR

Text as Data in the
Social Sciences
Introduction to Computing for Complex Systems
(Session XVI)
ICPSR – August 11, 2010
Abe Gong
[email protected]
www-personal.umich.edu/~agong
Big Picture
2. The field of NLP
3. Automated text classification
4. A census of the political web
1.
Agenda
Big Picture…
1. Language is the root of conscious
thought, culture, and shared meaning.
2. Artificial and human intelligence are
complementary tools for scientific inquiry.
3. Computers are surprisingly good at
understanding human language.
4. Suddenly, huge amounts of
digitized text are available.
The field of NLP…
Buzz word
Parent Field
Emphasis
Natural language
processing (NLP)
Computer science
Algorithmic extraction
of meaning from text
Computational
linguistics
Linguistics,
psychology
Understanding words
and text through
statistics
Machine learning
Computer science
Open-ended
automated problem
solving
Automated content
analysis
Political science,
sociology
Large-n content
analysis
Information retrieval
Computer science
Efficient storage and
retrieval of data
NLP and Related fields
Supervised learning
Using a large set of labeled data, the
computer learns to mimic humans on
some task
Applications
◦ Handwriting, speech, and pattern recognition
◦ Spam filtering
◦ Bioinformatics
◦…
Learning Modes
Supervised learning
Using a large set of labeled data, the
computer learns to mimic humans on
some task
Strengths
◦ Very flexible
◦ Easy to adapt to existing theory
Weaknesses
◦ Specifying ontologies can be time-consuming
◦ Requires substantial training data
Learning Modes
Unsupervised learning
Using raw, unlabeled data, the computer
looks for patterns and regularities
Applications
◦ Clustering
◦ Neural networks
◦ Algorithmic stock trading
◦ Data-driven marketing
◦…
Learning Modes
Supervised learning
Using raw, unlabeled data, the computer
looks for patterns and regularities
Strengths
◦ Does not require labeled data
◦ Discovers new patterns
Weaknesses
◦ Often difficult to relate to existing theory
Learning Modes
Active learning
Supervised learning, but the computer
selects or generates training examples
◦ Optimal experimental design
◦ Performance boost for supervised learning
Semi-supervised learning
Blend of supervised and unsupervised
learning
◦ Algorithmic forecasting, stock trading
◦ Topic maps
◦ Machine summarization
Learning Modes
In all of these applications, a large degree
of control is turned over to the computer.
“Data
Mining” is not always a dirty word.
Bad: Re-run statistical models until p > .05
Good: Tap all the data available for
patterns and inference
“Data Mining”
Google Image Search: “data mining books”
“Data Mining”
Topic tracking and sentiment analysis
Track trends in attention and opinion over
time.
http://www.google.com/trends
http://memetracker.org
http://textmap.com
http://www.ccs.neu.edu/home/amislove/tw
ittermood/
Current applications
Data visualization
Clever ways to make data accessible
http://manyeyes.alphaworks.ibm.manyeyes
http://flowingdata.com
http://morningside-analytics.com
Current applications
Machine translation
Translate text from one language to
another.
http://babelfish.yahoo.com/
Machine summarization
Summarize the most important points from
a document or group of related
documents.
http://newsblaster.cs.columbia.edu/
http://www.newsinessence.com/
Current applications
Miscellaneous

Language detection
http://www.google.com/uds/samples/language/detect.html






Part-of-speech tagging
Word-sense disambiguation
Probabilistic parsing
Spell checking
Grammar checking
Spam filtering
Current applications







Speeches
Legislation
Amendments
Hearings
Rules
Floor debate
Public comments








Data sources
Judicial opinions
Legal Briefs
Party Manifestos
Media coverage
Blogs
Treaties
Reports
Anything on the
public record…
http://bulk.resource.org/
Data sources
Data sources
Two options
• Out-of-the-box software
• Nice for getting started
• Methodology is constrained
• Lags the development curve
•
Build it yourself
• High overhead
• Requires skill development
• Extremely flexible
 Make sure to use existing libraries!
Software
Ex: Provalis WordStat
 Out of Box, Plug and Play

Software Package Developed by Provalis
◦ http://www.provalisresearch.com/


Booth at Midwest & APSA -- 2008, 2009
The Full Package: WordStat, QDA Miner, SimStat
Software
Programming languages
Perl, C++, Java, Ruby…
Python
If you’re going to learn a language, make it python
• Free, open source
• Intuitive syntax
• Enormous code and user base
• Well-documented, with excellent references
• Multiplatform, mature distribution
• Strong NLP capability
• Ex: nltk, lxml, numpy, scipy, scikits libraries
Software
5-minute demo
Train a classifier to recognize the difference
between Twain’s Huck Finn and Stoker’s
Dracula.
Get python here:
http://www.python.org/download/
Download the script here:
http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip
Download the books here:
http://www.gutenberg.org/files/32325/32325-h/32325-h.htm
http://www.gutenberg.org/files/345/345-h/345-h.htm
Demo
Demo
Demo
Automated text classification
Goal: Sort documents into predefined
categories, based on their text.








Task
Document
Corpus
Token
Feature
Feature string
Feature vector
Bag-of-words classifiers
Terminology
Naïve Bayes Classifiers
Assume words are drawn independently,
conditional on document class. Infer each
document’s class from its words.
Strengths
• Clear statistical foundation
• Fast to train and implement
• Lightweight
Weaknesses
• Noticeably less effective than other approaches
• Statistical foundation is based on false
assumptions
Algorithms and Estimators
Support Vector Machines (SVM)
Vectorize documents, then find the
maximum-margin separating hyperplane.
Strengths
• High accuracy
• Intuitive explanation
• Work with little training data
Weaknesses
• No explicit statistical foundation
• Training is slow with large data sets
Algorithms and Estimators
Support Vector Machines (SVM)
Vectorize documents, then find the
maximum-margin separating hyperplane.
Algorithms and Estimators
Logistic regression
Maximum likelihood estimator
Algorithms and Estimators
Decision Trees
Like playing 20 questions.
Strengths
• Able to capture subtle details
Weaknesses
• Require large amounts of training data
• Classification is often “brittle”
Algorithms and Estimators
Goal: Sort documents into predefined
categories, based on their text.








Task
Document
Corpus
Token
Feature
Feature string
Feature vector
Bag-of-words classifiers
Terminology
Percent agreement
Precision
Recall
F-measure
Cohen’s kappa
Krippendorff’s alpha
Evaluation

Bias plot and difficulty curve
Evaluation
A Census of the Political Web
Why study politics online?
1.
Impact of new technology on politics
◦
◦
2.
Barack Obama did 60% of his recordbreaking fundraising online
Trent Lott, Dan Rather, Howard Dean
New data on age-old political behavior
◦
Examples to follow shortly
Motivation

“No complete index of political websites
exists.”

Unable to use sampling theory
◦ Size, representativeness, generalizability, etc.
◦ Possible bias, error in existing methods
Motivation
Goal:
A complete census of the political web
Web site
Web page
http://domain
http://domain/path
Examples
(3 sites and 1 page)
http://www.yahoo.com
http://www.yahoo.com/politics
http://www.dailykos.com
http://abegong.dailykos.com
Web sites v. web pages
Sites correspond with human beings
2. Feasibility.
1.
~ 230 million websites
~ 30 billion web pages
Why web sites?
1.
2.
3.
4.
5.
Train an automated text classifier to
recognize political content.
Start from a seed batch of political sites.
Download and classify each site in the
batch.
For political sites:
a. Harvest all outbound hyperlinks.
b. Add previously unvisited links to the next batch.
Repeat until no new links are found.
Automated snowball census
How can we know if the automated classifier
is working properly?
The same way we know if a human coder is
working properly: compare coding with
others
Hand-code a training set (n=1,000 x 1)
2. Train the classifier
3. Hand-code a testing set (n=200 x 4)
4. Compare results
1.
a.
b.
Human-human
Human-computer
Evaluation
Intuitive definition
• Minimal training
•
Amazon Mechanical Turk
Coding protocol
Human-human coding
.733 Ordinal Kripp. Alpha

Even with minimal training, our shared
definition of political content is quite
strong.
Sites in the gray area: www.msnbc.com,
www.rff.org, …
Reliability
Prob(political) ≈ logit(α+βX)
X = Vector of word counts
α = Bias term
β = Word weights
Word β
obama 8.186
polit 6.696
govern 5.542
senat 4.709
presid 4.649
Max. Likelihood Estimator
• Asymp. unbiased
• Asymp. efficient
•
american 4.417
… …
art -2.994
ago -3.044
game -3.244
home -3.301
amp -5.873
Training a text classifier
HumanHuman
HumanComputer
Binary percent
agreement
.809
.810
Binary Kripp.
alpha
.617
.620
Automated
classification is just as
accurate and reliable as human
classification.
Reliability
.4
[.95]
[.90]
Threshold
Precision
Recall
Thresholds
23 hrs
120 GB
1.8 million
Runtime
Hard drive
Sites visited
650,000
Political sites
112,000
60,000
Est. False positives
Est. False negatives
Results

Stability across time
◦ Is the political web today the same as the web
last year?

Clutter
◦ Advertising, spam, etc.

Private sites
◦ Password protection: Facebook, myspace,
twitter

Improved classifier
◦ Other predictors of political-ness (esp. links)
Limitations
Survey
 Content analysis

◦ By author
◦ Over time
◦ In panel

Network analysis
Uses
Are estimates really unbiased?
Classifier predictions have
known certainty.
Allows
us to estimate the
gray area in our definition.
Estimating the gray area
http://textmap.org/
Sentiment analysis
4/12/2016
Abe Gong - Evaluating text classifiers and
text generators
62
A hard task
Density
Density
An easy task
The Bayesian
approach
Prob(X|f)
to content coding
Prob(X|f)
Intuition
 Applications

◦ Data compression
◦ Telecommunications
◦ Cryptography
Example:
http://www.invacua.com/markov_gen.html
Markov text generation