Transcript ppt
The Liberal Media and Right-Wing
Conspiracies: Using Cocitation
Information to Estimate Political
Orientation in Web Documents
Miles Efron
School of Information, University of Texas
ACM CIKM 2004 (19% acceptance)
Objective
• Identify the political orientation of web
documents in the US
• Left Versus Right
Approach
• Pure link-based approach
• Web pages that are cited frequently with
left/right web pages have left/right
orientation
• Use PMI-IR model on links
PMI for Links
• Pointwise Mutual Information:
• p(di) = probability that a document in the
community contains a link to di
• p(di,dj) = probability that a document in the
community contains links to both di and dj
Political Orientation Score
• PO(di) = PMI(di, R) - PMI(di, L)
• R : set of exemplar right-leaning web
pages
• L : set of exemplar left-leaning web pages
• PO(di) > 0 right orientation
• PO(di) = 0 neutral
• PO(di) < 0 left orientation
Political Score Estimation
• After some simple math, we get
• “link: di” is the number of links to di given
by Altavista
• “link: R” is the query “link:R1 OR link:R2
OR … OR link:R|R|” to Altavista
Exemplar Partisan Documents
• 19 web pages per side
• Use Open Directory Project
(www.dmoz.org)
• Left Side: Use web pages in
Politics:Liberalism:Social Liberalism
category with at least 1000 incoming links
• Right Side: Use Politics:Conservatism
category
Experiment 1: Web Data
• Use Yahoo! and Open Directory Project to
build a corpus of 695 testing documents
• Baseline: SVM and Naïve Bayes using
lexical features
– Training corpus of 2412 documents is
constructed from the first 2 levels of exemplar
web pages
Experiment 1 Result
Experiment 2: Blogs
• 119 left-wing blogs
• 43 right-wing blogs
Experiment 2 Result
Method
%Accuracy
Cocitation
98.8 (2 mistakes)
SVM
60.5 (64 mistakes)
Experiment 3: Non-Partisan
Documents
• Select 72 non-partisan web pages from
Open Directory Project
• Test whether the orientation score is close
to 0
Experiment 3 Result
Future Work
• Combine lexical information with links