Using Topic Modeling to Study Everyday *Civic Talk* and

Download Report

Transcript Using Topic Modeling to Study Everyday *Civic Talk* and

Using Topic Modeling to
Study Proto-politics and the
global climate debate
Veikko Eranti (& Tuukka Ylä-Anttila)
Universities of Helsinki & Tampere
Citizens in the Making (Kone Foundation, Academy of Finland)
blogs.uta.fi/cim | blogs.helsinki.fi/eranti
@VeikkoEranti
Background
• A larger project combining
ethnographic and digital
methods
• Citizenship as action, as
process – “grown into”
• Political Sociology: political
argumentation &
justification
• online proto-politics and
politics
Sociologists playing with
language
• We are not interested in describing corpuses as a
whole
• Rather, how do we find interesting sociological
needles from the haystack of text?
• And how to follow single ideas, or frames, through
time etc?
-> we aim at ”understanding” something
Methods
• Topic modeling: unsupervised machine learning
• We use Latent Dirichlect Allocation (LDA) by Blei,
Ng & Jordan (2003)
• Takes text as bag-of-words documents, outputs
“topics”: sets of words that occur together in
documents, and lists of the documents ranked in
order
• Fast-rising method, the one unsupervised text
classification method social scientists are using
(to the extent they are using any)
One question and two examples
?. What are topics and how to make sense of
them?
1. Global Climate Debate in New York Times and
The Hindu
2. Politics and proto-politics in Suomi24
What are topics?
Topics or frames?
• Interpreting the output of LDA algorithm as a
”topic” is a choice in itself
• In cultural and political sociology, a frame is
either deliberate or non-deliberate way of
organizing reality – issues can be framed in many
ways
• Can frames, discourses, justifications etc. objects
of cultural sociology be operationalized as topics?
(DiMaggio, Nag & Blei 2013)
Global Climate Debate
in New York Times and
The Hindu
Combining manual analysis with topic modeling
Aim of the study
• Make sense of the global climate debate
• Part of a larger study about global civic society
• Climate change and activism
• How different actors/speakers justify their
political positions?
History & materials
• 2000 newspaper articles
• 6 countries, 8 coders, 13 variables, 158 possible
codes, 18 pages of codebook
…
• In this presentation: Climate summit coverage
from the New York Times and The Hindu
• Hand-coded political claims as the corpus
Validation of interpretations
• VALIDATE, VALIDATE, VALIDATE!
• Context-specific deep knowledge of your data –
read it!
• Internal validation, external validation (Evans
2014, Grimmer & Stewart 2013)
• Structural topic modeling vs. using a mishmash of
filenames and command line tools
Politics and protopolitics in Suomi24
Distilling politics from online discussions
Materials
• Project: several social media datasets
• Here: Suomi24 (Finland24) forum
• Subset of 2.5M words (whole 2001–2015 dataset:
2.5B words)
• A general interest forum, largest of its kind in Finland
• Sub-forums: local municipalities, cars, hobbies, home
& DIY, pets, travel, Jesus, sex, and Jesus & sex
• Dedicated sections for political discussion, but it also
“leaks” to other discussion areas
• We look at proto-political talk on the forum as a whole
Modeling
• We run a 50-topic LDA model with MALLET to find
(proto)political talk in everyday debates
• 50 sets of words which often occur together:
topics of discussion
• Iterative stop-word hacking
• A single message on the message board is used
as the document
Examples of topics
(top 10 words)
topic17: new need Finland through produce change
problem build small action future use nowadays
opportunity option
topic23: Finland Sweden language church Finnish
Swedish speak school country learn Catholic religion
belong study Islam
topic32: Finland pay Euro money tax billion state million
poor cut government economy rich count large
Interpreting topics
• These were political words / topics, but don’t really
represent a political articulation (a position, a
justification or even a policy theme)
• We interpret 9 of 50 topics as political or protopolitical
• How to get closer to political articulations from this
general “civic talk”?
• Let’s pick “proto-political” topics from the 50 and reduce
the dataset to the 100 most important messages from
each
• Reduced to 827 messages (from ~42 000)
• 30-topic LDA model on them
Examples of topics in
“submodel”
topic3: Marx work workingclass capitalism teacher socialism worker
create pay workingtime value long wellbeing production product
topic12: Finland Niinistö parliament Soini president TrueFinn party
Halla-aho choose minister leader chairman foreignminister
memberofparliament Russia
topic22: member association function union expel organization
name right important only Halonen membershipfee forum join DDR
21 of 30 topics are rather clear political articulations!
Example
Conclusions and questions
• What are topics?
• Interplay of computational analysis and
qualitative reading
• Drawing a map of big datasets for further
qualitative exploration
• Instead of death of theory, we see new avenues
and instruments for interplay of data and theory
@veikkoeranti
References
• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet
allocation. the Journal of machine Learning research, 3, 993-1022.
• Dahlgren, Peter. 2000. “The Internet and the Democratization of Civic
Culture.” Political Communication 17: 335–40.
• DiMaggio, Paul, Manish Nag, and David M. Blei. 2013. “Exploiting
Affinities between Topic Modeling and the Sociological Perspective on
Culture: Application to Newspaper Coverage of U.S. Government Arts
Funding.” Poetics 41(6): 570–606.
• Evans, Michael S. 2014. “A Computational Approach to Qualitative
Analysis in Large Textual Datasets.” PLoS ONE 9(2): 1–10.
• Farrell, Justin (2016) Network structure of the climate change
counter-movement. Nature Climate Change 6, 370-374.
• Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The
Promise and Pitfalls of Automatic Content Analysis Methods for
Political Texts.” Political Analysis 21(3): 267–97.
• Etc.