algemene presentatie Universiteit Utrecht

Download Report

Transcript algemene presentatie Universiteit Utrecht

WAHSP/BILAND
Towards flexible and stable
CLARIN-supported open-source
web-applications for historical
data-mining in public media
April 8, 2016
WAHSP/BILAND
Research team:
Stephen Snelders(UU), Pim Huijnen(UU),
Daan Odijk(ISLA, UvA),
Fons Laan(ISLA), Maarten de Rijke (ISLA),
Toine Pieters (UU),
April 8, 2016
Research
Creating big-data resources
4/8/2016
National library of the Netherlands
Digital Newspaper Archive
> 1200 titles
> 30.000.000 articles
1618
1995
> 10.000.000 pages
Still growing...
How did/do you
study
30 million
newspaper articles?
Dutch press on Germany
Frank van Vree (1989)
> 1200 titles
Sampling
> 31.000.000 articles
1618
1995
Research
4/8/2016
Developing semantic
document selection
tools
Research
WE NEED:
A semi-automatic and interactive open-source
application
An application that does not replace, but
supports the intuition and insights of the
historical researcher with expert knowledge of a
specific topic or domain.
An application that is user-friendly.
April 8, 2016
Research
Problem:
Context and background of Dutch drug and eugenics
debates in time
Aim
Understanding and evaluation of public debates around
drugs, addiction and eugenics in the Netherlands, 19001945
Research question
What are the dynamics (in terms of patterns and trends)
of public debates and sentiments around drugs and
addiction, and eugenics in the Dutch newspapers in the
first half of the twentieth century
April 8, 2016
Research
Poe’s detective finds
the truth by using data
in those newspaper
articles that do not
concern the murder.
In a similar way we
will find terms and
sentiments in those
newspaper articles
that may seem
irrelevant, but are not.
April 8, 2016
Information-extraction
Recognize structure in text
Part of speech
Noun, verb, …
Entities
people, organisations, locations,
temporal expressions, …
Relations
Who, what, with whom, how, why
12
E-everything
Information-extraction (2)
13
E-everything
Enjoyable but what does it tell us?
4/8/2016
Research
4/8/2016
Research
Start Query: Opium
4/8/2016
Research
Drugs and drug policy
4/8/2016
Odijk D., de Rooij O., Peetz M-H., Pieters T., de Rijke
M., Snelders S. (2012). "Semantic Document
Selection", TPDL 2012: Theory and Practice of
Digital Libraries: Springer, September.
Combining and clustering queries
4/8/2016
Research
By carefully inspecting the word counts, we found quantitative evidence for
historical turning points that indicated the criminalization of the drugs debate
around 1924
4/8/2016
Research
Eugenics case;
query overerving (hereditarian) 1867
Primarily associations with health related
terms/entities
4/8/2016
Research
Eugenics case;
4/8/2016
Research
Eugenics case;
query overerving 1935
In 1935, however, the medical context of using the term inheritance
made way for a legal and racial context
4/8/2016
NEW HORIZONS in DIGITAL HUMANITIES
E-Humanity Approaches to Reference Cultures: The Emergence of
the United States in Public Discourse in the Netherlands,
1890-1990
Challenges:
1. OCR-Repair
2. Improving Text-mining software and
data infrastructure
3. Developing new historical research
strategies
4. Educating historians and other
humanities researchers
4/8/2016