Web application for archives

Download Report

Transcript Web application for archives

Social Context as a part of
News-Archive-Explorer
Web application for exploratory browsing of
news streams and archives
Marko Grobelnik
Jasna Škrbec
Jozef Stefan Institute
Introduction
News publishers generate content archives
The goal is to build a system to make such archives
usable through text mining & visualization
Archive characteristics:
Large corpora (up-to few M articles)
Rich meta data (specific for each archive)
Different input formats (xml structure)
Poor search interfaces (not specialized for archives)
What we want?
Application to…
help user search and browse through archives
help user read more about topics related to search
visualize how things are connected in time, place,
stories, etc.
get user’s attention and interest in other related issues
tell more about searched content
Architecture
Server side
Client side
Archive
Preprocessing
SQL
Server
Enrycher
Database model
Already done
Import archive xml files
New York Times archive (15M articles)
NYTimes LDC (1.7M articles)
Nature (300k articles),
Reuters (830k articles)
Server side
Import to database - PostgreSQL
Preprocessed with enrycher
Client side
Faceted Search interface (author, entity, keyword, publish date,
category)
Showing context around searched content/article
Current version of the GUI
Showing relationships between entities
Plans for the future
Improve search (with narrowing criteria, suggestions)
Adding visualizations to show content in time, space and
other contexts
Adding links to similar content (stories)
Adding links to outside resources (like dbpedia) or bring
this resources inside this application
Integrate with tools developed in AILab to improve search
and presentation of articles (SearchPoint, DocAtlas, …)
Improve usability & appearance of user interface
Topic landscape of the query “Clinton”
from Reuters news 1996-1997
Query
Search
Results
Topic Map
Selected
group of news
Selected
story
Visualization of social relationships
between “Clinton” and other entities
Query
Named
entities
in relation
Topic Trends Tracking of the documents
including “Clinton”
US Elections
Query
US Budget
Result set
Topic Trends
Visualization
NATO-Russia
Mid-East
conflict
Topics
description
WW2 query “Pearl Harbor” into NYTimes
archive
Dec 7th 1941
WW2 query “Belgrade” into NYTimes archive
Apr 6th 1941
WW2 query “Normandy” into NYTimes archive
June 1944