Transcript Document

Pascal Visualization Challenge
Blaž Fortuna, IJS
Marko Grobelnik, IJS
Steve Gunn, US
Part I: Challenge details
ePrints


Database of around 1600
papers published by Pascal
members
Papers are described with:




Authors (unique Pascal Id)
Title
Abstract (most papers)
Publish date (some papers
only have year)
Challenge Goal
Two main goals:
 to test and compare different text
visualization methods, ideas and algorithms
on a common dataset,
 to contribute to the Pascal dissemination and
promotion activities by using data about
scientific publications from Pascal’s EPrints
server
Task
Visualize and present the Pascal ePrints
data in a novel way which enables:





discovering main areas covered by the papers
and people in Pascal,
discovering area and people developments
trough time,
helping the researchers with recommendation
on which papers to read,
helping at finding the right reviewers for new
papers.
Data


Raw XML file from Pascal ePrints server
Processed data for easier use:



Bag-of-words (TextGarden, Matlab)
Graph (Matlab, Pajek)
Data processed for different possible
scenarios.
Raw XML file


Cleaned data from Pascal
ePrints server.
Data is given as a list of
papers, each paper is
described by:





Title
Abstract
Year of publication
List of authors
Each Author is described by
unique Pascal Id and
institution.
<paper id="2080" year="2006">
<title>Synthesis of Maximum…</title>
<abstract>In this presentation…</abstract>
<subjects>
<subject id="CS">Computati…</subject>
<subject id="LO">Learning…</subject>
<subject id="TA">Theory …</subject>
</subjects>
<authors>
<author id="452" institution_id="1">
Sandor Szedmak
</author>
<author id="1" institution_id="1">
John Shawe-Taylor
</author>
</authors>
<institutions>
<institution id="1">Universit…</institution>
</institutions>
</paper>
Bag-of-words
Covered scenarios:



Document == Paper
Document == Author
Document == Institution
Available formats:
 TextGarden


Text file where one line
equals one document
Matlab

Data available in form of
sparse Term-Document
matrix
TextGarden (www.textmining.net):

Format:
Document_name !Subject DocumentList

Example:
Support_Vector_Machine_to_synthesise_kernels
!Machine_Vision !Theory_and_Algorithms
Support Vector Machine to synthesise kernels - Suppose we are given two sets of …
Matlab:

Sparse matrix saved in text file, it can be
simply read into Matlab by:
X = spconvert(load(‘papers.dat’));


Documents are columns in the matrix
Names of columns (document names)
and rows (words) are provided.
Graph
Covered scenarios:



Vertex == Word,
Edge == Co-Appearance
Vertex == Author,
Edge == Co-Authors
Vertex == Institution,
Edge == Collaboration
Available formats:
 Matlab


Data available in form of
sparse adjacency matrix
Pajek

Software for network
analysis
Matlab:

Sparse matrix saved in text file, it
can be simply read into Matlab by:
X = spconvert(load(‘words.dat’));

Names of vertices (words,
authors, institutions) are provided.
Pajek:

Can be downloaded from:

vlado.fmf.unilj.si/pub/networks/pajek
Submissions

The results can be:







images,
movies,
Web sites,
VRML files,
executables (windows, linux),
etc.
For interactive tool also provide a video,
showing the use of the tool on the Pascal
ePrints data.
Evaluation






Usability of visualization – The goal is to assess usability of
particular visualization in different practical contexts.
Innovativeness – The goal is to estimate how innovative are the
ideas used for visualization.
Aesthetics of the image – Here we are aiming to identify the
"nicest" images from the challenge.
General Pascal-researchers’ voting over the web about "who
likes what".
Since all the criteria are subjective, we will hire experts for
judging about the quality.
Each of the criteria will generate a separate ranking.
Part II: Examples
Visualization example 1/2: Document Atlas
Bag-of-words approach:
 Document == Author
 Author is described by
a sum of all the
abstracts from the
papers he co-authored.
 We construct separate
profile for papers from
year 2004 and papers
from year 2005.
Dimensionality reduction

Documents are mapped from
bag-of-words space to two
dimensions in two steps:

Latent Semantic Indexing:
13.000 dim => 110 dim

Multidimensional Scaling
110 dim => 2 dim

The background reflects the
density of documents
document
Background words



Each part of the map is
assigned a keyword which
is most representative for
the documents in the
area.
We get a “map” of the
topics covered within the
documents.
In the case of Pascal
ePrints data areas on the
map correspond to the
areas covered within the
Pascal Network.
Time dynamics


For each author we
have profile for years
2004 and 2005
By showing the
difference we can see
how authors’ research
focus developed
between 2004 and
2005.
gradient
Co-Authorships
Live Demo
Visualization example 2/2: IST World


Web portal developed within
IST World EU project
Uses search and
visualization methods to:



discover the main research
areas and collaborations
within the PASCAL
organizations
produce recommendation on
which papers to read (e.g.
papers on image recognition,
or kernel trick)
find the right reviewers for a
new paper (e.g a paper on
"brain computer interface")
and assess their competence
Research areas


Institutions are
placed on the map of
research areas from
Pascal Network
Example shows
which are the areas
closely related to JSI
Collaborations
Collaboration
of institutions
Collaboration of
authors working on
“text mining”
Paper Recommendation
Competence Search
Live Demo
Thank you!