Visualizing Document Collections
Download
Report
Transcript Visualizing Document Collections
Visualizing
Document Collections
cs5764: Information Visualization
Chris North
Where are we?
•
•
•
•
•
•
•
Multi-D
1D
2D
3D
Trees
Graphs
Document collections
• Design Principles
• Empirical Evaluation
• Visual Overviews
Structured Document Collections
• Multi-dimensional
• author, title, date, journal, …
• Trees
• Dewey decimal
• Graphs
• web, citations
Envision
• Ed Fox, et al.
• Multi-D
• similar to
Spotfire
Citation Networks
• Butterfly Browser
• Mackinlay et al (PARC)
Butterfly:
Left = refs
Right = citers
Yellow = #citers
Blue = visited
3d plot:
date,
Name,
# citers
Unstructured Document Collections
• Focus on Full Text
• Examples:
• digital libraries, news archives, web pages
• email archives, image galery
• Tasks:
•
•
•
•
•
search
Browse
Classification, structurization
Statistics, keyword usage, languages
Subjects, themes, coverage
Visualization Strategies
• Cluster Maps
• Keyword Query
• Relationships
• Reduced representation
• User controlled layout
Cluster Map
• Create a “map” of the document collection
• Similar documents near each other
• Dissimilar document far apart
• “Grocery store” concept
Document Vectors
•
•
•
•
“aardvark”
“banana”
“chris”
…
Doc1
1
2
0
Doc2
2
1
0
Doc3
0
0
3
• Similarity between pair of docs =
• dot product
• Layout documents in 2-D map by similarity
• similar to spring model for graph layout
…
Cluster Algorithms
• Partition clustering:
Partition into k subsets
• Pick k seeds
• Iteratively attract nearest neighbors
• Hierarchical clustering:
Dendrogram
• Group nearest-neighbor pair
• Iterate
Landscapes
• Wise et al, “Visualizing the non-visual”
• ThemeScapes, Cartia
• PNNL
• Mountain height = Cluster size
Kohonen Maps
• Xia Lin, “Document Space”
•
•
http://faculty.cis.drexel.edu/sitemap/index.html
WebSOM
• http://websom.hut.fi/websom/
Map.net
• http://maps.map.net/start
Galaxy of
News
MIT
Cluster map
with full text
zooming
Cluster Map
• Good:
•
•
•
•
Map of collection
Major themes and sizes
Relationships between themes
Scales up
• Bad:
• Where to locate documents with multiple themes?
» Both mountains, between mountains, …?
• Relationships between documents, within documents?
• Algorithm becomes (too) critical
Keyword Query
• Keyword query, Search engine
• Rank ordered list
• “Information Retrieval”
• Visualization of results
Keyword Distributions
• Hearst, “TileBars”
•
• http://elib.cs.berkeley.edu/tilebars/
• Keyword distributions
within documents
Document Distributions
• Korfhage, “VIBE”
• http://www.pitt.edu/~korfhage/interfaces.html
• Documents located between query keywords using
spring model
VR-VIBE
Keyword Query
• Good:
• Reduces the browsing space
• Map according to user’s interests
• Bad:
• What keywords do I use?
• What about other related documents that don’t use these
keywords?
• No initial overview
• Mega-hit, zero-hit problem
Relationships
• Show inter-relationships
• Matrix or Complete Graph
• Similarity measure between all pairs of docs
• Threshold level
• Salton
Variations
Docs + Paragraphs
Themes
Relationships
• Better for smaller, more detailed map
• Scale up: Network visualization
• Good:
• Can see more complex relationships between/within
documents
• Can act like hyperlinks!
• Bad:
• Finding specific documents
• Scale up difficult
Reduced Visual Representation
• Bederson, “Image browsing”
•
User Controlled Layout
• Card, “WebBook and Web Forager”
•
•
http://vtopus.cs.vt.edu/~north/infoviz/webbook.mpa
Data Mountain
• Robertson, “Data Mountain”
•
(Microsoft)