Data Visualization

Download Report

Transcript Data Visualization

Structural Web Search Using a
Graph-Based Discovery System
Nitish Manocha, Diane J. Cook, and
Lawrence B. Holder
University of Texas at Arlington
[email protected]
http://www-cse.uta.edu/~cook
Structured Web Search




Existing search engines use linear feature match
Web contains structural information as well
 Hyperlink information
 Web viewed as a graph [Kleinberg]
Subdue searches based on structure
 Use as foundation of a structural search engine
Incorporation of WordNet allows for synonym
match




Discovers structural patterns in input graphs
A substructure is connected subgraph
An instance of a substructure is a subgraph
that is isomorphic to substructure definition
Pattern discovery, classification, clustering
Input Database
T1
Substructure S1
(graph form)
shape
C1
S1
Compressed Database
triangle
object
R1
on
T2
T3
T4
S2
S3
S4
shape
object
C1
S1
R1
square
S1
S1
S1
Subdue Algorithm
•
•
•
•
•
Start with individual vertices
Keep only best substructures on queue
Expand substructure by adding
edge/vertex
Compress graph and repeat to generate
hierarchical description
Optional use of background knowledge
Inexact Graph Match
Some variations may occur between
instances
 Want to abstract over minor differences
 Difference = cost of transforming one
graph to make it isomorphic to another
 Match if cost/size < threshold

Application Domains






Protein data
Human Genome DNA data
Spatial-temporal domains
 Earthquake data
 Aircraft Safety and Reporting System
Telecommunications data
Program source code
Web data
Represent Web as Graph

Breadth-first search of domain to generate graph
 Nodes represent pages / documents
 Edges represent hyperlinks
 Additional nodes represent document keywords
subdu
e
texas
university
word
projects
word
work
hyperlink
page
page
parallel
group
learning
robotics
planning
WebSubdue’s Structural Search
Formulate query as graph
 Use Subdue’s predefined substructure
option to search for instances of query

Instructor
http
Teaching
Robotics
http
Research
Robotics
Postscript
| PDF
Publicatio
n
Robotics
Query: Find all pages which link to
a page containing term ‘Subdue’
Subgraph vertices:
Subdue
word
page
hyperlink
page
1 page
URL: http://cygnus.uta.edu
7 page
URL: http://cygnus.uta.edu/projects.html
8 Subdue
[1->7] hyperlink
[7->8] word
/* Vertex ID Label */
/* Edge Vertex 1 Vertex 2 Label */
s
v 1 page
v 2 page
v 3 Subdue
d 1 2 hyperlink
d 2 3 word
Search for Presentation Pages
page
hyperlink
hyperlink
hyperlink
page
page
hyperlink

WebSubdue
 22 instances

page
hyperlink
AltaVista

Query “host:www-cse.uta.edu AND
image:next_motif.gif AND
image:up_motif.gif AND
image:previous_motif.gif.”

12 instances
Search for Reference Pages
page
hyperlink
hyperlink
hyperlink
page


page
…
page
Search for page with at least 35 in links
 WebSubdue found 5 pages in www-cse
AltaVista cannot perform this type of search
Inclusion of WordNet


When generating graph
 Use common stopword list
When searching for subgraph instances
 Morphology functions
 October = Oct
 teaching = teach
 Synsets
 Optional allowance of synonyms
Search for pages on ‘jobs in
computer science’



Inexact match: allow one level of synonyms
WebSubdue found 33 matches
 Words include employment, work, job,
problem, task
AltaVista found 2 matches
page
word
jobs
word
word
computer
science
Search for ‘authority’ hub and authority pages
page
HUBS
AUTHORITIES word
algorithms

page
page
page
hyperlink
page

page
WebSubdue found 3
hub (and 3 authority)
pages
AltaVista cannot
perform this type of
search
word
word
algorithms


algorithms
Inexact match applied
with threshold = 0.2 (4.2
transformations
allowed)
WebSubdue found 13
matches
Subdue Learning from Web Data

Distinguish professors’ and students’ web pages
 Learned concept (professors have “box” in
address field)
word
page

box
Distinguish online stores and professors’ web pages
 Learned concept (stores have more levels in
graph)
page
page
page
page
page
page
page
Conclusions
WebSubdue can be used to search for
structural web data
 Could be enhanced with additional
WordNet features such as synset path
length
 Efficient structural search necessary for
future of web search tools

To Learn More
cygnus.uta.edu/subdue
[email protected]
http://www-cse.uta.edu/~cook