Data Visualization
Download
Report
Transcript Data Visualization
Structural Web Search Using a
Graph-Based Discovery System
Nitish Manocha, Diane J. Cook, and
Lawrence B. Holder
University of Texas at Arlington
[email protected]
http://www-cse.uta.edu/~cook
Structured Web Search
Existing search engines use linear feature match
Web contains structural information as well
Hyperlink information
Web viewed as a graph [Kleinberg]
Subdue searches based on structure
Use as foundation of a structural search engine
Incorporation of WordNet allows for synonym
match
Discovers structural patterns in input graphs
A substructure is connected subgraph
An instance of a substructure is a subgraph
that is isomorphic to substructure definition
Pattern discovery, classification, clustering
Input Database
T1
Substructure S1
(graph form)
shape
C1
S1
Compressed Database
triangle
object
R1
on
T2
T3
T4
S2
S3
S4
shape
object
C1
S1
R1
square
S1
S1
S1
Subdue Algorithm
•
•
•
•
•
Start with individual vertices
Keep only best substructures on queue
Expand substructure by adding
edge/vertex
Compress graph and repeat to generate
hierarchical description
Optional use of background knowledge
Inexact Graph Match
Some variations may occur between
instances
Want to abstract over minor differences
Difference = cost of transforming one
graph to make it isomorphic to another
Match if cost/size < threshold
Application Domains
Protein data
Human Genome DNA data
Spatial-temporal domains
Earthquake data
Aircraft Safety and Reporting System
Telecommunications data
Program source code
Web data
Represent Web as Graph
Breadth-first search of domain to generate graph
Nodes represent pages / documents
Edges represent hyperlinks
Additional nodes represent document keywords
subdu
e
texas
university
word
projects
word
work
hyperlink
page
page
parallel
group
learning
robotics
planning
WebSubdue’s Structural Search
Formulate query as graph
Use Subdue’s predefined substructure
option to search for instances of query
Instructor
http
Teaching
Robotics
http
Research
Robotics
Postscript
| PDF
Publicatio
n
Robotics
Query: Find all pages which link to
a page containing term ‘Subdue’
Subgraph vertices:
Subdue
word
page
hyperlink
page
1 page
URL: http://cygnus.uta.edu
7 page
URL: http://cygnus.uta.edu/projects.html
8 Subdue
[1->7] hyperlink
[7->8] word
/* Vertex ID Label */
/* Edge Vertex 1 Vertex 2 Label */
s
v 1 page
v 2 page
v 3 Subdue
d 1 2 hyperlink
d 2 3 word
Search for Presentation Pages
page
hyperlink
hyperlink
hyperlink
page
page
hyperlink
WebSubdue
22 instances
page
hyperlink
AltaVista
Query “host:www-cse.uta.edu AND
image:next_motif.gif AND
image:up_motif.gif AND
image:previous_motif.gif.”
12 instances
Search for Reference Pages
page
hyperlink
hyperlink
hyperlink
page
page
…
page
Search for page with at least 35 in links
WebSubdue found 5 pages in www-cse
AltaVista cannot perform this type of search
Inclusion of WordNet
When generating graph
Use common stopword list
When searching for subgraph instances
Morphology functions
October = Oct
teaching = teach
Synsets
Optional allowance of synonyms
Search for pages on ‘jobs in
computer science’
Inexact match: allow one level of synonyms
WebSubdue found 33 matches
Words include employment, work, job,
problem, task
AltaVista found 2 matches
page
word
jobs
word
word
computer
science
Search for ‘authority’ hub and authority pages
page
HUBS
AUTHORITIES word
algorithms
page
page
page
hyperlink
page
page
WebSubdue found 3
hub (and 3 authority)
pages
AltaVista cannot
perform this type of
search
word
word
algorithms
algorithms
Inexact match applied
with threshold = 0.2 (4.2
transformations
allowed)
WebSubdue found 13
matches
Subdue Learning from Web Data
Distinguish professors’ and students’ web pages
Learned concept (professors have “box” in
address field)
word
page
box
Distinguish online stores and professors’ web pages
Learned concept (stores have more levels in
graph)
page
page
page
page
page
page
page
Conclusions
WebSubdue can be used to search for
structural web data
Could be enhanced with additional
WordNet features such as synset path
length
Efficient structural search necessary for
future of web search tools
To Learn More
cygnus.uta.edu/subdue
[email protected]
http://www-cse.uta.edu/~cook