Transcript Slide 1

Growing Parallel Paths for
Entity-Page Retrieval
Tim Weninger, Cindy Xide Lin, and Jiawei Han
Department of Computer Science
University of Illinois Urbana-Champaign, Urbana, IL
Work Submitted to VLDB'10
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Problem: Entity Page Retrieval
Given: Reference page
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Problem: Entity Page Retrieval
…Can We find Entity Pages of the same Type?
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Problem: Entity Page Retrieval
…Can We find Entity Pages of the same Type?
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Definitions:
Defn 1: Root to link path:
◊ - hrefX contains
HTML-TABLE-TR1—TD-hrefX
Defn 2: Parallel Links:
Share a root to link path.
i.e., lists of links
Defn 3: Intra-page parallel paths:
◊ - hrefC ǁ ◊ - hrefB
◊ - hrefC ǁ ◊ - hrefX
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Definitions:
Defn 4: Inter-page parallel
◊ - hrefC in Page A ǁ ◊ - hrefW in Page B
Advanced Data Mining
May 4, 2010
Defn 5: Parallel Web site paths
Share intra or inter-page parallel paths
across multiple pages
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Properties of Parallel Paths
Prop. 1: Equal Path Length Property:
Parallel paths must contain the same number of pages.
Prop. 2: Parallel Page Property:
The test of two paths being in parallel is equivalent to the result of tests of
respective pages.
Prop. 3: Equal Page Length Property:
Parallel paths must have the same number of nodes across pages.
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Properties of Parallel Paths
Prop. 4: Divergent Path Property:
Parallel Paths can extend through separate pages
Prop. 5: Early Termination Property:
The test of two paths can be terminated at the first occurrence of a dissimilar
node
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Finding Paths
Naive Method
Can be very costly
Growing Parallel Paths
First find example path
Then grow paths which are in parallel to the example
Repeat with alternate paths
This makes magic happen
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Repeating with alternate paths
k-shortest paths
Do k-shortest path
search. Explore all of
these paths
Removing links
After exploring a path remove the edges from the graph
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Interpreting the Output
Side Effect of Repeating with Alternate paths
Given: Jiawei Han
Result:
Jiawei Han
Cheng Zhai
Kevin Chang
Dan Roth
Vikram Adve
Roy Campbell
40
38
38
32
4
3
…
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Interpreting the Output
Side Effect of Path Finding
What does the link labels on the path tell us about the entity
First path
People
Faculty
Jiawei Han
Personal Site
Second path
Research
Data Mining
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Experiments
Top 25 CS Departments in US (according to US News)
Find all professors
United States Congress
Find all senators, representatives, and committees
UIUC only
Find all courses
Final all research groups
Baseline
Google’s find similar search (essentially TFIDF-type ranking)
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Results
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Results
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Results
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Conclusions and Future Work
Given a reference page and an example entity type we
can retrieve all entity pages of the same type
Implications:
We can use this for information integration
Search, retrieval can be enhanced
Shortcomings:
Most errors due to incorrect list finding
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign
Questions?
Advanced Data Mining
May 4, 2010
Data and Information Systems Laboratory
University of Illinois Urbana-Champaign