The Web of the Future

Download Report

Transcript The Web of the Future

The BINGO! System for
Information Portal Generation
and Expert Web Search
Sergej Sizov, Michael Biwer, Jens Graupmann,
Stefan Siersdorfer, Martin Theobald,
Gerhard Weikum, Patrick Zimmer
University of the Saarland, Saarbrücken, Germany
[email protected]
http://www-dbs.cs.uni-sb.de
Outline:
• Problem & research challenge
• Our work in progress
• Future research avenues
1
(In-) Effectivity of Web Search Engines
query = „ARIES recovery But
openthere
source
download“
is hope:
Google (and Yahoo) top 3:
• exploit structure
CpSc 862 Assignment 6
• explore
neighborhood
... You can open it by double clicking
the Mars ...
(logfunc ... and any deviations from•the
standard
Aries recovery ...
use
topic
directory
www.cs.clemson.edu/~jzwang/cpsc862/hw07/cpsc862hw07.htm
Tutorial: Application Servers and Associated Technologies
... an OIA for his inventions (ARIES, ARIES/IM, Commit_LSN ...
www.almaden.ibm.com/u/mohan/ AppServersTutorial_SIGMOD2002_Slid
Recovery from Spiritual Abuse
... with you no matter how long your recovery takes ... powerful form of
spiritual abuse and a source of spiritual ... their denial in hopes that ...
www.nacronline.com/dox/library/ biblestudies/sp_abuse.pdf
Sourceforge:
Topic –> Database –> Database Engines/Servers
BZFlag - OpenSource OpenGL Multiplayer Multiplatform battlezone ...
JBoss.org - The JBoss/Server is the leading Open Source ...
Dev-C++ - Dev-C++ is an full-featured Integrated ... libraries ...
2
From Observations to Research Avenues
Key observation:
yes, there are ways to find what you are searching,
but intellectual time is expensive !
 requires „intelligent“ automation
Research Avenues:
Structure and annotate information: XML
Organize documents „semantically“: ontologies
Leverage machine learning: automatic classification
More computer time for better result: focused crawling
Goal:
should be able to find results for advanced info request
in one day with < 5 min intellectual effort
that the best human experts can find with infinite time
3
Challenge: Expert Web Queries
Where can I download an open source implementation
of the ARIES recovery algorithm?
Find the text and notes of the western song Raw Hide.
What are Chernoff-Hoeffding bounds?
Find Fermat‘s last / Wiles‘ theorem in MathML format.
Are there any theorems isomorphic to my new conjecture?
Find related theorems.
Which professors from D are teaching
DBS and have research projects on XML?
D
DBS
XML
Which Shakespeare drama has a scene where a woman
talks a Scottish nobleman into murder?
Who was the Italian woman that I met at the PC meeting
where Moshe Vardi was PC Chair?
4
Challenge: (Meta-) Portal Building
Who are the top researchers in the database system community?
Who is working on using machine learning techniques for
searching XML data?
What are the most important results in large deviation theory?
Find information about public subsidies for plumbers.
Find new EU regulations that affect an electrician‘s business.
Which gene expression data from Barrett tissue in the esophagus
exhibit high levels of gene A01g ?
Are there metabolic models for acid reflux that could be related
to the gene expression data?
5
Crawler
Classifier
DB / Cache / Queue Mgr
Portal
Explorer
Deep
Web
Focus
Mgr
Web Service Generator
(UDDI, WSDL)
HTML2XML
Auto Annotation
BINGO!
...
Filter
Ontologies (XML ...)
Search Engine
Web Applications
Chamber of Crafts
Portal
Materialography
Portal
XXL
Query Processor
Index
Mgr
Ontology
Service
Intranet Applications
Our Research Agenda
Ontology
Mgr
Semantic
Web
6
Focused Crawling
automatially build
personal topic directory
(Soumen Chakrabarti et al. 1999)
WWW
......
.....
......
.....
seeds
Crawler
training
Classifier
Link Analysis
Root
DB Core
Technology
Semistrutured
Data
Web
Retrieval
Data
Mining
XML
critical issues:
• classifier accuracy
• feature selection
• quality of training data
7
The BINGO! Focused Crawler
WWW
......
.....
......
.....
seeds
Crawler
training
Classifier
Link Analysis
high SVM
high authority
confidence
score
re-training
Root
DB Core
Technology
Semistrutured
Data
Web
Retrieval
Data
Mining
XML
topic-specific
archetypes
8
Classification using Support Vector Machines

(SVM)
w x b 0
x2
?
C
C
n training vectors with
components (x1, ..., xm, C)
with C = +1 oder -1


x1
Training:

Compute separating hyperplane w x  b  0 that maximizes the
min. distance of all positive and negative samples to the hyperplane
 solve quadratic programming problem
Decision:
m


Test new vector y for membership in C: ( w y  b )   wi yi  b  0

i 1
y
Distance of
from hyperplane yields classification confidence
9
BINGO! Adaptive Re-training
and Focus Control
for each topic V do {
archetypes(V) := top docs of SVM confidence ranking
 top authorities of V ;
remove from archetypes(V) all docs d that do not satisfy
confidence(d)  mean confidence(V) ;
recompute feature selection based on archetypes(V) ;
recompute SVM model for V with archetypes(V) as training data }
combine re-training with two-phase crawling strategy:
• learning phase:
aims to find archetypes (high precision)
 hard focus for crawling vicinity of training docs
• harvesting phase:
aims to find results (high recall)
 soft focus for long-range exploration with tunnelling
10
Feature Space Construction & Meta Strategies
possible strategies:
• single term frequencies or tf*idf with top n cross-entropy terms
• term pairs within proximity window
(e.g., support vector, match about world championship)
• text terms from hyperlink neighbors
• anchor text terms from neighbors
(e.g., <a href=...> click here for soccer results </a>)
meta strategies (over m feature spaces for class k):
m
• unanimous decision: Ck ( d j )  1 if  Ck( )  m
 1
m
( ) ( )
~
C
(
d
)

1
if
p
 k Ck  
k
j
• weighted average:
 1
~p (  )
with  estimator k (Joachims‘00)
strategy  with best ratio of
for precision of model  for class k
estimated precision to
based on leave-one-out training
runtime cost
11
Experiment on Information Portal Generation (1)
for single-topic portal on „Database Research“
start with only 2 initial seeds: homepages of DeWitt and Gray
goal: automatic gathering of DBLP author homepages
(with DBLP excluded from crawl)
learning phase for improved feature selection and classification:
depth-first crawl limited to domains of seeds
followed by archetype selection and retraining (for high precision)
harvesting phase for building a rich portal:
prioritized breadth-first crawl (for high recall)
12
Experiment on Information Portal Generation (2)
result after 12 hours (on commodity PC):
• 3 mio. URLs crawled on 30 000 hosts, 1 mio. pages analyzed,
• 0.5 mio. pages positively classified
• found 7000 homepages out of 30 000 DBLP authors,
712 authors of the top 1000 DBLP authors
with 267 among the 1000 best crawl results
+ postprocessing for querying and analysis:
• ranking by SVM confidence, authority score, etc.
• clustering, relevance feedback, etc.
13
Ongoing and Future Work
Deep Web exploration with auto-generated queries
Exploiting ontological knowledge
e.g.: search for a „woman talking someone into murder“
Construct richer feature spaces
Exploiting linguistic analysis methods
e.g.: „... cut his throat ...“
act: killing
subject:... object:...
Generalized links & semantic joins, e.g. named entities
Identifying semantically coherent units
Combining focused crawling with XML search
 auto-annotation of HTML, Latex, PDF, etc. docs
 cross-document querying à la XXL
User guidance & portal admin methodology
Towards “Semantically Coherent” Units
Homepage
of ...
Teaching:
DBS
IR
...
Homepage
of ...
Teaching
Research
vs.
Address: ...
Research:
XML
Auto-tuning
...
My work:
J2EE
XML
WSDL
...
My hobbies:
Poisonous frogs
South Amerian stamps
Canoeing
...
Ongoing and Future Work
Deep Web exploration with auto-generated queries
Exploiting ontological knowledge
e.g.: search for a „woman talking someone into murder“
Construct richer feature spaces
Exploiting linguistic analysis methods
e.g.: „... cut his throat ...“
act: killing
subject:... object:...
Generalized links & semantic joins, e.g. named entities
Identifying semantically coherent units
Combining focused crawling with XML search
 auto-annotation of HTML, Latex, PDF, etc. docs
 cross-document querying à la XXL
User guidance & portal admin methodology
Concluding Remarks
long-term goal: exploit the Web‘s potential
for being the world‘s largest knowledge base
XML and Semantic Web are key assets,
but by themselves not sufficient; we need to
cope with diversity, incompleteness, and uncertainty
 absolute need for ranked retrieval
view information organization and information search
as dual views of the same problem
combine techniques from DBS, IR, CL, AI, and ML
need better theory about quality/efficiency tradeoffs
as well as large-scale experiments
17