Workshop on Web Mining Technology and Applications Panel Web

Download Report

Transcript Workshop on Web Mining Technology and Applications Panel Web

Web Mining for Unknown Term
Translation
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and
Information engineering
[email protected]
http://myweb.ncku.edu.tw/~whlu
Web Mining
Research Problems
•
Difficulties in automatic construction of multilingual translation
lexicons
– Techniques: Parallel/comparable corpora
– Bottlenecks: Lacking diverse/multilingual resources
•
Difficulties in query translation for cross-language information
retrieval (CLIR)
– Techniques: Bilingual dictionary/machine translation/parallel
corpora
– Bottlenecks: Multiple-senses/short/diverse/unknown query
• Challenges
– Web queries are often
• Short: 2-3 words (Silverstein et al. 1998)
• Diverse: wide-scoped topic
• Unknown (out of vocabulary): 74% is unavailable in CEDICT
Chinese-English electronic dictionary containing 23,948 entries.
– E.g.
• Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein)
• New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染
(Nosocomial infections)
Cross-Language Information Retrieval
•
Query in source language and retrieve relevant
documents in target languages
SARS
愛因斯坦
老年癡呆症
National Palace
Museum
?
Source
Query
Query
Translation
Target
Translation
Information
Retrieval
Target
Document
s
Difficulties in Web Query Translation Using
Machine Translation
Chinese translation: 全國宮殿博物館
English source query : National
Palace Museum
Research Paradigm
New approach
Live Translation
Lexicon
Web Mining
Anchor-Text
Mining
Internet
Search-Result
Mining
Term-Translation
Extraction
Applications
Cross-Language
Information Retrieval
Cross-Language
Web Search
Multilingual Anchor-Texts
Language-Mixed Texts in Search Result
Pages
Anchor-Text Mining with Probabilistic Inference
Model
• Asymmetric translation models: P(s  t )  P(s  t )
P( s )
• Symmetric model with link information:
Conventional
translation model
Co-occurrence
n
P( s  t ) 
P( s  t )

P( s  t )
 P( s  t | ui ) P(ui )
i 1
n
 P( s  t | ui ) P(ui )
i 1
n

 P( s  t | ui ) P(ui )
i 1
n
 [ P( s | ui )  P(t | ui )  P( s  t | ui )] P(ui )
i 1
n

 P( s | ui ) P(t | ui ) P(ui )
i 1
n
 [ P( s | ui )  P(t | ui )  P( s | ui ) P(t | ui )] P(ui )
i 1
where P(ui ) 
L(ui )
n
 L(uj )
j 1
, L(uj )  the number of uj ' s in-link
Page authority
Transitive Translation Model for Multilingual
Translation
• Direct Translation Model
s
Pdirect ( s, t )  P( s  t )
P( s  t ) : probabilis tic inference
Direct
Translatio
n
model
t
新力
ソニー
(Traditional
Pdirect ( s, t )  log P( s  t ) (Japanese)
Chinese)
• Indirect Translation Model
Pindirect ( s, t )  m P( s  m, m  t ) P(m)
  m P ( s  m)  P ( m  t ) P ( m)
P(m) : occurrence
probabilit y in the corpus
• Transitive Translation Model
Pdirect ( s, t ), if Pdirect ( s, t )  
Ptrans ( s, t )  
Pindirect ( s, t ), otherwise.
 : predefined threshol d value.
m
Sony
(English)
…
s : source term
t : target translation
m: intermediate
translation
Indirect
Translatio
n
Promising Results for Automatic Construction
of Multilingual Translation Lexicons
Source terms
(Traditional
Chinese)
新力
耐吉
史丹佛
雪梨
網際網路
網路
首頁
電腦
資料庫
資訊
English
Sony
Nike
Stanford
Sydney
internet
network
homepage
computer
database
information
Simplified
Chinese
索尼
耐克
斯坦福
悉尼
互联网
网络
主页
计算机
数据库
信息
Japanese
ソニー
ナイキ
スタンフォード
シドニー
インターネット
ネットワーク
ホームページ
コンピューター
データベース
インフォメーション
Search-Result Mining
•
•
Goal: Improve translation coverage for diverse queries
Idea
– Chi-square test: co-occurrence relation
– Context-vector analysis: context information
•
Chi-square similarity
measure
•
N  (a  d  b  c) 2
S  2 ( s, t ) 
(a  b)  (a  c)  (b  d )  (c  d )
•
2-way contingency table
Context-vector similarity
measure
m
SCV ( s, t ) 
•
i 1 wsi  wti
2
2
im1 ( wsi )  im1 ( wti )
Weighting scheme: TF*IDF
wti 
f (ti , d )
N
 log( )
max j f (t j , d )
n
t
~t
s
a
b
f (ti ,d ) : the frequency of ti in search result page d ,
~s
c
d
N : the total number of Web pages,
n : the number of pages including ti .
Workshop on Web Mining Technology
and Applications (Dec. 13, 2006)
Panel
Web Mining:
Recent Development and Trends
曾新穆 教授 (Vincent S. Tseng)
成功大學 資訊工程系
Main Categories of Web Mining
• Web content mining
• Web usage mining
• Web structure mining
Web Content Mining
• Trends
–
–
–
–
Deep web mining
Semantic web mining
Vertical search
Web multimedia content mining
• Web image/video search
• Web image/video
annotation/classification/clustering
• Web multimedia content filtering
– Example: YouTube
• Integration with web log mining
Web Usage Mining
• Developed techniques
– Mining of frequent usage patterns
• Association rules, sequential patterns, traversal patterns,
etc.
• Trends
– Personalization
– Recommendation
• Web Ads
–
–
–
–
Incorporation of content semantics/ontology
Considerations of Temporality
Extension to mobile web applications
Multidiscipline integration
Problems: Under-utilization of Clickstream
Data
• Shop.org: U.S.-based visits to retail Web sites
exceeded 10% of total Internet traffic for the first
time ever on Thanksgiving, 2004
• Top five sites: eBay, Amazon.com, Dell.com,
Walmart.com, BestBuy.com, and Target.com
• Aberdeen Group:
– 70% of site companies use Clickstream data only
for basic website management!
Challenges for Clickstream Data Mining
- Arun Sen et al., Communications of ACM, Nov. 2006
• Problems with data
– Data incompleteness
– Very large data size
– Messiness in the data
– Integration problems with Enterprise Data
• Too Many Analytical Methodologies
– Web Metric-based Methodologies
– Basic Marketing Metric-based Methodologies
– Navigation-based Methodologies
– Traffic-based Methodologies
• Data Analysis Problems
– Across-dimension analysis problems
– Timeliness of data mining under very large data size
– Determination of useful/actionable analysis under thousands of
metrics
Web Information Extraction:
The Issues for Unsupervised Approaches
Dr. Chia-Hui Chang (張嘉惠)
Department of Computer Science and
Information Engineering,
National Central University, Taiwan
(Talk given at 2006 網路探勘技術與趨
勢研討會 )
Outline
• Web Information Extraction
– The key to web information integration
• Three Dimensions
– Task definition
– Automation degree
– Technology
• Focused on Template Pages IE task
– Issues for record-level IE
– Techniques for solving these issues
Introduction
•
The coverage of Web information is very wide and
diverse
–
–
–
–
–
The Web has changed the way we obtain information.
Information search on the Web is not enough anymore.
The stronger need for Web information integration has
increased than ever (both for business and individuals).
Understanding those Web pages and discovering valuable
information from them is called Web content mining.
Information extraction is one of the keys for web content mining.
Web Information Integration
•
From information search to information
extraction, to information mapping
1. Focused crawling / Web page gathering
•
Information search
2. Information (Data) extraction
•
Discovering structured information from input
3. Schema matching
•
With a unified interface / single ontology
Three Dimensions to See IE
• Task Definition
– Input (Unstructured free texts, semi-structured Web
pages)
– Output Targets (record-level, page-level, site-level)
• Automation Degree
– Programmer-involved, annotation-based or
annotation-free approaches
• Techniques
– Learning algorithm: specific/general to
general/specific
– Rule type: regular expression rules vs logic rules
– Deterministic finite-state transducer vs probabilistic
hidden Markov models
IE from Nearly-structured Documents
Google search result
Multiple-records Web page
IE from Nearly-structured Documents
Amazon.com book pages
Single-record Pages
IE from Semi-structured Documents
Ungrammatical snippets
A publication list
Selected articles
Information Extraction From Free Texts
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
Named entity
extraction,
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
[Excerpted from Cohen & MaCallum’s talk].
Information Extraction From Free Texts
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
[Excerpted from Cohen & MaCallum’s talk].
Dimension 1: Task Definition - Input
Dimension 1: Task Definition Output
• Attribute level (single-slot)
– Named entity extraction, concept annotation
• Record level
– Relation between slots
• Page level
– All data embedded in a dynamic page
• Site level
– All information about a web site
Template Page Generation &
Extraction
• Generation/Encoding
Template (T)
Database
CGI
(T,x)
……
……
……
Output Pages
• Extraction/Decoding: A reverse
engineering
Dimension 2: Automation
Degree
• Programming-based
– For programmers
• Supervised learning
– A bunch of labeled examples
• Semi-supervised learning/Active learning
– Interactive wrapper induction
• Unsupervised learning
– Mostly for template pages only
Tasks vs. Automation Degree
• High Automation Degree (Unsupervised)
– Template page IE
• Semi-Automatic / Interactive
– Semi-structured document IE
• Low Automation Degree (Supervised)
– Free text IE
Dimension 3: Technologies
• Learning Technology
– Supervised: rule generalization, hypothesis testing,
statistical modeling
– Unsupervised learning: pattern mining, clustering
• Features used
– Plain text information: tokens, token class, etc.
– HTML information: DOM tree path, sibling, etc.
– Visual information: font, style, position, etc.
• Rule Types (Expressiveness of the rules)
– Regular expression, first-order logic rules, HMM
model
Issues for Unsupervised
Approaches
•
For Record-level Extraction
1. Data-rich Section Discovery
2. Record Boundary (Separator) Mining
3. Schema Detection & Data Annotation
•
For Page-level Extraction
– Schema Detection - differentiate template
from data tokens
Data-Rich
Section
Record
Boundary
Attribute
Attribute
Some Related Works on
Unsupervised Approaches
• Record-level
–
–
–
–
–
IEPAD {Chang and Liu, WWW2001]
DeLa [Wang and Lochovsky, WWW2003]
DEPTA [Zhai and Liu, WWW2005]
ViPER [Simon and Lausen, CIKM 2005]
ViNT[Zhao et al, WWW 2005]
• Page-level
– Roadrunner [Crescenzi, VLDB2001]
– EXALG [Arasu and Garcia-Molina, SIGMOD2003]
– MSR [Zhao et al., VLDB 2006]
Issue 1: Data-Rich Section
Discovery
• Comparing a normal
page with no-result
page
• Comparing two normal
pages
– Locate static text lines,
e.g.
•
•
•
•
•
•
ViNT [Zhao, et al. WWW2005]
Books
Related Searches
Narrow or Expand
Results
Showing
Results
…
MSE [Zhao, et al. VLDB2006]
Issue 1: Data-Rich Section Discovery
(Cont.)
• Similarity between two adjacent leaf
nodes
• 1-dimension clustering
• Pitch Estimation
HL(R)
[Papadakis, et al., SAINT2005]
Issue 2: Record Boundary
Mining
• String Pattern Mining • Tree Pattern Mining
<html><body><b>T</b><ol>
<li><b>T</b>T<b>T</b>T</li>
<li><b>T</b>T<b>T</b></li>
</ol></body><html>
IEPAD [Chang and Liu, WWW2001]
<P><A>T</A><A>T</A>
T</P><P><A>T</A>T</P>
 <P><A>T</A>T</P>
<P><A>T</A>T</P>
DeLa [Wang and Lochovsky, WWW2003]
DEPTA [Zhai and Liu, WWW2005]
Issue 2: Record Boundary Mining
(Cont.)
•
•
Finding repeat
separators from visual
encoded context lines
Heuristics
–
–
–
–
•
Visual cues
Line following an HR-LINE
A unique line in a block that
starting with a number
Line in a block has the
smallest position code (Only
one).
Line following the BLANK
line is the first line.
ViNT [Zhao, et al. WWW2005]
ViPER [Simon and Lausen, CIKM05]
Issue 3: Data Schema Detection
• Alignment of the multiple records found
– Handling missing attributes, multiple-value attributes
– String alignment or tree alignment
– Examining two records at a time
• Differentiate template from data tokens with some
assumptions
– Tag tokens are considered part of templates
– Text lines are usually part of data except for static text
lines
• Similar to the problem of page-level IE tasks
Page-level IE: EXALG
[Arasu and Garcia-Molina, SIGMOD 2003]
• Identifying static markers (tag&word tokens) from
multiple pages
Critical point: Tags are not
– Occurrence vector for each token
• Differentiating token roles
easy to differentiate as
compared to text lines used in
[Zhao, et al, VLDB206]
– By DOM tree path
– By position in the EC class
• Equivalent class (EC)
– Group tokens with the same occurrence vector
• LFECs form the template
– e.g. <1,1,1,1>: {<html>, <body>, <table>, </table>, </body>,
</html>}
On the use of techniques
• From supervised to unsupervised
approaches
• From string alignment (IEPAD,
RoadRunner) to tree alignment (DEPTA,
Thresher)
• From two page summarization (MSE) to
multiple page summarization (EXALG)
Summary
• Content of this talk
– Web Information Extraction
– Three Dimensions
– Focused on IE for template pages IE task
•
•
Issues for unsupervised approaches
Techniques for solving these issues
• Content not in this talk
– Probabilistic model for free text IE tasks
Personal Vision
• From information search to information
integration
• Better UI for information integration
– Information collection: focused crawling
– Information extraction
– Schema matching and integration
• Not only for business but also for individuals
References – Record Level
•
•
•
•
•
C.-H. Chang, S.-C. Lui, IEPAD: Information Extraction
based on Pattern Discovery, WWW01
B. Liu, R. Grossman and Y. Zhai, Mining Data Records
in Web Pages, SIGKDD03
Y. Zhai, B. Liu. Web Data Extraction Based on Partial
Tree Alignment, WWW05
K. Simon and G. Lausen, ViPER: Augmenting
Automatic Information Extraction with Visual
Perceptions, CIKM05
H. Zhao, W. Meng, V. Raghavan, and C. Yu, Fully
Automatic Wrapper Generation for Search Engines,
WWW05
References – Page Level &
Survey
•
•
•
•
•
A. Arasu, H. Garcia-Molina, Extracting Structured Data from Web
Pages, SIGMOD03
V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards
Automatic Data Extraction from Large Web Sites, VLDB01
H. Zhao, W. Meng, and C. Yu, Automatic Extraction of Dynamic
Record Sections From Search Engine Result Pages, VLDB06
A. Laender, B. Ribeiro-Neto, A. da Silva, J. Teixeira. A Brief Survey
of Web Data Extraction Tools. ACM SIGMOD Record02.
C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web
Information Extraction Systems, IEEE TKDE06.
Taxonomic Information
Integration: Challenges and
Applications
Cheng-Zen Yang (楊正仁)
Department of Computer Sci. and Eng.
Yuan Ze University
[email protected]
Outline
• Introduction
• Problem statement
• Integration approaches
– Flattened catalog integration
– Hierarchical catalog integration
• Applications
• Conclusions and future work
Introduction
• As the Internet develops rapidly, the number of on-line
Web pages becomes very large today.
– Many Web portals offer taxonomic information (catalogs) to
facilitate information search [AS2001].
• These catalogs may need to be integrated if Web portals
are merged.
– B2B electronic marketplaces bring together many online
suppliers and buyers.
• An integrated Web catalog service can help users
– gain more relevant and organized information in one catalog,
and
– can save them much time to surf among different Web catalogs.
B2C e-commerce: Amazon
The taxonomic information
integration problem
• Taxonomic information integration is more
than a simple classification task.
• When some implicit source information is
exploited, the integration accuracy can be
highly improved.
• Past studies have shown that
– the Naïve Bayes classifier, SVMs, and the
Maximum Entropy model enhance the
accuracy of Web catalog integration in a
flattened catalog integration structure.
The problem statement (1/2)
• Flattened catalog integration
– The source catalog S containing a set of categories
S1 , S2 , … , Sm is to be integrated into the destination
catalog D consisting of categories D1 , D2 , …, Dn.
Source Catalog
Destination Catalog
Document D11
Document S11
Document S12
Document D12
Integrated
S1
D1
Document D1k
Document S1k
Integrated
Integrated
S2
D2
Sm
Dn
The problem statement (2/2)
• Hierarchical catalog integration
Catalog D
URL Category D1
URL Category D2
URL Category D3
Catalog S
URL Category S1
URL Category S2
Category S1
URL f
URL g
f
g
Category S2
URL h
URL i
h
i
Category D1
URL a
URL b
a
b
Category D2
URL b
URL c
b
c
Category D3
URL d
URL e
d
e
Integration Approaches
for Flattened Catalogs
The enhanced naïve Bayes
approach
• The pioneer work [AS2001]
– They exploit the implicit source information
and improve the integration accuracy.
– Naïve Bayes Approach
d : Test document in source catalog
Pr(Ci ) Pr( d | Ci )
Pr( Ci | d ) 
Pr( d )
Ci : Category in Destination Catalog
S : Category in Source catalog
– The Enhanced Naïve Bayes Approach
Pr(Ci | d , S ) 
Pr(Ci | S ) Pr( d | Ci )
Pr( d | S )
Probabilistic enhancement and
topic restriction
• NB and SVM [TCCL2003]
• Probabilistic Enhancement
Pr(vt | x) Pr(vt | s )
vPE ( x)  arg max
vt H 2
Pr(vt )
x : Test document in source catalog
vt : Label of class in Destination Catalog
s : The class label of x in Source Catalog
• Topic Restriction
Catalog D
URL Category D1
URL Category D2
URL Category D3
Catalog S
URL Category S1
URL Category S2
Category S1
URL f
URL g
f
g
Category S2
URL h
URL i
h
i
Category D1
URL a
URL f
a
f
Category D2
URL b
URL f
b
f
Category D3
URL d
URL e
d
e
The pseudo relevance
feedback approach
• Iterative-Adapting SVM [CHY2005]
An Application Example
Searching for multi-lingual news
articles
• Many Web portals
provide monolingual
news integration
services.
• Unfortunately, users
cannot effectively find
the related news in
other languages.
The basic idea
• Web portals have
grouped related news
articles.
• These articles
should be
about the
same main story.
• Can we discover
these mappings?
Techniques in our current work
• Machine translation
• Taxonomy integration
Mapping
Finding
Taxonomy integration
• The cross-training process [SCG2003]
– To make better inferences about label
assignments in another taxonomy
English News
Features
Chinese News
Features
1st SVM
Semantically
Overlapped Features
2nd SVM
English-Chinese News Category Mappings
Mapping decision
• The SVM-BCT classifiers then calculate
the positively mapped ratios as the
mapping score (MSi) to predict the
semantic overlapping. [YCC2006]
• The mapping score MSi of Si Dj
• Then we can rank the mappings according
to their scores.
Performance evaluation
• NLP resources
– Standard Segmentation Corpus from ACLCLP
• 42023 segmented words
– Bilingual wordlists (version 2.0) from
Linguistic Data Consortium (LDC)
• Chinese-to-English version 2 (ldc2ce) with about
120K records
• English-to-Chinese (ldc2ec) with 110K records
Experimental datasets
• Properties
– news reports in the international news category of
Google News Taiwan and U.S. version
– May 10, 2005 - May 23, 2005
– 20 news event categories per day
– Chinese-to-English
• 46.9MB
– English-to-Chinese
• 80.2MB
– 29182 news stories
Conclusions and Future Work
Conclusions
• Taxonomic information integration is an
emerging issue for Web information
mining.
• New approaches for flattened catalog
integration and hierarchical catalog
integration are still in need.
• Our approaches are in the first stage for
taxonomic information integration.
Future work
• Taxonomy alignment
– Heterogeneous catalog integration (Jung 2006)
• Incorporated with
more conceptual information
– Wordnet, Sinica BOW, etc.
• Evaluation on other classifiers
– EM, ME, etc.
References
•
•
•
•
•
•
[AS2001] Agrawal, R., Srikant., R.: On Integrating Catalogs. Proc. the 10th
WWW Conf. (WWW10), (May 2001) 603–612
[BOYAPATI2002] Boyapati, V.: Improving Hierarchical Text Classification
Using Unlabeled Data. Proc. The 25th Annual ACM Conf. on Research and
Development in Information Retrieval (SIGIR’02), (Aug. 2002) 363–364
[CHY2005] I.-X. Chen, J.-C. Ho, and C.-Z. Yang.: An iterative approach for
web catalog integration with support vector machines. Proc. of Asia
Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 703–708
[DC 2000] Dumais, S., Chen, H.: Hierarchical Classification of Web Content.
Proc. the 23rd Annual ACM Conf. on Research and Development in
Information Retrieval (SIGIR’00), (Jul. 2000) 256–263
[HCY2006] J.-C. Ho, I.-X. Chen, and C.-Z. Yang.: Learning to Integrate
Web Catalogs with Conceptual Relationships in Hierarchical Thesaurus.
Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct.
2006) 217-229
[JOACHIMS1998] Joachims, T.: Text Categorization with Support Vector
Machines: Learning with Many Relevant Features. Proc. the 10th European
Conf. on Machine Learning (ECML’98), (1998) 137–142
•
•
•
•
•
•
[JUNG2006] Jung, J. J.: Taxonomy Alignment for Interoperability Between
Heterogeneous Digital Libraries. Proc. The 9th Int’l Conf. on Asian Digital
Library (ICADL 2006), (Nov. 2006), 274-282
[KELLER1997] Keller,A. M.: Smart Catalogs and Virtual Catalogs. In Ravi
Kalakota and Andrew Whinston, editors, Readings in Electronic Commerce.
Addison-Wesley. (1997)
[KKL2002] Kim, D., Kim, J., and Lee, S.: Catalog Integration for Electronic
Commerce through Category-Hierarchy Merging Technique. Proc. the 12th
Int’l Workshop on Research Issues in Data Engineering: Engineering eCommerce/e-Business Systems (RIDE’02), (Feb. 2002) 28–33
[MLW 2003] Marron, P. J., Lausen, G., Weber, M.: Catalog Integration Made
Easy. Proc. the 19th Int’l Conf. on Data Engineering (ICDE’03), (Mar. 2003)
677–679
[RR2001] Rennie, J. D. M., Rifkin, R.: Improving Multiclass Text
Classification with the Support Vector Machine. Tech. Report AI Memo AIM2001-026 and CCL Memo 210, MIT (Oct. 2001)
[SCG2003] Sarawagi, S., Chakrabarti S., Godbole., S.: Cross-Training:
Learning Probabilistic Mappings between Topics. Proc. the 9th ACM
SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, (Aug. 2003)
177–186
•
•
•
•
•
•
[SH2001] Stonebraker, M. and Hellerstein, J. M.: Content Integration for eCommerce. Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of
Data, (May 2001) 552–560
[SLN2003] Sun, A. ,Lim, E.-P., and Ng., W.-K. :Performance Measurement
Framework for Hierarchical Text Classification. Journal of the American
Society for Information Science and Technology (JASIST), Vol. 54, No. 11,
(June 2003) 1014–1028
[TCCL2003] Tsay, J.-J., Chen, H.-Y., Chang, C.-F., Lin, C.-H.: Enhancing
Techniques for Efficient Topic Hierarchy Integration. Proc. the 3rd Int’l Conf.
on Data Mining (ICDM’03), (Nov. 2003) (657–660)
[WTH2005] Wu, C.-W., Tsai, T.-H., and Hsu, W.-L.: Learning to Integrate
Web Taxonomies with Fine-Grained Relations: A Case Study Using
Maximum Entropy Model. Proc. of Asia Information Retrieval Symposium
2005 (AIRS2005), (Oct. 2005) 190–205
[YCC2006] C.-Z. Yang, C.-M. Chen, and I.-X. Chen.: A Cross-Lingual
Framework for Web News Taxonomy Integration. Proc. The 3rd Asia
Information Retrieval Symposium (AIRS 2006), (Oct. 2006), 270-283
[YL1999] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods.
Proc. the 22nd Annual ACMConference on Research and Development in
Information Retrieval, (Aug. 1999) 42–49
• [ZADROZNY2002] Zadrozny., B.: Reducing Multiclass to Binary by Coupling
•
•
Probability Estimates. In: Dietterich, T. G., Becker, S., Ghahramani, Z. (eds):
Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT
Press. (2002)
[ZL2004WWW] Zhang, D., Lee W. S.: Web Taxonomy Integration using
Support Vector Machines. Proc. WWW2004, (May 2004) 472–481
[ZL2004SIGIR] Zhang, D., Lee W. S.: Web Taxonomy Integration through
Co-Bootstrapping. Proc. SIGIR’04, (July 2004) 410–417
Search
Mining
Integration
Mining in the Middle:
From Search to Integration on
the Web
Kevin C. Chang
Joint with: the UIUC and Cazoodle Teams
To Begin With:
What is “the Web”?
Or: How do search engines view the Web?
Version 0.1–
“Web is a SET of PAGES.”
Version 1.1–
“Web is a GRAPH of PAGES.”
But,…
What have you been
searching lately?
Structured Data--- Prevalent but
ignored!
Version V.2.1: Our View–
Web is “Distributed Bases” of “Data
Entities”.
?
?
?
Challenges on the Web come in “dual”:
Getting access to the structured
information!
Kevin’s 4-quadrants:
Access
Structure
Deep Web
Surface Web




We are inspired: From search to
integration—Mining in the middle!
Deep Web
Surface Web
Access
Structure
Search
Mining
Integration
Challenge of the Deep Web:
Access: How to Get There?
MetaQuerier:
Holistic Integration
over
the Deep Web.
The previous Web:
Search used to be “crawl and index”
The current Web: Search must
eventually resort to integration
MetaQuerier: Exploring and
integrating the deep Web
Cars.com Amazon.com
Explorer
• source discovery
• source modeling
• source indexing
Apartments.com
411localte.com
FIND sources
db of dbs
Integrator
• source selection
• schema integration
• query mediation
QUERY sources
unified query interface
The challenge – How to deal with “deep”
semantics across a large scale?
“Semantics” is the key in integration!
• How to understand a query interface?
– Where is the first condition? What’s its
attribute?
• How to match query interfaces?
– What does “author” on this source match on
that?
• How to translate queries?
– How to ask this query on that source?
Survey the frontier
before going to the battle.
• Challenge reassured:
–
–
–
–
450,000 online databases
1,258,000 query interfaces
307,000 deep web sites
3-7 times increase in 4 years
• Insight revealed:
– Web sources are not arbitrarily complex
– “Amazon effect” – convergence and regularity
naturally emerge
“Amazon effect” in action…
Attributes converge
in a domain!
Condition patterns converge
even across domains!
Search moves on to integration.
Don’t believe me? See what Google
has to say…
DB People: Buckle Up!
To embrace the burgeoning of structured
data on the Web.
Challenge of the Surface Web:
Structure: What to look for?
WISDM:
Holistic Search
over
the Surface Web.
Challenge of the surface Web:
Despite all the glorious
search engines…
Are we searching
for what we want?
What have you been searching
lately?
•
•
•
•
•
•
•
•
What is the email of Marc Snir?
What is Marc Snir’s research area?
Who are Marc Snir’s coauthors?
What are the phones of CS database faculty?
How much is “Canon PowerShot A400”?
Where is SIGMOD 2006 to be held?
When is the due date of SIGMOD 2006?
Find PDF files of “SIGMOD 2006”?
NO!
Regardless of what you want,
you are searching for pages…
Your creativity is amazing: A few
examples
• WSQ/DSQ at Stanford
– use page counts to rank term associations
• QXtract at Columbia
– generate keywords to retrieve docs useful for extract
• KnowItAll at Washington
– both ideas in one framework
• And there must be many I don’t know yet…
Time to distill to build a better “mining” engine?
•
•
•
•
•
•
•
What is an “entity”?
Your target of information– or,
anything.
Phone number
Email address
PDF
Image
Person name
Book title, author, …
Price (of something)
We take an entity view of the
Web:
How different is
“entity search”?
How to define such searches?
Let’s motivate by contrasting…
Page Retrieval
Entity Search
Consider
the
entire
process:
Page Retrieval
4. Output: one page per result.
Marc
Snir
Marc
Snir
3. Scope: Each page itself.
2. Criteria: content keywords.
1. Input: pages.
Entity search is thus different…
Entity Search
4. Output: associative results.
3. Scope: holistic aggregagtes.
2. Criteria: contextual patterns.
1. Input: probabilistic entities.
What are technical
challenges?
Or, how to write (reviewer-friendly)
papers?
More issues…
• Tagging/merging of basic entities?
– Application-driven tagging
– Web’s redundancy will alleviate accuracy
demand.
• Powerful pattern language
– Linguistic; visual
• Advanced statistical analysis
– correlation; sampling
• Scalable query processing
– new components scale?
Promises of the Concepts
• From page at a time to entity-tuple at a time
– getting directly to target info and evidences
• From IR to a mining engine
– not only page retrieval but also construction
• From offline to online Web mining and integration
– enable large scale ad-hoc mining over the web
• From Web to controlled corpus
– enhance not only efficiency but also effectiveness
• From passive to active application-driven indexing
– enable mining applications
Conclusion: Mining in just the
middle!
Dual Challenges:
– Getting access to the deep Web.
– Getting structure from the surface Web.
Central Techniques:
– Holistic mining for both search and integration.
Search
Mining
Integration
Search
Mining
Integration
What will such a
Mining Engine be?
You tell me!
Students’ imagination knows
no bounds.