slides - Persone

Download Report

Transcript slides - Persone

Next generation
search engines
Paolo Ferragina
Dipartimento di Informatica, Pisa
Our journey today!
Web
search engines
XML
search engines
Basic Research
on data compression,
indexing and mining
Much interest...

More than 85% users arrive to a site from a SE

Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5% ASK,...

Toolbar searches: 49.6% Google, 46.1% Yahoo,...

SE impact onto: Web structure, knowledge and
understanding, social behavior.... and

marketing !!
33% users believe that “the results of a query are the best
place where to buy things” !!

Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy)


Paid search: 65% Google, 25% Yahoo, 8% MSN,...
Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...
Goal of a Search Engine
Retrieve the docs that are “relevant” for the
user query

Doc: file word or pdf, web page, email, blog, e-book,...

Query: paradigm “bag of words”
 Relevant
?!?
...We face many difficulties, especially on the Web!!!
Web is huge: 8 bil pages
[Google]
We need to “rank” the results !!
Web is heterogeneous
Languages/Encodings

Hundreds of languages: 55 (Jul01)

Home pages:


In 1997: English 82%, the next 15 take 13%
In 2001: English 53%, the next 9 take 30%
Distributed authorship


Millions of people creating pages with their own style…
Not all have the purest motives in providing high-quality
information - commercial motives drive “spamming”.
Extracting “significant data” is difficult !!
Web is highly dynamic
[154 sites, 2004]
Normalized
wrt first week
A “good” coverage of the
indexed Web is difficult !!
User Queries are “difficult”

Query composition:

Short



2001: 2.54 terms avg
80% less than 3 terms

Imprecise terms

78% of the queries are not modified
Query results:

Users are lazy: 85% look at just one page of results
User Needs are “variegate”

Informational – want to learn about something (~40%)
Asthma

Navigational – want to go to a page (~25%)
Alitalia

Transactional – want to do something (~35%)

Access a service

Downloads

Shop
NY weather
Mars surface images
Nikon CoolPix
Evolution of Search Engines

First generation -- use only on-page, web-text data


1995-1997 AltaVista,
Excite, Lycos, etc
Second generation -- use off-page, web-graph data



Word frequency and language
Link (or connectivity) analysis
Anchor-text (How people refer to a page)
1998: Google,
now everyone
Third generation -- answer “the need behind the query”




Focus on “user need”, rather than on query
Integrate multiple data-sources
Click-through data
Query mining
Fourth generation  Information Supply
[Andrei Broder, VP emerging search tech, Yahoo! Research]
No winner yet !!
Various players:
Google, Yahoo, Msn,
Ask,…
Yesterday.....
...Today
Yesterday...
...Today
All these tools
are built upon a
Search Engine
Structure of (Web) Search Engines
Page archive
Crawler
Query
Page
Analizer
Indexer
Query
resolver
Control
Indexing
data structures
Ranker
Size of search engines
[2005]
Google vs Yahoo:
20-30% sharing of results
Ranking: Google vs Yahoo!
Ranking: Google.com - Google.cn
Not only Web Searches...
 Clustering

engines
Vivisimo, Snaket,...
 Suggestions
 Products
 Local
searches
 News,
Blogs, ....
Web search and mining
We +
Yahoo! World

Search





Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Image,
Video,
Local,
News,
Shopping Search,

Mobile:



Commerce:


Communication









Yahoo! Mail,
Yahoo! Messenger,
My Web,
Yahoo! Personals,
Yahoo! 360º,
Yahoo! Photos,
Flickr, delicious, ...
Yahoo! Answers











Yahoo! Sports,
Yahoo! Finance,
Yahoo! Music,
Yahoo! Movies,
Yahoo! News,
Yahoo! Games.
My Yahoo!






Yahoo!
Yahoo!
Yahoo!
Yahoo!
Shopping,
Autos,
Auctions,
Travel,
Small Business:

Content:
Yahoo! Mobile
Yahoo! Go
Yahoo! Small Business
Yahoo! Domains,
Yahoo! Web Hosting,
Yahoo! Merchant Solutions,
Yahoo! Business Email,
HotJobs
Advertising:


Yahoo! Search Marketing
Yahoo! Publisher Network.
[source: R. Baeza-Yates]
Yahoo! numbers
[April, ‘06]

15 languages, 20 countries, 6B users

Each day:




1 million new accounts
3.4 billion page views
10 Tb of data processed (total, 20Pb)
2 billion Mail+Messenger sent
[source: R. Baeza-Yates]
Yahoo! Research Barcelona

Starting date: May 2006, Barcelona
Director: Ricardo Baeza-Yates

Areas: Web Mining and Web Search

People: more than 10 and… fast growing !!

Why me ?


First academic grant in Europe

Three years project on “Data compression and
indexing on hierarchical memories”
[source: R. Baeza-Yates]
Data to be mined or searched

Crawled data





(large, heterogeneous, …)
Web Pages & Links
Blogs
Items for sale: Shopping, Travel, etc.
RSS Feeds
Produced data


(high quality, sparse,…)
Yahoo’s Web: YCars, YHealth, Ytravel,…
Edited news, purchased news,…
Direct interaction

Social links

Tagged content
(quality??)
[source: R. Baeza-Yates]
What is Flickr ?
The wisdom of the crowd can
be used to improve the
search and extraction process
Observed data

Query Logs


spelling, synonyms, phrases (named entities), substitutions,…
Clicks

relevance, intent, …
“There is a new type of economics that has emerged and that the
world doesn't understand,”
“Web usage data is an amazing leading indicator because it tells you
where intent is heading”
U. Fayyad, Yahoo Chief Data Officer
Our future goals…
 Deploy user actions, e.g. queries + clicks + …
 Implicit semantic information

It's free and unbiased

Large volume
… the Semantic Web

Hypothesis - Explicit Semantic Information

Obstacle - Us
Possible uses:
• Query suggestion
• Query disambiguation
• Adv suggestions
• Web-site design
...
XML search and mining
We +
An XML excerpt
<dblp>
<book>
<author> Donald E. Knuth </author>
<title> The TeXbook </title>
<publisher> Addison-Wesley </publisher>
<year> 1986 </year>
</book>
<article>
<author> Donald E. Knuth </author>
<author> Ronald W. Moore </author>
<title> An Analysis of Alpha-Beta Pruning </title>
<pages> 293-326 </pages>
<year> 1975 </year>
<volume> 6 </volume>
<journal> Artificial Intelligence </journal>
</article>
...
</dblp>
The literature on XML indexing...



XML document types

data centric

text centric
[relational data: DB exports]
[literary texts, reports, emails, news, …]
Various tools are available

TreSy
[Cribecu, 1997]

eXist
[TU Darmstadt, 2002]

GalaTex [AT&T, 2004]
Some of their limitations

Run on a single machine

Use a lot of computational resources (time, space,…)

Limit the indexable XML document structure
Our proposal:
Query interface
• XML based
Tauro
Application Level
Query solver
• analysis + optimization
Result retriever
• indexing data structure
Data Collection manager
• data compression
• snippet extraction
The first scenario: Client-Server
Context of use : Biblio search,...
The second scenario: Peer-to-Peer
Context of use: Collaborative search
Our goal...
Exploit the power of the crowd

The largest library of XML tagged text collections
…and the power of search engines

A suite of search + text mining tools

Syntactic text comparison

Motifs extraction for text pattern identification

Concept identification via LSI
You find already loaded
rare texts
in editions and translations
coming from ‘400 and ‘500
My documents
5
You may compose
sophisticated queries
you can
visually compose
sophisticated
structural queries
Stay in touch...
Everything on the finger tips of humanists
 Nokia 770, Origami (Microsoft ), SmartPhones, …
http://signum.sns.it
Basic research

Recurrent themes of this talk

Large volume of data

Efficient search

Hierarchical memory systems: L1-L2 caches, RAM,
(Multi-) Disks, (Web) Network, …

Basic algorithmic tools
 Indexing data structures
 Data compression
Do we face a paradoxical situation ?
Six years ago...
[now, J. ACM 05]
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
Survey by Navarro-Makinen cites more
than 50 papers on the subject !!
[December 2003]
[January 2005]
Joint effort with Navarro’s
group at Univ. Chile
Some figures over hundreds of MBs of data:
• Count(P) takes few millisecs
• Locate(P) takes few millisecs for each occurrence of P
• Space is about
[bzip ~ 20%]
• 22% (support just Count ops)
• 35% (Count, Locate ops)
Compressed index for XML
60%
[Ferragina et al, WWW ’06]
UniPi is
50%
patenting it !!
40%
30%
20%
10%
0%
DBLP
Huffword
Pathways
XPress
XQzip
News
XBzipIndex
XBzip
Query (counting) time  8 ms, Navigation time  3 ms
Next generation search engines
Paolo Ferragina
University of Pisa
Thanks !!
An XML excerpt
<dblp>
<book>
<author> Donald E. Knuth </author>
<title> The TeXbook </title>
<publisher> Addison-Wesley </publisher>
<year> 1986 </year>
It
</book>
<article>
<author> Donald E. Knuth </author>
<author> Ronald W. Moore </author>
<title> An Analysis of Alpha-Beta Pruning </title>
<pages> 293-326 </pages>
<year> 1975 </year>
<volume> 6 </volume>
<journal> Artificial Intelligence </journal>
</article>
...
</dblp>
is verbose !
A tree interpretation...
Subset of XPath [W3C]
 XML document exploration  Tree navigation
 XML document search
 Labeled subpath searches
The Problem
We wish to devise a compressed representation for a labeled tree T that
efficiently supports some operations:



Navigational operations
Subpath and content searches
Visualization operation
 XML-aware compressors (like XMill, XmlPpm, ScmPpm,...)
XML-native search engines
 need the whole decompression
might exploit this tool as a core block for
query optimization
(compressed)
storage
 XML-queriable compressors
(like XPress,and
XGrind,
XQzip,...)
 poor compression and scan of the whole (compressed) file
 Summary indexes (like Dataguide, 1-index or 2-index)
 large space and do not support “content” searches
A transform for “labeled trees”
[Ferragina et al, IEEE Focs ’05]
 We propose the XBW-transform that linearizes a
labeled tree T in 2 arrays such that:
 the compression of T reduces to the compression of these two
arrays (via e.g. gzip, bzip2, ppm,...)
 the indexing of T reduces to implement simple rank/select
query operations over these two arrays
A = a b a a a c b c d a b e c d ...
a , 7 2)°=a#a
Select( a Rank(
, 2 ) = pos
= 3in A[1,7] = 4
The XBW-Transform
C
S
a
B
D
c
A
c
a
b
a
B
D
D
c
b
a
Step 1.
Visit the tree in pre-order.
For each node, write down its label
and the labels on its upward path
Permutation
of tree nodes
Sp
C
B
D
c
a
c
A
b
a
D
c
B
D
b
a
e
C
BC
DB
DB
BC
C
AC
AC
AC
DA
C
BC
DB
BC
C
C
C
C
upward labeled paths
The XBW-Transform
C
S
a
B
D
c
A
c
a
b
a
B
D
D
c
b
Step 2.
Stably sort according to Sp
a
Sp
C
b
a
D
D
c
D
a
B
A
B
c
c
a
b
e
AC
AC
AC
BC
BC
BC
BC
C
C
C
DA
DB
DB
DB
C
C
C
C
upward labeled paths
The XBW-Transform
C
B
D
c
Slast S
A
c
a
b
a
B
D
D
c
b
a
Step 3.
Add a binary array Slast marking the
rows corresponding to last children
1
0
0
1
0
1
0
1
0
0
1
1
0
1
1
a
C
b
a
D
D
c
D
a
B
A
B
c
c
a
b
XBW
Sp
e
AC
AC
AC
BC
BC
BC
B C XBW can be
C built and inverted
C in optimal time
C
DAC
D B XBW
C
takes
DB
C
optimal
space
DBC
An illustrative example
Tags, Attributes and the symbol =
XBW is compressible:
 Sa and Spcdata are locally homogeneous
Pcdata
 Slast has some structure
XBzip = XBW + PPMd
[Ferragina et al, WWW ’06]
25%
20%
15%
10%
5%
0%
DBLP
gzip
bzip2
Pathways
ppmdi
xmill + ppmdi
News
scmppm
XBzip
A general algorithmic paradigm

Basic approach (…now only for text and labelled trees)

Transform the input data in few arrays

Index (+compress) to support Rank/Select
A lot of interest around it:
Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)
Experimental: Wea ’06 (2)
You can test it:
http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl
A general algorithmic paradigm

Basic (magic ?!?) approach

Transform the input data in few arrays

Index (+compress) them to support Rank/Select ops
A = a b a a a c b c d a b e c d ...
a , 7 2)°=a#a
Select( a Rank(
, 2 ) = pos
= 3in A[1,7] = 4
A lot of interest around it:
Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)
Experimental: Wea ’06 (2)