powerpoint - Berkeley Database Research
Download
Report
Transcript powerpoint - Berkeley Database Research
Search and Query:
An {Over, Re}view
Joseph M. Hellerstein
Computer Science Division
UC Berkeley
[email protected]
Background
My interests: Database & Information
Systems
Tech: Indexing, Query Processing,
Dataflow, Parallel/Distributed/Federated Systems
Apps: Interactive Data Analysis, eMarkets,
eCatalogs, Structured Querying on Internet,
Querying Sensors
My (lack of) agenda for today
Examples to defamiliarize today’s modes of
operation
Technical breakdown/review of system
components
Core/common vs. Value-Added
A Note of Caution
Contemplating a redesign? Fine. But…
Familiarity often trumps other desiderata
Memories are short, cross-fertilization is poor
Even in the technical community
Trap: common culture/techniques:
“What’s hot”: everybody knows web searching
“What’s packaged”: Database technology = Oracle
Rather than “What’s core technology”
In Expanding DNS, be sure to review known
technology
What’s core under the packaging of related technologies?
Reuse/generalize common ideas
Leave specialized ideas above the core (minimize core
constraints)
Beware of Today’s Common
Culture
Rise of the web => Rise of Full Text Search
Used to be a library app (yawn)
Suddenly the dominant paradigm (??)
Beware: Internet Web!
DNS is not document-centric
Neither are Peer-to-Peer filesharing
or “Deep Web” Facts and Figures
Keyword search may not be core functionality
If DNS is to serve many needs…
Must separate core search services from varying
apps!
Try to accommodate what’s hot tomorrow
Road Map
Defamiliarization
Some new Internet “search” apps
Provided by Telegraph
An {over, re}view
Some basics of querying
Some separation of the issues
Discussion…
Peer-to-Peer Filesharing
Napster, Gnutella, Aimster, etc.
And a pile of ill-defined startups
This is not the web, but it is part of the Internet
Lessons already!
Distributed files and search by substring
Isn’t that heaven?!
Would like to handle spelling, language translation (pig latin),
etc.
A layer of canonicalization (“nameprep”)
Well, what about these translations…
I post my name mapping scheme on my website
Download the Billboard Top Ten
Download tunes I’ve paid for
No problem…
TeleNap: Amazon Meets Napster
Album Information
OpenNap Servers
Telenap: Amazon Meets Napster
“Search” vs. Query
“Search” can
return only
what’s been
stored
E.g., best
match at iWon,
Google,
AskJeeves top
ten:
“Search” vs. Query
“Search” can
return only
what’s been
stored
E.g., best
match at iWon,
Google,
AskJeeves top
ten:
But the Basic Facts are There!
Query >> Search:
http://fff.cs.berkeley.edu
“Federated Facts and
Figures”
Yahoo join FECInfo
Imagine DNS
analogies:
Find IP addresses of
hosts registered by
companies that donated
to GW Bush
FECInfo join WHOIS
Note: errors!
Can do fuzzy matching
and ranking, too
Query >> Search:
http://fff.cs.berkeley.edu
“Federated Facts and
Figures”
APBNews join FECInfo
Imagine DNS
analogies:
Show # of live nodes,
Roll Up By Domain
Show Avg Ping time,
Roll Up By Domain
Show Best Ping Time
per Group, Roll Up By
Domain
Slippery Slope of “Mapping”
Step 1: Exact Match Lookups
Step 2: Simple canonicalization
E.g. toupper()
Step 3: Complex canonicalization
E.g. “nameprep”
Step 4: Join w/Arbitrary Mapping Tables
Centralized “keywords” (RealNames)
Personalized mapping, “bookmarks”
Geo-located mapping tables for Localization
Load-balancing mapping via net status queries
Ad-hoc mapping: ad hoc database join queries!
E.g. “The lightest-loaded machine containing >5 of the Billboard
Top Ten Songs”
Q: How much of this belongs in DNS??
Slippery Slope of “Mapping”
Input Value
Output Value
Cs.Berkeley.EDU
128.32.35.1
cHeese
CHEESE
Nueva_York_Yanquis
NYYankees
Cheese
www.kraft.com
Dear old Dad
[email protected]
Thai Food
www.plearn.com
White Album
Back in the USSR
White Album
Dear Prudence
www.yahoo.com
server47.yahoo.com
Slippery Slope of Querying
Step 1: Single collection, exact lookup
Step 2: Single collection, fancy lookup
“Search”
Step 2: Combine multiple collections
E.g. Mapping (join), “Metasearch” (union)
Step 3: Roll-up/drill-down
Step 4: Statistical analyses
Step 5: “Turing-complete” queries
Prolog/Datalog
How much of this belongs in DNS??
Note: Step 5 was even rejected in databases!
Where Does it End??
All this functionality seems plausibly
useful
Some of it even necessary!
But what should be in network
infrastructure?
Vs. centralized, replicated services
Who pays for these services?
Querying 101
Query Optimization
Bulk Data Processing
Access Methods
This is Core and Common to
All Query systems.
e.g. Rewriting in canonical form,
Choosing AMs and Processing Options.
Very different techniques in Databases,
Search Engines, though not mutually
exclusive.
e.g. Joins, Intersect, Union, Rank/Sort,
Group/Aggregate (Stat. Summaries)
Used in Databases, Search Engines
Parallelized on Well-Maintained Clusters
e.g. B-trees, Hash Indexes
Used in Directories, Databases,
Search Engines. Can be distributed.
Access Methods
Local search:
Main-memory data structures
Disk-based data structures
binary trees, hashtables, skip lists, etc.
B-trees, linear hash indexes, etc.
Typically equality and/or range lookups
Distributed search
Flat partitioning (hash, range), w/replication
Hierarchical partitioning
More recent multi-hop search & replication
E.g. CAN, Chord, PAST, Tapestry, Pastry
Equality lookups only (so far), no need for hierarchies
Proposed as a DNS replacement by networking
Data Processing
Flow the data through processing code
Absent in Directory Services
Used in database systems (in the box)
Ad hoc combinations of operators possible
Constrained use in text search engines (in the box)
1 collection: (word, docID, position, score, …)
Indexed by stemmed word
OR, AND, NOT: Union/Intersect/Subtract
Map docIDs to URLs, snippets, etc.
Sort by a (magic) function of position, score, etc.
Text search is one (highly tuned!) database query
Fun research: genericize this dataflow technology
Goal of Telegraph project: adaptive dataflow (out of the box)
Cluster-based implementation
Distributed (P2P) implementation
Query Optimization
Text Search
Query rewrite: stemming, stop words, thesaurus
Scheduling: which machine(s) on cluster
Based on data partitioning, load & data statistics
Database Systems
Query rewrite: authorization, “views”
Choices among (redundant) access methods
Choices among data processing algorithms (joins)
Choices of reorderings for these algorithms
Scheduling: which machines(s) on cluster
Based on data partitioning, load & data statistics
Lots of fancy tricks here!
And what about storage
semantics??
So far we only looked at the query side
Query results only as good as the data!
Again, varying solutions here…
Replication & Data Consistency
Databases do Transactions
Atomic, durable updates across multiple records
Data consistency guaranteed
Distributed transactions possible, but slow
Most people do “warm” replication
Two-Phase Commit
Log-shipping w/xactional networking -- MQ
Heavyweight technology!
Brewer’s CAP “theorem”
Directories tend to use “leases” (TTL)
Tend to be per record or collection
Cross-object consistency not guaranteed
A little drift is often OK
In a scenario with mapping, may need to think about atomicity
across records/tables
One View: Core vs. Apps
Ensure that core services:
Scale to large # of machines (~size of Internet)
Work in presence of failures
Well-defined standard results: no surprises/ambiguity!
E.g. SQL subsets, Boolean text search
Need not be a user in the loop!
Unanticipated scenarios in future
Support ad hoc queries -- think cross-paradigm
Apps:
Scale to clusters, need not scale to size of Internet
Allow for mapping/customization
Allow result browsing, analysis
Results can be preference-dependent
What about time-/geo-dependent??
Fuzzy results, roll-ups, summaries all OK: user in the loop!
Impose a query paradigm
More?
[email protected]
http://www.cs.berkeley.edu/~jmh
Telegraph:
http://telegraph.cs.berkeley.edu
Federated Facts and Figures:
http://fff.cs.berkeley.edu