Introduction to Stanford DB Group Research

Download Report

Transcript Introduction to Stanford DB Group Research

Introduction to Stanford DB
Group Research
Li Ruixuan
http://cs.hust.edu.cn/rxli/
[email protected]
1
Contents
Introduction
 Past projects
 Current projects
 Events
 References
 Links

2
The Stanford Database Group

“Mainstream” faculty
–
–
–
–

Hector Garcia-Molina
Jennifer Widom
Jeff Ullman
Gio Wiederhold
“Adjunct” faculty
– Chris Manning (natural language processing)
– Rajeev Motwani (theory)
– Terry Winograd (human-computer interaction)

A.k.a. Stanford InfoLab
3
Database Group (cont’d)
Approximately 25 Ph.D. students
 Varying numbers of M.S. and undergraduate
students
 Handful of visitors
 One senior research associate
 One systems administrator, one programmer
 Excellent administrative staff
 Resident photographer

4
Research Areas (very coarse)
Digital libraries
 Peer-to-peer systems
 Data streams
 Replication, caching, archiving,
broadcast, …
 The Web
 Ontologies, semantic Web
 Data mining
 Miscellaneous

5
Past Projects





LIC: Large-Scale Interoperation and Composition (1999) –
mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB)
SKC: Scalable Knowledge Composition (2000) - semantic
heterogeneity
TID: Trusted Image Distribution (2001) - Image Filtering
for Secure Distribution of Medical Information
Image Database: Content-based Image Retrieval (2003)
SimQL:Simulation Access Language (2001) - Software
modules in manufacturing, acquisition, and planning
systems
6
Past Projects (cont’d)
TSIMMIS: Wrapping and mediation for
heterogenous information sources (1998)
 Lore: A Database Management System for XML
(2000)
 WHIPS: WareHouse Information Prototype at
Stanford (1998) - Data warehouse creation and
maintenance
 MIDAS: Mining Data at Stanford (1999)
 WSQ: Web-Supported Queries (2000) Integrating database queries and Web searches7

Current Projects







WebBase: Crawling, storage, indexing, and querying of
large collections of Web pages. (Molina)
STREAM: A Database Management System for Data
Streams (Widom)
Peers: Building primitives for peer-to-peer systems (Molina)
Digital Libraries: Interoperating on-line services for enduser support (TID,WebBase,OntoAgents) (Molina)
TRAPP: Approximate data caching: trading precision for
performance (Widom)
CHAIMS: Compiling High-level Access Interfaces for
Multi-site Software (1999) (Wiederhold)
OntoAgents: Ontology based Infrastructure for Agents
(2002) (Wiederhold)
8
WebBase: Objectives
Provide a storage infrastructure for Web-like
content
 Store a sizeable portion of the Web
 Enable researchers to easily build indexes of
page features across large sets of pages
 Distribute Webbase content via multicast
channels
 Support structure and content-based querying
over the stored collection

9
WebBase: Architecture
WWW
Page Repository
Crawler
Indexing
Module
Analysis
Module
Multicast
Module
I
n
d
e
x
A
P
I
Query
Engine
Retrieval Indexes
Feature Repository
Client
Client
Indexing
Client
W
e
b
B
a
s
e
A
P
I
Client
Client
10
WebBase: Current Status

Efficient “smart” crawler
– Parallelism
– Freshness & Relevance

Efficient and scalable indexing
– Distributed Web-scale content indexes
– Indexes over graph structure

Unicast dissemination
– Within Stanford
– External clients: Columbia, U.Wash, U.C.Berkeley
11
WebBase: In Progress

WebBase Infrastructure
– Multicast dissemination
– Complex queries

Other work
–
–
–
–
PageRank extensions
Clustering and similarity search
Structured data extraction
Hidden Web crawling
12
Data Streams: Motivation
Traditional DBMS -- data stored in finite,
persistent data sets
 New applications -- data as multiple, continuous,
rapid, time-varying data streams

–
–
–
–
–
–
–
Network monitoring and traffic engineering
Security applications
Telecom call records
Financial applications
Web logs and click-streams
Sensor networks
Manufacturing processes
13
STREAM: Architecture
Register
Query
Streamed
Result
Stored
Result
DSMS
Input streams
Archive
Scratch Store
Stored
Relations
14
STREAM: Challenges
Multiple, continuous, rapid, time-varying
streams of data
 Queries may be continuous (not just one-time)

– Evaluated continuously as stream data arrives
– Answer updated over time

Queries may be complex
– Beyond element-at-a-time processing
– Beyond stream-at-a-time processing
15
DBMS versus DSMS

Persistent relations

Transient streams (and
persistent relations)

One-time queries

Continuous queries

Random access

Sequential access

Access plan determined
by query processor and
physical DB design

Unpredictable data
arrival and
characteristics

“Unbounded” disk store

Bounded main memory
16
STREAM: Current Status
Data streams and stored relations
 Declarative language for registering
continuous queries
 Flexible query plans
 Designed to cope with high data rates and
query workloads

– Graceful approximation when needed
– Careful resource allocation and usage

Relational, centralized (for now)
17
STREAM: Ongoing Work
Algebra for streams
 Semantics for continuous queries
 Synopses and algorithmic issues
 Memory management issues
 Exploiting constraints on streams
 Approximation in query processing
 Distributed stream processing
 System development

18
STREAM: Related Work









Amazon/Cougar (Cornell) – sensors
Aurora (Brown/MIT) – sensor monitoring, dataflow
Hancock (AT&T) – telecom streams
Niagara (OGI/Wisconsin) – Internet XML databases
OpenCQ (Georgia) – triggers, incr. view maintenance
Stream (Stanford) – general-purpose DSMS
Tapestry (Xerox) – pub/sub content-based filtering
Telegraph (Berkeley) – adaptive engine for sensors
Tribeca (Bellcore) – network monitoring
19
Peer-To-Peer Systems
Multiple sites (at edge)
 Distributed resources
 Sites are autonomous (different owners)
 Sites are both clients and servers
 Sites have equal functionality

20
P2P Benefits



Pooling available (inexpensive) resources
High availability and fault-tolerance
Self-organization
21
P2P Challenges

Search
–
–
–
–
–

Query Expressiveness
Comprehensiveness
Topology
Data Placement
Message Routing
Resource Management
– fairness
– load balancing

Security & Privacy
–
–
–
–
Anonymity
Reputation
Accountability
Information
Preservation
– Information Quality
– Trust
– Denial of service
attacks
22
Peers: Stanford Research
New Architectures
 Performance Modeling and Optimization
 Security and Trust
 Distributed Resource Management
 Applications

23
Digital Library Project: Overview
Payment
Institutions
Internet
Libraries
User Interfaces
and Annotations
Copyright
Services
Telnet
HTTP
Z39.50
Search
Agents
Query/Data
Conversion
Commercial
Information Brokers &
Providers
24
DigLib Projects: DLI1,DLI2
Resource Discovery
 Retrieving Information
 Interpreting Information
 Managing Information
 Sharing Information

25
DigLib: Resource Discovery

Geographic Views (Tools to assist you in
more systematically locating different types
of information from a large and diverse
number of information sources)
26
DigLib: Retrieving Information
Information Tiling
 PalmPilot Infrastructure (PDA)
 Power Browsing (PDA)
 Query Translator
 SDLIP (Simple Digital Library
Interoperability Protocol)
 Value Filtering
 WebBase

27
DigLib: Interpreting Information
Murals (Tools to help a user interpret and
organize search results)
 Web Clustering

28
DigLib: Managing Information
Archival Repositories
 Archiving Movie
 InterBib (a tool for maintaining
bibliographic information)
 Medical Transport Info
 PhotoBrowser

29
DigLib: Sharing Information





Diet ORB (PDA, based on MICO)
Digital Wallets
Mobile Info Delivery
Mobile Security
Multicasting
30
DLI1 Projects (95-99)








AHA
ComMentor
DLITE
Google
GLOSS
FAB
Grassroots
Metadata Architecture







RManage/FIRM
SenseMaker
SCAM
Shopping Models, U-PAI
SONIA
STARTS
WebWriter
31
TRAPP: Overview
TRAPP: Tradeoff in Replication Precision
and Performance
 A.k.a: Approximate Data Caching
 Project goal: investigating techniques to
permit controlled and explicit relaxation of
data precision in exchange for improved
performance

32
TRAPP: Motivation
Transactional consistency too expensive
 Even nontransactional propagation of every
update still too expensive in many cases


Solution: Approximate Caching
– Exploit the fact that many applications do not
require exact consistency
– Avoid propagating insignificant updates
– Trade cache precision for network load
33
Example: TRAPP Over Numeric Data
Caches store intervals
that bound the exact
source values
 Sources refresh when
value leaves interval

cache
[2, 5]
[-1, 0.8]
refreshes
3.9
source
refreshes
0.2
source
Query answers are intervals
 Precision constraints specify maximum width

34
Eg(cont’d): Querying in TRAPP
For one-time aggregation queries:
– Answers computed by combining approximate cached
data and exact source data
– At query-time: Find low-cost subset of sources to
probe so final answer will have adequate precision
– Algorithm determined by aggregation function
» Some easy, some hard
Query: X + Y
(within 2)
cache
[2, 5]
[-1, 0.8]
probe
X 3.9
source
Y 0.2
source
Answer:
[2.9, 4.7]
35
TRAPP: Approximate Caching
Two common scenarios:
• Minimize bandwidth usage, precision fixed
» TRAPP: caches store bounds as approximations
» Queries select combination of cached & source data
» Adaptive bound adjustment for good precision level
• Bandwidth fixed, maximize precision
» Best-Effort Synchronization: caches store stale copies
» Refreshing based on priority scheduling
» Global priority order via threshold
» Adaptive threshold setting for flow control
36
TRAPP: Status
Past work: focused on an approximate data
caching architecture that permits finegrained control of the precisionperformance tradeoff for numerical data in
data caching environments.
 Current work: applying the above
techniques and others to more complex data
such as Web pages.

37
CHAIMS: Overview




CHAIMS: Compiling High-level Access Interfaces for
Multi-site Software
Objective: Investigate revolutionary approaches to largescale software composition.
Approach: Develop and validate a composition-only
language, a protocol for large, distributed, heterogeneous
and autonomous megamodules, and a supporting system.
Planned contributions:
–
–
–
–
Asynchrony by splitting up CALL-statement.
Hardware and software platform independence.
Potential for multi-site dataflow optimization.
Performance optimization by invocation scheduling.
38
CHAIMS: Overview
Megaprogram for composition,
written by domain programmer
CHAIMS
Megamodules
CHAIMS system automates
generation of client for
distributed system
Megamodules, provided
by various megamodule
providers
39
CHAIMS: Architecture
Megamodule
Provider
Megaprogrammer
writes
wraps non-CHAIMS
compliant megamodules
information
Megaprogram
(in CHAIMS language)
information
adds information to
Wrapper
Templates
CHAIMS
Repository
CHAIMS
Compiler
b
generates
CSRT
(compiled megaprogram)
a
d e
c
MEGA modules
Distribution System (CORBA, RMI…)
40
OntoAgents: Objective
OntoAgents goal: establish an agent
infrastructure on the WWW or WWW-like
networks
 Such an agent infrastructure requires an
information food chain: every part of the
food chain provides information, which
enables the existence of the next part.

41
OntoAgents: Architecture
•End
User
•Ontology
Articulation Toolkit
•Ontology
Construction Tool
•Agents
•Community
Portal
•Ontologies
•Inference
Engine
•Webpage
Annotation Tool
•Annotated
Webpages
•Metadata
Repository
42
Events: DB Seminars
Academic
Year
Fall
Winter
Spring
2002/2003
no seminar in the fall
quarter
Database Seminar
(CS545)
Genome Databases
(CS545G)
2001/2002
Past, Present, and Future
of Database Technology
Genome Databases
Database Seminar to
come
2000/2001
Interoperation,
Databases and the
Semantic Web
Image Databases
Databases and the
Semantic Web
1999/2000
Ontologies, E-Commerce,
n/a
XML & Metadata
1998/1999
Digital Libraries
Image Databases
1997/1998
Data Warehousing
Image Databases
1996/1997
Fall Quarter 96
Image Databases
Ontologies, ECommerce, XML &
Metadata
Internet and
Databases
Internet and
Databases
Spring Quarter
43 97
Events: Meetings
Stanford Computer Science Forum - Annual
Affiliates Meeting, Stanford, May 2003.
 SWiM (the Stream Winter Meeting): About 35
researchers in the data streams are came
together at Stanford for SWiM, Jan. 2003.

– Stream Team: A few data streams research groups
held some informal get-togethers, 2002.

Conference Talk: ACM SIGMOD/PODS,
VLDB, ICDT, ICDE, ICDCS, CIDR
44
References: WebBase



Junghoo Cho, Hector Garcia-Molina. "Parallel
Crawlers," In Proceedings of the Eleventh World Wide
Web Conference, May 2002.
Taher Haveliwala, Aristides Gionis, etc. "Evaluating
Strategies for Similarity Search on the Web,"
Proceedings of the Eleventh International World Wide
Web Conference, May 2002.
Taher Haveliwala. "Topic-Sensitive PageRank,"
Proceedings of the Eleventh International World Wide
Web Conference, May 2002.
45
References: STREAM



R. Motwani, J. Widom, etc. Query Processing, Resource
Management, and Approximation in a Data Stream
Management System
In Proc. of the 2003 Conference on Innovative Data Systems
Research (CIDR), January 2003
A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream
Data Manager
In Proc. of the ACM Intl Conf. on Management of Data
(SIGMOD 2003), June 2003
B. Babcock, S. Babu, etc. Models and Issues in Data Stream
Systems
Invited paper in Proc. of the 2002 ACM Symp. on Principles
of Database Systems (PODS 2002), June 2002
46
References: Peers
Neil Daswani, Hector Garcia-Molina and
Beverly Yang. Open Problems in Data-Sharing
Peer-to-Peer Systems, In ICDT, 2003.
 Hector Garcia-Molina. Peer-To-Peer Data
Management, Key-notes In ICDE, 2002.
 Hrishikesh Deshpande, Mayank Bawa, and
Hector Garcia-Molina. Streaming Live Media
over a Peer-to-Peer Network.

47
References: TRAPP


C. Olston and J. Widom. Best-Effort Cache
Synchronization with Source Cooperation. ACM
SIGMOD 2002 International Conference on
Management of Data, Madison, Wisconsin, June 2002,
pp. 73 -84.
C. Olston, B. T. Loo and J. Widom. Adaptive Precision
Setting for Cached Approximate Values. ACM
SIGMOD 2001 International Conference on
Management of Data, Santa Barbara , California, May
2001, pp. 355-366.
48
Useful Links

Database Group: http://www-db.stanford.edu/
 STREAM: http://www-db.stanford.edu/stream/
Peers: http://www-db.stanford.edu/peers/
 DigLib: http://www-diglib.stanford.edu/
 TRAPP: http://www-db.stanford.edu/trapp/
 WebBase: http://wwwdiglib.stanford.edu/~testbed/doc2/WebBase/

49