Probabilistic Databases - Home | Georgia State University

Download Report

Transcript Probabilistic Databases - Home | Georgia State University

Presented by:
Chaitanya Sambhara
- Graduate Student
- CSc 8712 ( Advanced Database Systems )
- Spring 2008
- Instructor : Dr Yingshu Li
Index
•
•
•
•
•
•
•
•
•
•
Why this topic
Introduction
What is a probabilistic database
Why probabilistic databases/ the need
Where did this concept come from and gains
Uncertainties and their types
Dependencies and their types
Constraints
Drawbacks
Current research and challenges
Why this topic ?
• The traditional databases are deterministic.
• Every item is either in the database or is not, 0
or 1.
• But modern enterprise applications need to deal
with lots of imprecise data: different
representation of the same object in multiple
data sources.
• Imprecise data is best modeled as probabilistic
data, and managed by a probabilistic database
system.
Relational Databases
• Usually a relational database (RDB) is
defined a finite collection of relations where
each relation is a subset of cartesian product
of sets referred to as domains.
• Where as a probabilistic databases provide
means of representing types of information
that cannot be captured by relational
database.
What is a probabilistic
Database?
• Traditional databases store solid “facts” that can
be considered certain
• In many cases, we don’t know things precisely
• I saw a bird, but not sure if it was a dove or a
sparrow
• Uncertain, incomplete data is becoming more
and more common
Why probabilistic databases ?
• A conventional relational database model cannot handle
probabilistic information well, though there is much
application of it.
•
To treat such information in relational models, database
users themselves need to determine how to represent
such information and define its semantics.
• Thus, there has been a need for probabilistic database
models which can represent such information.
Need for probabilistic databases
• Example : Supplier X supplies part Y
• From the point of view of data-modelling
many situations require more complex
forms of information.
Need for probabilistic
databases
• How reliable is part Y when supplied by the
supplier X?
• Is the probability that a person of type X will
purchase product Y greater if that person has
also purchased product Z ?
• If X is known how much additional information
about Y is provided by knowledge of Z ?
Where did this concept come
from ?
Application pull:
• The need to manage imprecision in data
Technology push:
• Advances in query processing techniques
The gain ?
• Probabilistic Databases provide a framework
that generalizes the relational database model.
• Extends all the concepts that have been
developed to deal with the collection of yes/no
facts to apply.
• Also to facts about which one is uncertain
(probabilistic databases) or about one has
vague or fuzzy information (fuzzy databases).
Types of Uncertainty's related to
probabilistic databases
• Tuple-level uncertainty
• Attribute-level uncertainty
Tuple Level Uncertainty
• All attributes in a tuple are known
precisely; existence of the tuple is
uncertain.
• Tuple (“GSU”, ...) will be present in the
answer with some uncertainty
Trio
•
Witness 1: Car green; driver was a man, 20 years.
•
Witness 2: Car yellow; driver was 35 years old. •
• Statements are partly contradictory. Statements are partly
independent.
•
Need to gather facts and weigh options: – What is the probability
that the driver was a man?
• What is the probability that the car is green and the driver is 35
years old? – (Which witness do you trust; do you trust a witness all
or nothing?)
Inconsistent Data
• Types of imprecision addressed:
• Data from different sources is
contradictory.
• Data is uncertain, hence, arguably,
probabilistic
• Query answers are probabilistic
A probabilistic would…
• Give better recall and precision!
• Needs to support disjoint tuple events
Example for a tuple level
uncertainty
•This figure represents that
a tuple (p,e) exists in the
database with probability
0.4, which is indicated by a
special attribute named pS
.However, these models
assume that the existence
of each tuple is
independent, and are not
able to represent
dependency among tuples
such as “at least one tuple
of two or more tuples must
exist in a database.”
State Event pS
p
e 0.4
p
f
0.6
q
e 0.5
q
f
0.4
Example for a tuple level
uncertainty
•For example, in a this
figure , the probability that
neither of the two tuples
(p,e) and (p,f) exists in the
database is given by (1-0.4)
* (1-0.6) = 0.24
•0.24>0 , so it cannot
represent the case that at
least one tuple of the two
must exist in the database.
State Event pS
p
e 0.4
p
f
0.6
q
e 0.5
q
f
0.4
Attribute level uncertainty
• Tuples (identified by keys) exist for certain;
an attribute value is however uncertain
• Tomorrow temperature will be somewhere
between 50F and 70F
Attribute level uncertainty
• Attribute-level uncertainty means the
attribute values are uncertain.
• The attributes are independent of each
other
Example for attribute level
uncertainty
• Tuples exist with certainty
• Temperature at time t1 at location 1 etc.
• But the attribute values (temperatures) are
uncertain
• In particular, each temperature value is a
Gaussian
Example for attribute level
uncertainty
•Tuples exist with certainty.
Time temp temp
low high
1
20
21
•• Temperature at time t1 at
location 1 etc.
2
22
23
3
18
19
•• But the attribute values
(temperatures) are
uncertain
Probabilistic Data
Dependencies
• In simple relational database applications
data dependencies can be inferred from
the meaning of the attributes as
determined by the application.
• In complex probabilistic databases though
dependencies are not known beforehand.
An important analogue of
Probabilistic database design
• To determine which dependencies exist
• Relative strength of various dependencies
when they do not exist in absolute sense.
• Hence the probabilistic dependencies can
broadly be classified in to three types.
Types of probabilistic
dependencies
• JOIN, FUNCTIONAL and MULTIVALUED
dependencies.
• They generalize the database in the sense
that• They apply both to probabilistic and relational databases
• They are straightforward, by application of results from
information theory.
Join Dependency
• A join dependency hold when a relational
or probabilistic database system can be
decomposed into a collection of
(sub)systems such that the system is
equal to the join of the subsystems.
Functional Dependencies
• In relational databases XY
• For probabilistic databases the same
applies, that is, once the tuple values for
attributes X are known, then there is no
uncertainty regarding the possible tuple
values for attributes Y.
Multivalued Dependencies
• X  Y
• What does it mean ?
• X Y if knowledge of R-(Y-X) gives the
same or less information about Y than
does the knowledge of X alone
Example of multivalued
dependency
Drinkers(name, addr, phones, beersLiked)
• A drinker’s phones are independent of the
beers they like.
– name->->phones and name ->->beersLiked.
• Thus, each of a drinker’s phones appears
with each of the beers they like in all
combinations.
• This repetition is unlike FD redundancy.
name->addr is the only FD.
Constraints
• The constraints for probabilistic databases
are nearly same as those for relational
databases.
• Such as domain, referential, key and
entity.
• Intergrity constraints might have issues
Integrity constraints
• Integrity constraints might not be forced
enforced or satisfied.
• In some environments checking
consistency of constraints might be too
expensive.
• Like work loads with high update rates.
Query Evaluation
• Queries in
• Databases and Information retrieval :
Fundamentally different
Example
Example
Query Evauluation
• But what if a query looks like:
Support for uncertain matches
• SQL query with uncertain predicates.
• Assign probability to each tuple according
to how well it matches the uncertain
predicates.
• Outputs are now ranked and have
probabilistic semantics
Example
Example
Drawbacks
• Commercial database systems have been
relatively slow to incorporate imprecision
capabilities
• Practitioners are also concerned with
compatibility.
• Many of the query evaluation algorithms for
matching imprecise data or for processing
imprecise queries are fairly complex and
inefficient.
Current research and
challenges
• Preliminary work in this area includes
investigation of incompleteness in logical
databases.
• The design of a knowledge-rich system
that can deal with various aspects of
imprecision
Current research and
challenges
• The method of data organization
presumes knowledge of probabilistic
independence among attributes, which
may not be known at design time.
• A model is proposed with some restrictions
such as: each relational scheme contains
a set of deterministic key attributes.
Bibliography
•
Query Evaluation on Probabilistic Databases
By : Christopher R´e, Nilesh Dalvi and Dan Suciu
University of Washington year 2006
http://www.cs.washington.edu/homes/nilesh/papers/debul06probdb.pdf
•
Probabilistic databases with correlated tuples
By : Prithviraj Sen, Amol Deshpande and Lise Getoor
A journal published at Dept of Computer science
University of Maryland College Park 2006
http://www.cs.purdue.edu/probdb/updb06/prob_db_wkshp.pdf
•
A General Probabilistic database model
By :Veronica Biazzo, Alfredo Ferro, Angelo Gilio and Rosalba
Giugno
Dipartimento di Matematica e Informatica, Viale A. Doria, 6 - 95125
Catania, Italy December 18, 2002
http://www.dmi.unict.it/~giugno/ipmu00.pdf
Bibliography
•
Models for Incomplete and Probabilistic information
By : Todd J. Green and Val Tannen
Department of Computer and Information Science
University of Pennsylvania
International Workshop on Inconsistency and Incompleteness in
Databases
March 26, 2006
http://www.cis.upenn.edu/~tjgreen/iidb06-talk.pdf
•
Efficient Management of Inconsistent Databases.
By : Ariel Fuxman, Elham Fazli, and Renee J. Miller. In ACM SIGMOD
Conference ,
155-166, 2005
http://www.cs.toronto.edu/~afuxman/publications/sigmod05.pdf
•
A System for Efficient Query Answering Over Inconsistent Databases.
Ariel Fuxman, Diego Fuxman, and Renee J. Miller.
In International Conference
on Very Large Databases (VLDB) ,
1354-1357, 2005
http://www.cs.toronto.edu/~afuxman/publications/vldbdemo05.pdf
Suggestions/ Question time