#### Transcript mcdb

Managing Uncertain Data MCDB: A MONTE CARLO APPROACH J A M P A N I (1), X U (1), W U (1) (1), P E R E Z (1), J E R M A I N E (1), H A S S (2) UNIVERSITY OF FLORIDA, (2) IBM RESEARCH CENTER DANIEL BLUM SDBI H E B R E W U N I V E R S I T Y, C O M P U T E R S C I E N C E D E P T. 2008 Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions Introduction – Uncertainty in Database System Types of Uncertainty Uncertainty – Can’t determine whether an assertion in the model is true or false. “The age of John is 35” Imprecision – The information is not as specific as it should be. “The age of John is between 30 and 40” Vagueness – Elements in the model are inherently vague. “John is in early middle age” Inconsistency – The model contains two or more assertions that can’t be true at the same time. “John is 35” and “The age of John is between 25 and 30” Ambiguity – Some elements of the model lack complete semantics. “John’s salary is 15,000$” (Month? Year?) Introduction – Uncertainty in Database System Categories of Uncertainty Description Uncertainty – The actual data which is stored in the database is uncertain. Transaction Uncertainty – Uncertain queries which might return uncertain data or could update the database with uncertain data. Processing Uncertainty – The algorithms used for processing might: Sacrifice accuracy for the sake of simplicity/complexity Use statistical model and approximations Fail in giving accurate results because a lack of resources during processing Introduction – Uncertainty in Database System Description Uncertainty What is affected? Uncertain data values The time of a given flight in the database is uncertain. Tuple uncertainty There is no actual flight from A to B. Relation\Structural uncertainty There is no direct flight from A to B (There are limitations in the representation of the data in the relation). Introduction – Uncertainty in Database System Description Uncertainty Source for uncertainty? Unavailable information John has a well defined age which isn’t available. Unreliable information sources Faulty readings, forms which were filled incorrectly, noise etc. Information gathering methods Usage of estimations or approximations. Model limitations Database stores only two occupations per employee. Introduction – Uncertainty in Database System Description Uncertainty: Degrees of uncertainty Existence There is uncertainty if the tuple even exists. Disjunctive – Set of alternatives An attribute that has a set of alternatives. Probabilistic We have a range and probability for each value. Introduction – Uncertainty in Database System Dealing with uncertainty Null values First Name Last Name Grade Daniel Blum Null Joint probabilities (ERM – Extended Relational Model) First Name Last Name Grade Probability Daniel Blum 99 0.01 Adding “lineage” in addition to stored probabilities ID Name Course ID Course Grade ID Name Grade 11 Daniel SDBI 21 SDBI 99 31 Daniel 99 Lineage: (31) = {11,21} Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions Motivation (Problems in Probabilistic Databases) Problem 1 – “Hard Wired” Data ID Name Grade Probability 31 Daniel 99 0.01 ID Name Grade Probability 31 Daniel 99 0.01 Lineage: (31) = {11,21} If a new unanticipated manifestation of uncertainty is found? Motivation (Problems in Probabilistic Databases) Problem 2 – Flexibility and Expressiveness ID Name Grade Probability 31 Daniel 99 0.01 Lineage: (31) = {11,21} How can we update the probability? What if the we use the a complex statistical model? Can we dynamically parameterize the uncertainty on the global state of the database? What about on the results from queries? Motivation (Problems in Probabilistic Databases) Problem 3 – “Extrapolation Uncertainty” Product Price Q1 Profits Q2 Profits Q3 Profits Q4 Profits iPod nano 149$ 56M$ 44M$ 20M$ 66M$ iPod touch 229$ 72M$ 78M$ 40M$ 90M$ MacBook 999$ 32M$ 40M$ 38M$ 55M$ Can we use the current state of the database in order to answer the following question What would our profits have been in the last 4 quarters if we have raised the prices by 5%? Motivation (Problems in Probabilistic Databases) Problem 3 – “Extrapolation Uncertainty” How can we solve this problem? Bayesian approach – Uses a “prior” distribution model for the customers demands, and create a “posterior” distribution demand for the new prices Why is it difficult to implement this analysis in ERM? This statistical model is quite unique The parameterization of the model depends on the current state The posterior distribution for the new demand is complex We might not be able to represent it in a closed-form Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions Monte Carlo Method From Wikipedia “Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results”. Useful in studying systems with large number of coupled degrees of freedom. Example Modeling phenomena with significant uncertainty in inputs, such as the calculation of risk in business. Monte Carlo Method Example – Calculating Pi (π) Draw a square and inscribe a circle within Uniformly scatter objects of uniform size (rice) Count the number of objects in the circle, multiply by 4 and divide by the total number of objects in the square The proportion will approximate π/4 Monte Carlo Method The General Pattern Define a domain of possible inputs Generate inputs randomly from the domain Perform a deterministic computation using the inputs Aggregate the results of the individual computations into the final result Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB - Introduction MCDB – The Monte Carlo Database System Allows the user to attach arbitrary stochastic models to the database data in order to “guess” the values for unknown or inaccurate data. The models produce multiple possible database instances (“possible worlds”) and the query is run over each instance (but in an efficient way…) MCDB - Introduction The Monte Carlo Database System Doesn’t encode uncertainty within the data model itself. Instead allows to define arbitrary variable generation (VG) functions (This will encode the logics of the Bayesian approach) Uses parameter tables rather than probabilities, which are easy to change The VG functions generate a large number of i.i.d “possible worlds”, and the query is run over each of them MCDB - Introduction Monte Carlo Benefits Can deal with several types of queries which are problematic in exact computation: Complex query (“EXISTS”, “NOT IN”…) causes significant problems (P#) Also queries such as “SUM” and ‘AVG” pose significant challenge It is often unclear how to analyze query output (e.g. quantiles). Same general purpose methods apply to correlated or uncorrelated model Can easily deal with arbitrary continuous distributions, even if there isn’t a closed-from representation MCDB - Introduction Limitations of MCDB Exact evaluation of some query results is not feasible Tuple appearance probabilities Expected value of an aggregation query This is caused by the generality of the VG (“black box”) which makes it hard to quantify the relationship of it to the query result MCDB - Introduction Two natural concerns to MCDB Performance Implementation of an efficient system Estimation of output results MCDB merely estimates the output results “How do we generate or calculate the probabilities in the probabilistic database in the first place?” MCDB - Introduction Contributions of MCDB First “pure” Monte Carlo approach toward managing uncertain data Powerful and flexible representation of data uncertainty Provide a syntax which requires small changes to SQL VG are similar to the User Defined Functions (UDF) Provide new query processing algorithms that execute a query plan only once Acceptable performance according to experiments Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB – Schema Specification Definitions A relation is “deterministic” if its realization is the same in all possible worlds, otherwise it is “random” Each “random” relation is specified by a schema and a set of VG functions The output of a query is a probability distribution over possible answers MCDB – Schema Specification Create-Table and Correlations An extended version of SQL CREATE TABLE and VG functions generate relation instances Each random relation is viewed as a union of blocks of correlated tuples Tuples in different blocks are independent MCDB – Schema Specification Syntax – Example (1) “Patient systolic blood pressure” Exact values are unavailable, but values are distributed normally Default blood pressure is 100 mm Hg We know the mean and standard deviation Patients are assumed independent SBP_PARAM PATIENTS Mean Std PID Gender 10 5 1 Female 2 Male 3 Female MCDB – Schema Specification Syntax – Example (2) “Patient systolic blood pressure” CREATE TABLE SBP_DATA(PID, GENDER, SBP) AS FOR EACH p in PATIENTS WITH SBP AS Normal ( One parameter (SELECT s.MEAN, s.STD VG function FROM SBP_PARAM s)) SELECT p.PID, p.GENDER, b.VALUE FROM SBP b MCDB – Schema Specification Parameterizing VG Functions (1) “Customers expenses” We know the region in which a customer lives CID Region CUST_ATTRS For region we associate a probability that a customer lives there Name Gender Region Prob CITIES Money attribute follows a gamma distribution Shift – Same for all Scale – Regional Shape – Customer level Shift MONEY_SHIFT Region Scale MONEY_SCALE CID Shape MONEY_SHAPE MCDB – Schema Specification Parameterizing VG Functions (2) “Customers expenses” CREATE TABLE CUST (CID, GENDER, MONEY, LIVES_IN) AS FOR EACH d in CUST_ATTRS WITH MONEY AS Gamma ( (SELECT n.SHAPE FROM MONEY_SHAPE n WHERE n.CID = d.CID), (SELECT sc.SCALE FROM MONEY_SCALE sc WHERE sc.REGION = d.REGION), (SELECT SHIFT FROM MONEY_SHIFT)) WITH LIVES_IN AS DiscreteChoice ( (SELECT c.NAME, c.PROB FROM CITIES c WHERE c.REGION= d.REGION)) SELECT d.CID, d.GENDER, m.VALUE, l.VALUE FROM MONEY m, LIVES_IN l Three parameter VG function One parameter VG function MCDB – Schema Specification Capturing ERM Functionality (attribute-value) Lets look on a variant of the previous example in which each customer is associated with a set of possible cities Name Region Prob ID Name Prob A small change in the definition of the 2nd VG and we capture the ability to represent attribute-value uncertainty WITH LIVES_IN AS DiscreteChoice ( (SELECT c.NAME, c.PROB FROM CITIES c WHERE c.REGION= d.REGION)) WITH LIVES_IN AS DiscreteChoice ( (SELECT c.NAME, c.PROB FROM CITIES c WHERE c.CID= d.CID)) MCDB – Schema Specification Capturing ERM Functionality (tuple-inclusion) We can add to the same example from previous slides tupleinclusion uncertainty in the following manner CID Gender Region CID Gender Region INCL_PROB Add the VG WITH IN_TABLE AS Bernoulli (VALUES(d.INCL_PROB)) And by adding the following line to the SELECT clause SELECT d.CID, d.GENDER, m.VALUE, l.VALUE FROM MONEY m, LIVES_IN l WHERE i.VALUE = true MCDB – Schema Specification Structural Uncertainty – Fuzzy Queries (1) Customers locations LID City LOCATIONS Customers transaction records SID Name Name Amount SALES We define a random table which simulate a fuzzy join regarding the Name field MCDB – Schema Specification Structural Uncertainty – Fuzzy Queries (2) CREATE TABLE LS_JOIN (LID, SID) AS FOR EACH t in ( SELECT l.LID, l.NAME AS NAME1, s.SIN, s.NAME AS NAME2 FROM LOCATIONS l, SALES s) WITH JOINS AS Bernoulli ( VALUES(Sim(t.NAME1, t,NAME2))) SELECT t.LID, t.SID FROM JOINS j WHERE j.VALUE = true Sim returns a value between 0 and 1 according string similarity Bernoulli returns “true” with probability p as given and “false” with (1-p) MCDB – Schema Specification Structural Uncertainty – Fuzzy Queries (3) Overall result is given by the query SELECT l.CITY, SUM(s.AMOUNT) FROM LOCATIONS l, SALES s, LS_JOIN j WHERE l.TID = j.lid AND s.SID = j.SID GROUP BY l.CITY MCDB – Schema Specification Correlated Attributes If income and city of residence are correlated we can handle this in the following manner CREATE TABLE CUST (CID, GENDER, MONEY, LIVES_IN) AS FOR EACH d in CUST_ATTRS WITH MLI AS MyJointDistribution (…) SELECT d.CID, d.GENDER, MLI.VALUE1, MLI.VALUE2 FROM MLI MCDB – Schema Specification Correlated Tuples (1) “Temperature Sensors” Each sensor reading as mean of normal probability distribution Sensors are divided into groups according physical closeness Stored data ID LAD LONG ID MEAN ID1 ID2 GID S_PARAMS MEANS COVAR COVARS MCDB – Schema Specification Correlated Tuples (2) “Temperature Sensors” CREATE TABLE SENSORS(ID, LAD, LONG, TEM c.ID1 P) AS FOR EACH g in (SELECT DISTINCT GID FROM S_PARAMS) MDNoraml is WITH TEMP AS MDNormal ( invoked once per (SELECT m.ID, m.MEAN group, and FROM MEANS m, S_PARAMS ss WHERE m.ID=ss.ID AND ss.GID=g.GID), returns multi-row table having one (SELECT c.ID1, c.ID2, c.COVER row per group FROM COVERS c, S_PARAMS ss member WHERE c.ID1=ss.ID AND ss.GID=g.GID), SELECT s.ID, s.LAD, s.LONG, t.VALUE FROM S_PARAMS s, TEMP t WHERE s.ID=t.ID Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB – VG Functions Basic Interface Implemented in C++ Each VG function has the following four public methods Initialize() – Seed for the VG function TakeParams() – Gets the parameters, one vector at a time OutputVals() – One row returned per call Finalize() – Deletes the internal VG function data structures Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB – Inference and Accuracy Inference Operator – Definition In general MCDB returns its query results as a set of (ti, fi) ti – Are the distinct tuples produced in the N MC iterations fi – The fraction of the N possible worlds which tuple ti appears SELECT SUM(sales) FROM T – Where T is a random table doesn’t return a fixed number, but a random number MCDB – Inference and Accuracy Accuracy We can use statistical tools (central limit theorem) in order to calculate the number of Monte Carlo iterations needed to estimate E[x] to within a desired precision We can also perform statistical tests of hypotheses such as “The expected value of the result Q1 is greater than the expected value of the result of Q2” Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB – Queries, Tuples Bundles and Operations Query – Processing (Naïve) Generate an instance of each random relation as specified by the various CREATE TABLE statements Once an entire instance of the database has been materialized execute the query Q Append every tuple in Q’s answer set with a number identifying the current Monte Carlo iteration Once N different answer sets have been generated, compute the for each tuple the number of worlds it belongs to Naïve implementation is impractical – MCDB has other strategy! MCDB – Queries, Tuples Bundles and Operations Query – Processing in MCDB (1) MCDB runs each query one time, regardless of N MCDB delays random attribute materialization Each tuple in MCDB is actually a “bundle” of tuples where t[i] is the value of t in the i-th Monte Carlo database instance They will be materialized right before random attributes are used by some relational operation In MCDB values of attributes are reproducible MCDB ensures that each tuple carries the pseudorandom number seeds that is supplies to the VG functions The seed is a compressed representation of the random attribute values in the tuple bundle MCDB – Queries, Tuples Bundles and Operations Tuple Bundle – Definition The only requirement on a set of tuple bundles t1,t2, … tk is that, for each i the set r i-th realization of R i t j [i ] corresponds precisely to the j A tuple bundle in MCDB may have a special random attribute called “isPresent”. And t[i].isPres=true iff there is a tuple in the bundle which actually appears in the i-th MC instance. If it is not explicitly represented it assumed to be true for all i. This is implemented as a binary array MCDB – Queries, Tuples Bundles and Operations New Operations – Introduction Because we use the tuple-bundle notation we need to add some additional operations in order to handle different situations The classical relational operations must be modified slightly to handle the fact that tuple bundles move through the query plan MCDB – Queries, Tuples Bundles and Operations New Operations – Seed Operator For a given random table R and VG function V, the SEED operator appends each tuple created by R’s FOR EACH statement an integer unique to the (tuple, VG) pair This serves the pseudorandom seed for V when expanding the tuple into uncompressed tuple bundle MCDB – Queries, Tuples Bundles and Operations New Operations – Instantiate Operator (1) The most unique and fundamental operator used by MCDB For a random table R, this operator uses a VG function to generate a set of attribute values corresponding to a MC iteration – which is appended to individual tuple bundles in R. Lets follow an example… MCDB – Queries, Tuples Bundles and Operations New Operations – Instantiate Operator (2) The operator gets the following 7 parameters from the CREATE TABLE statement Qout – The answer set for the FOR EACH VG – The VG function VG Atts – Set of attributes produced by the VG (Normal.VALUE) Out Atts – Set of attributes in the result (p.ID and p.GENDER) Qin1, Qin2,… - The answer sets for the inner queries for the VG (SELECT s.MEAN…s.GENDER FROM SBP_PARAM) In Atts1, In Atts2,… - Set of attributes from the i-th inner query (s.MEAN and s.STD) B1, B2, … - The boolean join (s.GENDER = p.GENDER) MCDB – Queries, Tuples Bundles and Operations New Operations – Instantiate Operator (3) MCDB – Queries, Tuples Bundles and Operations New Operations – Split Operator (1) In case that one of the attributes in the tuple-bundle is not constant (equal on all instances) we need to have a split operator for being able to use other operators The SPLIT takes a tuple-bundle and returns a new set of tuple bundles in which the given attributes are constant MCDB – Queries, Tuples Bundles and Operations New Operations – Split Operator (2) Fname Lname age Lets say we have 4 MC instances in which Fname and Lname are constant, only the age isn’t In i=1,3 age=20 and in i=2,4 age=21. T = (Jane, Smith, (20,21,20,21), (T,T,T,T)) After the SPLIT on where atts={age} T1 = (Jane, Smith, 20, (T,F,T,F)) T2 = (Jane, Smith, 20, (F,T,F,T)) MCDB – Queries, Tuples Bundles and Operations New Operations – Inference Operator This operator returns a set of distinct unbundled tuples, and for each the fraction of the MC iterations for which it appears This is done in the following way Call SPLIT. Call Duplicate Removal (in a few slides) Counts the number of i values for which t[i].isPres=true. MCDB – Queries, Tuples Bundles and Operations Standard Relational Operations – Selection Given a boolean relational selection predicate B and a tuple bundle t: For each i t[i].isPres = B(t[i]) and t[i].isPres Standard Relational Operations – Projection If a non-constant attribute is projected away, the entire array of values for that attribute is removed. MCDB – Queries, Tuples Bundles and Operations Standard Relational Operations – Join Cartesian Product t[i] = r[i] s[i] (Tuple concatenation) t[i].isPres = r[i].isPres and s[i].isPres Join Cartesian Product Applying the modified Selection operator MCDB – Queries, Tuples Bundles and Operations Standard Relational Operations – Duplicate Removal Call SPLIT Sort lexicographically according attribute values For each group (T) one result tuple is output t[i].isPres = V(t’ in T) t’[i].isPres Outline Introduction – Uncertainty in Database Systems Motivation (Problems in Probabilistic Databases) Monte Carlo Method MCDB Introduction Schema Specification VG Functions Inference and Accuracy Queries, Tuple Bundles and Operations Experiments and Conclusions MCDB – Experiments and Conclusions Experiment – Setup 20,000 lines of C++ code 20GB instance of TCP-H database 3,000$ server with 4 160GB ATA HDs and 8 2.0GHz cores partitioned over 2 CPUs. 8 GB RAM Ubuntu Linux OS MCDB – Experiments and Conclusions Experiment – Query 1 This query guesses the revenue gain for products supplied by Japanese companies next year (assumed to be 1996), assuming that current sales trends hold. The ratio of sales volume in 1995 to 1994 is first computed on a percustomer basis. Then the 1996 sales are generated by replicating each 1995 order a random number of times, according to a Poisson distribution with mean . This process approximates a “bootstrapping” re-sampling scheme. Once 1996 is generated, the additional revenue is computed. MCDB – Experiments and Conclusions Experiment – Query 2 This query estimates the number of days until all orders that were placed today are delivered. Using past data, the query computes the mean and variance of both time-to-shipment and time-to-delivery for each part. For each order placed today, instances of these two random delays are generated according to discretized gamma distributions with the computed means and variances. Once all of the times are computed for each component of each order, the maximum duration is selected. MCDB – Experiments and Conclusions Experiment – Query 3 One shortcoming of the TPC-H schema is that, for a given supplier and part, only the current price is maintained in the database. Thus, it is difficult to ask, “What would the total amount paid to suppliers in 1995 have been if we had always gone with the most inexpensive supplier?” Query Q3 starts with the current price for each item from each supplier and then performs a random walk to guess prices from December, 1995 back to January, 1995. MCDB – Experiments and Conclusions Experiment – Query 4 This query is the one mentioned at the beginning of the lecture, which estimates the effect of a 5% customer price increase on an organization’s profits. MCDB – Experiments and Conclusions Experiment – Results MCDB - Experiments and Conclusions Contributions of MCDB First “pure” Monte Carlo approach toward managing uncertain data Powerful and flexible representation of data uncertainty Provide a syntax which requires small changes to SQL VG are similar to the User Defined Functions (UDF) Provide new query processing algorithms that execute a query plan only once Acceptable performance according to experiments MCDB – Experiments and Conclusions Future Work Query optimization (improving the optimizer) Error control (currently user selects the number of iterations) Improved risk assessment (estimating quantiles) Correlated relations (finding the best way to handle) Lineage (explicitly) Non-relational applications (Uncertain XML) Thank You “Gambling: The sure way of getting nothing from something” ~Wilson Mizner Resources The Paper: MCDB: a Monte Carlo approach to managing uncertain data Management of Uncertainty in Database Systems http://cs.gmu.edu/~ami/research/publications/pdf/modern94.pdf Other Resources: http://en.wikipedia.org/wiki/Monte_Carlo_method http://www.tpc.org/tpch/default.asp http://portal.acm.org/citation.cfm?doid=1376616.1376686 http://www.tpc.org/tpch/default.asp http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.7 320