Using SQL to Build User-Defined Aggregates and Extenders for O
Download
Report
Transcript Using SQL to Build User-Defined Aggregates and Extenders for O
Using SQL to Build User-Defined
Aggregates and Extenders
for O-R Systems
Haixun Wang, Carlo Zaniolo
[email protected], [email protected]
Computer Science Dept.
University of California, Los Angeles
State of the Art
DBMSs struggling to keep up with new applications
New language constructs (based on aggregates)
multimedia, time series, spatial/temporal DB
OLAP, data mining
Grouping Sets, Rollup, Cube, OLAP Functions, …
UDFs: the main extension mechanism
Data Blades for time series, spatial, XML, etc.
Very Hard to use
No aggregate functions supported
2
User-Defined Aggregates (UDAs)
Not in SQL99—though in earlier SQL3 drafts, and
supported in Informix (and others)
SQL3 UDAs suffer from serious limitations and easeof-use problems
Claim: we have a great solution for the UDA problem
Our UDAs are better than UDFs for extending DBs
AXL – a system to make it easy to define UDAs
AXL – can be used to do Data Mining in SQL
3
UDAs in SQL3
AGGREGATE FUNCTION myavg(val NUMBER)
RETURN NUMBER
STATE state
INITIALIZE myavg_init
ITERATE myavg_iterate
TERMINATE myavg_return
INITIALIZE: gives an initial value to the aggregate
ITERATE : computes the intermediate value for each new record
TERMINATE: returns the final value computed for the aggregate
myavg_init, myavg_iterate, myavg_return are 3 functions that the user
must write in a procedural programming language
On line aggregation: this would require another function for
early returns
4
Limitation of SQL3 UDAs
Cannot define on-line aggregation or rollups
Aggregation as a function from a set (or multiset) to a
single value
Aggregates can not be used inside recursion (the
nonmonotonicity curse)
Ease of use is a major issue
5
Ease of Use
THE PROBLEM: UDFs are very hard to write and debug. In “unfenced
mode” they jeopardize the integrity of the system. UDAs defined using
several UDFs are prone to the same problem.
A SOLUTION: Use a high-level language for defining UDAs. But who
wants a new DB language?
THE IDEAL SOLUTION: Use SQL to define new aggregates.
Substantial benefits:
Users are already familiar with SQL
No impedance mismatch of data types and programming paradigms
DB advantages: scalability, data independence, optimizability,
parallelizability
But how far can we take SQL?
7
AXL by Example: average
AGGREGATE avg(value INT) : REAL
{
TABLE state(sum INT, cnt INT);
INITIALIZE: {
INSERT INTO state (value, 1);
}
ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt+1;
}
TERMINATE: {
INSERT INTO RETURN SELECT sum/cnt FROM state;
}
}
8
Second Example
Show the average salary of senior managers who make 3
times more than the average employees.
SQL:
SELECT avg(salary)
FROM employee
WHERE title = ‘senior manager’ AND
salary > 3 * (SELECT avg(salary) FROM employee)
Two scans of the employee table required
With AXL UDAs:
SELECT sscan(title, salary)
FROM employee
9
AXL: Using a Single Scan
AGGREGATE sscan(title CHAR(20), salary INT) : REAL
{
TABLE state(sum INT, cnt INT) AS VALUES (0,0);
TABLE seniors(salary INT);
INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+salary, cnt=cnt+1;
INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’;
}
TERMINATE: {
INSERT INTO RETURN
SELECT avg(s.salary) FROM seniors AS s
WHERE s.salary > 3 * (SELECT sum/cnt FROM state);
}
}
10
Early Returns
AVG normally converges early: an early
approximation is all is needed in several
applications
Online aggregation means that early returns are
produced during the computation
Early returns are useful in many other
computations: for instance to find the local max
and min in a sequence of values -- and various
temporal aggregates
14
Return avg for Every 100 Records
AGGREGATE olavg(value INT): REAL
{
TABLE state(sum INT, cnt INT);
INITIALIZE: {
INSERT INTO state VALUES (value,1); }
ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt+1;
INSERT INTO RETURN
SELECT sum/cnt FROM state
WHERE cnt MOD 100 = 0; }
TERMINATE: {
INSERT INTO RETURN SELECT sum/cnt FROM state; }
}
15
Temporal Coalescing
AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME)
{
TABLE state(cFrom TIME, cTo TIME);
INITIALIZE: {
INSERT INTO state VALUES (from, to);
}
ITERATE: {
UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to;
INSERT INTO RETURN
SELECT cFrom, cTo FROM state WHERE cTo < from;
UPDATE state SET cFrom = from, cTo = to WHERE cTo < from;
}
TERMINATE: {
INSERT INTO RETURN SELECT cFrom, cTo FROM state;
}
}
16
Recursive Aggregates
In AXL, aggregates can call other aggregates.
Particularly, an aggregate can call itself
recursively.
AGGREGATE alldesc(P CHAR(20)): CHAR(20)
{
INITIALIZE: ITERATE: {
INSERT INTO RETURN
VALUES(P);
INSERT INTO RETURN
SELECT alldesc(Child)
FROM children
WHERE Parent = P;
}
}
Find all the descendents of Tom:
SELECT alldesc(Child)
FROM children
WHERE Parent = ‘Tom’;
21
AXL: Where SQL and Data Mining Intersects
Loosely-coupled:
Tightly-coupled:
Cache Mining: current data mining applications
have a loose connection with databases
UDFs based data mining functions
Ideal:
AXL powered by recursive aggregates
22
Decision Tree Classifiers
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Training set: tennis
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
PlayTennis
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
no
RecId
1
1
1
1
2
2
2
2
ColId Value
1
Sunny
2
Hot
3
High
4
Weak
1
Sunny
2
Hot
3
High
4
Strong
PlayTennis
No
No
No
No
Yes
Yes
Yes
Yes
14
14
14
14
1
2
3
4
Rain
Mild
High
Strong
No
No
No
No
Stream of Column/Value Pairs
(together with RecId and Category)
23
Convert training set to column/value pairs
AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT)
: (col INT, val INT, YorN INT)
{ INITIALIZE: ITERATE: {
INSERT INTO RETURN
VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);}
}
CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT)
SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis)
FROM tennis;
SELECT sprint(recId, col, val, YorN)
FROM col-val-pairs;
24
SPRINT Algorithm in AXL
[ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT)
[ 2] {
TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value));
[ 3]
TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT);
[ 4]
TABLE split(Rec INT, LeftOrRight INT, KEY (RecId));
[ 5]
TABLE mincol(Col INT, Val REAL, Gini REAL);
[ 6]
TABLE node(Node INT) AS VALUES(iNode);
[ 7]
INITIALIZE: ITERATE: {
[ 8]
INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN);
[ 9]
UPDATE summary
[10]
SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal)
[11]
WHERE Col=iCol;
[12]
}
[13]
TERMINATE: {
[14]
INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary;
[15]
INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n;
[16]
INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m
[17]
WHERE t.Col = m.Col AND m.Gini > 0;
[18]
SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN)
[19]
FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec
[20]
GROUP BY s.LeftOrRight;
[21]
}
[22] }
Performance
SPRINT Algorithm: AXL vs. C
Categorical Classifier: AXL vs. C
26
Implementation of AXL
AXL V1.2: approaching 40,000 lines of code
AXL compiler translates AXL programs into C++ code
Open interface of physical data model.
DB2 add-on: emulate UDAs using UDFs
Standalone: running under both Win98/NT and UNIX
Currently using Berkeley DB as our storage manager
In memory tables
Limited Optimization
Using B+-Tree indexes to support equality/range query
Predicate push-down / push-up
27