Using SQL to Build New Aggregates and Extenders for Object
Download
Report
Transcript Using SQL to Build New Aggregates and Extenders for Object
User-Defined Aggregates for Advanced
Database Applications
Haixun Wang
[email protected]
Computer Science Dept.
University of California, Los Angeles
State of the Art: the big picture
Relational Databases:
Single data model of spartan simplicity
Logic-based database languages
Commercial success and dominance through SQL
standards (in the 80’s)
A new wave of DB applications (in the 90’s)
Knowledge discovery, Data Mining, OLAP, Time Series
analysis, Spatial/Temporal DBs, XML, …
Underscores limitations of RDBMS
Prompted major extensions leading to SQL3 standards
(SQL99 is a subset of SQL3)
2
State of the Art: R&D highlights
Deductive DBs
OO-DBs
Rule based syntax
Logic based formalization of query language semantics –
e.g., nonmonotonic reasoning and stratification
Recursive queries
Complex data types / inheritance
Expressive power by merging PL and query languages
OR-DBs
Rich data types / Path Expression (SQL)
UDFs and Database Extenders (Data Blades)
3
State of the Art: the seamy side
A patchwork of major extensions
Still not powerful enough
DBMSs have become more powerful but much hard and
complex to build and use
Data Mining: clustering, classification, association
New language constructs not helping either
Limited expressive power in other applications
Bill of Materials (BoM) type of applications
Temporal reasoning
4
This thesis:
Many of the problems can be solved by UDAs
User Defined Aggregates (UDAs):
insufficient support in commercial world and DB
standards
Our claim: UDAs provide a more general
and powerful mechanism for DB extensions
AXL – a system to make it easier to define
UDAs
AXL – where SQL and Data Mining intersect
5
A Brief History of AXL (and of my thesis)
Logic formalization of aggregates
SQL-AG:
Simple Aggregate Definition Language
[ICDE’00]
using SQL to define new aggregates
easy to use, but with limited performance and power
AXL:
Implementing and extending SQL3 UDAs
[DBPL’99]
Implemented on DB2 using extended user-defined aggregates with
‘early returns’
SADL:
[DDLP’98, LBAI’00]
Early returns, monotonic aggregates: used freely in recursive queries.
Extensions of LDL++ (Logic Database Language)
Aggregate eXtension Language
[VLDB’00]
powerful, efficient and still SQL-based
6
Defining UDAs in SQL3
AGGREGATE FUNCTION myavg(val NUMBER)
RETURN NUMBER
STATE state
INITIALIZE myavg_init
ITERATE myavg_iterate
TERMINATE myavg_return
INITIALIZE: gives an initial value to the aggregate
ITERATE : computes the intermediate aggregate value for each new
record
TERMINATE: returns the final value computed for the aggregate
myavg_init, myavg_iterate, myavg_return are 3 functions that the user
must write in a procedural programming language
7
Limitation of SQL3 UDAs
UDAs in SQL3, Postgres, Informix, and early versions of
LDL++ share the same limitations:
Aggregates can not be used inside recursion
No support for early returns and on-line aggregation
Also Ease of Use is a major issue (except for LDL++)
8
Ease of Use
THE PROBLEM: UDFs are very hard to write and debug. In “unfenced
mode” they jeopardize the integrity of the system. UDAs defined using
several UDFs are prone to the same problem.
A SOLUTION: Use a high-level language for defining UDAs. But there
are many potential problems with any new language.
THE IDEAL SOLUTION: use SQL to define new aggregates.
Substantial benefits:
Users are already familiar with SQL
No impedance mismatch of data types and programming paradigms
DB advantages: scalability, data independence, optimizability,
parallelizability
9
Simple Aggregates
AGGREGATE avg(value INT) : REAL
{
TABLE state(sum INT, cnt INT);
INITIALIZE: {
INSERT INTO state (value, 1);
}
ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt+1;
}
TERMINATE: {
INSERT INTO RETURN SELECT sum/cnt FROM state;
}
}
10
Avoiding Multiple Scans
Show the average salary of senior managers who make
3 times more than the average employees.
SQL:
SELECT avg(salary)
FROM employee
WHERE title = ‘senior manager’
AND salary > 3 * (SELECT avg(salary) FROM employee)
Two scans of the employee table required
With AXL UDAs:
SELECT sscan(title, salary)
FROM employee
11
AXL: Using a Single Scan
AGGREGATE sscan(title CHAR(20), salary INT) : REAL
{
TABLE state(sum INT, cnt INT) AS VALUES (0,0);
TABLE seniors(salary INT);
INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+salary, cnt=cnt+1;
INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’;
}
TERMINATE: {
INSERT INTO RETURN
SELECT avg(s.salary) FROM seniors AS s
WHERE s.salary > 3 * (SELECT sum/cnt FROM state);
}
}
12
Ordered Sequences and Time Series
We have a sequence of events, each of which is active
during a certain interval (from, end). Find out at which
point of time we have the largest number of active
events.
SQL:
Group-by on the start time of each interval, and count!
With AXL UDAs:
SELECT density(from, end)
FROM events
13
AXL: Using a Single Scan
AGGREGATE density(start TIME, end TIME) : (time TIME, count INT)
{ TABLE state(time TIME, count INT) AS (0, 0);
TABLE active(endpoint TIME);
INITIALIZE: ITERATE: {
DELETE FROM active WHERE endpoint < start;
INSERT INTO active VALUES(end);
UPDATE state SET time=start, count =count + 1
WHERE count < (SELECT count(*) FROM active);
}
TERMINATE: {
INSERT INTO RETURN
SELECT time, count FROM state;
}
}
14
Early Returns
AVG normally converges early: an early
approximation is all is needed in several
applications
Online aggregation means that early returns are
produced during the computation
Many applications: e.g., find the local max and
min in a sequence of values, various temporal
aggregates, rollups, etc.
These might depend on the order – same as new
OLAP extensions in SQL3
16
Return avg for Every 100 Records
AGGREGATE olavg(value INT): REAL
{
TABLE state(sum INT, cnt INT);
INITIALIZE: {
INSERT INTO state VALUES (value,1);
}
ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt+1;
INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0;
}
TERMINATE: {
INSERT INTO RETURN SELECT sum/cnt FROM state;
}
}
17
Temporal Coalescing
AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME)
{
TABLE state(cFrom TIME, cTo TIME);
INITIALIZE: {
INSERT INTO state VALUES (from, to);
}
ITERATE: {
UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to;
INSERT INTO RETURN
SELECT cFrom, cTo FROM state WHERE cTo < from;
UPDATE state SET cFrom = from, cTo = to WHERE cTo < from;
}
TERMINATE: {
INSERT INTO RETURN SELECT cFrom, cTo FROM state;
}
}
18
Recursive Aggregates
In AXL, aggregates can call other aggregates.
Particularly, an aggregate can call itself
recursively.
AGGREGATE alldesc(P CHAR(20)): CHAR(20)
{
INITIALIZE: ITERATE: {
INSERT INTO RETURN
VALUES(P);
INSERT INTO RETURN
SELECT alldesc(Child)
FROM children
WHERE Parent = P;
}
}
Find all the descendents of Tom:
SELECT alldesc(Child)
FROM children
WHERE Parent = ‘Tom’;
19
Check Point
Simple applications: AXL UDAs provide a solution
with better performance and good ease of use.
Many advanced database applications
Time Series, Temporal Database, Spatial
Database…
In particular data mining applications
20
Data Mining and Database Systems
Current Approach:
Cache-Mine:
Cursor-based: loose-coupling, stored procedures
UDFs: ease of use problems
Using DBMSs as containers of data
Many attempts to closely integrate data
mining functions into DBMS have shown
major problems
21
What we need …
SQL-aware Data Mining Systems
Surajit Chaudhuri “Data Mining and Database Systems: Where is the
Intersection?” IEEE Data Engineering Bulletin, 1998
Decision Tree Classifiers
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Training set: tennis
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
PlayTennis
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
no
RecId
1
1
1
1
2
2
2
2
ColId Value
1
Sunny
2
Hot
3
High
4
Weak
1
Sunny
2
Hot
3
High
4
Strong
PlayTennis
No
No
No
No
Yes
Yes
Yes
Yes
14
14
14
14
1
2
3
4
Rain
Mild
High
Strong
No
No
No
No
Stream of Column/Value Pairs
(together with RecId and Category)
25
Convert training set to column/value pairs
AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT)
: (col INT, val INT, YorN INT)
{
INITIALIZE: ITERATE: {
INSERT INTO RETURN
VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn);
}
}
CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT)
SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis)
FROM tennis;
SELECT classify(recId, col, val, YorN)
FROM col-val-pairs;
26
Categorical Classifier in AXL
[ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT)
[ 2] {
TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT);
[ 3]
TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value});
[ 4]
TABLE mincol(Col INT, MinGini REAL);
[ 5]
TABLE ginitable(Col INT, Gini REAL);
[ 6]
INITIALIZE: ITERATE: {
[ 7]
INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN);
[ 8]
UPDATE summary
[ 9]
SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue;
[10]
INSERT INTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0;
[11]
}
[12]
TERMINATE: {
[13]
INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc)
[14]
FROM summary GROUP BY Col;
[15]
INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable;
[16]
INSERT INTO result SELECT iNode, Col FROM mincol;
[17]
SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN)
[18]
FROM treenodes AS t,
[19]
(SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m
[20]
WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m
[21]
WHERE t.RecId = m.RecId GROUP BY m.Value;
[22]
}
[23] }
Performance
SPRINT Algorithm: AXL vs. C
Categorical Classifier: AXL vs. C
28
SPRINT Algorithm in AXL
[ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT)
[ 2] {
TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value));
[ 3]
TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT);
[ 4]
TABLE split(Rec INT, LeftOrRight INT, KEY (RecId));
[ 5]
TABLE mincol(Col INT, Val REAL, Gini REAL);
[ 6]
TABLE node(Node INT) AS VALUES(iNode);
[ 7]
INITIALIZE: ITERATE: {
[ 8]
INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN);
[ 9]
UPDATE summary
[10]
SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal)
[11]
WHERE Col=iCol;
[12]
}
[13]
TERMINATE: {
[14]
INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary;
[15]
INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n;
[16]
INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m
[17]
WHERE t.Col = m.Col AND m.Gini > 0;
[18]
SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN)
[19]
FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec
[20]
GROUP BY s.LeftOrRight;
[21]
}
[22] }
Comparison with Other Architectures
Integrating Association Rule Mining with
Relational Database Systems
S. Sarawagi et al SIGMOD 98
Time in Sec
4000
3000
Pass
Pass
Pass
Pass
2000
1000
4
3
2
1
0
Cache
S-Proc
UDF
SQL
Datasets with 6.6 million records, support level .25%.
30
Implementation of AXL
Standalone Mode
Berkeley DB
DB2 Add-on Mode
.axl
.axl
.cc
.cc
.exe
UDFs
DB2 .lib
SQL
DB2
31
Implementation of AXL
Open interface of physical data model.
Limited Optimization
Currently using Berkeley DB as our storage manager
In memory tables
Using B+-Tree indexes to support equality/range query
Predicate push-down / push-up
User Defined Aggregates
Hash based
Return multiple rows: ‘early return’
Return multiple columns: employee’s name and salary
32
Implementation of AXL (cont’d)
Non-blocking aggregation
Keeping the state of aggregation between calls to the
aggregate routines
Local tables defined inside aggregation are passed as
parameters to the aggregates
Explicit sorting (and implicit hash-based
aggregation)
AXL V1.2: above 30,000 lines of code
33
Check Point
Simple applications: AXL UDAs provide a
solution with better performance and good
ease of use.
Data Mining applications
Formal Semantics of Aggregates and
Monotonic Aggregation
34
Aggregates in Recursion
Stratification:
shaves(barber, X) :- villager(X), shaves(X, X).
villager(barber).
Aggregates:
p count(p) =0
Aggregates in many applications are actually
monotonic (and should be allowed inside
recursion).
35
Beyond Stratification
Significant previous efforts…
I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The
magic of duplicates and aggregates”, VLDB 1990
A. Van Gelder, “Foundations of aggregates in deductive
databases”, DOOD 1993
K. A. Ross and Y. Sagiv, “Monotonic aggregates in
deductive databases”, JCSS 1997
S. Greco and C. Zaniolo, “Greedy algorithms in Datalog
with choice and negation”, JICSLP 1998
36
Formal Semantics of Aggregates
choice((X),(Y))
Ordering a domain
Enforcing functional dependency. FD: X->Y
Multiplicity of stable models, monotonic transformation
Once (X,Y) is generated, choice ensures this is the only arc
leaving source node X and entering sink node Y
Formal semantics of UDA
…
return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V).
…
37
Early Returns Monotonic Aggregates
Aggregates with only ‘early returns’ and no ‘final returns’ are
monotonic w.r.t. set containment:
AGGREGATE mcount(): INT
{
TABLE state(cnt INT) AS VALUES (0);
INITIALIZE: ITERATE: {
UPDATE state SET cnt=cnt+1;
INSERT INTO RETURN SELECT cnt FROM state;
}
}
38
Early Returns Monotonic Aggregates
SELECT mcount(*) FROM employee;
v.s.
SELECT count(*) FROM employee;
Name
John
Mary
Tom
Name
John
Mary
Tom
Jerry
{1,2,3}
count{John, Mary, Tom} 3
mcount{John, Mary, Tom}
{1,2,3,4}
count{John, Mary, Tom, Jerry} 4
mcount{John, Mary, Tom, Jerry}
39
Return sum at the nth value
AGGREGATE sumat(value INT, n INT): INT
{
TABLE state (sum INT, cnt INT) AS VALUES (0,0);
INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt +1;
INSERT INTO RETURN
SELECT sum FROM state WHERE cnt = n;
}
}
40
Monotonic Aggregation
Monotonic aggregates can be used without any
restriction and without changing the underlying
implementation.
This solves the problem that had eluded database
researchers since the introduction of relational
systems:
BoM, Company Control, Join-the-Party…
Greedy optimization algorithms, such as Dijkstra’s
single source shortest path.
42
Join-the-Party Problem
Some people will come to the
party no matter what, and their
names are stored in a
sure(PName) relation. But
many other people will join only
after they know that at least K=3
of their friends will be there.
WITH wllcm(Name) AS
((SELECT Pname
FROM sure)
UNION ALL
(SELECT f.Pname
FROM friend AS f, wllcm AS w
WHERE w.Name =f.Fname
GROUP BY f.Pname
HAVING mcount()=3))
SELECT Name FROM wllcm;
Density-based Clustering [M. Ester et al. KDD 96]
43
BoM: Cost of Parts
assembly
001
1
5
1
3
002
004
003
100
2
5
10
005
8
3
Part
001
001
001
001
002
003
003
004
004
004
Qty
5
1
1
100
10
2
5
3
8
3
part-cost
Basic Part
005
006
006
Subpart
002
003
004
005
005
006
005
006
005
003
Cost
10
15
fan-out
Part
001
002
003
004
ChC
4
1
2
3
44
BoM: Using AXL
WITH cst(part, cost) AS
((SELECT part, cost
FROM part-cost)
UNION ALL
(SELECT a.part,
sumat(a.qty * c.cost, p.ChC)
FROM assembly AS a, cst AS c, fan-out AS p
WHERE a.subpart = c.part
AND p.part = a.part
GROUP BY a.part))
SELECT part, cost
FROM cst;
Bottom up solution.
Computes the cost for each
part once and only once.
Monotonic sumat(cost, n)
returns sum when exactly n
items are aggregated.
Works in DB2 after AXL
rewrites callings of sumat()
to callings of automatically
generated UDFs.
45
BoM: Using Recursive SQL
WITH mpath(subpart, qty) AS
((SELECT subpart, qty
FROM assembly
WHERE part = ‘001’)
UNION ALL
(SELECT c.subpart, m.qty * c.qty
FROM mpath m, assembly c
WHERE m.subpart = c.part))
SELECT sum(m.qty * c.cost)
FROM mpath m, part_cost c
WHERE m.subpart = c.part ;
Top down solution:
computes the cost of part
‘001’
Explosion: all edges that
descend from part ‘001’ are
“stored” in mpath
What if we want to
compute the cost of each
part?
46
Check Point
Simple applications
Data Mining and Decision Support
Formal Semantics & Monotonic aggregates
OLAP and other aggregate extensions
SUCH THAT
CUBE, ROLLUP, GROUPING SET
OLAP Functions
47
“Such That”
For each division, show the
average salary of senior
managers who make 3 times
more than the average
employees, and the average
salary of senior engineers who
make 2 times more than the
average employees (in the same
output record).
D. Chatziantoniou, Kenneth Ross,
VLDB 1996
SELECT division,
avg(X.salary),
avg(Y.salary)
FROM employee
GROUP BY division: X, Y
SUCH THAT
X.title = ‘senior manager’ AND
X.salary > 3*avg(salary) AND
Y.title = ‘senior engineer’ AND
Y.salary > 2*avg(salary)
48
Expressing “Such That” in AXL
TABLE seniors(salary INT);
AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL
{
TABLE state(sum INT, cnt INT) AS VALUES (0,0);
INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+salary, cnt=cnt+1;
INSERT INTO seniors VALUES (salary) WHERE title = qtitle;
}
TERMINATE: {
SELECT avg(s.salary)
FROM seniors AS s
WHERE s.salary > ratio * (SELECT sum/cnt FROM state);
}
}
49
Using UDA sscan2
SELECT
division,
sscan2(title, salary, ‘senior manager’, 3),
sscan2(title, salary, ‘senior engineer’, 2)
FROM employee
GROUP BY division;
No joins or sub-queries required.
One pass through the employee relation (standard
SQL requires at least 2 passes).
50
Other Aggregate Extensions
GROUPING SETS, ROLL-UP, CUBE
New OLAP extensions
Windows containing a partitioning, an ordering of rows
and an aggregate group
“… every standard must be prepared to tackle new issues
that arise as the market evolves. If SQL does not respond
positively to this challenge, SQL risks becoming irrelevant
…” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle
Introduction to OLAP Functions
51
Contributions
Adding extended UDAs to O-R systems
Tightly couple data mining functions with DBMS
high level language, minimal additions to SQL
monotonic aggregates, recursive aggregates
designed for general purpose applications
SPRINT Algorithm, Categorical Classifier, …
Performance
More efficient than cursor-based languages like PL/SQL,
JDBC and UDF-based approaches …
56
Future Directions
Parallelization
Extenders/Data Blades
Decision Support
build on top of UDAs instead of UDFs
the Apriori algorithm
Windows and OLAP functions
Spatial/Temporal extensions
the TENOR system
57
Future Direction: Parallelization
Current parallel aggregation algorithms valid for AXL:
Inter-Partition parallelism: all tuples of the same group-by
value are in one node
Two phase algorithm: user provides a COMBINE routine
Unlike SQL3, AXL’s aggregate routines are written in SQL,
so we can apply traditional query parallelization
techniques to INITIALIZE, ITERATE, and TERMINATE
Since aggregate routines are written in SQL, the COMBINE
routine can be generated automatically by the system for
simple UDAs
58