pptx - CS Technion

Download Report

Transcript pptx - CS Technion

LOGO
Towards Declarative Queries on
Simon Zeltser
Based on the article by Nicolas Bruno and Pablo Castro
Contents
1
Introduction
2
LINQ on Rich Data Structures
3
LINQ Query Optimization
4
Conclusions and Discussion
Seminar in Database Systems
Technion
Introduction
 THE PROBLEM



There is an increasing number of applications that need
to manage data outside the DBMS
Need for a solution to simplify the interaction between
objects and data sources
Current solutions lack rich declarative query mechanism
THE NEED

Unified way to query various data sources
THE SOLUTION

LINQ (Language Integrated Query)
Seminar in Database Systems
Technion
Introduction
LINQ : Microsoft.NET 3.5 Solution



Accessing multiple data sources via the same API
Technology integrated into the programming language
Supports operations:
 Traversal – grouping, joins
 Filter – which rows
 Projection –which columns
var graduates = from student in students
where student.Degree = “Graduate”
orderby student.Name,
student.Gender, student.Age
select student;
BUT… The default implementation is simplistic

Appropriate for small ad-hoc structures in memory
Seminar in Database Systems
Technion
Introduction
THE GOAL OF THIS SESSION



Introduce LINQ key principles
Show model of customization of LINQ’s Execution
Model on Rich Data Structures
Evaluate the results
Seminar in Database Systems
Technion
LINQ – High Level Architecture
LINQ – Enabled Data Sources
<xml
>
Objects
Seminar in Database Systems
Databases
XML
Technion
Compare two approaches
Iteration
LINQ
List<String> matches = new List<String>();
// Find and sort matches
var matches = from n in data
where n.StartsWith("Eric")
orderby n
select n;
// Find the matches
foreach (string item in data) {
if (item.StartsWith("Eric")) {
matches.Add(item);
}
}
// Sort the matches
matches.Sort();
// Print out the matches
foreach (string item in matches)
}
Console.WriteLine(item);
{
Seminar in Database Systems
// Print out the matches
foreach (var match in matches)
}
Console.WriteLine(match);
{
Technion
Language Integration
Function
int StringLength(String s)
{
return s.Length();
{
var matches = from n in data
where n.StartsWith("Eric")
orderby n
select n;
Lambda Expression
s => s.Length();
var matches = data
.Where(n => n.StartsWith("Eric"))
.OrderBy(n => n)
.Select(n => n)
public static IEnumerable<TSource> Where<TSource>
(this IEnumerable<TSource> source,
Func<TSource, bool> predicate)
var name = "Eric";
var age = 43;
var person = new { Name = "Eric", Age = 43 };
var names = new [] {"Eric", "Ryan", "Paul" };
foreach (var item in names)
Seminar in Database Systems
Technion
LINQ - Example
// Retrieve all CS students with more
// than 105 points
var query =
from stud in students
where ( stud.Faculty == “CS” &&
stud.Points > 105)
orderby stud.Points descending
select new { Details = stud.Name +
“:” + stud.Phone
};
// Iterate over results
foreach(var student in query)
{
Console.WriteLine(student.Details);
}
Seminar in Database Systems
Technion
Customizing LINQ Execution Model
EXPRESSION TREES


*
LINQ represents queries as in-memory abstract syntax tree
Query description and implementation are not tied together
+
1
5
7
THE PROBLEM

The default implementation of the operations uses fixed, general
purpose algorithms
SUGGESTED SOLUTION


Change how the query is executed without changing how it’s
expressed
Analyze alternative implementations of a given query and
dynamically choose the most appropriate version depending on the
context.
Seminar in Database Systems
Technion
Customizing LINQ Execution Model (2)
PROBLEM EXAMPLE
WHERE operator is implemented by performing a sequential scan
over the input and evaluating the selection predicate on each tuple!
int[] A = {1, 2, 3, 10, 20, 30};
var q = from x in A
where x < 5
select 2*x;
foreach(int i in q)
Console.WriteLine(i);
3
Query Implementation:
1
var q = A.Where(x=>x<5)
.Select(x=>2*x);
2
IEnumerable<int> q = Enumerable.Project(
Enumerable.Where(A, AF1),
AF2);
bool AF1(int x) { return x<5;
int AF2(int x) { return 2*x;
}
}
IEnumerable<int> res = new List<int>();
foreach(int a in A)
if (AF1(a)) res.Add(AF2(a));
return res;
Seminar in Database Systems
Technion
Rich Data Structures - DataSet
 In-memory cache of data
 Typically populated from a database
 Supports indexing of DataColumns via DataViews
DataSet object
DataTable object
DataTable object
DataRow
Unique
Constraint
Data
Column
Unique
Constraint
ForeignKeyConstraint
We will use LINQ on DataSet for demonstrating query optimization techniques
Seminar in Database Systems
Technion
LINQ on Rich Data Structures
Enable LINQ to work over DataSets.
EXAMPLE

Given R and S – two DataTables
from r in R.AsEnumerable()
join s in S.AsEnumerable()
on r.Field<int>(“x”) equals
s.Field<int>(“y”)
select new { a = r.Field<int>(“a”),
b = s.Field<int>(“b”) };
Compile and run-time phases on an implementation of our prototype
LINQ on
DataSet
Standard
C# Code
Interm.
Language
Compile Time
Expression
Tree
Optimized
Expression
Tree
Run Time
Self-tuning
State
Seminar in Database Systems
Interm.
Language
DataSet
Technion
Expression Tree Optimizer
Our solution will be built according to the following architecture
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Query Cost Estimator
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Query Estimation - Cost Model
Follow traditional database approach:


COST: {execution plans} -> [expected execution time]
Relies on:
 a set of statistics maintained in DataTables for some
of its columns
 formulas to estimate selectivity of predicates and
cardinality of sub-plans
 formulas to estimate the expected costs of query
execution for every operator
Seminar in Database Systems
Technion
Cardinality Estimation
Returns an approximate number of rows that
each operator in a query plan would output


To reduce the overhead, we will use only these
statistical estimators:
 maxVal – maximum number of distinct values
 minVal – minimum number of distinct values
 dVal – number of distinct values in a column
If statistics are unavailable, rely on “magic numbers”
until automatically creation of statistics
Seminar in Database Systems
Technion
Predicate Selectivity Estimation
Let: σp(T ) be an arbitrary expression.
The cardinality of T is defined: Card(σp(T )) =sel(p)·

Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T):
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Predicate
Selectivity Estimation
Intuition:
c1
maxVal(c)
c
model
sel(c
the probability to get a “c”
o<=c<=c1) as
Let:
σp(T )We
be
an
arbitrary
expression.
value in interval [c0, c1] among all possible “c” values
c0
minVal(c)
The
cardinality of T is defined: Card(σp(T )) =sel(p)·

Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T:
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Consider
now a Selectivity
join predicate:
T1 c1=c2T2
Predicate
Estimation
Card (T 1) Card (T 2)
Let:
σ
(T
)
be
an
arbitrary
expression.
min(
dVal
(
c
1
),
dVal
(
c
2
))
*
*
Card(T1 p c1=c2 T2)=
dVal (c1) dVal (c 2)
The cardinality of T is defined: Card(σp(T )) =sel(p)·

Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T):
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Query Analyzer
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Execution Alternatives
Rely on indexes on DataColumns when possible

Example:
σa=7∧(b+c)<20
Alternative 1:
5
Index on “a”
column
Alternative 2:
Full Table Scan
2
3
7
4
a=7
b+c < 20
a
b
c
a
b
c
a
b
c
7
1
3
7
1
3
7
1
3
2
3
76
2
3
76
2
3
76
5
34
32
5
34
32
5
34
32
7
14
8
7
14
8
7
14
8
4
9
9
94
9
9
4
9
9
7
4
23
7
4
23
7
4
23
3
1
3
83
1
3
3
1
3
Seminar in Database Systems
Technion
Analyzing Execution Plans
 Global vs. Local Execution Plan – EXAMPLE:
Global Execution Plan
Join
Products
Carts
Local Execution Plan
 HashJoin?
 IndexJoin?
 MergeJoin?
Join
Filter
Customers
Seminar in Database Systems
Technion
Enumeration Architecture
 Two phases:


First phase: Join reordering based on estimated cardinalities
Second phase: Choose the best physical implementation for
each operator
 EXAMPLE: Suppose we analyze JOIN operator.

We evaluate the following JOIN implementations:
 Hash Join
 Merge Join (inputs must be sorted in the join columns)
 Index Join (index on the inner join column must be
available)
 Other possible calculation options
 Choose the alternative with the smallest cost
Seminar in Database Systems
Technion
Query Analysis
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Self Tuning Organization
We want to reach the smallest query execution
time.

Indexes can be used to speedup query execution
PROBLEM:

It might become problematic to forecast in advance
what indexes to build for optimum performance
SOLUTION:

Continuous monitoring/tuning component that
addresses the challenge of choosing and building
adequate indexes and statistics automatically
Seminar in Database Systems
Technion
Self Tuning Organization - Example
Consider the following execution plan:
 The selection predicate
Name=“Pam” over
Customers DataTable can
be improved if an index on
Customers(Name) is built
 Both hash joins can be
improved if indexes I2 and
I3 are available, since we
can transform hash join
into index join
* The three sub-plans enclosed in
dotted lines might be improved if
suitable indexes were present
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index Tuning
High-Level Description:

Identify a good set of candidate indexes that would
improve performance if they were available.

Later, when the optimized queries are evaluated, we
aggregate the relative benefits of both candidate and
existing indexes.

Based on this information, we periodically trigger
index creations or deletions, taking into account
storage constraints, overall utility of the resulting
indexes, and the cost to creating and maintaining them.
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:


H – a set of candidate indexes to materialize
T – task set for query qi
H (initially empty)
Task Set
I1, δI1 I2, δI2
. . … In, δIn
I – either a candidate or an existing index
δIi – amount that I would speed up query q
 i

Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:


ΔI – value maintained for each index I
Materialized index – already created one
H
I1
Task Set
I1, δI1 I2, δI2


. . … In, δIn
SELECT query: ΔI = ΔI + δI
UPDATE query: ΔI = ΔI – δI
Seminar in Database Systems
Technion
Index Tuning Algorithm
The purpose of ΔI:
 We maintain ΔI on every
query evaluation
 If the potential
aggregated benefit of
materializing a candidate
index exceeds its creation
cost, we should create it,
since we gathered enough
evidence that the index is
useful
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Remove “bad” indexes phase
Notation:





Δmin – minimum Δ value for index I
Δmax – maximum Δ value for index I
BI – the cost of creating index I
Residual(I) = BI – (Δmax – Δ)
(the “slack” an index has before being deemed “droppable”)
IF (Residual(I)) <= 0) THEN Drop(I)
Net-Benefit(I) = (Δ-Δmin)-BI
(the benefit from creating the index)
IF (Net-Benefit(I) >= 0) THEN Add(I)
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:




ITM – all the indexes from H which creation is cost effective
ITD – subset of existing indexes such that:
 ITD fits in existing memory
 It’s still cost effective to create new index I after possibly
dropping members from ITD
If creating index I is more effective than maintaining
existing indexes in ITD, DROP(ITD) && CREATE(I)
Remove I from H (set of candidate indexes to materialize)
Seminar in Database Systems
Technion
Experimental Evaluation
Consider the following schema:
Generated:
• 200,000 products
• 50,000 customers
• 1,000 categories
• 5,000 items in the shopping carts
Seminar in Database Systems
checkCarts($1) =
from p in Products.AsEnumerable()
join cart in Carts.AsEnumerable()
on p.Field<int>("id")
equals cart.Field<int>("p_id")
join c in Customers.AsEnumerable()
on cart.Field<int>("cu_id")
equals c.Field<int>("id")
where c.name = $1
select new { cart, p }
browseProducts($1) =
from p in Products.AsEnumerable()
join c in Categories.AsEnumerable()
on p.Field<int>("ca_id") equals
c.Field<int>("id")
where c.par id = $1
select p
Possible Indexes
I1 Categories(par_id)
I2 Products(c_id)
I3 Carts(cu_id)
I4 Products(ca_id)
I5 Customers(name)
Technion
Execution plans for evaluation queries
Seminar in Database Systems
Technion
Experimental Evaluation – Cont.
Generated schedule when tuning was disabled
Seminar in Database Systems
Technion
Experimental Evaluation – Cont.
Generated schedule when tuning was enabled
Seminar in Database Systems
Technion
Summary
We’ve discussed:



LINQ – for declarative query formulation
DataSet - a uniform way of representing in-memory
data.
A lightweight optimizer for automatically adjusting
query execution strategies
Article’s main contribution:


NOT a new query processing technique
BUT: careful engineering of traditional database
concepts in a new context
Seminar in Database Systems
Technion
LOGO
Simon Zeltser
LINQ Execution Model
Query syntax is
converted to function
calls and lambda
expressions
Compiler merges
LINQ extension
methods
 Adds query
operations to
IEnumerable<T>
 At compile time
Compiler infers
types produced
by queries
Compiler finds a
query pattern
 Datasets are
strongly typed
 Operations on
data sets are strongly
typed
 Specialized or base
 Can optimize and
re-write query
Seminar in Database Systems
Lambda expressions
are converted to
expression trees
 Expressions are evaluated at
run-time
 Parsed and type checked at
compile-time
Query is executed
lazily
 Expressions and operations
can execute remotely
 At run-time, when results are used
 We can force evaluations (ToArray())
Technion