pptx - CS Technion
Download
Report
Transcript pptx - CS Technion
LOGO
Towards Declarative Queries on
Simon Zeltser
Based on the article by Nicolas Bruno and Pablo Castro
Contents
1
Introduction
2
LINQ on Rich Data Structures
3
LINQ Query Optimization
4
Conclusions and Discussion
Seminar in Database Systems
Technion
Introduction
THE PROBLEM
There is an increasing number of applications that need
to manage data outside the DBMS
Need for a solution to simplify the interaction between
objects and data sources
Current solutions lack rich declarative query mechanism
THE NEED
Unified way to query various data sources
THE SOLUTION
LINQ (Language Integrated Query)
Seminar in Database Systems
Technion
Introduction
LINQ : Microsoft.NET 3.5 Solution
Accessing multiple data sources via the same API
Technology integrated into the programming language
Supports operations:
Traversal – grouping, joins
Filter – which rows
Projection –which columns
var graduates = from student in students
where student.Degree = “Graduate”
orderby student.Name,
student.Gender, student.Age
select student;
BUT… The default implementation is simplistic
Appropriate for small ad-hoc structures in memory
Seminar in Database Systems
Technion
Introduction
THE GOAL OF THIS SESSION
Introduce LINQ key principles
Show model of customization of LINQ’s Execution
Model on Rich Data Structures
Evaluate the results
Seminar in Database Systems
Technion
LINQ – High Level Architecture
LINQ – Enabled Data Sources
<xml
>
Objects
Seminar in Database Systems
Databases
XML
Technion
Compare two approaches
Iteration
LINQ
List<String> matches = new List<String>();
// Find and sort matches
var matches = from n in data
where n.StartsWith("Eric")
orderby n
select n;
// Find the matches
foreach (string item in data) {
if (item.StartsWith("Eric")) {
matches.Add(item);
}
}
// Sort the matches
matches.Sort();
// Print out the matches
foreach (string item in matches)
}
Console.WriteLine(item);
{
Seminar in Database Systems
// Print out the matches
foreach (var match in matches)
}
Console.WriteLine(match);
{
Technion
Language Integration
Function
int StringLength(String s)
{
return s.Length();
{
var matches = from n in data
where n.StartsWith("Eric")
orderby n
select n;
Lambda Expression
s => s.Length();
var matches = data
.Where(n => n.StartsWith("Eric"))
.OrderBy(n => n)
.Select(n => n)
public static IEnumerable<TSource> Where<TSource>
(this IEnumerable<TSource> source,
Func<TSource, bool> predicate)
var name = "Eric";
var age = 43;
var person = new { Name = "Eric", Age = 43 };
var names = new [] {"Eric", "Ryan", "Paul" };
foreach (var item in names)
Seminar in Database Systems
Technion
LINQ - Example
// Retrieve all CS students with more
// than 105 points
var query =
from stud in students
where ( stud.Faculty == “CS” &&
stud.Points > 105)
orderby stud.Points descending
select new { Details = stud.Name +
“:” + stud.Phone
};
// Iterate over results
foreach(var student in query)
{
Console.WriteLine(student.Details);
}
Seminar in Database Systems
Technion
Customizing LINQ Execution Model
EXPRESSION TREES
*
LINQ represents queries as in-memory abstract syntax tree
Query description and implementation are not tied together
+
1
5
7
THE PROBLEM
The default implementation of the operations uses fixed, general
purpose algorithms
SUGGESTED SOLUTION
Change how the query is executed without changing how it’s
expressed
Analyze alternative implementations of a given query and
dynamically choose the most appropriate version depending on the
context.
Seminar in Database Systems
Technion
Customizing LINQ Execution Model (2)
PROBLEM EXAMPLE
WHERE operator is implemented by performing a sequential scan
over the input and evaluating the selection predicate on each tuple!
int[] A = {1, 2, 3, 10, 20, 30};
var q = from x in A
where x < 5
select 2*x;
foreach(int i in q)
Console.WriteLine(i);
3
Query Implementation:
1
var q = A.Where(x=>x<5)
.Select(x=>2*x);
2
IEnumerable<int> q = Enumerable.Project(
Enumerable.Where(A, AF1),
AF2);
bool AF1(int x) { return x<5;
int AF2(int x) { return 2*x;
}
}
IEnumerable<int> res = new List<int>();
foreach(int a in A)
if (AF1(a)) res.Add(AF2(a));
return res;
Seminar in Database Systems
Technion
Rich Data Structures - DataSet
In-memory cache of data
Typically populated from a database
Supports indexing of DataColumns via DataViews
DataSet object
DataTable object
DataTable object
DataRow
Unique
Constraint
Data
Column
Unique
Constraint
ForeignKeyConstraint
We will use LINQ on DataSet for demonstrating query optimization techniques
Seminar in Database Systems
Technion
LINQ on Rich Data Structures
Enable LINQ to work over DataSets.
EXAMPLE
Given R and S – two DataTables
from r in R.AsEnumerable()
join s in S.AsEnumerable()
on r.Field<int>(“x”) equals
s.Field<int>(“y”)
select new { a = r.Field<int>(“a”),
b = s.Field<int>(“b”) };
Compile and run-time phases on an implementation of our prototype
LINQ on
DataSet
Standard
C# Code
Interm.
Language
Compile Time
Expression
Tree
Optimized
Expression
Tree
Run Time
Self-tuning
State
Seminar in Database Systems
Interm.
Language
DataSet
Technion
Expression Tree Optimizer
Our solution will be built according to the following architecture
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Query Cost Estimator
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Query Estimation - Cost Model
Follow traditional database approach:
COST: {execution plans} -> [expected execution time]
Relies on:
a set of statistics maintained in DataTables for some
of its columns
formulas to estimate selectivity of predicates and
cardinality of sub-plans
formulas to estimate the expected costs of query
execution for every operator
Seminar in Database Systems
Technion
Cardinality Estimation
Returns an approximate number of rows that
each operator in a query plan would output
To reduce the overhead, we will use only these
statistical estimators:
maxVal – maximum number of distinct values
minVal – minimum number of distinct values
dVal – number of distinct values in a column
If statistics are unavailable, rely on “magic numbers”
until automatically creation of statistics
Seminar in Database Systems
Technion
Predicate Selectivity Estimation
Let: σp(T ) be an arbitrary expression.
The cardinality of T is defined: Card(σp(T )) =sel(p)·
Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T):
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Predicate
Selectivity Estimation
Intuition:
c1
maxVal(c)
c
model
sel(c
the probability to get a “c”
o<=c<=c1) as
Let:
σp(T )We
be
an
arbitrary
expression.
value in interval [c0, c1] among all possible “c” values
c0
minVal(c)
The
cardinality of T is defined: Card(σp(T )) =sel(p)·
Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T:
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Consider
now a Selectivity
join predicate:
T1 c1=c2T2
Predicate
Estimation
Card (T 1) Card (T 2)
Let:
σ
(T
)
be
an
arbitrary
expression.
min(
dVal
(
c
1
),
dVal
(
c
2
))
*
*
Card(T1 p c1=c2 T2)=
dVal (c1) dVal (c 2)
The cardinality of T is defined: Card(σp(T )) =sel(p)·
Under this definition we define:
Predicate
Selectivity Estimation
sel(p1 ^ p2)
sel(p1)· sel(p2)
sel(p1 v p2)
sel(p1) + sel(p2)−sel(p1 ^ p2)
sel(c = c0)
(dVal(c))-1
sel(c0<=c<=c1)
COSTT(Execution Plan) = Σ (COST(p))
EXAMPLE: Consider full table scan of table T):
COST(T) = Card(T) * MEM_ACCESS_COST
For each p in {operators of T}
Average Cost Of Memory Access
Seminar in Database Systems
Technion
Query Analyzer
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Execution Alternatives
Rely on indexes on DataColumns when possible
Example:
σa=7∧(b+c)<20
Alternative 1:
5
Index on “a”
column
Alternative 2:
Full Table Scan
2
3
7
4
a=7
b+c < 20
a
b
c
a
b
c
a
b
c
7
1
3
7
1
3
7
1
3
2
3
76
2
3
76
2
3
76
5
34
32
5
34
32
5
34
32
7
14
8
7
14
8
7
14
8
4
9
9
94
9
9
4
9
9
7
4
23
7
4
23
7
4
23
3
1
3
83
1
3
3
1
3
Seminar in Database Systems
Technion
Analyzing Execution Plans
Global vs. Local Execution Plan – EXAMPLE:
Global Execution Plan
Join
Products
Carts
Local Execution Plan
HashJoin?
IndexJoin?
MergeJoin?
Join
Filter
Customers
Seminar in Database Systems
Technion
Enumeration Architecture
Two phases:
First phase: Join reordering based on estimated cardinalities
Second phase: Choose the best physical implementation for
each operator
EXAMPLE: Suppose we analyze JOIN operator.
We evaluate the following JOIN implementations:
Hash Join
Merge Join (inputs must be sorted in the join columns)
Index Join (index on the inner join column must be
available)
Other possible calculation options
Choose the alternative with the smallest cost
Seminar in Database Systems
Technion
Query Analysis
Self Tuning Organizer
Cost Model
Seminar in Database Systems
Technion
Self Tuning Organization
We want to reach the smallest query execution
time.
Indexes can be used to speedup query execution
PROBLEM:
It might become problematic to forecast in advance
what indexes to build for optimum performance
SOLUTION:
Continuous monitoring/tuning component that
addresses the challenge of choosing and building
adequate indexes and statistics automatically
Seminar in Database Systems
Technion
Self Tuning Organization - Example
Consider the following execution plan:
The selection predicate
Name=“Pam” over
Customers DataTable can
be improved if an index on
Customers(Name) is built
Both hash joins can be
improved if indexes I2 and
I3 are available, since we
can transform hash join
into index join
* The three sub-plans enclosed in
dotted lines might be improved if
suitable indexes were present
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index Tuning
High-Level Description:
Identify a good set of candidate indexes that would
improve performance if they were available.
Later, when the optimized queries are evaluated, we
aggregate the relative benefits of both candidate and
existing indexes.
Based on this information, we periodically trigger
index creations or deletions, taking into account
storage constraints, overall utility of the resulting
indexes, and the cost to creating and maintaining them.
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:
H – a set of candidate indexes to materialize
T – task set for query qi
H (initially empty)
Task Set
I1, δI1 I2, δI2
. . … In, δIn
I – either a candidate or an existing index
δIi – amount that I would speed up query q
i
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:
ΔI – value maintained for each index I
Materialized index – already created one
H
I1
Task Set
I1, δI1 I2, δI2
. . … In, δIn
SELECT query: ΔI = ΔI + δI
UPDATE query: ΔI = ΔI – δI
Seminar in Database Systems
Technion
Index Tuning Algorithm
The purpose of ΔI:
We maintain ΔI on every
query evaluation
If the potential
aggregated benefit of
materializing a candidate
index exceeds its creation
cost, we should create it,
since we gathered enough
evidence that the index is
useful
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Remove “bad” indexes phase
Notation:
Δmin – minimum Δ value for index I
Δmax – maximum Δ value for index I
BI – the cost of creating index I
Residual(I) = BI – (Δmax – Δ)
(the “slack” an index has before being deemed “droppable”)
IF (Residual(I)) <= 0) THEN Drop(I)
Net-Benefit(I) = (Δ-Δmin)-BI
(the benefit from creating the index)
IF (Net-Benefit(I) >= 0) THEN Add(I)
Seminar in Database Systems
Technion
Algorithm for automatic index tuning
Seminar in Database Systems
Technion
Index tuning algorithm
Notation:
ITM – all the indexes from H which creation is cost effective
ITD – subset of existing indexes such that:
ITD fits in existing memory
It’s still cost effective to create new index I after possibly
dropping members from ITD
If creating index I is more effective than maintaining
existing indexes in ITD, DROP(ITD) && CREATE(I)
Remove I from H (set of candidate indexes to materialize)
Seminar in Database Systems
Technion
Experimental Evaluation
Consider the following schema:
Generated:
• 200,000 products
• 50,000 customers
• 1,000 categories
• 5,000 items in the shopping carts
Seminar in Database Systems
checkCarts($1) =
from p in Products.AsEnumerable()
join cart in Carts.AsEnumerable()
on p.Field<int>("id")
equals cart.Field<int>("p_id")
join c in Customers.AsEnumerable()
on cart.Field<int>("cu_id")
equals c.Field<int>("id")
where c.name = $1
select new { cart, p }
browseProducts($1) =
from p in Products.AsEnumerable()
join c in Categories.AsEnumerable()
on p.Field<int>("ca_id") equals
c.Field<int>("id")
where c.par id = $1
select p
Possible Indexes
I1 Categories(par_id)
I2 Products(c_id)
I3 Carts(cu_id)
I4 Products(ca_id)
I5 Customers(name)
Technion
Execution plans for evaluation queries
Seminar in Database Systems
Technion
Experimental Evaluation – Cont.
Generated schedule when tuning was disabled
Seminar in Database Systems
Technion
Experimental Evaluation – Cont.
Generated schedule when tuning was enabled
Seminar in Database Systems
Technion
Summary
We’ve discussed:
LINQ – for declarative query formulation
DataSet - a uniform way of representing in-memory
data.
A lightweight optimizer for automatically adjusting
query execution strategies
Article’s main contribution:
NOT a new query processing technique
BUT: careful engineering of traditional database
concepts in a new context
Seminar in Database Systems
Technion
LOGO
Simon Zeltser
LINQ Execution Model
Query syntax is
converted to function
calls and lambda
expressions
Compiler merges
LINQ extension
methods
Adds query
operations to
IEnumerable<T>
At compile time
Compiler infers
types produced
by queries
Compiler finds a
query pattern
Datasets are
strongly typed
Operations on
data sets are strongly
typed
Specialized or base
Can optimize and
re-write query
Seminar in Database Systems
Lambda expressions
are converted to
expression trees
Expressions are evaluated at
run-time
Parsed and type checked at
compile-time
Query is executed
lazily
Expressions and operations
can execute remotely
At run-time, when results are used
We can force evaluations (ToArray())
Technion