Decision Tree Construction

Download Report

Transcript Decision Tree Construction

ORDB Implementation
Discussion
From RDB to ORDB
Issues to address when
adding OO extensions
to DBMS system
Layout of Data
Deal with large data types : ADTs/blobs
– special-purpose file space for such data, with special access
methods
Large fields in one tuple :
– One single tuple may not even fit on one disk page
– Must break into sub-tuples and link via disk pointers
Flexible layout :
– constructed types may have flexible sized sets, , e.g., one
attribute can be a set of strings.
– Need to provide meta-data inside each type concerning layout of
fields within the tuple
– Insertion/deletion will cause problems when contiguous layout of
‘tuples’ is assumed
Layout of Data
More layout design choices (clustering on disk):
– Lay out complex object nested and clustered on disk (if nested
and not pointer based)
– Where to store objects that are referenced (shared) by possibly
several other and different structures
– Many design options for objects that are in a type hierarchy with
inheritance
– Constructed types such as arrays require novel methods, like
array chunking into (4x4) subarrays for non-continuous access
Objects/OIDs
OID generation : uniqueness across time and
system
Object reference handling :
– must avoid dangling references
– semantics for object manipulation for shared objects
ADTs
– Type representation: size/storage
– Type access : import/export
– Type manipulation: special methods to serve as
filter predicates and join predicates
– Special-purpose index structures : efficiency
ADTs
Mechanism to add index support along with ADT:
– External storage of index file outside DBMS
– Provide “access method interface” a la:
• Open(), close(), search(x), retrieve-next()
• Plus, statistics on external index
– Or, generic ‘template’ index structure
• Generalized Search Tree (GiST) – user-extensible
• Concurrency/recovery provided
Query Processing
Query Parsing :
– Type checking for methods
– Subtyping/Overriding
Query Rewriting:
– May translate path expressions into join operators
– Deal with collection hierarchies (UNION?)
– Indices or extraction out of collection hierarchy
Query Optimization Core
– New algebra operators must be designed :
• such as nest, unnest, array-ops, values/objects, etc.
– Query optimizer must integrate them into optimization
process :
• New Rewrite rules
• New Costing
• New Heuristics
Query Optimization Revisited
– Existing algebra operators revisited : SELECT
– Where clause expressions can be expensive
– So SELECT pushdown may be bad heuristic
Selection Condition Rewriting
EXAMPLE:
(tuple.attribute < 50)
– Only CPU time (on the fly)
(tuple.location OVERLAPS lake-object)
– Possibly complex CPU-heavy computations
– May Involve both IO and CPU costs
State-of-art:
– consider reduction factor only
Now, we must consider both factors:
– Cost factor : dramatic variations
– Reduction factor: unrelated to cost factor
Operator Ordering
op1
op2
Ordering of SELECT Operators
–
–
–
–
–
Cost factor : dramatic variations
Reduction factor: orthogonal to cost factor
We want: maximal reduction and minimal cost
Rank ( operator ) = (reduction) * ( 1/cost )
Order operators by increasing ‘rank’
– High rank (good) -> low in cost, and large reduction
– Low rank (bad) -> high in cost, and small reduction
Access Methods ( on what ?)
Indexes that are ADT specific
Indexes on navigation path
Indexes on methods, not just on columns
Indexes over collection hierarchies (trade-offs)
Indexes for new WHERE clause expressions not just
=, <, > ; but also “overlaps”
Registering New Index (to Optimizer)
What WHERE conditions it supports
Estimated cost for “matching tuple”
– Given by index designer (user?)
– Monitor statistics; even construct test plans
Estimation of reduction factors/join factors:
Register auxiliary function to estimate factor
Provide simple defaults
Estimation of method costs (~IO/CPU)
Methods
Dynamic linking of methods (outside DB)
Overwriting methods for type hierarchy
Use of “methods” with implied semantics
Incorporation of methods into query process :
termination?
“untrusted” methods : methods corrupt server or modify
DB content (side effects)
Handling of “untrusted” methods :
– restrict language; interpret vs compile, separate address
space as DB server
Query Optimization with Methods
Estimation of “costs” of method predicates
Optimization of Method execution:
– Similar idea as handling correlated nested subqueries; must
recognize repetition and rewrite physical plan.
– Provide some level of precomputation and reuse
Optimization of Method execution:
– 1. If called on same input, cache that one result
– 2. If on full column, presort column first (groupby)
– 3. Or, precompute results of methods for each possible value in
domain; and put in hash-table : fct (val );
Look up in hash-table during query processing or even join with it,
instead of recomputing : val  fct (val)
Query Processing
User-defined aggregate functions:
– E.g., “second largest” or “second yellowest”
Distributive aggregates: incremental computation
Provide:
– Initialize(): set up state space
– Iterate(): per tuple update the state
– Terminate(): compute final result based on state; and cleanup state
For example : “second largest”
– Initialize(): 2 fields
– Iterate(): per tuple compare numbers
– Terminate(): remove 2 fields
Following Disk Pointers?
Complex object structures with object pointers may
exist (~ disk pointers)
Navigate complex object into memory for a long-running
transaction like in CAD design
What to do about “pointers” between subobjects or
related objects ?
– Swizzle = replace OIDs dereferences by in-memory pointers,
and unswizzle back at end.
– Issues : In-memory table of OIDs and their state; indicate in
each object pointer via a bit.
– Different policies for swizzling: on access, attached to object
brought in, etc.
Models of Persistence
Different models of persistence for OODB implementations:
Parallel type systems:
– E.g., int and dbint
– User must make decision at object creation time
– Allow for user control by “casting” types
Persistence by container management:
– Objects must be placed into “persistent containers” such as relations in order
to stay around
– Eg., Insert o into Collection MyBooks;
– Could be rather dynamic control without casting
Persistence by reachability :
– Use global variable names to objects and structures
– Objects being referenced by other objects that are reachable by application,
they by transitivity are also persistent.
– need garbage collection
Summary
A lot of work to get there:
From physical database design/layout issues up
to logical query optimizer extensions
ORDB: reuses existing implementation base
and incrementally adds new features on (but
relation is first-class citizen)