PPTX - SFU computer science
Download
Report
Transcript PPTX - SFU computer science
Topic for Thursday?
1
Miscellaneous Topics
in Databases
PARALLEL DBMS
WHY PARALLEL ACCESS TO DATA?
At 10 MB/s
1.2 days to scan
1 Terabyte
10 MB/s
1,000 x parallel
1.5 minute to scan.
1 Terabyte
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
4
PARALLEL DBMS: INTRO
Parallelism is natural to DBMS processing
Pipeline parallelism: many machines each doing one
step in a multi-step process.
Partition parallelism: many machines doing the same
thing to different pieces of data.
Both are natural in DBMS!
Pipeline
Partition
Any
Sequential
Program
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
outputs split N ways, inputs merge
5
Speed-Up
More resources means
proportionally less time
for given amount of
data.
Xact/sec.
(throughput)
SOME || TERMINOLOGY
Ideal
Realistic
degree of ||-ism
Scale-Up
If resources increased
in proportion to
increase in data size,
time is constant.
Why Realistic <> Ideal?
sec./Xact
(response time)
Realistic
Ideal
degree of ||-ism
6
INTRODUCTION
Parallel machines are becoming quite common
and affordable
Prices of microprocessors, memory and disks have
dropped sharply
Recent desktop computers feature multiple processors
and this trend is projected to accelerate
Databases are growing increasingly large
large volumes of transaction data are collected and
stored for later analysis.
multimedia objects like images are increasingly stored
in databases
Large-scale parallel database systems
increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
7
Google data centers around the world, as of 2008
8
PARALLELISM IN DATABASES
Data can be partitioned across multiple disks for
parallel I/O.
Individual relational operations (e.g., sort, join,
aggregation) can be executed in parallel
data can be partitioned and each processor can work
independently on its own partition
Results merged when done
Different queries can be run in parallel with each
other.
Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to
parallelism.
9
PARTITIONING
Horizontal partitioning (shard)
involves putting different rows into different tables
Ex: customers with ZIP codes less than 50000 are
stored in CustomersEast, while customers with ZIP
codes greater than or equal to 50000 are stored in
CustomersWest
Vertical partitioning
involves creating tables with fewer columns and using
additional tables to store the remaining columns
partitions columns even when already normalized
called "row splitting" (the row is split by its columns)
Ex: split (slow to find) dynamic data from (fast to find)
static data in a table where the dynamic data is not
used as often as the static
10
COMPARISON OF PARTITIONING TECHNIQUES
Evaluate how well partitioning techniques
support the following types of data access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries.
E.g., r.A = 25.
3.Locating all tuples such that the value of a
given attribute lies within a specified range –
range queries.
E.g., 10 r.A < 25.
11
HANDLING SKEW USING HISTOGRAMS
Balanced partitioning vector can be constructed from
histogram in a relatively straightforward fashion
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, or
sampling (blocks containing) tuples of the relation
12
INTERQUERY PARALLELISM
Queries/transactions execute in parallel with one
another
concurrent processing
Increases transaction throughput; used primarily
to scale up a transaction processing system to
support a larger number of transactions per
second.
Easiest form of parallelism to support
13
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multiple
processors/disks; important for speeding up longrunning queries
Two complementary forms of intraquery
parallelism :
Intraoperation Parallelism – parallelize the
execution of each individual operation in the query
(each CPU runs on a subset of tuples)
Interoperation Parallelism – execute the different
operations in a query expression in parallel.
(each CPU runs a subset of operations on the data)
14
PARALLEL JOIN
The join operation requires pairs of tuples to be
tested to see if they satisfy the join condition, and
if they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairs
to be tested over several processors. Each
processor then computes part of the join locally.
In a final step, the results from each processor can
be collected together to produce the final result.
15
QUERY OPTIMIZATION
Query optimization in parallel databases is more complex
than in sequential databases
Cost models are more complicated, since we must take into
account partitioning costs and issues such as skew and
resource contention
When scheduling execution tree in parallel system, must
decide:
How to parallelize each operation
how many processors to use for it
What operations to pipeline
what operations to execute independently in parallel
what operations to execute sequentially
Determining the amount of resources to allocate for each
operation is a problem
E.g., allocating more processors than optimal can result
in high communication overhead
16
DEDUCTIVE DATABASES
OVERVIEW OF DEDUCTIVE DATABASES
Declarative Language
Language to specify rules
Inference Engine (Deduction Machine)
Can deduce new facts by interpreting the rules
Related to logic programming
Prolog language (Prolog => Programming in logic)
Uses backward chaining to evaluate
Top-down application of the rules
Consists of:
Facts
Similar to relation specification without the necessity of
including attribute names
Rules
Similar to relational views (virtual relations that are not stored)
18
PROLOG/DATALOG NOTATION
Facts are provided as predicates
Predicate has
a name
a fixed number of arguments
Convention:
Constants are numeric or character strings
Variables start with upper case letters
E.g., SUPERVISE(Supervisor, Supervisee)
States that Supervisor SUPERVISE(s) Supervisee
19
PROLOG/DATALOG NOTATION
Rule
Is of the form head :- body
where :- is read as if and only iff
E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)
E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)
20
PROLOG/DATALOG NOTATION
Query
Involves a predicate symbol followed by some variable
arguments to answer the question
where :- is read as if and only iff
E.g., SUPERIOR(james,Y)?
E.g., SUBORDINATE(james,X)?
21
Prolog notation
Supervisory tree
22
PROVING A NEW FACT
23
24
DATA MINING
DEFINITION
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Example pattern (Census Bureau Data):
If (relationship = husband), then (gender = male). 99.6%
26
DEFINITION (CONT.)
Data mining is the exploration and analysis of large quantities
of data in order to discover valid, novel, potentially useful,
and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern
beforehand.
Useful: We can devise actions from the
patterns.
Understandable: We can interpret and
comprehend the patterns.
27
WHY USE DATA MINING TODAY?
Human analysis skills are inadequate:
Volume and dimensionality of the data
High data growth rate
Availability of:
Data
Storage
Computational power
Off-the-shelf software
Expertise
28
THE KNOWLEDGE DISCOVERY PROCESS
Steps:
Identify business problem
Data mining
Action
Evaluation and measurement
Deployment and integration into businesses
processes
29
PREPROCESSING AND MINING
Knowledge
Patterns
Target
Data
Preprocessed
Data
Interpretation
Model
Construction
Original Data
Preprocessing
Data
Integration
and Selection
30
DATA MINING TECHNIQUES
Supervised learning
Unsupervised learning
Classification and regression
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
31
EXAMPLE APPLICATION: SKY SURVEY
Input
data: 3 TB of image data with 2
billion sky objects, took more than six
years to complete
Goal: Generate a catalog with all objects
and their type
Method: Use decision trees as data mining
model
Results:
94% accuracy in predicting sky object classes
Increased number of faint objects classified by
300%
Helped team of astronomers to discover 16 new
high red-shift quasars in one order of
magnitude less observation time
32
CLASSIFICATION EXAMPLE
Example
training
database
Two predictor attributes:
Age and Car-type (Sport,
Minivan and Truck)
Age is ordered, Car-type is
categorical attribute
Class label indicates
whether person bought
product
Dependent attribute is
categorical
Age Car
20 M
30 M
25
T
30
S
40
S
20
T
30 M
25 M
40 M
20
S
Class
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
No
33
GOALS AND REQUIREMENTS
Goals:
To produce an accurate classifier/regression function
To understand the structure of the problem
Requirements on the model:
High accuracy
Understandable by humans, interpretable
Fast construction for very large training databases
34
WHAT ARE DECISION TREES?
Age
<30
>=30
YES
Car Type
Minivan
YES
Sports, Truck
Minivan
YES
Sports,
Truck
NO
YES
NO
0
30
60 Age
35
DENSITY-BASED CLUSTERING
A cluster is defined as a connected dense component.
Density is defined in terms of number of neighbors of
a point.
We can find clusters of arbitrary shape
36
MARKET BASKET ANALYSIS
Consider shopping cart filled with several items
Market basket analysis tries to answer the
following questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?
37
MARKET BASKET ANALYSIS (CONTD.)
Coocurrences
Association rules
80% of all customers purchase items X, Y and Z together.
60% of all customers who purchase X and Y also buy Z.
Sequential patterns
60% of customers who first buy X also purchase Y within
three weeks.
38
SPATIAL DATA
40
WHAT IS A SPATIAL DATABASE?
Database that:
Stores spatial objects
Manipulates spatial objects just like other objects in
the database
41
WHAT IS SPATIAL DATA?
Data which describes either location or shape
e.g.House or Fire Hydrant location
Roads, Rivers, Pipelines, Power lines
Forests, Parks, Municipalities, Lakes
In the abstract, reductionist view of the computer,
these entities are represented as Points, Lines,
and Polygons.
42
Roads are represented as Lines
Mail Boxes are represented as Points
43
TOPIC THREE
Land Use Classifications are
represented as Polygons
44
TOPIC THREE
Combination of all the previous data
45
SPATIAL RELATIONSHIPS
Not just interested in location, also interested in
“Relationships” between objects that are very
hard to model outside the spatial domain.
The most common relationships are
Proximity : distance
Adjacency : “touching” and “connectivity”
Containment : inside/overlapping
46
SPATIAL RELATIONSHIPS
Distance between a toxic waste dump and a piece of
property you were considering buying.
47
SPATIAL RELATIONSHIPS
Distance to various pubs
48
SPATIAL RELATIONSHIPS
Adjacency: All the lots which share an edge
49
Connectivity: Tributary relationships in river
networks
50
MOST ORGANIZATIONS HAVE SPATIAL
DATA
Geocodable addresses
Customer location
Store locations
Transportation
tracking
Statistical/Demograph
ic
Cartography
Epidemiology
Crime patterns
Weather Information
Land holdings
Natural resources
City Planning
Environmental
planning
Information
Visualization
Hazard detection
51
ADVANTAGES OF SPATIAL DATABASES
Able to treat your spatial data like anything else
in the DB
transactions
backups
integrity checks
less data redundancy
fundamental organization and operations handled by
the DB
multi-user support
security/access control
locking
52
ADVANTAGES OF SPATIAL DATABASES
Offset complicated tasks to the DB server
organization and indexing done for you
do not have to re-implement operators
do not have to re-implement functions
Significantly lowers the development time of
client applications
53
ADVANTAGES OF SPATIAL DATABASES
Spatial querying using SQL
use simple SQL expressions to determine spatial
relationships
distance
adjacency
containment
use simple SQL expressions to perform spatial
operations
area
length
intersection
union
buffer
54
Original Polygons
Union
Intersection
55
Buffered rivers
Original river network
56
ADVANTAGES OF SPATIAL DATABASES
… WHERE distance(<me>,pub_loc) < 1000
SELECT distance(<me>,pub_loc)*$0.01 + beer_cost
…
... WHERE touches(pub_loc, street)
… WHERE inside(pub_loc,city_area) and city_name
= ...
57
ADVANTAGES OF SPATIAL DATABASES
Simple value of the proposed lot
Area(<my lot>) * <price per acre>
+ area(intersect(<my log>,<forested area>) ) * <wood value per
acre>
- distance(<my lot>, <power lines>) * <cost of power line laying>
58
New Electoral Districts
• Changes in areas between 1996 and
2001 election.
• Want to predict voting in 2001 by
looking at voting in 1996.
• Intersect the 2001 district polygon with
the voting areas polygons.
• Outside will have zero area
• Inside will have 100% area
• On the border will have partial area
• Multiply the % area by 1996 actual
voting and sum
• Result is a simple prediction of 2001
voting
More advanced: also use demographic
data.
59
DISADVANTAGES OF SPATIAL DATABASES
Cost to implement can be high
Some inflexibility
Incompatibilities with some GIS software
Slower than local, specialized data structures
User/managerial inexperience and caution
60
PICTOGRAMS - SHAPES
Types: Basic Shapes, Multi-Shapes, Derived
Shapes, Alternate Shapes, Any possible Shape,
User-Defined Shapes
Basic Shapes
Alternate Shapes
Multi-Shapes
Any Possible Shape
N
Derived Shapes
0, N
*
User Defined Shape
!
61
SPATIAL DATA ENTITY CREATION
Form an entity to hold county names, states,
populations, and geographies
CREATE TABLE County(
Name
varchar(30),
State
varchar(30),
Pop
Integer,
Shape
Polygon);
Form an entity to hold river names, sources,
lengths, and geographies
CREATE TABLE River(
Name
varchar(30),
Source varchar(30),
Distance Integer,
Shape
LineString);
62
EXTENDING THE ER DIAGRAM
Standard ER
Diagram
Spatial ER
Diagram
63
64