Transcript Document

Pig Latin & RDF Data Analysis
The MD-join: An Operator for Complex OLAP, ICDE 2001
MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004
Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD 2008
RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web, ISWC 2009
Towards Scalable RDF Graph Analytics on MapReduce, MDAC 2010
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2010-08-20
presented by Jaeseok Myung
MapReduce
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
MapReduce
 MapReduce: Simplified Data Processing on Large Cluster, 2004

Map: (k1, v1) -> list (k2, v2)

Reduce: (k2, list (v2)) -> (k3, v3)
JTeam, http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 3/38
Programming with MapReduce
map (String key, String value):
// key : document name
// value : document contents
for each word w in value:
EmitIntermediate (w, “1”);
reduce (String key, Iterator values):
// key : a word
// values : a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, AsString(result));
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 4/38
Why Is MapReduce Useful?
 Scalability

Capable of handling large amount of data
 Extensibility

Commodity machines

Shared-nothing architecture
 Fault-tolerance

Writing intermediate results into the physical storage
 Flexibility

Enabling users to write custom codes
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 5/38
Pig Latin
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Motivation
 SQL wrench programmers away from their preferred method of
analyzing data, namely writing imperative scripts or code, toward
writing declarative queries in SQL, which they often find
unnatural, and overly restrictive
 MapReduce paradigm is too low-level and rigid, and leads to a
great deal of custom user code that is hard to maintain, and
reuse
 We have developed a new language called Pig Latin that we
have designed to fit in a sweet spot between the declarative
style of SQL, and low-level, procedural style of MapReduce
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 7/38
Pig Latin - Example
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 8/38
Pig Latin
 Data Model

Atom, Tuple, Bag, Map
 Language

Expressions

Operators -> MapReduce
–
LOAD, STORE, SPLIT
–
FOREACH, FILTER
–
COGROUP, GROUP, JOIN
–
Aggregate Functions
–
UDF (User Defined Function)
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 9/38
Data Model

Atom


Tuple


A tuple is a sequence of fields, each of which can be any of the data types, e.g.,
(‘alice’, ‘lakers’)
Bag


An atom contains a simple atomic value such as a string or a number, e.g., ‘alice’
A bag is a collection of tuples with possible duplicates
Map

A map is a collection of data items, where each item has an associated key through
which it can be looked up
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 10/38
Expressions in Pig Latin
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 11/38
RDF Data Analysis
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
RDF Data Model
Graph representation
Statements (triples)
Sub
Prop
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
R1
avgDuration
97
UV1
type
UserVisits
UV1
srcIP
158.112.27.3
UV1
destURL
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
UV1
userAgent
SCOPE
UV1
cCode
VNM
UV1
iCode
VNM-KH
UV1
sKeyword
comets
UV1
avgTime
3
Center for E-Business Technology
Rankings
Groups = Stars
UserVisits
Copyright  2010 by CEBT
IDS Lab. Seminar – 14/38
Basic Steps for Analytical Query Processing
Compute the average pageRank and total adRevenue for all pages visited by a
particular srcIP with visitDate between 1979/12/01 and 1979/12/30

Rankings
map1
reduce1
UserVisits
FILTER
FILTER
JOIN
JOIN
JOIN
map2
reduce2
map3
reduce3
GROUP BY
map4
FOREACH reduce4
Step 321 : Aggregation
Pattern Matching
Grouping
Copyright  2010 by CEBT
IDS Lab. Seminar – 15/38
Positioning
 RAPID+

RDF Graph Pattern Matching
 RAPID

Grouping & Aggregation
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 16/38
RAPID+
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Pattern Matching in Pig: Approach 1
Rankings
type
R1
RankingsStarPattern =
Ranking
JOIN triples1 ON Sub,
pageRank
11
pageURL
triples2 ON Sub,
url1
Triple store
triples1
triples3 ON Sub;
triples3
triples2
Sub
Prop
Obj
Sub
Prop
Obj
R1
R1
R1
UV1
UV1
type
pageRank
pageURL
type
srcIP
Ranking
11
Url1
UserVisits
158.112.27.
3
R1
R1
R1
UV1
UV1
type
pageRank
pageURL
type
srcIP
Ranking
11
Url1
UserVisits
158.112.27.
3
Sub
Prop
R1
R1
R1
UV1
UV1
type
pageRank
pageURL
type
srcIP
Obj
Ranking
11
Url1
UserVisits
158.112.27.
3
Rankings star pattern = 3-way self-join
UserVisits star pattern = 5-way self-join
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 18/38
Pattern Matching in Pig: Approach 2
LOAD all the RDF triples
SPLIT
typeRanking
destURL
Sub Prop
Obj
UV1 destURL url1
UV2 destURL url1
Sub Prop Obj
R1 type Ranking
R2 type Ranking
pageRank
pageURL
Sub Prop
Obj
R1 pageURL url1
R2 pageURL url2
typeUV
visitDate
srcIP
Sub Prop
Obj
R1 pageRank 11
R2 pageRank 27
Sub Prop
Obj
UV1 visitDate 1979/12/12
UV2 visitDate 1980/02/02
Sub Prop Obj
UV1 type userVisits
UV2 type userVisits
Sub Prop Obj
UV1 scrIP 158.112.27.3
UV2 scrIP 159.222.21.9
adRev
Sub Prop Obj
UV1 adRev 339.08142
UV2 adRev 330.51248
Filte
r
visitDate
Sub Prop
Obj
UV1 visitDate 1979/12/12
UV4 visitDate 1979/12/02
UserVisits = JOIN
(compute Star Pattern)
Ranking = JOIN
(compute Star Pattern)
JOIN between Ranking, UserVisits
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 19/38
RAPID+
 Propose UDFs for RDF pattern matching
 Goal: Minimize I/O costs
 Strategy:

Concurrent computation of star patterns using grouping-based
algorithm

Can improve efficiency using Operator-coalescing and Look-ahead
processing
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 20/38
Concurrent Star Pattern Matching
Sub
R1
R1
Ranking
R1
R1
UV1
UV1
UV1
UV1
UV1
UserVisits
UV1
UV1
UV1
UV1
UV1
Prop
type
pageRank
pageURL
avgDuration
type
srcIP
destURL
adRevenue
visitDate
userAgent
cCode
iCode
sKeyword
avgTime
Obj
Ranking
11
Url1
97
UserVisits
158.112.0
url1
339.08142
1979/12/12
SCOPE
VNM
VNM-KH
comets
3
Compute the average pageRank and total
adRevenue for all pageURLs visited by a
particular srcIP with visitDate between
1979/12/01 and 1979/12/30
Single
intermediate
result
Filter
irrelevant
properties
Copyright  2010 by CEBT
Sub
Prop
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
UV1
type
UserVisits
UV1
srcIP
158.112.0
UV1
destURL
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
IDS Lab. Seminar – 21/38
Operator Coalescing
Filter irrelevant triples by coalescing LOAD and FILTER
operators
Our Approach
Using Pig Latin
LOAD
map1
FILTER
Operator
Coalescing
map1
LOAD
loadFilter
input = LOAD ‘\data’ using
loadFilter ( pageRank,
pageURL, type:Ranking,
destURL, adRevenue, srcIP,
visitDate, type:UserVisits )
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 22/38
Grouping-based Pattern Matching
starSubgraphs = GROUP input BY $0;
Sub
Prop
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
UV1
type
UserVisits
UV1
srcIP
UV1
destURL
158.112.27.
3
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
Center for E-Business Technology
GROUP
BY
Subject
Copyright  2010 by CEBT
IDS Lab. Seminar – 23/38
Filtering the Groups
 Structure-based filtering
eliminate sub graphs
with missing properties
visitDate between 1979/12/01
Missing
srcIP
and
1979/12/30
 Value-based filtering
validate each sub graph
against filter condition
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 24/38
Look-ahead Processing
Star Pattern Matching  Joining the Stars
Look-Ahead - Annotate bag based on join key
Join between the star
sub graphs
Eliminate properties irrelevant for future processing (join and filter prop)
 Minimize size of intermediate results
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 25/38
Comparison: Pig vs RAPID+
Pig Approach
RAPID+
Multiple map-reduce cycles
- N star sub graphs  N cycles
Single cycle
- N star sub graphs  1 cycle
Potential for increased I/O
(i)Disk spills (SPLIT operator)
(ii)Materialization of several
intermediate results due to sequential
computation of star patterns
Minimized I/O
(i)Filtering in triple storage model +
load-filter coalescing
(ii)Concurrent computation of star
patterns (single intermediate result)
Would require advanced optimization
techniques
- Introduce project operator to
eliminate unneeded columns
Smaller intermediate result sizes
- Eliminate tuples and columns not
necessary in future steps of processing
Not applicable
Minimize repeated tuple handling by
look-ahead processing
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 26/38
Experiment
Cost Analysis for Task A (PM)
5-node cluster
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 27/38
RAPID
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
RAPID
 Propose a language, RAPID, which extends Yahoo’s Pig Latin
language with query primitives for dealing with the graph
structured nature of RDF
 Propose an approach for achieving the scalable processing of the
non-trivial analytical tasks on RDF datasets that is based on an
efficient multidimensional query operator called the MD-Join
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 29/38
Architecture of RAPID
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 30/38
Data Expression
 Class Expression

type:Class
 Property Expression
 Path Expression
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 31/38
MD-Join
 The MD-Join: An Operator for Complex OLAP, ICDE 2001


Complex OLAP query is often difficult to express and difficult to
optimize using standard relational operators
–
ex. compute the total sales broken down by all possible combinations of
attributes prod, month, state
–
Tight coupling of the GROUPBY and aggregation clause is one reason of
the problem
MD-Join decouples the grouping and the aggregation clauses in
query expressions, in order to achieve great flexibility and simplicity
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 32/38
MD-Join

The MD-Join operator MD (B, R, l, ) defines a relation with

B: base table (contains all combinations of key values representing each
group on which aggregations have to be computed)

R: fact table (contains all the data that needs to be aggregated)

l: a list of aggregation functions that need to be computed over attributes

 : is a set of conditions involving the attributes of B and R
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 33/38
MD-Join
 MD (B, R, sum(price), Prod=R.Prod & Month = R.Month)
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 34/38
Extending Pig Latin for Analytical Processing RDF
 Generating Fact Table (GFD)

fact_dataset = LOAD ‘input.rdf’ USING GFD (…);
 Generating Base Table (GBD)

base_dataset = LOAD ‘input.rdf’ USING GBD(…);
 Results of GFD & GBD are stored in the same MDJ.rdf file
 MD-Join

output_dataset = LOAD ‘MDJ.rdf’ USING MDJ(…);
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 35/38
Experiment
 DBLP
 BSBM
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 36/38
Experimental Result
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 37/38
Summary
 데이터 분석 도메인에서 MR 기반의 접근이 활용되고 있음
 MR이 too low-level 이라 Pig 같은 도구들이 많이 사용됨
 대용량 RDF 분석에도 MR 및 Pig의 활용이 시도되고 있음

RDF 모델 특성 때문에 기존 도구들에 약간의 수정이 필요
 Pig 기반의 RDF 처리는 RAPID, RAPID+ 가 유이한 솔루션

RDF에 대해 Pig 보다는 좋은 성능을 보이고 있음
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 38/38