Transcript pptx

Presented by: Dardan Xhymshiti
Spring 2016
Authors:
Sean Chester*
Darius Sidlauskas`
Ira Assent*
Kenneth S. Bogh*
*Data-Intensive Systems Group, Aarhus University, Denmakr
`Data-Intensive Applications and Systems Laboratory, EPFL, Switzerland
Publication:

ICDE 2015
Type:

Research Paper
2

Skyline is expensive to compute especially in large datasets.

The recent multi-core skyline algorithms does not effectively reduce the
dominance tests.

State-of-the art skyline algorithms outperform multi-core algorithms.

Most of the multi-core Skyline algorithms use the Divide-&-Conquer approach
which has two drawbacks:

If the number of local skyline points is large, the merging step is expensive.

Increasing the cardinality of the dataset, the computation becomes expensive.
3

To come up with a new multi-core algorithm, which eliminate as much as it
can dominance tests.
4

Provide an overview about skyline operator.

Introduces to the innovative skyline Hybrid algorithm.

Provide experiments which shows that this algorithm outperforms multi-core
and sequential algorithms.
5

How to increase the performance of skyline algorithms:

Implementation in GPUs

Implementation in Multi-Core CPUs.

Implementing in distributing environments like MapReduce.

The authors have developed an algorithm called: Hybrid

The authors have chosen the Multi-Core CPU to do the implementation of the
algorithm because of:

Cheaper shared data structures and

Parallel work need not be isolated.
6

All the skyline points are maintained in a shared global data structure.

This data structure gets updated regularly and is read by all threads.

The skyline points in the data structure are ordered to detect dominance relationships
quickly.

The processing of points is done in blocks that guarantee each point is compared to at
most 𝛼 points than in a sequential algorithm.
7


Sort-based algorithms (quickly detect dominance relationships)

SFS (Sort Filter Skyline)

LESS

SaLSa (Sort and Limit Skyline algorithm)
Object-based space partitioning (quickly detect incomparability)

Object-Space partitioning

BSky-Tree-P
8
Presented by: Dardan Xhymshiti
Spring 2016
Authors:
Michael Shekelyam
Gregor Josse
Matthias Schubert
Institute of Informatics, Ludwig-Maximilians-University Munich
Publication:

ICDE 2015
Type:

Research Paper
10

In many application areas, data is organized as a network of graph.

Important task: compute a cost-optimal path between a start node and a target one.

Example:


Road networks (Cost criteria: travel time, travel distance, energy consumption etc.)

Computer networks (Cost criteria: bandwidth and the latency between routers.)
Cost vector: when considering more than one criterion at a time, the cost of complete
path is called cost vector.
Cost criteria 1
Cost criteria 2
Cost criteria 3
11

How to define if a path is an optimal path?

1. Map the cost vector to a value by employing a monotonic combination
function, and then sort the paths. The top n paths are the optimal ones.
Problems:
a) Hard to find a suitable function,
b) Different types of cost might have different levels of scale.

2. Compute the pareto optimal (mathematical definition of Skyline) cost
vectors. (This is also known as conventional path skyline).
Problems:
a) The number of pareto optimal paths might increase exponentially as
function of distance and the amount of considered cost criteria.
b) Showing to the user a large amount of results is not very helpful.
12

There actually exist solutions for computing linear path skylines, but they are restricted
to the specific case of having just two cost criteria.
Problem: cannot be generalized to more criteria.

The number of skyline paths increases exponentially with the distance between the
locations and the number of cost criteria. Thus the result set might be too big.
13

Come up with a new approach of computing the results set of path skylines,
which provides better and faster results.
14

Recall: Conventional path skyline computes all potential optimal paths, but
the result is too big.

Idea: reduce the result set, to only show the paths which are optimal under a
weighted sum function or linear combination of cost criteria.

Intuitively saying: The user weights each type of cost with a percentage
describing its importance.

How to compute the linear path skyline?

Naive approach: compute the conventional path skyline and then compute
the convex hull on the resulting cost vectors. (Inefficient).
15

What is Convex Hull?

Definition: In mathematics, the convex hull or convex envelope of a set X of
points in the Euclidean plane or Euclidean space is the smallest convex set
that contains X.
16

The authors come up with an algorithm called LSCH (Linear Skyline Convex Hull) which
constructs a linear path skyline successively while only adding new paths which are
members of the result set.

Implementation overview:

1. To add a new cost vector, a single search is performed which combine the cost vectors
based on the normal vectors of the hyper planes currently limiting the linear path skyline.

2. To identify the areas on the linear skyline where additional results might still exits, the
algorithms applies multidimensional convex hull computation.
17

Experiments are run in two different types of networks:

1. Munich road network with five cost criteria.

2. Artificial lattice graphs that allow to simulate different problem instances and parameter
settings.
18

Computing route Skylines algorithms.

Convex hull algorithm.
19
Presented by: Shahab Helmi
Spring 2016
Authors:
Publication:

ICDE 2015
Type:

Research Paper
21
A recent study suggests that the routes provided by a leading navigation service often fail
to agree with the routes chosen by local drivers, why?

Limited number of travel costs, e.g., distance or travel time.

With the rapid development and continuing use of vehicle tracking technologies, it is
possible to learn and update individual drivers’ driving preferences according to their
trajectories.
22

It proposes a novel problem on personalized route recommendation based on big
trajectory data.

It proposes techniques to model and update driving preferences from drivers’
trajectories.

The proposed driving preference model can support arbitrary number of travel costs of
interest and distributions of cost ratios.

Comprehensive experiments were done conducted on a substantial, real trajectory
data set to show efficiency and effectiveness.
23
1.
Indexing the road network.
2.
Modeling drivers’ preferences from their trajectories considering multiple travel costs.
3.
Selecting sub-trajectories according to source, destination, departure time and driver’s
preferences.
4.
constructing a small graph (Zohreh) with appropriate edge weights reflecting how the
driver would like to use the edges based on the selected trajectories.
5.
Returning the shortest route in the small graph as the personalized route to the driver.
24

Route Planning.

Route Planning Using Trajectories: no driver modeling:


Most popular route (MPR)

Time period-based most popular route (TPMPR)

Top-k popular routes (TKPR)
Personalized Route Planning:

TRIP: closest work, tested over a smaller dataset, can only model travel time
25


GPS records:

52,211 taxis in Beijing.

during 2012-09-30 to 2012-11-30.

One GPS record is collected in every 5 seconds or less.
Road network:

6th street in Beijing.

28,342 vertices and 38,577 edges.

60 km60 km square region.
26


Trajectories:

32,379,248 trajectories.

starts when a passenger got in the taxi and ends when the passenger got off the taxi.
Travel Costs:

Travel distance.

Travel time.

Fuel consumption.
27
Presented by: Shahab Helmi
Spring 2016
Authors:
Publication:

ICDE 2015
Type:

Research Paper
29
an episode (serial episode) is usually defined as a totally ordered set of events
that occur relatively close to each other.
of an episode: how frequently it occurs in a sequence. Frequency count
methods:

Window-based

Non-overlapped occurrences

Non-interleaved occurrences

Total Frequency
can capture the most intense correlation between events.
30
Previous studies on frequent episode mining (FEM) mostly process the data in offline
mode. Shortcomings:

This process may take hours or days.

Testing whether an episode occurs in a sequence is an NP-complete problem.
Real world application challenges:

Fast-growing data.

Recency effect: only freshets pattern from recent events are of interest (high
frequency trading → a trader holds stock only for 22 seconds in average).

Time-critical analysis: a delay may lead to drastic loss (predictive maintenance of data
centers).
31
Online FEM algorithm challenges:
1.
Infrequent events at the current moment could become frequent in the future → they
cannot be discarded.
2.
Then a compact and effective data structure is required to store all episode
occurrences.
3.
Mining all minimal occurrences of episode also becomes a big challenge over the
growing sequence.

For #1&#2 a data structure, named episode trie, is proposed. It’s a prefix tree similar
to on we have in FP-Growth.

For #3, an algorithm, named mining frequent serial episode via last occurrence
(MESELO), is proposed. Claimed to be the first online FEM algorithm.

The last occurrence concept is defined for the first time: when was the last time in
which an episode occurred → using the last occurrence information, MESELO can
generate new minimal frequent episodes after receiving new data.

Complexity analysis: O(MW), it’s claimed that in most real world applications W is
small, so it works well.
32

There is no previous work of online FEM.

Frequent episode mining on event sequences (main difference is the frequency count
method):

Alarm sequences in telecom networks

Web navigation logs

Timestamped fault reports in car manufacturing plants

Sales transactions and stock data

News

Breath first approaches (Apriori-based)

Depth first approaches (prefix tree)
33

Online frequent itemset mining (approximate and exact methods):

MOMENT

Can-Tree

SWIM

FP-Stream

FDPM

…
They does not apply on episode mining since in frequent itemset mining the time information is
not important while episodes are ordered according to their occurrence time. Hence, keeping
the tree up-to-date is hard. Using the last occurrence concept it becomes more efficient.
34
Server 1 (for algorithm):

Intel Xeon E5-2620 2.00 GHz processor

32 GB memory

Windows server 2008
Server 2 (for MySQL database):

Intel Xeon E5-2620 2.00 GHz processor

16 GB memory

Linux
35
36
1.
Online mode: MESELOBS is similar to MESELO
but does not use the
concept of last episode
occurrence.
MinSupprt
2.
Offline mode: the
performance
of
MESELO is compare to
baseline algorithms.
•
•
X axis shows window size
MinSupport is 10
Window size
37