MalStone:Towards A Benchmark for Analytics on Large Data Clouds

Download Report

Transcript MalStone:Towards A Benchmark for Analytics on Large Data Clouds

MalStone:Towards A
Benchmark for Analytics on
Large Data Clouds
Collin Bennett
Open Data Group
400 Lathrop Ave Suite 90
River Forest IL 60305
David Locke
Open Data Group
400 Lathrop Ave Suite 90
River Forest IL 60305
Robert L. Grossman
Open Data Group
400 Lathrop Ave Suite 90
River Forest IL 60305
Jonathan Seidman
Open Data Group
400 Lathrop Ave Suite 90
River Forest IL 60305
Steve Vejcik
Open Data Group
400 Lathrop Ave Suite 90
River Forest IL 60305
KDD’10, July 25–28, 2010, Washington, DC, USA
OUTLINE
0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
ABSTRACT
INTRODUCTION
Common Elements
MalStone A & B
MalGen
THREE IMPLEMENTATIONS
EXPERIMENTAL STUDIES
DISCUSSION
RELATED WORK
SUMMARY
0. ABSTRACT
Terasort
 MalStone
 MalGen

1. INTRODUCTION

Data Mining for Clouds:Hbase, Apache Pig, Hive
and ZooKeeper,

There are no similar benchmarks for comparing
two large data clouds that support building
analytic models on large datasets.

Use MalStone, also describe the implementation
of a data generator for MalStone called MalGen
2.Common Elements


Time stamps
Sites


Entities






e.g. Web sites, computers, network devices
e.g. visitors, users, flows
Log files fill disks, many, many disks
Behavior occurs at all scales
Want to identify phenomena at all scales
Need to group “similar behavior”
Need to do statistics (not just sorting)
2.Common Elements
Abstract the Problem Using Site-Entity Logs
Example
Sites
Entities
Measuring online Web sites
advertising
Consumers
Drive-by exploits Web sites
Computers
(identified by
cookies or IP)
Compromised
systems
User accounts
Compromised
computers
3. MalStone A & B
MalStone Benchmark



Benchmark developed by Open Cloud Consortium
for clouds supporting data intensive computing.
Code to generate synthetic data required is
available from code.google.com/p/malgen
Stylized analytic computation that is easy to
implement in MapReduce and its generalizations.
3. MalStone A & B
MalStone A computes j
for all sites j in the log files.
MalStone B computes j;t for sites j in the log files
3. MalStone A & B


be the set of all entities ei
Aj that become marked at any time in the
monitor window
3. MalStone A & B


is the set of entities
that become marked at any time during the
monitor window.
3. MalStone A & B
The statistic is (1 + 0 + 0)/(1 + 1 + 0) = 1/2
4. MalGen







Tens of millions of sites
Hundreds of millions of entities
Billions of events
Most sites have a few number of events
Some sites have many events
Most entities visit a few sites
Some visitors visit many sites
4. MalGen

For generating site-entity log files
5. THREE IMPLEMENTATIONS

HDFS, Hadoop Streams and Python

Hadoop HDFS and MapReduce

Sector and Sphere UDFs(User Defined Functions )
6. EXPERIMENTAL STUDIES
6. EXPERIMENTAL STUDIES
Sector/Sphere v1.20
# Nodes
# Records
Size of Dataset
Tests done on Open Cloud Testbed.
MalStone B
44 min
20 nodes
10 Billion
1 TB
7. DISCUSSION


Hadoop streams does not require the MapReduce
framework.
Python programs can be invoked by Hadoop
streams.
8. RELATED WORK

In 2008,Haddop by Terasort:297sec.
In 2009,Hadoop by Terasort:209sec.
In nowadays,Terasort was replacement by Minute
Sort:in about 1 Min.

[MapReduce for machine learning on multicore]
Using MapReduce,but does not describe a
computation similar to the MalStone statistic.
9. SUMMARY

MalGen to create large amount of data.

Performance depend upon which cloud
middleware is used to compute.