Transcript Slide 1

Project on
Project on
“SaaS Log Data Processing and Mining Based on
Hadoop ”
Presenting By
Under the Guidance of
Priyanka B Marihal
Sahanadevi.K.J
1EW13SCS14
Assistant Professor
4th sem M.Tech
Dept.,of computer science,EWIT
7/21/2015
1
1. Abstract
2. Introduction
3. Problem Statement
4.Three modules :
I. Amazon Web Services
II. Hadoop Architecture
A.Hadoop Distributed File System(HDFS)
B.Hadoop’s MapReduce Engine
III. Apriori Algorithm
5. System Architecture
6. Advantages
7.Publication Details
8.References
7/21/2015
2

In present day software's are delivered as services through internet
(SaaS).

The end users need not have to worry about installation , upgradation
nor maintenance of the software.
Example: Google Docs. (Free software delivered as a service to end
users).

The end users have to pay only for the functionalities used.

The service providers need to understand the usage patterns of the end
users so that they can provision the resources efficiently.

The usage pattern is recorded in a log file that can run into terabytes.

As a result, processing and mining of log data is a major challenge.
7/21/2015
3

SaaS applications are deployed in data centers with cloud
computing technologies.

The usage information of SaaS is composed of data that is
generated while the applications are accessed by users.
Example: Accessing time, User source, Function used, Resource
consumed are typical usage information.

While using the SaaS applications to fulfill the business goals,
users need to know the usage information as well.
7/21/2015
4
SaaS applications, produce huge amount of log data which needs
to be processed and mined by the service provider to provision the
resources. Processing and mining the huge size of the log file by
presently available technologies like RDBMS packages is not
efficient w.r.t time and cost.A SaaS log data processing and
mining method based on Hadoop cluster built on commodity
hardware can achieve higher level of scalability, reliability and
performance.
7/21/2015
5

Amazon Web Services provides a variety of cloud-based
computing services including a wide selection of compute
instances which can scale up and down automatically to meet the
needs of your application.

It offers a broad set of global compute, storage, database,
application, and deployment services that help organizations move
faster, lower IT costs, and scale applications.

We extract the logdata from the amazon web server and place it
into the HDFS.
7/21/2015
6

Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr
192.168.0.2/53

Mar 29 2004 09:54:19: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/3842 gaddr 10.0.0.187/53 laddr
192.168.0.2/53

Mar 29 2004 09:54:19: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/36205 gaddr 10.0.0.187/53 laddr
192.168.0.2/53

Mar 29 2004 09:54:26: %PIX-4-106023: Deny icmp src outside:Some-Cisco dst inside:10.0.0.187 (type 3, code 1) by accessgroup "outside_access_in"

Mar 29 2004 09:54:27: %PIX-4-106023: Deny icmp src outside:Some-Cisco dst inside:10.0.0.187 (type 3, code 1) by accessgroup "outside_access_in"

Mar 29 2004 09:54:29: %PIX-4-106023: Deny icmp src outside:Some-Cisco dst inside:10.0.0.187 (type 3, code 1) by accessgroup "outside_access_in"

Mar 29 2004 09:54:30: %PIX-6-106015: Deny TCP (no connection) from 192.168.0.2/2794 to 192.168.216.1/2357 flags SYN
ACK on interface inside

Mar 29 2004 09:54:32: %PIX-6-302006: Teardown UDP connection for faddr 192.168.245.1/137 gaddr 10.0.0.187/2789 laddr
192.168.0.2/2789 ()

Mar 29 2004 09:54:32: %PIX-6-302006: Teardown UDP connection for faddr 192.168.110.1/137 gaddr 10.0.0.187/2790 laddr
192.168.0.2/2790 ()

Mar 29 2004 09:54:32: %PIX-6-302006: Teardown UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53
laddr 192.168.0.2/53

Mar 29 2004 09:54:33: %PIX-6-106015: Deny TCP (no connection) from 192.168.0.2/2794 to 192.168.216.1/2357 flags SYN
ACK on interface inside

Mar 29 2004 09:54:38: %PIX-6-302005: Built UDP connection for faddr 194.224.52.6/36455 gaddr 10.0.0.187/53 laddr
192.168.0.2/53

Mar 29 2004 09:54:39: %PIX-6-106015: Deny TCP (no connection) from 192.168.0.2/2794 to 192.168.216.1/2357 flags SYN
ACK on interface inside
7/21/2015
7

Hadoop is an open-source software framework and based on
Google’s MapReduce and distributed file system work.

Hadoop is designed to be deployed on commonly available,
general-purpose infrastructure(commodity hardwre).
Key parts of the Hadoop framework include the following:
A.Hadoop Distributed File System(HDFS):It achieves fault tolerance
and high performance by breaking data into blocks and spreading
them across large numbers of worker nodes.
B.Hadoop’s MapReduce Engine :It accepts jobs from applications
and divides those jobs into tasks that it assigns to various worker
nodes.
7/21/2015
8
7/21/2015
9
7/21/2015
10
7/21/2015
11
7/21/2015
12

Its a basic algorithm for finding frequent itemsets for association rules.

It iteratively find frequent itemsets with size from 1 to k (k-itemset)

Basic idea is to reduce the search space by using the Apriori principle:

Any subset of a frequent itemset must be frequent,that is, if {AB} is a
frequent itemset, both {A} and {B} should be frequent itemsets


There are two main steps in Apriori:
Join: The candidates are generated by joining among the frequent item
sets level-wise.

Prune- Discard items set if support is less than minimum threshold
value.
7/21/2015
13
7/21/2015
14
7/21/2015
15
7/21/2015
16
7/21/2015
17

It is a fault tolerant.

It is a Scalable and Reliable.

It is a cheap and efficient solution.

It can process the structured ,semistructured and unstructured data
so no need of preparing the dataset.

Communication overhead will be less.
7/21/2015
18

I have published a paper titled “A Survey on SaaS Log Data
Processing and Mining Based on Hadoop” in International
Journal of Emerging Technology & Advanced Engineering (ISSN
2250-2459, ISO 9001:2008 Certified Journal), Volume 5, Issue 1,
January, 2015.

This paper can be downloaded from the following IJETAE
website link.
http://www.ijetae.com/files/Volume5Issue1/IJETAE_0115_33.pdf
7/21/2015
19
[1] “Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining
Analysis”V.Pradeep Kumar1, Dr. R. V. Krishnaiah2.
[2] Agrawal, R. and Srikant, R.” Mining sequential patterns”, P. S. Yu and A. S. P. Chen,
Eds.In:IEEE Computer Society Press, Taipei, Taiwan.
[3]
Andrew,Kusiak,Association
rules-The
Apriori
algorithm[Online],Available:
http://www.engineering. uiowa. edu/~comp /Public /Apriori.pdf.
[4] Goswami D.N., Chaturvedi Anshu.,Raghuvanshi C.S.,” An Algorithm for Frequent
Pattern Mining Based On Apriori”, In: Goswami D.N. et. al. / (IJCSE) International
Journal on Computer Science and Engineering ,,Vol. 02, No. 04, 2010, 942-947, ISSN
: 0975-3397.
[5] R. Agrawal and J. Shafer, “Parallel mining of association rules,” IEEE Transactions on
Knowledge and Data Engineering, Vol. 8, 1996, pp. 962-969.
7/21/2015
20
7/21/2015
21