Transcript Big Data

Big Data
1
World Cup Soccer
German soccer Team
:
2014.07.05
IoT + Bigdata
2
What is big data?
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
3
Big Data is Every Where!
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/
grocery stores
• Bank/Credit Card
transactions
• Social Network
4
5
What does big data do?
6
Time of Big Data
What is Big Data?
http://www.youtube.com/watch?v=
7D1CQ_LOizA
The most popular big data application program
is HADOOP:
What is HADOOP?
http://www.youtube.com/watch?v=9svSeWej1U
7
Evolution of Names
•
•
•
•
•
Artificial Intelligence
Machine Learning
Business Intelligence
Data mining
Big Data/Data Sciences
8
What Is Data Mining?
• Data mining (knowledge discovery in
databases):
• A process of identifying hidden patterns and
relationships within data (Groth)
• Data mining:
• Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
9
DM and Business Decision Support
• Database Marketing
•
•
Target marketing
Customer relationship management
•
Credit scoring
•
Clinical decision support
• Credit Risk Management
• Fraud Detection
• Healthcare Informatics
10
Data Mining: A KDD Process
Pattern Evaluation
• Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
11
A mining software:
 SAS
Enterprise Miner (EM)
 Clementine for SPSS
R
 Python
12
Government
• In 2012, the Obama administration announced the Big Data
Research and Development Initiative, which explored how big data
could be used to address important problems faced by the
government. The initiative was composed of 84 different big data
programs spread across six departments.
• Big data analysis played a large role in Barack Obama's successful
2012 re-election campaign.
• The United States Federal Government owns six of the ten most
powerful supercomputers in the world.
• The Utah Data Center is a data center currently being constructed
by the United States National Security Agency. When finished, the
facility will be able to handle yottabytes of information collected
by the NSA over the Internet.
13
Business
• Amazon.com handles millions of back-end operations every day, as well as queries
from more than half a million third-party sellers. The core technology that keeps
Amazon running is Linux-based and as of 2005 they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
• Walmart handles more than 1 million customer transactions every hour, which is
imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes)
of data – the equivalent of 167 times the information contained in all the books in the
US Library of Congress.
• Facebook handles 50 billion photos from its user base.
• FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts
world-wide.
• The volume of business data worldwide, across all companies, doubles every 1.2 years,
according to estimates.
• Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers
to help new home buyers determine their typical drive times to and from work
throughout various times of the day.
14
Bigdata in google trend
15
Bigdata case
Movement of carts: Product display
16
16
Wild Fire in Korea(1991 – 2011)
17
17
Google Flue Service
18
18
Find Location for your business
busienss
19
19
Crime Mapping in Sanfrancisco : 71%
accuracy
20
20
Evolution of bigdata
• Artificial Intelligence
• Data mining
• Business Intelligence
• Bigdata
• Business Analytics
• Data Sciences
21
22
Future direction of bigdata
23
bigdata 2013
bigdata 2014
24
Google glass
Mashup, bigdata,
visualisation
-> analysis of
commerce area
25
IoT
Key: Smart & Intelligence
26
3D Printer
Healthy food, organ,
face recommended?
27
A Case on Bigdata
(Association Rule Analysis)
28
Association Rues Analysis
As an Example of Data mining Tool:
Market
Basket
Analysis
29
What Is Association Mining?
•
Association rule mining:
• Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
•
Applications:
• Market basket analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
•
Examples:
• Rule form: “Body Head [support, confidence]”
•
buys(x, “cookie”)  buys(x, “milk”) [0.5%, 60%]
30
Support and Confidence
• Support
• Percent of samples contain both A and B
• support(A  B) = P(A ∩ B)
• Confidence
• Percent of A samples also containing B
• confidence(A  B) = P(B|A)
• Example
• Sliced pork  lattuce
[support = 2%, confidence = 60%]
31
A store selling fruits and vegetables
Which items are sold together
frequently?
32
An Example of Market Basket(1)
• There are 8 transactions on
three items on A (Apple), B
(Banana) , C (Carrot).
• Check associations for
below two cases.
(1) A (apple) B(banana)
#
Basket
1
A
2
B
3
C
4
A, B
5
A, C
6
B, C
7
A, B, C
8
A, B, C
33
An Example of Market Basket(1(2)
• Basic probabilities are below:
(1) AB
Coverage
5/8 = 0.625
Support
P(A∩B) = 3/8 = 0.375
Confidence P(B|A)=3/5=0.6
Lift
P(A∩B)
P(A)*P(B)
Leverage
P(A∩B) - P(A)*P(B) =0.375 - 0.39 = -0.015
0.375/(0.625*0.625)=0.375/0.39=0.0.96
34
Lift
• What are good association rules?
(How to interpret them?)
• If lift is close to 1, it means there is no
association between two items (sets).
• If lift is greater than 1, it means there is a
positive association between two items (sets).
• If lift is less than 1, it means there is a negative
association between two items (sets).
35
Leverage
•
•
•
•
Leverage = P(A∩B) - P(A)*P(B) , it has three types
① Leverage > 0
② Leverage = 0
③ Leverage < 0
① Two items (sets) are positively associated
② Two items (sets) are independent
③Two items (sets) are negatively associated
36
Lab on Association Rules(1)
• SAS Enterprise Miner or SPSS Clementine have
association rules softwares.
• For this exercise, however, we uses Magnum
Opus.
• download Magnum Opus evaluation version
( click)
37
• After you install the problem, you can see below initial screen. From
menu, choose File – Import Data (Ctrl – O).
38
• Demo Data sets are already there. Magnum Opus has two types of data sets available:
(transaction data: *.idi, *.itl) and (attribute-value data: *.data, *.nam)
• Data format has below two types:(*.idi, *.itl).
idi
itl
(identifier-item file) (item list file)
001,
001,
001,
002,
002,
002,
002,
apples
oranges
bananas
apples
carrots
lettuce
tomatoes
apples, oranges, bananas
apples, carrots, lettuce, tomatoes
39
• If you open
tutorial.idi using
note pad, you can
see the file inside
as left.
• The example left
has 5 transactions
(baskets)
40
• File – Import Data, or
click
. click
Tutorial.idi
• Check Identifier –
item file and click
Next >.
41
• Set things as
they are.
• Search by:
LIFT
• Minimum lift:
1
• Maximum
no. of rules:
10
• Click GO
42
• Results are saved in tutorial.out file.
• Below is an example of rule derived:
tomatoes -> lettuce
[Coverage=0.263 (263); Support=0.111 (111);
Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9);
p=2.35E-019]
43
Output from association rule analysis
Only 55 rules satisfy the specified constraints.
tomatoes -> lettuce
[Coverage=0.263 (263); Support=0.111 (111); Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019]
lettuce -> tomatoes
[Coverage=0.217 (217); Support=0.111 (111); Strength=0.512; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019]
tomatoes -> carrots
[Coverage=0.263 (263); Support=0.085 (85); Strength=0.323; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012]
carrots -> tomatoes
[Coverage=0.175 (175); Support=0.085 (85); Strength=0.486; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012]
onions -> potatoes
[Coverage=0.189 (189); Support=0.082 (82); Strength=0.434; Lift=1.53; Leverage=0.0285 (28.5); p=5.30E-007]
potatoes -> onions
[Coverage=0.283 (283); Support=0.082 (82); Strength=0.290; Lift=1.53; Leverage=0.0285 (28.5); p=5.30E-007]
lettuce & carrots -> tomatoes
[Coverage=0.045 (45); Support=0.039 (39); Strength=0.867; Lift=3.30; Leverage=0.0272 (27.2); p=3.16E-008]
44