PowerPoint 簡報

Download Report

Transcript PowerPoint 簡報

Cloud Computing Era (Practice)
Phoenix Liau
Trend Micro
Three Major Trends to Chang the World
Cloud Computing
Big Data
Mobile
什麼是雲端運算?
美國國家標準技術研究所 (NIST)的定義:
Essential
Characteristics
Service Models
Deployment Models
以服務(as-a-service)的商業模式,透過Internet技術,提供具有擴充
性(scalable)和彈性(elastic)的IT相關功能給使用者
It’s About the Ecosystem
Structured, Semi-structured
Cloud Computing
Enterprise Data Warehouse
SaaS
PaaS
IaaS
Generate
Big Data
Lead
Business Insights
create
Competition, Innovation,
Productivity
What is BigData?
A set of files
A database
A single file
What is the problem
• Getting the data to the processors
becomes the bottleneck
• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22
minutes!
The Era of Big Data – Are You Ready
• Businesses are driving the growth of big data. The capable data
storage, efficient management, and capturing values to business
values of huge size of data are enterprise big challenges.
• Overwhelming quantities of big data will challenge enterprise
storage infrastructure and data center architecture which will cause
chain reactions in database storage, data mining, business
intelligence, cloud computing, and computing application.
Data for business commercial analysis
• 2011: multi-terabyte (TB)
• 2020: 35.2 ZB (1 ZB = 1 billion TB)
Who Needs It?
Hadoop
Enterprise Database
When to use?
When to use?
•
Ad-hoc Reporting (<1sec)
•
Affordable Storage/Compute
•
Multi-step Transactions
•
Unstructured or Semi-structured
•
Lots of Inserts/Updates/Deletes
•
Resilient Auto Scalability
Hadoop!
– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
Hadoop Core
MapReduce
HDFS
©2011 Cloudera, Inc. All Rights Reserved.
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce
• Two Phases of Functional Programming
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Java API
13
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core
Java
Java
MapReduce
HDFS
Java
Java
14
©2011 Cloudera, Inc. All Rights Reserved.
Word Count Example
Key: offset
Value: line
Key: word
Value: count
0:The cat sat on the mat
22:The aardvark sat on the sofa
Key: word
Value: sum of count
The Hadoop Ecosystems
The Ecosystem is the System
• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
What is ZooKeeper
• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information
Flume / Sqoop – Data Integration Framework
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
What’s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path
(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Flume Architecture
Log
Log
...
Flume Node
Flume Node
HDFS
©2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks
• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP
©2011 Cloudera, Inc. All Rights Reserved.
Sqoop
• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS
Sqoop
HDFS
Sqoop
RDBMS
28
©2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples
$ sqoop import --connect jdbc:mysql://localhost/world -username root --table City
...
$ hadoop fs -cat City/part-m-00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazar-eSharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
...
29
©2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!
Hive
– Developed by
• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive
SQL
Hive
MapReduce
33
©2011 Cloudera, Inc. All Rights Reserved.
Pig
– Initiated by
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug
A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Pig
Script
Pig
MapReduce
©2011 Cloudera, Inc. All Rights Reserved.
Hive vs. Pig
Hive
Pig
Language
HiveQL (SQL-like)
Pig Latin, a scripting language
Schema
Table definitions
that are stored in a
metastore
A schema is optionally defined
at runtime
Programmait Access JDBC, ODBC
PigServer
WordCount Example
• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye,
1> just sums up the values
• the
reduce
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
The Story So Far
SQL
Hive
Pig
Java
MapReduce
Java
HDFS
Script
Sqoop Flume
SQL
4
1
RDBMS FS
©2011 Cloudera, Inc. All Rights Reserved.
Posix
Hbase – Column NoSQL DB
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by
• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
–
–
–
–
PUT
GET
DELETE
SCANE
Hbase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
Hbase – workflow
HBase Examples
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
create 'mytable', 'mycf‘
list
put 'mytable', 'row1', 'mycf:col1', 'val1‘
put 'mytable', 'row1', 'mycf:col2', 'val2‘
put 'mytable', 'row2', 'mycf:col1', 'val3‘
scan 'mytable‘
disable 'mytable‘
drop 'mytable'
©2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
What is
?
• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
• Triggered
– Time
– Data
Job 1 Job 2
Job 3
Job 4 Job 5
Oozie Features
• Component Independent
–
–
–
–
MapReduce
Hive
Pig
SqoopStreaming
©2011 Cloudera, Inc. All Rights Reserved.
Mahout – Data Mining
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster
Mahout Use Cases
• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform
©2011 Cloudera, Inc. All Rights Reserved.
Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
Conclusion
Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem
Recap – Hadoop Ecosystem
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
趨勢科技雲端防毒 Case Study
Collaboration in the underground
網路威脅呈現爆炸性的成長
各式各樣的變種病毒、垃圾郵件、不明的下載來源等等,這些來自網路上
的威脅,躲過傳統安全防護系統的偵測,一直持續呈現爆炸性的成長,形
成嚴重的資安威脅
New Unique Malware Discovered
1M
unique
Malwares
every
month
New Design Concept for Threat Intelligence
Human
Intelligence
CDN / xSP
Honeypot
Web Crawler
Trend Micro
Mail Protection
Trend Micro
Web Protection
Trend Micro
Endpoint Protection
150M+ Worldwide Endpoints/Sensors
Challenges We Are Faced
The Concept is Great but ….
6TB of data and 15B lines of logs received daily by
It becomes the Big Data Challenge!
Issues to Address
Raw Data
Information
 Volume: Infinite
 Time: No Delay
 Target: Keep Changing Threats
Threat
Intelligence/Solution
SPN
Feedback
SPN High Level Architecture
SPAM
CDN Log
HTTP POST
L4
Log
Receiver
Log
Receiver
Web Pages
L4
HTTP Download
Log Post
Processing
Log Post
Processing
Log Post
Processing
SPN infrastructure
Adhoc-Query (Pig)
MapReduce
HBase
Hadoop Distributed File System
(HDFS)
Lumber
Jack
Circus
(Ambari)
Tracking
Logging
System
(TLS)
Malware
Classificati
on
Correlation
Platform
Global
Object
Cache
(GOC)
Feedback Information
Message Bus
Application
Email Reputation Service
Web Reputation
Service
File Reputation
Service
Trend Micro Big Data process capacity
雲端防毒每日需要處理的資料量
• 85 億個 Web Reputation 查詢
• 30 億個 Email Reputation查詢
• 70 億個 File Reputation 查詢
• 處理 6 TB 從全世界收集到的 raw logs
• 來自1.5億台終端裝置的連線
Trend Micro: Web Reputation Services
Technology
Trend Micro
Products / Technology
CDN Cache
Hadoop Cluster
Operation
User Traffic | Honeypot
8 billions/day
40% filtered
Akamai
4.8 billions/day
Rating Server for Known
Threats
82% filtered
Unknown & Prefilter
860 millions/day
Page Download
Web Crawling
99.98% filtered
Threat
Analysis
Machine Learning
Data Mining
25,000 malicious URL /day
Block malicious URL within 15 minutes once it goes online!
15 Minutes
High Throughput Web Service
Process
Big Data Cases
Line Data on HBase
• Line data
– MODEL: <key> -> <model>
– INDEX: <key> -> <[property in model>
• User: <userID> -> <User obj>, <userID> <-> <phone>
• Consistency in HBase
• Contact model: use column qualifier to store
• Support range query (e.g. message box)
Pig at Linkedin
Linkedin - Pig Example
• views = LOAD '/data/awesome' USING
VoldemortStorage();
• views = LOAD '/data/etl/tracking/extracted/profile-view'
USING VoldemortStorage('date.range',
'num.days=90;days.ago=1’)
Facebook Messages
Facebook Open Source Stack
• Memcached --> App Server Cache
• ▪ZooKeeper --> Small Data Coordination Service
• ▪HBase --> Database Storage Engine
• ▪HDFS --> Distributed FileSystem
• ▪Hadoop --> Asynchronous Map-Reduce Jobs
Questions?
Thank you!