Network_Traffic_Analysis_using_HADOOP_Architecture
Download
Report
Transcript Network_Traffic_Analysis_using_HADOOP_Architecture
Network Traffic Analysis
using HADOOP Architecture
Shan Zeng
HEPiX, Beijing
17 Oct 2012
Outline
• Introduction to Hadoop
• Traffic Information Capture
• Traffic Information Resolution
• Traffic Information Storage
• Traffic Information Analysis
• Traffic Information Display
• Conclusion
ZENG SHAN/CC/IHEP
Introduction to Hadoop
ZENG SHAN/CC/IHEP
What can Hadoop do?
• Hadoop is an open-source software framework .
• It was originally developed to support distribution for
the Nutch search engine project.
• Supports data-intensive distributed applications.
• Licensed under the Apache v2 license.
• It enables applications to work with thousands of
computation-independent computers and petabytes of data
ZENG SHAN/CC/IHEP
Traffic Information Capture
ZENG SHAN/CC/IHEP
What is a flow?
• Network flow is a sequence of packets
• From a source computer to a destination, which may be
another host, a multicast group, or a broadcast domain.
• A network flow measures sequences of IP packets
sharing a common feature as they pass between
devices.
• Flow format:
• NetFlow(Cisco)
• J-Flow(Juniper)
• Sflow(HP)
• ….
ZENG SHAN/CC/IHEP
What is nProbe?
• nProbe is an open source tools
• Capture packets flowing on a Ethernet segment, computes flows and
export them to the specified collectors.
• Features:
• Ability to keep up with Gbit speeds on Ethernet networks handling thousand of
packets per second without packet sampling on commodity hardware.
• Support for major OS including Unix, Windows and Mac OS X
• it is designed for environments with limited resources
ZENG SHAN/CC/IHEP
Traffic Information Resolution
ZENG SHAN/CC/IHEP
nfcapd
• nfcapd is the netflow capture daemon, it reads netflow data
from the network and stores it into files.
• The output file is automatically rotated and renamed every
n minutes - typically 5 min
e.g. nfcapd.201205030900 contains the data from May 3rd 2012
09:00 onward
• Usage:
/usr/local/bin/nfcapd -p 2055 -t 300 -l
/home/zengshan/netflow/nfcapd_file/IHEP &
-p portnum Specifies the port number to listen. Default port is 9995
-t interval Specifies the time interval in seconds to rotate files.
-l base_directory Specifies the base directory to store the output files.
ZENG SHAN/CC/IHEP
nfdump
• nfdump Reads the netflow data from the files stored by nfcapd
• And then dump them to text
•
ZENG SHAN/CC/IHEP
nfdump output text format
•
Tag
Description
Tag
%ts
Start Time - first seen
%in
%te
End Time - last seen
%out Output Interface num
%td
Duration
%pkt
Packets counts in this flow
%pr
Protocol
%byt
Bytes count in this flow
%sa
Source Address
%fl
Number of flows.
%da
Destination Address
%flg
TCP Flags
%sap
Source Address: Port
%tos
Type of Service
%dap
Destination Address:Port
%bps
bits per second
Source Port
%pps packets per second
%dp
Destination Port
%bpp
%sas
Source AS
%das Destination AS
%sp
Description
Input Interface num
Bytes per package
ZENG SHAN/CC/IHEP
•
ZENG SHAN/CC/IHEP
Traffic Information Storage
ZENG SHAN/CC/IHEP
HDFS
• HDFS is short for Hadoop Distributed File System
• HDFS can provide high throughput access to application data
• Differences from other distributed file systems:
•
•
•
•
•
•
•
highly fault-tolerant
designed to be deployed on low-cost hardware.
Portability across heterogeneous hardware and software platforms
Applications run on HDFS need streaming access to their data sets
provides high throughput access to application data
suitable for applications that have large data sets
Moving computation is cheaper than moving data
ZENG SHAN/CC/IHEP
HDFS master/slave architecture
• NameNode
• Manages name space of the file system
• Regulates access to files by clients
• Determine the mapping of blocks to DataNodes
• DataNodes
• Responsible for serving read and write requests from the clients
• Perform block creation, deletion and replication upon the instructions from
NameNode
ZENG SHAN/CC/IHEP
Data Replication in HDFS
• To ensure the fault tolerance in HDFS, the blocks of a file are
replicated, the replicas of a block can be specified by the application
ZENG SHAN/CC/IHEP
Traffic Information Analysis
ZENG SHAN/CC/IHEP
Map/Reduce
• MapReduce is a programming model for processing large data sets
• MapReduce is typically used to do distributed computing
on clusters of computers
• MapReduce can take advantage of locality of data, processing data on
or near the storage assets to decrease transmission of data.
• The model is inspired by the map and reduce functions
• "Map" step: The master node takes the input, divides it into smaller sub-problems, and
distributes them to slave nodes. The slave node processes the smaller problem, and
passes the answer back to its master node.
• "Reduce" step: The master node then collects the answers to all the sub-problems and
combines them in some way to form the final output
ZENG SHAN/CC/IHEP
Traffic Information Display
ZENG SHAN/CC/IHEP
Drawing tools
• RRDtool
•
•
•
•
acronym for round-robin database tool
The data are stored in a round-robin database(circular buffer)
It also includes tools to extract RRD data in a graphical format
drawing flow trend graph in three dimensionality: Flow count,
Packet count, Traffic count
• Highstock
• Highstock lets you create stock or general timeline charts in pure
JavaScript including sophisticated navigation options like a small
navigator series, preset date ranges, date picker, scrolling and
panning.
• We just need to write API between HDFS and Highstock
ZENG SHAN/CC/IHEP
Architecture
ZENG SHAN/CC/IHEP
ZENG SHAN/CC/IHEP
ZENG SHAN/CC/IHEP
ZENG SHAN/CC/IHEP
ZENG SHAN/CC/IHEP
Conclusion
• Network flow trend chart from IHEP every 5 minutes in
three dimension: Flow count/Packet count/Traffic count
• Detailed traffic information chart
• select a single timeslot and get the detailed traffic information
• select a time window and get the detailed traffic information
• on hovering the chart, a tooltip text with traffic information on
each point and series can be displayed.
• the tooltip follows as the user moves the mouse over the graph
• Traffic information related to the HEP experiments
• Once the IP addresses of the machines related to the data
transferring of the HEP experiment is specified
• we already have DYB/YBJ/CMS/ALTAS
ZENG SHAN/CC/IHEP
Thank You
Questions?
ZENG SHAN/CC/IHEP