Intro to Big Data - Seidenberg School of CSIS

Download Report

Transcript Intro to Big Data - Seidenberg School of CSIS

Data Science and Big Data Analytics
Chap1: Intro to Big Data Analytics
Charles Tappert
Seidenberg School of CSIS, Pace University
1.1 Big Data Overview

Industries that gather and exploit data

Credit card companies monitor purchase


Mobile phone companies analyze calling
patterns – e.g., even on rival networks


Good at identifying fraudulent purchases
Look for customers might switch providers
For social networks data is primary product

Intrinsic value increases as data grows
Attributes Defining
Big Data Characteristics

Huge volume of data


Complexity of data types and structures


Not just thousands/millions, but billions of items
Varity of sources, formats, structures
Speed of new data creation and grow

High velocity, rapid ingestion, fast analysis
Sources of Big Data Deluge








Mobile sensors – GPS, accelerometer, etc.
Social media – 700 Facebook updates/sec in2012
Video surveillance – street cameras, stores, etc.
Video rendering – processing video for display
Smart grids – gather and act on information
Geophysical exploration – oil, gas, etc.
Medical imaging – reveals internal body structures
Gene sequencing – more prevalent, less expensive,
healthcare would like to predict personal illnesses
Sources of Big Data Deluge
Example:
Genotyping from 23andme.com
1.1.1 Data Structures:
Characteristics of Big Data
Data Structures:
Characteristics of Big Data

Structured – defined data type, format, structure


Semi-structured


Text data with discernable patterns – e.g., XML data
Quasi-structured


Transactional data, OLAP cubes, RDBMS, CVS files, spreadsheets
Text data with erratic data formats – e.g., clickstream data
Unstructured

Data with no inherent structure – text docs, PDF’s, images, video
Example of Structured Data
Example of Semi-Structured Data
Example of Quasi-Structured Data
visiting 3 websites adds 3 URLs to user’s log files
Example of Unstructured Data
Video about Antarctica Expedition
1.1.2 Types of Data Repositories
from an Analyst Perspective
1.2 State of the Practice
in Analytics




Business Intelligence (BI) versus
Data Science
Current Analytical Architecture
Drivers of Big Data
Emerging Big Data Ecosystem and
a New Approach to Analytics
Business Drivers
for Advanced Analytics
1.2.1 Business Intelligence (BI)
versus Data Science
1.2.2 Current Analytical Architecture
Typical Analytic Architecture
Current Analytical Architecture




Data sources must be well understood
EDW – Enterprise Data Warehouse
From the EDW data is read by applications
Data scientists get data for downstream
analytics processing
1.2.3 Drivers of Big Data
Data Evolution & Rise of Big Data Sources
1.2.4 Emerging Big Data Ecosystem
and a New Approach to Analytics

Four main groups of players

Data devices


Data collectors


Phone and TV companies, Internet, Gov’t, etc.
Data aggregators – make sense of data


Games, smartphones, computers, etc.
Websites, credit bureaus, media archives, etc.
Data users and buyers

Banks, law enforcement, marketers, employers, etc.
Emerging Big Data Ecosystem
and a New Approach to Analytics
1.3 Key Roles for the
New Big Data Ecosystem
Deep analytical talent
1.

Advanced training in quantitative disciplines –
e.g., math, statistics, machine learning
Data savvy professionals
2.

Savvy but less technical than group 1
Technology and data enablers
3.

Support people – e.g., DB admins,
programmers, etc.
Three Key Roles of the
New Big Data Ecosystem
Three Recurring
Data Scientist Activities
1.
2.
3.
Reframe business challenges as analytics
challenges
Design, implement, and deploy statistical
models and data mining techniques on
Big Data
Develop insights that lead to actionable
recommendations
Profile of Data Scientist
Five Main Sets of Skills
Profile of Data Scientist
Five Main Sets of Skills





Quantitative skill – e.g., math, statistics
Technical aptitude – e.g., software
engineering, programming
Skeptical mindset and critical thinking –
ability to examine work critically
Curious and creative – passionate about
data and finding creative solutions
Communicative and collaborative – can
articulate ideas, can work with others
1.4 Examples of
Big Data Analytics

Retailer Target


Apache Hadoop



Uses life events: marriage, divorce, pregnancy
Open source Big Data infrastructure innovation
MapReduce paradigm, ideal for many projects
Social Media Company LinkedIn



Social network for working professionals
Can graph a user’s professional network
250 million users in 2014
Data Visualization of User’s
Social Network Using InMaps
Summary

Big Data comes from myriad sources



Companies are finding creative and novel ways to
use Big Data
Exploiting Big Data opportunities requires




Social media, sensors, IoT, video surveillance, and sources
only recently considered
New data architectures
New machine learning algorithms, ways of working
People with new skill sets
Always Review Chapter Exercises
Focus of Course

Focus on quantitative disciplines –
e.g., math, statistics, machine learning

Provide overview of Big Data analytics

In-depth study of a several key
algorithms