Big Data and Business Intelligence

Download Report

Transcript Big Data and Business Intelligence

Big Data and
Business Intelligence
Ryan J. Baxter, Ph.D
Boise State University
EMBA Session - November 18, 2016
Session Goals
• Explore the history, current use, and trajectory of Big Data including
the underlying technologies and their role in enabling Big Data.
• Consider critically the challenges and opposition to Big Data
• Review and analyze industry specific examples and insights of Big
Data and Analytics
Agenda
12:30-1:15 (45 min) Overview and Exploration
1:15-1:45 (30 min)
Team Breakout: Analyzing industries and Looking for Patterns
1:45-2:00 (15 min)
Break
2:00-2:50 (60 min)
Reconvene and Share Insights
2:50-3:00 (10 min)
Break
3:00-4:00 (60 min)
Mark Bastian – Clearwater Analytics: Systematically Converting Unstructured Data into Value
Changing data alone won’t solve problems
• A concrete example…
4
5
6
7
8
Key Takeaways
• Changing the reference system affects:
• Use of Tools
• Culture, Habits, Customs
• More data intensive reference system allows for:
• Tighter coordination
• Orchestrate complex routines
What is Big Data?
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
11
How many Vs does it take to define Big Data?
• Volume
• Variety
• Velocity
• Veracity
• Variability
• Visualization
• Value
3 Vs of Big
Data
4 Vs of Big
Data by IBM
Where is the data coming from?
Decreasing cost of data storage
Average Cost in $USD Per Gigabyte
500,000
450,000
437,500
400,000
350,000
300,000
250,000
200,000
150,000
105,000
100,000
11,200
1,120
50,000
11.00
1.24
0.090
2000
2005
2010
0.050
0.030
0.022
0.019
2015
2016
0
1980
1985
1990
1995
Average Cost Per Gigabyte
Recreated from source: http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
2013
2014
Miniaturization and Mobility of Computing Technology and Sensors
http://www.computerhistory.org/atchm/the-worlds-smallest-computer/
By Kopiersperre (talk) - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36391402
By Author of Carna Botnet "Internet Census 2012", https://commons.wikimedia.org/w/index.php?curid=26114329
Automotive
Appliances
Computers
Consumer Electronics
Healthcare
Industrial
Military
https://www.ncta.com/platform/broadband-internet/behind-the-numbers-growth-in-the-internet-of-things-2/
Customer Interaction Evolution
Maturing National
Merchant
Early Large
Merchant
•Loose Relationship
with customer
•Little personal data,
•Tight Relationship but lots of general
with customer –
data
•Rich, organic,
credible narrative
data
Small
Merchants:
•Tightening relationship
with customer
•Increasing personal data +
lots of aggregate data
Current National
Merchant
Future Global and
SME Merchants
https://www.flickr.com/ph
otos/gleonhard/897955548
2/
•Multi-faceted
•Intimate relationship relationship with
customer
with customer
•Huge
amount of
•Lots of personal and
personal
and aggregate
aggregate data
data
https://www.flickr.com/p
hotos/davedugdale/5102
910864/in/photostream/
19
What other trends or advances are contributing to
data growth?
Analytics, Big Data, Business Intelligence, Decision
Support Systems, Data Mining…
How do these fit together?
How do we deal with this data?
Volume, Variety, Velocity
The relational database
• Good
• Avoid redundant data (save space!)
• Transaction friendly
• Consistency during update
• Bad
• Scaling
• High volume availability
• Sensitive to small changes
Relational Databases are Sensitive to Change
• “This notion of thinking about data in a structured, relational
database is dead.” 1
• Each year, billions of dollars are spent on data modeling and ETL*
processes to create and recreate more “perfect” data models that will
never change. BUT THEY ALWAYS DO.2
1.
2.
*.
2009, Vivek Kundra, Former CIO of the U.S. Federal Government (Cited in #2).
2016, Matt Allen, “Relational Databases Are Not Designed To Handle Change”
ETL = Extract, Transform, and Load
Necessity is the mother of …
Big Data Technologies Leverage
• Controlling clusters of commodity hardware
• Non-relational databases
• Open source
• Rapidly evolving
NoSQL: “Not only SQL” – Non Relational
• Characteristics:
•
•
•
•
•
Non-relational
Schema-less (on input)
Open source
Cluster-friendly
Real-time (fast
read/write)
• Why?
• Large dataset – scale
horizontally
• Ease of programming
• Schema-less
• Data variety
• Faster capture
• Redundant
Additional resource: https://www.youtube.com/watch?v=qI_g07C_Q5I (Introduction to NoSQL by Martin Fowler)
Normalization vs.
Aggregation
Source: https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Apache Hadoop
• Open source
• Large scale, distributed storage and processing
• Clusters of commodity hardware (high failure tolerance)
• Immutability of Data
• Batch oriented
Resource: https://developer.yahoo.com/hadoop/tutorial/module1.html
30
Immutability of Data
• All data appended
• No rewriting/updating
• Learn from “streams of change”
Criticisms of Big Data
Privacy – Asymmetry of Power
“… these capabilities, most of which are not
visible or available to the average consumer,
also create an asymmetry of power
between those who hold the data and
those who intentionally or inadvertently
supply it.”1
1. Source: BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES (Executive Office of the President May 2014)
http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf
2. By Toby Hudson (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
33
Data will Help us Manifesto… (http://datawillhelp.us/)
• “…we’re abandoning
timeless decision-making
tools like wisdom,
morality, and personal
experience for a new kind
of logic which simply says:
“show me the data”.
“Big data has arrived, but big insights have not.”
Big Data Articles of Faith:
1.
2.
3.
4.
It’s accurate
All data captured - (no need for sampling)
Causation is unimportant
“…the numbers speak for themselves”
Theory free analysis is fragile. “If
you have no idea what is behind
a correlation, you have no idea
what might cause that
correlation to break down. ”
Source: http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3MByvnOn8
35
Mirai Bot – IOT – What’s Going on?
• DDos attack on default IOT devices
• 61 default username/passwords
• No Industry Minimum or Standard
• Future regulation?
Partial List: http://www.csoonline.com/article/3126924/security/hereare-the-61-passwords-that-powered-the-mirai-iot-botnet.html
Medical Devices are vulnerable
“In our recent assessment of medical devices used in
clinics and hospital around the country, weak
encryption, lack of key management, poor
authentication and authorization protocols, and
insecure communications were all common findings.” -Chandu Ketkar, Technical Manager at Cigital
https://www.bitsighttech.com/press-releases/news/industry-analysis-reveals-healthcare-and-pharmaceuticalindustry-lags-in-security-effectiveness
Case Study: St. Jude Medical Devices
Vulnerabilities
• Watch Video at: http://www.bloomberg.com/news/articles/201608-25/in-an-unorthodox-move-hacking-firm-teams-up-with-shortsellers
• “A number of associations in the model were really problematic,”
• “It’s scary enough to think that private companies are gathering endless
amounts of data on us. It’d be even worse if the conclusions they reach
from that data aren’t even right.” (Lazar)
39
Crime Prediction and Prevention
• Police leverage real-time analytics to provide actionable intelligence
that can be used to understand criminal behavior, identify
crime/incident patterns, and uncover location-based threats.
• That reminds me of a movie I once watched…
https://www.mapr.com/solutions/industry/big-data-and-apache-hadoop-government
40
Prediction?
Source: http://paperathensupm59.files.wordpress.com/2010/11/schermafbeelding-2010-11-29-om-19-34-10.png
41
Gaining or Losing from lost Privacy?
• “When we lose privacy, we gain so much more. For example, if we
open all our medical data for everybody to have, we can have
insights.” (Kira Radinsky – CTO and co-founder of SalesPredict)
• Crowdsourcing Health Data
• 23andme genetic research
• Ouraring and WeAreCurious
42
Hold on! Are you leveraging existing data
opportunities?
Little Data?
• Management and work
practices alignment
• Data quality
• Data synchrony
• Scorecard – Evidence based
management
• Coaching
• Business rules management
(aligning operational decisions
with strategy)
44
Best Practices for New Initiatives
• Well-defined use cases
• Hypotheses
• Build Infrastructure
• Measure
• Adapt
• Iterate…
• Leverage increasing infrastructure to
explore
9. Measure
8. Increase/Refine
Infrastructure
4. Measure
5. Adapt
1. Use Case
2. Hypotheses
10. Adapt
6. New Use
Case
7. Hypotheses
3. Build
Infrastructure
11. Iterate
Next wave…Data Driven Automation of
Business Decisions
• Operational Analytics by Bill Franks
• Focus on breadth (good enough vs. perfect)
• Design connections from data to decisions
• Prototype, Test, Refine
See an overview at: http://www.theanalyticsrevolutionbook.com/
Discussion After Group Breakout:
What are the keys to evolving to a data-driven/centric
organization?
Additional Resources and Issues
Data Mining
• Techniques for learning patterns in data by applying statistical
techniques.
• Training
• Classifying, Clustering, Associations
• Predictive
• Resource: https://rayli.net/blog/data/top-10-data-mining-algorithmsin-plain-english/
Public Data Sets
Listings – e.g.
• https://github.com/caesar0301/awesome-public-datasets
• https://aws.amazon.com/datasets/
• https://www.google.com/publicdata/directory
• https://www.reddit.com/r/datasets/
Facebook Data Set Example:
https://docs.google.com/spreadsheets/d/1mLO7SFqHmUaZEpp87cwk
M0luJutSwmwKMx7kaM9348U/edit#gid=1042851424
http://www.wsj.com/articles/whats-all-that-data-worth-1413157156
• “A lot of what is going on at the companies is not being reflected
in public disclosures or the accounting,” (Glen Kernick, a managing director at
investment-banking and valuation advisory firm Duff & Phelps Corp.)
• “the accounting profession has completely failed modern
business in not being able to catch up to new forms of property”
(Alex Poltorak, CEO of General Patent Corp)
51
Designing Data Repositories
• Data Warehouse – Structured – Schemas on Data Write
• Data Lake – Raw – structuring happens on Read