Rich Text Format Formatting Help Pages
Download
Report
Transcript Rich Text Format Formatting Help Pages
Big Data
What is it?
How will it affect business?
Copyright © 2014-2016 Curt Hill
What is big data?
• Large - more data than will fit in a
single spreadsheet
• Complex – many different formats
– Binary
– JSON
– Tab delimited
• Unstructured or semi-structured
– Not easy to make sense of by a
machine
– Consider medical notes
– Little metadata
Copyright © 2014-2016 Curt Hill
Where does it come from?
• In last two years we have generated
90% of the world’s data
• Most of this is machine generated
data
–
–
–
–
Sensors
Mobile devices
Smart devices
Web data
Copyright © 2014-2016 Curt Hill
Why is big data important?
• Has the potential to revolutionize:
– Science
– Business
– Governament
• Lets consider some examples
Copyright © 2014-2016 Curt Hill
First
• The first big data center was built
in1965 by the US government
• The goal was to store federal data
– 742 million tax returns
– 175 million sets of fingerprints
• This is not large by todays standards
but was very impressive then
Copyright © 2014-2016 Curt Hill
Science
• Astonomy
– Sloan Digital Sky Survey has altered the jobs
of astronomers
– They used to spend a significant amount of
time taking pictures of the sky
– The data is now in a database
• Biology
– A single Next Generation Sequencing
machine can produce Terabytes of data per
day
• The Large Hadron Collider produces 11
TByte per second
– Only 2 TByte can be retained
Copyright © 2014-2016 Curt Hill
Business
• Web focused companies use the
vast quantity of data to customize
advertising, content
recommendations for their users
• Health care companies monitor and
analyze data from hospital and home
care devices
• Energy companies are using the
consumption data to make their
production scheduling more
efficient
• Many other Copyright
examples
© 2014-2016 Curt Hill
Government
• Census data was the perhaps the
first big data application
• Los Angeles is using sensors and
big data analysis to make decisions
concerning traffic
• Use sensors to model and analyze
and seismic activity, weather data,
etc.
Copyright © 2014-2016 Curt Hill
4 Vs Define Big Data
•
•
•
•
Volume
Velocity
Variety
Variability or Veracity
Copyright © 2014-2016 Curt Hill
Volume
• What is the most data that can be
stored in a single disk drive
– There are physical boundaries
• We can up these boundaries by
distributing this over many boxes
– Scale up – increase the box
– Scale out – distribute over many boxes
• Then we have the well known
communication problems
Copyright © 2014-2016 Curt Hill
Velocity
• The data is coming in fast
– Faster than can be structured by a
person
• 1 Terabyte of data for each New
York Stock Exchange day
• 15 Billion network connections in a
day
• 100 sensors per car
• Storage first
– Then figure out if you want it
Copyright © 2014-2016 Curt Hill
Variety
• Formats of data:
– JSON
– Tab delimited / CSV
– Tweets and Facebook updates
• The formats of the future are yet to
be revealed
• We need to be able to handle these
Copyright © 2014-2016 Curt Hill
Variability or Veracity
• Making sense of it on its own is
problematic
• The meaning may change over the
course of time
• The reliability issues must be
considered
– False signal from a sensor about to
malfunction
Copyright © 2014-2016 Curt Hill
Challenges
• If we had only one of these Vs to
deal with, we could handle it
successfully
• The problem is we often have to deal
with two or more at a time
– This may make traditional relational
databases too slow to respond
Copyright © 2014-2016 Curt Hill
Big Data Life Cycle
• Several phases:
– Acquisition
– Extraction and cleaning
– Integration, Aggregation and
Representation
– Modeling and Analysis
– Interpretation
Copyright © 2014-2016 Curt Hill
Acquisition
• The data is usually recorded by
sensors and transmitted to a
computer
– In the case of computer data the sensor
is software: an application, a web
server, web browser or part of the
network
• Often the data is so large that real
time filtering/compression is
required to reduce
– Eg. Large Hadron Collider (11 TB/sec)
or upcoming square kilometer
telescope (100 million TB/day)
Copyright © 2014-2016 Curt Hill
Extraction and cleaning
• Data arrives in a variety of formats
• Consider health care data
– Raw data from sensors, each in its own
format
– Admission data from an existing
database
– Physician comments
• These diverse sources must be
formatted to be usable
Copyright © 2014-2016 Curt Hill
Cleaning
• Sensor data may have errors from
transmission or interference
• Transcription of handwritten
information may reflect bias of
author
• Part of the extraction process is to
attempt to clean up the data
• No predefined way to do so
– Dependent on the source of the data,
which determines the types of errors
possible
Copyright © 2014-2016 Curt Hill
Integration, Aggregation
and Representation
• How is the data stored for use?
• The use of data from multiple
sources complicates this
• Since the data represents many
different components it needs to be
stored in form useful for modeling
and analysis
Copyright © 2014-2016 Curt Hill
Provenance
• How and where the data was
collected
– Context information
• Multiple usage data may need to
record the provenance
– Web data originally collected to inform
customized advertising may be used to
study traffic patterns
• Single usage data, such as health
records, is easier
– Even health records might be
anonymized for statistical studies
Copyright © 2014-2016 Curt Hill
Modeling and Analysis
• Once the data is established in a
data warehouse the data mining can
proceed
• Data Mining extracts previously
unknown and potentially useful
information
• Big data provides issues not present
in normal data mining
• Let us digress to traditional data
mining
Copyright © 2014-2016 Curt Hill
A Local Retailer
• Sales tickets are collected at the
point of sale terminal
• These live in the store transactional
database to guide low or mid level
management with the following
information:
– Income and expenses
– Products selling poorly or well
• After a short time these are purged
from this database
Copyright © 2014-2016 Curt Hill
The Data Warehouse
• At a corporate data warehouse the
information no longer needed for
day to day operations are
accumulated
– Every ticket from every store
• Once the data arrives it is retained
for years
• Data mining is used to give insight
into:
– Sales trends
– Types of shoppers
– How product
arrangement affects sales
Copyright © 2014-2016 Curt Hill
Contrast
• Although the previous is a big data
application, it also differs from many
others
• The data in this example is well
formed, accurate and from well
controlled sources
• Big data tends to be noisy, dynamic,
inter-related and not always
trustworthy
– Fortunately, the magnitude of the data
allows statistical methods to ignore the
noise
Copyright © 2014-2016 Curt Hill
Interpretation
• What does the data tell us?
– This is the really hard part and must be
done by a person
• Part of the problem is understanding
the pipeline from acquisition to
model
• This involves understanding the
types of errors:
– Data errors
– Programming errors
– Assumptions made in the models
Copyright © 2014-2016 Curt Hill
Challenges of Big Data
• Heterogeneity
– People are better at processing
different types of data than machines
– Also preserving the provenance
throughout the pipeline
• Inconsistency and Incompleteness
– Sensors are notoriously unreliable
– Statistics can help, but the modeling
must account for the possibility
Copyright © 2014-2016 Curt Hill
• Scale
Challenges 2
– Data volume has been increasing much
faster than Moore’s Law
– Storing the data is a problem
– Processing the data in reasonable
times requires substantial computer
power
• Timeliness
– We generally cannot store all the data
– Filtering and compressing the data
becomes important
– This must be done without loss of
usefulness
Copyright © 2014-2016 Curt Hill
Challenges 3
• Privacy and data ownership
– Health care has the most legislation
concerning what may be shared and
how it may be shared
– After Edward Snowden revealed what
NSA was saving, there has been
considerable privacy concerns
– For example, most smart phones have
some location awareness
– This is good for finding a restaurant
nearby and even better for tracking you
Copyright © 2014-2016 Curt Hill
Up and Out
• Scaling Up
– Specialized hardware – expensive
– Simplified software
– Single point of failure
• Scaling Out
– Commoditized hardware
– Specialized software for replication,
querying and communications
• This is the hard part
– Any point may fail, but with no apparent
loss of availability
Copyright © 2014-2016 Curt Hill
Summary
• Big data is characterized by 4 Vs
–
–
–
–
Volume
Velocity
Variety
Veracity
• Hadoop is an open source system
– Replicates and distributes the data
– Uses map and reduce scripts to
process
Copyright © 2014-2016 Curt Hill