Dr. Bin Zhou guest lecture on data mining
Download
Report
Transcript Dr. Bin Zhou guest lecture on data mining
Big Data
• What is Big Data?
• Recently much good science, whether physical,
biological, or social, has been forced to confront and has often benefited from - the Big Data
phenomenon.
• Big Data refers to the explosion in the quantity
(and sometimes, quality) of available and
potentially relevant data, largely the result of
recent and unprecedented advancements in
data recording and storage technology. (p.
115)
Diebold, F.X. (2003), \Big Data Dynamic Factor Models for Macroeconomic Measurement
and Forecasting: A Discussion of the Papers by Reichlin and Watson," In M. Dewatripont, L.P. Hansen and S. Turnovsky (eds.), Advances in Economics and Econometrics:
Theory and Applications, Eighth World Congress of the Econometric Society, Cambridge
University Press, 115-122
Big data spans four
dimensions:
Volume, Velocity, Variety,
and Veracity
• The first 3Vs definition is widely used by
Gartner and much of the industry
• The new V “Veracity” is introduced by some
organizations
• Volume: Enterprises
are awash with evergrowing data of all
types, easily
amassing
– terabytes—even
petabytes—of
information.
– Turn 12 terabytes of
Tweets created each day
into improved product
sentiment analysis
– Convert 350 billion
annual meter readings to
better predict power
consumption
• Velocity: Sometimes
2 minutes is too late.
For time-sensitive
processes such as
catching
– fraud, big data must be
used as it streams into
your enterprise in order
to maximize its value.
– Scrutinize 5 million
trade events created
each day to identify
potential fraud
– Analyze 500 million
daily call detail records
in real-time to predict
customer churn faster
• Variety: Big data is any
type of data structured and
unstructured data such
as text, sensor
– data, audio, video, click
streams, log files and
more. New insights are
found when analyzing
these data types together.
– Monitor 100’s of live video
feeds from surveillance
cameras to target points of
interest
– Exploit the 80% data
growth in images, video
and documents to improve
customer satisfaction
• Veracity: 1 in 3 business leaders
don’t trust the information they
use to make decisions.
– How can you act upon information if you
don’t trust it?
– Establishing trust in big data presents a
huge challenge as the variety and number
of sources grows.
Where Does Big Data Come
From?
• Our Data-driven World
– Science
• Data bases from astronomy, genomics,
environmental data, transportation data,
…
– Humanities and Social Sciences
• Scanned books, historical documents,
social interactions data, new technology
like GPS, …
– Business & Commerce
• Corporate sales, stock market
transactions, census, airline traffic, …
– Entertainment
• Internet images, Hollywood movies, MP3
files, …
– Medicine
• MRI & CT scans, patient records, …
Usage Example in Big Data
US 2012 Election
- predictive modeling
- mybarackobama.com
- drive traffic to other campaign sites
Facebook page (33 million “likes”)
YouTube channel (240,000
subscribers and 246 million page views)
- a contest to dine with Sarah Jessica
Parker
- Every single night, the team ran
66,000 computer simulations
- Amazon web services
- Orca big-data app
(however, there were so
many fails about ORCA)
- YouTube channel (23,700
subscribers and 26 million
page views)
Usage Example in Big Data
(cont.)
Data Analysis prediction for US 2012 Election
Drew Linzer, June 2012
332 for Obama,
206 for Romney
Nate Silver’s, Five thirty Eight blog
Predict Obama had a 86% chance of
winning
Predicted all 50 state correctly
Sam Wang, the Princeton Election
Consortium
The probability of Obama’s re-election
at more than 98%
media continue reporting the
race as very tight
Big Challenge in Big Data
• How to convert big data into useable
information by identifying patterns and
deviations from those patterns?
• Big data challenge requires talents
– Highly skilled in programming and data
analysis to extract meaningful information
and insights
Big Data Techniques and
Technologies
• Common Skill Sets
– Data analysis is the cornerstone
– Education and experience in data analysis, business
analytics, mathematics, statistics, quantitative skills
•
•
•
•
•
•
•
•
•
•
•
•
A/B testing
Association rule learning
Classification
Cluster analysis
Crowdsourcing
Data fusion and data
integration
Data mining
Ensemble learning
Genetic algorithms
Machine learning
Natural language
processing
Neural networks
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Network analysis
Optimization
Pattern recognition
Predictive modeling
Regression
Sentiment analysis
Signal processing
Spatial analysis
Statistics
Supervised learning
Simulation
Time series analysis
Unsupervised learning
Visualization
…
Big Questions about Big Data
• What happens in a world of radical transparency,
with data widely available?
• If you could test all your decisions, how would
that change the way you compete?
• How would your business change if you used big
data for widespread, real time customization?
• How can big data augment or even replace
Management?
• Could you create a new business model based on
data?
• …
Related Careers in Big Data
• Data scientist
– Often at the top of the big data hierarchical chart
– Typically proven professionals who posses deep
analytical talent
• Data architect
– Computer programmers who are skilled in working
with undefined data and disparate types of data
• Data visualizer
– Professionals who are able to translate data into
information that people can effectively use
• Data change agent
– Use data analytics to recommend and drive changes
within an organization
• Data engineer and operator
– Designers, builders and managers of big data systems
Job Opportunities in Big Data
Demand for Deep Analytical Talent in US
•
•
•
Resource: McKinsey
There will be a shortage of talent necessary for organizations
to take advantage of big data. By 2018, the United States
alone could face a shortage of 140,000 to 190,000 people with
deep analytical skills as well as 1.5 million managers and
analysts with the know-how to use the analysis of big data to
make effective decisions
Big Data industry is worth more than $100 billion growing at
almost 10% a year (roughly twice as fast as the software
business)
IS Relevant Courses
• IS 410: Introduction to Database Design
– Discuss the process of database development,
including data modeling, database design, and
database implementation
• IS 420: Database Application Development
– Offer hands-on experience for developing
client/server database applications using a major
database management system
• IS 427: Introduction to Artificial Intelligence:
Concepts and Applications
– Provide an introduction to, and hands-on
experience with several Artificial Intelligence (AI)
techniques
• IS 428: Data Mining Techniques and
Applications
– Learn both how data mining techniques work and
how to apply data mining to various business and
organizational contexts