machine_learning_overview_ICS90_Oct2015x

Transcript machine_learning_overview_ICS90_Oct2015x

Machine Learning: Adventures in Big Data
Padhraic Smyth
Department of Computer Science
Bren School of Information and Computer Sciences
University of California, Irvine
2
A Revolution in the Technology of Data
Graphic from Ray Kurzweil, singularity.com
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
3
200 million
emails sent
350,000
new tweets
2.5 million
search queries
issued
Graphic from http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html, downloaded in 2011
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
4
Graphics from Lars Backstrom, ESWC 2012
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
5
Google Search query = “beer”, over time
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
6
Google Search query = “beer”, over time
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
7
Property Database
+
Property
Listings
Mortgage
Applications
90 Million
Property Tax
Database
2.6 Million Active
Property Listings
99% of US Properties
+
+
+
Loan Data
50 million
active loans
Tax history on
125 M parcels
+
Bankruptcies
27 Million
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
8
Geolocated (GPS) Twitter Data
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
9
Real-Time Sports Data Analysis
Graphics from www.stats.com/sportvu/sportvu.asp
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
10
Self-Driving Cars
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
11
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
12
MACHINE LEARNING FOR AUTOMATED CLASSIFICATION
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
13
Real-World Examples of Classification
Application
Spam email
Sentiment detection
Credit approval
Face detection
Speech recognition
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
14
Real-World Examples of Classification
Application
Input Variables
Spam email
Words in an email
Sentiment detection
Words in a review
Credit approval
Customer attributes
Face detection
Image pixel values
Speech recognition
Audio features
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
15
Real-World Examples of Classification
Application
Input Variables
Output Classification
Spam email
Words in an email
Spam or not?
Sentiment detection
Words in a review
Positive review?
Credit approval
Customer attributes
Good risk or not?
Face detection
Image pixel values
Face in image or not?
Speech recognition
Audio features
What word was spoken?
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
16
Real-World Examples of Classification
Application
Input Variables
Output Classification
Spam email
Words in an email
Spam or not?
Sentiment detection
Words in a review
Positive review?
Credit approval
Customer attributes
Good risk or not?
Face detection
Image pixel values
Face in image or not?
Speech recognition
Audio features
What word was spoken?
Goal of machine learning is to learn a mapping from inputs to output
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
17
Each dot is a 2-dimensional point
(or vector) representing one person
= [ AGE, MONTHLY INCOME]
14000
12000
MONTHLY INCOME
10000
8000
6000
4000
2000
0
0
10
20
30
40
50
60
70
80
90
AGE
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
18
Blue dots = good loans
Red dots = bad loans
14000
Good
boundary?
12000
MONTHLY INCOME
10000
8000
6000
4000
Better
boundary?
2000
0
0
10
20
30
40
50
60
70
80
90
AGE
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
19
Mathematics of Decision Boundaries
The equation of a line in 2-dimensions is
ax+by+c =0
where x and y are the AGE and INCOME values
a, b, c are the “weights” (or parameters) of our classifier
Learning = finding the values for the weights (e.g., a, b, c) that
minimize an error function, e.g., how many points are misclassified
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
20
Initial guess for a, b, c
(not very good, high error)
14000
12000
MONTHLY INCOME
10000
8000
6000
4000
2000
0
0
10
20
30
40
50
60
70
80
90
AGE
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
21
Initial guess for a, b, c
(not very good, high error)
14000
12000
MONTHLY INCOME
10000
8000
6000
Final solution for a, b, c
(much better, low error)
4000
2000
0
0
10
20
30
40
50
60
70
80
90
AGE
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
22
Machine Learning Algorithms in Practice
Instead of 2 dimensions we often have 100,000 dimensions
– e.g., text documents where each word is a dimension
Often have a highly non-linear boundary instead of a line
– e.g., a neural network model
Learning algorithm uses sophisticated search and optimization methods
– e.g., multidimensional gradient information
All of this is based on fundamental concepts in
–
–
–
–
–
Multidimensional geometry
Multivariable calculus
Linear algebra
Optimization and search algorithms
Probability and statistics
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
23
An Artificial Neuron Model
x1
x2
f(x)
x3
Each “edge” has an associated weight or parameter
x4
Weighted sum goes through a non-linearity
Output is a weighted sum of the inputs
Goal: learn the weights that best predict the output
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
24
A Neural Network with 1 Hidden Layer
x1
x2
f(x)
x3
x4
Can recursively create more complex prediction models
Many more weights now….requires more data to estimate
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
25
Deep Learning: Models with 2 or More Hidden Layers
We can build on this idea to create “deep models” with many hidden layers
x1
x2
f(x)
x3
x4
The model is now a very flexible highly non-linear function
“Deep learning” is a very hot topic in machine learning just now
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
26
Optimization
Easy
(convex)
E(a)
a
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
27
Optimization
Easy
(convex)
E(a)
a
E(a)
Hard
(non-convex)
a
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
28
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
29
Figure from Krizhevsky, Sutskever, Hinton, 2012
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
30
Example: Typing a Query into Google
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
31
Steps:
1. Find matching documents
2. Compute features
3. Predict relevance given
features
4. Rank by relevance
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
32
Steps:
1. Find matching documents
2. Compute features
3. Predict relevance given
features
4. Rank by relevance
This requires machine learning, i.e.,
learning a predictive model offline
given billions of previous searches
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
33
?
?
?
Each ? represents
an “ad slot”
In a fraction of a
second, algorithms
predict which ads
you are most likely
to click on (from
1000’s of ads)
?
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
34
The ads that are
most likely to lead
to a click are
selected and
displayed
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
35
HOW CAN YOU GET INVOLVED IN MACHINE LEARNING?
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
36
Useful Course Background
AI and Machine Learning Courses
Applied Mathematics
– Linear algebra, numerical methods, optimization
Probability and Statistics
– Basic principles, distributions, Markov chains, etc
Algorithms, Data Structures, Databases
– E.g. efficient algorithms for indexing data
Graph Theory
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
37
Intelligent Systems Specialization in Computer Science Major
Required “core” courses
– CS 171: Introduction to Artificial Intelligence
– CS 175: Project in Artificial Intelligence
– CS 178: Machine Learning and Data Mining
Relevant elective courses
–
–
–
–
CS 121: Information Retrieval
CS 125: Next Generation Search Systems
CS 172B: Neural Networks and Deep Learning
CS 179: Algorithms for Probabilistic and Deterministic Graphical Models
Additional useful courses
– CS 122A: Introduction to Data Management
– CS 163: Graph Algorithms
– CS 169: Introduction to Optimization
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
38
New Undergraduate Major in Data Science
Motivation
Increasing industry demand for data analysis skills:
“The United States alone faces a shortage of 140,000 to 190,000 people with deep
analytical skills as well as 1.5 million managers and analysts to analyze big data and
make decisions based on their findings. The shortage of talent is just beginning.”
(McKinsey Global Institute Study on Big Data, 2011)
Goal
Provide students with skills at the intersection of computing and statistics
• Databases, algorithms, machine learning, visualization (computer science)
• Statistical models, probability, prediction (statistics)
Current Status
Freshmen can enroll in the major from Fall onwards
Freshman seminar in Data Science (Stats 5) will be offered in Winter quarter
Program is run jointly by Statistics and Computer Science departments
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
39
Years 1 and 2: foundational courses in computer science, mathematics,
statistics, including statistical computing
Year 3: sample program
Fall
Winter
Spring
Stats 110, Statistical Methods
for Data Analysis I
CS 161, Design and Analysis of
Algorithms
In4matx 43, Introduction to
Software Engineering
GE IV/VIII,
Stats 111, Statistical Methods
for Data Analysis II
CS 178, Machine Learning and
Data-Mining
ICS 139W, Critical Writing on
Information Technology
GE III/VII,
Stats 112, Statistical Methods
for Data Analysis III
CS 122A, Introduction to Data
Management
In4matx 143, Information
Visualization
GE VI,
Year 4: two-quarter capstone “data-intensive” project, + statistics and CS electives
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
40
Careers in Machine Learning and Data Science
• Internet/Computing
– Google, Facebook, Microsoft, Apple, Intel, IBM, + many more
– Ecommerce: Walmart, Amazon, eBay, etc
• Banking and Finance
– Banks, credit card companies (Experian), insurance companies
– Investment firms, Wall Street
• Health
– Health insurers, personalized medicine/genomics
• Engineering
– Automotive companies, sensor analytics, systems monitoring
• Government Labs
– NASA/JPL, Department of Defense, intelligence agencies.
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
41
Research in Machine Learning/Data Science
• PhD programs
– Usually in computer science departments
– Undergraduate degree in computer science, statistics, math, EE
– Strong knowledge of mathematics is a plus
• Typically takes 5 years to do a PhD
–
–
–
–
First 2 years mostly graduate course-work
Last 3 years doing research, publishing papers, writing a thesis
Work closely with a thesis advisor (faculty)
Can spend summers working in industry (e.g., Silicon Valley)
• Career opportunities
– Industry: same as for undergraduate, with more research, more responsibility
– Research labs in industry and government (often require a PhD)
– Academic faculty and instructor positions (always requires a PhD)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
42
Thank you…..questions?
More Information about Machine Learning @ UCI
Center for Machine Learning and Intelligent Systems
http://cml.ics.uci.edu
UC Irvine Data Science Initiative
http://datascience.uci.edu
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
43
BACKUP SLIDES
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
44
Visualizing what the Hidden Units are Learning
Figure from Lee et al., ICML 2009
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
45
Concluding Comments
• “Big Data” is all around us
– Mostly non-traditional data such as text, Web clicks, etc
• Predictive modeling is very widely applied
– email, search queries, product recommendation, social networks, etc
• Foundational ideas are relatively simple
– Data is often a big sparse matrix
– Machine Learning = {Model + Objective Function + Optimization}
• Successful solutions require a combination of
– Probability and statistics (basic concepts are a good start)
– Algorithms (machine learning ideas)
– Data Engineering (indexing, parallelization, etc)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
46
from IEEE Intelligent Systems, 2009
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
47
Particle Physics: Large Hadron Collider at CERN
700 Mbytes/second
60 Terabytes/day
20 Petabytes/year
1 Terabyte = 1012 bytes
1 Petabyte = 1015 bytes
Detecting new types of particles = “Needle in a haystack”
New algorithms for searching massive amounts of data to
find unusual patterns
Professor Daniel Whiteson
Department of Physics, UC Irvine
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
48
Email Data
Time plot of 1/3 million emails sent by Stephen Wolfram over 20 yearssince
1989
From:
Working on book
Return to “normal” life
Figures from The Personal Analytics of My Life
blog.stephenwolfram.com, March 2012
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
49
Life’s Daily Rhythms
Figures from The Personal Analytics of My Life
blog.stephenwolfram.com, March 2012
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
50
Example: Detecting Faces and Pedestrians in Images
Figures from Le Cun and Ranzato,
ICML 2013 Tutorial
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
51
Example: Face Detection and Recognition
A particular pixel location
Pixels
A particular
window in an
image
Image
Windows
Can replace pixels by “local features”: edges, averages, filter responses,…..
Could easily have 1 million columns – very high-dimensional!
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
52
Example: Gene Expression Data
Data Matrix:
Rows = genes
Columns = patients
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
53
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
54
The Vector Representation
• Each component = measurement/variable
• Examples:
Text -> vector of word counts, d = 105, 106
Web -> vector of number visits to different sites, d = 105
Images -> vector of pixel values, d = 108
Purchases -> vector of counts of purchases in categories, d = 104
• Sparsity
– Many counts = 0
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
55
Particle Physics: Large Hadron Collider at CERN
700 Mbytes/second
60 Terabytes/day
20 Petabytes/year
1 Terabyte = 1012 bytes
1 Petabyte = 1015 bytes
Detecting new types of particles = “Needle in a haystack”
New algorithms for searching massive amounts of data to
find unusual patterns
Professor Daniel Whiteson
Department of Physics, UC Irvine
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
56
Basic Concepts
Training data
e.g., {x, y} pairs, where y = target, x = input
x is typically a vector, d dimensions
Prediction Model
y = f( x | a),
e.g., f( x | a) = a 0 + a 1 x1 + a 2 x2 + …. + a d xd
– The a’s are the parameters of the model
– Goal: learn good values for the a’s are from the training data
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
57
Learning a Model as Optimization
Model parameters
Empirical Loss
Target y
Prediction at input x
Sum over all rows
in the data matrix
We want to find q to minimize the empirical loss
This is an optimization problem, can be solved by any
of a variety of optimization techniques
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
58
Data Matrices
• Treating data as a matrix is very useful
– Can leverage many concepts from linear algebra
– Can handle any data that can be represented as a matrix
– Same mathematics and algorithms for many types of data
• Allows us to think about data modeling in an abstract and general way
• Models used in practice are often quite simple
– e.g., weighted sums of the variables (many variations on this theme)
• But what about documents, images, speech?
– Key idea: represent each document or image as a fixed length vector/row
– Set of documents/images is just another matrix
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
59
The Importance of Optimization
Typical optimization problem in machine learning:
Minimize E(a) =
Si d ( yi
, f( xi | a ) )
where d is some distance function like squared error
Here a is a vector of parameters, e.g., 100 or 1000 or more parameters
So we are performing this optimization in a high-dimensional space
Usually no direct solution – many iterative gradient-based techniques
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
60
Your Data as a Vector
Your email
Your Web clicks
Your search queries
d-dimensional vector x
(often sparse, high-dimensional)
Your purchases
Your photos
Many problems can be effectively represented
using vectors, but not all
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
61
Sets of Vectors
• Each vector = point in d-dimensional space
• Naturally leads to geometric notions for data analysis
– “distance”
– “clusters”
– “class boundaries”
• N individuals, each with a d-dimensional vector of measurements
• Data Matrix = N x d matrix or array of data
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
62
Example: Document Classification
A particular word, e.g., “cat”
Words
A particular
document
Number of
times “cat” is in
the document
Documents
Given a matrix representation of the data we can now apply various algorithms
Idea is widely used in practice and is surprisingly powerful
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
63
Example: Credit Scoring
Attributes of Applicants (AGE, INCOME, etc)
Past Loan
Awardees
Training Data
(used to learn multiple models)
Target Variable
is Known
Test Data
(used to test the models)
New Loan
Applicants
Future Data
Target Variable
is Unknown
(using the model to make predictions)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
64
Recommender Systems
e.g., for Web sites that recommend books, movies, etc
Items
1
3
4
3
5
4
5
5
5
2
2
3
Users
3
2
5
2
3
1
1
3
1
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
65
Recommender Systems
e.g., for Web sites that recommend books, movies, etc
Items
1
3
4
3
5
4
?
5
5
5
2
2
3
Users
3
2
?
5
2
3
1
1
3
1
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
66
The $1 Million Question
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
67
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
68
Example: Representing Documents with Word Counts
A particular word, e.g., “cat”
Words
A particular
document
Number of
times “cat” is in
the document
Documents
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
69
Training and Prediction
Input Variables
Labeled
Examples
Training Data
(used to learn the model)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
70
Training and Prediction
Input Variables
Labeled
Examples
Training Data
Class Labels
are Known
(used to learn the model)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
71
Training and Prediction
Input Variables
Labeled
Examples
Training Data
Class Labels
are Known
(used to learn the model)
Unlabeled
Examples
Future Data
Class Labels
are Unknown
(using the model to make predictions)
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
72
Learning to Suggest Friends
Problem: automatically suggest friends
Restrict to “friends of friends”
– Still leaves 40,000 possibilities on average
From Lars Backstrom, ESWC 2011
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015
73
Machine Learning Solution
Machine learning resulted in a
major increase in click-through
rate
From Lars Backstrom, ESWC 2011
Padhraic Smyth, UC Irvine: ICS 90, Fall 2015

machine_learning_overview_ICS90_Oct2015x

Transcript machine_learning_overview_ICS90_Oct2015x

Directory