Transcript pptx

Introduction to Supervised
Machine Learning Concepts
PRESENTED BY B. Barla Cambazoglu⎪ February 21, 2014
Guest Lecturer’s Background
2
Lecture Outline
 Basic concepts in supervised machine learning
 Use case: Sentiment-focused web crawling
3
Basi c C oncepts
What is Machine Learning?
 Wikipedia: “Machine learning is a branch of artificial intelligence, concerning the
construction and study of systems that can learn from data.”
 Arthur Samuel: “Field of study that gives computers the ability to learn without being
explicitly programmed.”
 Tom M. Mitchell: “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.”
5
Unsupervised versus Supervised Machine Learning
 Unsupervised learning
›
Assumes unlabeled data (the desired output is not known)
›
Objective is to discover the structure in the data
 Supervised learning
6
›
Trained on labeled data (the desired output is known)
›
Objective is to generate an output for previously unseen input data
Supervised Machine Learning Applications
 Common
›
›
›
›
Spam filtering
Recommendation and ranking
Fraud detection
Stock price prediction
 Not so common
›
›
›
›
7
Recognize the user of a mobile device based on how he holds and moves the phone
Predict whether someone is a psychopath based on his twitter usage
Identify whales in the ocean based on audio recordings
Predict in advance whether a product launch will be successful or not
Terminology







8
Instance
Label
Feature
Training set
Test set
Learning model
Accuracy
Toy problem: To predict the income level of
a person based on his/her facial attributes.
Instances
9
Categorical Labels
10
Numeric Labels
$1K
11
$2K
$3K
$4K
$5K
$7K
$8K
$9K
$11K $12K
Features
12
Blonde
No
White
No
Male
5cm
Bald
No
White
Yes
Male
0cm
White
No
Black
Yes
Male
3cm
Dark
Yes
White
No
Female
12cm
Training Set
13
Blonde
No
White
No
Male
5cm
Bald
No
White
Yes
Male
0cm
White
No
Black
Yes
Male
3cm
Dark
Yes
White
No
Female
12cm
Test Set
14
Dark
No
White
No
Female
14cm
Dark
No
White
Yes
Male
6cm
Dark
No
Black
No
Male
6cm
Dark
Yes
White
No
Female
15cm
Training and Testing
Test instance
Set of training instances
Testing
Prediction
Training
Model
15
Accuracy
Actual labels
Predicted labels
Accuracy = # of correct predictions / total number of predictions = 2 / 4 = 50%
16
Precision and Recall
 In certain cases, there are two class labels and
predicting a particular class correctly is more
important than predicting the other.
 A good example is top-k ranking in web search.
 Performance measures:
17
›
Recall
›
Precision
Some Practical Issues
18
 Problem: Missing feature values
 Problem: Class imbalance
 Solution:
 Solution
›
Training: Use the most frequently observed (or
average) feature value in the instance’s class.
›
Oversampling: Duplicate the training
instances in the small class
›
Testing: Use the most frequently observed (or
average) feature value in the entire training set.
›
Undersampling: User fewer instances
from the bigger class
Majority Classifier
 Training: Find the class with the
largest number of instances.
 Testing: For every test instance,
predict that class as the label,
independent of the features of
the test instance.
Class
Size
Testing
Prediction
Model
19
13
8
4
k-Nearest Neighbor Classifier
 Training: None! (known as a
lazy classifier).
 Testing: Find the k instances
that are most similar to the
test instance and use majority
voting to decide on the label.
20
k=3
Decision Tree Classifier
 Training: Build a tree where leaves
represent labels and branches
represent features that lead to
those labels.
 Testing: Traverse the tree using the
feature values of the test instance.
21
Black
No
Yes
White
Black
Not black
Naïve Bayes Classifier
 Training: For every feature value v
and class c pair, we compute and
store in a lookup table the
conditional probability P(v | c).
 Testing: For each class c, we
compute:
22
P(
|
P(
|
P(
|
) = 0.40
) = 0.65
) = 0.78
Other Commonly Used Classifiers
 Support vector machines
 Boosted decision trees
 Neural networks
23
U se C ase:
Senti ment -Focused Web C rawl i ng
G. Vural, B. B. Cambazoglu, and P. Senkul, “Sentiment-focused web crawling”, CIKM’12, pp. 2020-2024.
Problem
 Early discovery of the opinionated content in the Web is important.
 Use cases
›
›
›
Measuring brand loyalty or product adoption
Politics
Finance
 We would like to design a sentiment-focused web crawler that aims to
maximize the amount of sentimental/opinionated content fetched from
the Web within a given amount of time.
25
Web Crawling
seed page
downloaded
discovered
(frontier)
 Subspaces
›
Downloaded pages
›
Discovered pages
›
Undiscovered pages
undiscovered
26
Sentiment-Focused Web Crawling
 Challenge: to predict the
sentimentality of an “unseen”
web page, i.e., without having
access to the page content.
27
starting
page
Features
 Assumption: Sentimental pages
are more likely to be linked by
other sentimental pages.
 Idea: Build a learning model using
features extracted from
Textual content of referring pages
› Anchor text on the hyperlinks
› URL of the target page
›
28
Labels
 Our data (ClueWeb09-B) lacks ground-truth sentiment scores.
 We created a ground-truth using the SentiStrength tool.
›
Assigns a sentiment score (between 0 and 8) to each web page as its label.
 A small scale user-study is conducted with three judges to verify the
suitability of this ground-truth.
›
500 random pages sampled from the collection.
›
pages are labeled as sentimental or not sentimental.
 Observations
29
›
22% of the pages are labeled as sentimental.
›
High agreement between judges: the overlap is above 85%.
Learner and Performance Metric
 As the learner, we use the LibSVM software in the regression mode.
 We rebuild the prediction model at regular intervals throughout the
crawling process.
 As the main performance metric, we compute the total sentimentality
score accumulated after fetching a certain number of pages.
30
Evaluated Crawlers
 Proposed crawlers
›
›
31
based on the average
sentiment score of
referring page content
based on machine
learning
 Baseline crawlers
 Oracle crawlers
›
random
›
highest sentiment score
›
indegree-based
›
highest spam score
›
breadth first
›
highest PageRank
Performance
32