Transcript Lecture 7

IS6146 Databases for Management
Information Systems
Lecture 7: Introduction to Unstructured
Data and Predictive Analytics
Rob Gleasure
[email protected]
robgleasure.com
IS6146

Today’s session
 Data and predictive analytics
 What?
 Why?
 Unstructured data
 Types of unstructured data
 Data mining vs. information retrieval
 Regression
 Supervised learning
 Classification
 Unsupervised learning
 Clustering
Why are we interested in data?

A 2015 HBR article by Morey et al. argues companies use data in
three different ways
 Improve product or service
 Facilitate targeted marketing
 Sell data to third parties

Google search is an example of a digital business that combines all
of these
 Your search behaviour becomes customised
 Ads are placed in front of you according to your history and
location
 Click-through behaviour and user overviews are provided to third
parties
Uses of data
Moving beyond self-reported data

That same article argues people create roughly three types of data
 Self-reported data
 Digital exhaust
Our discussions of structured
and semi-structured data have
 Profiling data
really focused on this one

Due to the growth in biotechnologies and sensors, there’s an
argument that ‘profiling data’ could be further broken down to
differentiate between a ‘digital behaviour profile’ data and ‘biometric
data’
Self-reported data



Self-reported data is great for telling us about people’s perceptions,
conscious intentions, beliefs, e.g.
 How would users describe something?
 What do users think is important?
 Why do users make the choices they make?
However, this perception-based data is prone to several
inaccuracies, notably
 Focus group fever
 Explanation by rationalisation
 Impression management
It also has to be actively created…
Direct-measurement data

Direct-measurement data provides a record of what people are
actually doing

The move to the cloud means more and more of this is
recorded/recordable either passively or as a lower effort by-product

However, because the focus is not on creating ‘data’ per se but on
performing some actions, expressing opinions, etc. the structure of
that data is not carefully prepared in advance – it is typically
unstructured
Unstructured Data

Unstructured data is generated by both humans and machines. This
includes
 Text and other multimedia
 Machine-to-machine communication

Examples of unstructured data include
 Social messages (e.g. emails, tweets, blog and Facebook posts)
 Business documents (e.g. reports, presentations, minutes from
meetings)
 Audio-visual content (e.g. audio files, photos, videos)
 Sensor readings (e.g. scanner feeds, imagery)
Types/sources of unstructured data
Image from http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
Structured vs. Unstructured Data
Image from http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
Predictive analytics

As the amount of data increases, we can look for patterns to make
sense of what’s happening, why it’s happening, what will happen in
the future, and what we should do to make things happen
Image from http://timoelliott.com/blog/2013/02/gartnerbi-emea-2013-part-1-analytics-moves-to-the-core.html
Mining unstructured data

Different way of looking at it


E.g. “Non-trivial extraction of implicit, previously unknown and
potentially useful information from data”
(Piatetsky-Shapiro & Frawley, 1991)
Or, alternatively "Torturing data until it confesses ... and if you
torture it enough, it will confess to anything"
(Jeff Jonas, IBM)
Data Mining
Image from http://www.slideshare.net/VisionGEOMATIQUE2014/gagnon-20141112vision
Types of analyses of unstructured
Data

Different types of analysis include:
 Entity analysis – finding ways to group people or organisations
 Topic analysis – finding the topics or themes that occur most
frequently
 Sentiment analysis – finding how people feel about something
(usually positive or negative)
 Feature analysis – finding ways of viewing things that captures
their most important qualities or characteristics of interest, e.g.
visual patterns, interaction patterns, mentions of terms
 Relationship analysis – finding the causal or correlational links
between different groups, topics, sentiments, or features
 Lots of others
Data Mining

Some approaches are predictive (i.e. analyses are used to pre-empt
future states), e.g.
 Classification
 Regression

Other approaches are descriptive (i.e. analyses are used to spot
trends and patterns that may otherwise go unnoticed)
 Clustering

These make some fundamentally different assumptions
Regression

Regression measures one or more independent variables, then uses
them to predict dependent variables
 E.g. imagine we have a regression problem where we wish to
determine if happy or sad tweets predict how long someone has
been a customer
 We could go through the same process previously described,
however for each Tweet also record the number of days since
that user registered as a customer
 We could then plot out each individual tweet’s happiness/
sadness against the number of days since its originator
registered
 A reliable trend indicates that tweets’ happiness/sadness does
predict the number of days
Regression (continued)

Typically divided into
 Linear regression (where techniques are used to search for linear
relationships in a continuous dependent variable)
 Binary regression (where techniques are used to predict one of
two outcomes)

Uses of linear regression include
 Predicting market trends (e.g. what books someone will buy,
whether someone will like a movie)
 Predicting returns on expenditures

Uses of logistic regression include
 Predicting corporate fraud, loan defaults, etc.
 Predicting brand preferences
Supervised vs. Unsupervised Learning
Machine
learning
Supervised
learning
Unsupervised
learning
Supervised Learning - Classification

Typically, most predictive approaches require supervised learning,
i.e. we help the algorithm to ‘learn by examples’
 E.g. imagine we have a classification problem where we wish
to classify tweets as either happy or sad
 We could read one tweet, then label it happy, read another,
then label it sad. Eventually we would have a large training set
of tweets.
 Our learning algorithm could then look for similarities and
differences in happy and sad tweets in this training set
 These similarities and differences are then used to create an
inferred function that can be applied to map happy or sad
values to new tweets
Steps in Classification

In more abstract terms, the steps required for supervised learning
are
1.
Define a suitable type of training examples (e.g. individual
tweets)
2.
Gather the training set
3.
Define the feature vector (the things to be considered in the
learning algorithm) for training examples (e.g. do we treat
hashtags differently? Should we measure tweet length? Should
we note the time of day of a tweet?)
4.
Select a suitable learning algorithm (e.g. decision trees, support
vector machine)
5.
Run the training examples through the learning algorithm to
produce the inferred function
6.
Test the accuracy of the inferred function on a new training set
(called a validation set)
Uses for Classification

Sentiment analysis (see previous example)

Document retrieval
 Some documents are tagged as relevant for some task/search
terms, others as not relevant (or many grades in between)

Targeted marketing
 Some customers are tagged as high priority, who then become
the focus pf marketing initiatives

Image processing
 Some images will contain specific features, others won’t (e.g.
medical scans picking up tumours, images of Mars showing
craters)
An example technique: Decision trees
Tweet
Links
Hashtags
Personal, professional, or company
Happy
1
Yes
Yes
Personal
Yes
2
Yes
Yes
Professional
No
3
No
Yes
Company
Yes
4
No
Yes
Professional
Yes
5
No
No
Personal
Yes
6
Yes
No
Professional
Yes
7
Yes
Yes
Company
Yes
8
Yes
Yes
Personal
Yes
9
No
No
Professional
No
10
No
Yes
Professional
Yes
11
No
No
Company
Yes
12
Yes
Yes
Personal
Yes
13
No
Yes
Professional
Yes
14
Yes
Yes
Professional
No
15
No
Yes
Personal
No
Decision trees
Industry?
4 happy / 3
not
4 happy / 1
not
Personal
Professional
Company
Hashtags
3 happy / 1
not
1 happy / 0
not
Yes
3 happy / 0
not
No
Yes
3 happy / 0
not
No
Links
No
0 happy / 1
not
1 happy / 1
not
3 happy / 2
not
Links
Yes
Hashtags
Yes
0 happy / 2
not
No
3 happy / 0
not
Links
Yes
1 happy / 0
not
No
0 happy / 1
not
Issues With Supervised Learning

Seriously reliant on a representative training set. Seriously, seriously
reliant.

Also seriously reliant on both comprehensiveness and parsimony in
the feature vector

The more complex the problem, the more training examples are
required
 Otherwise you run the risk of
 Assuming a classification or relationship exists when it doesn’t
(sometimes called a Type 1 or ‘alpha’ error)
 Assuming a classification or relationship does not exist when it
does (sometimes called a Type 2 or ‘beta’ error)
Unsupervised Learning – Clustering

Descriptive approaches can often make use of unsupervised
learning, i.e. the algorithm runs without our explicitly training it
 E.g. imagine we have a clustering problem where we have no
idea how we want to divide up a set of tweets (e.g. a political
debate has just finished and we want to find common themes in
what people are saying but don’t really know what to look for)
 We could map out different features of tweets and see which
features create ‘clusters’ of tweets
 We could then compare clusters and look for occasions where
clusters on one dimension predict clusters on another
(multidimensional clusters)
 E.g. maybe many tweets are very short, contain an image,
and ellipses (in which case they may be quips or sentimentheavy) or very long, contain a link, and question marks (in
which case they are meant to be more discursive)
Steps in Clustering

Clustering techniques vary significantly, however several steps are
generally required
1.
Define the feature vector (the things to be considered in the
clustering algorithm (again, are we coding hashtags, links (and
different characteristics of links), punctuation, demographics of
users, etc.)
2.
Select a suitable clustering algorithm (e.g. k-means,
hierarchical, two-step, DBSCAN)
3.
Define appropriate algorithm parameters (e.g. number of
expected clusters, the distance function)
4.
Run the algorithm on the data
5.
Analyse clusters semantically
6.
Refine parameters and rerun as appropriate
Uses of Clustering

Thematic analysis (see previous example)

Market segmentation
 Identify non-obvious ways to separate users/customers

Content distillation
 Sort large volumes of documents, emails, etc. into clusters that
can subsequently be analysed

Crime and policing
 Find domains, areas, markets, etc. where certain crimes occur
repeatedly to allow focused investigation
An example technique: K-means

Typically used if we can reasonably say how many clusters we
expect and our variables are continuous or ordered

Follows a simple process
Time to play cards!
Issues With Unsupervised Learning



Massive processing power required, as you need to cast the net
wide to avoid missing things (and because you don’t know in
advance what’s relevant)
When you finish, you don’t really know how well you’ve done in
terms of insights gained vs insights possible (apart from a subjective
interpretation of how useful the whole thing was)
Often used as a pre-cursor to supervised learning, e.g.
 Let’s you find the features of interest, which can then be fed into
an input vector
 Gives you clusters to feed into predictive testing
Want to read more?

Mayer-Schonberger, V. & Cukier, K. (2013). Big data. A revolution
that will transform how we live, work, and think. John Murray
Publishers, UK.

Mitchell, T.M. (1997). Machine Learning
 Free pdf at
http://personal.disco.unimib.it/Vanneschi/McGrawHill__Machine_Learning_-Tom_Mitchell.pdf

Bishop, C. (2007). Pattern Recognition and Machine Learning
 Free pdf at http://www.rmki.kfki.hu/~banmi/elte/Bishop%20%20Pattern%20Recognition%20and%20Machine%20Learning.p
df