Presentation on a specialist topic in Data Mining and Text Analytics

Download Report

Transcript Presentation on a specialist topic in Data Mining and Text Analytics

By Klejdi Muca & Stephen Quinn
A method used by companies like IMDB or Netlfix to turn raw data into useful
information, for example
• It helps companies concentrate on the most important behavioural data that they
have collected from their users and even potential users.
• It enables companies such as Blockbuster to mine their video rental history
database to recommend rentals to individual customers.
• The techniques and algorithms data mining uses will not just change a
presentation, but discovers formerly unknown relationships in the data.
The Internet Movie Database provides current Film and TV programme information
freely to the user. IMDB includes plot summaries, actors, production crew and
significantly offers a rating system that allows users to rate films on a scale of one to
ten. “The database aims to capture any and all information associated with movies
from any part of the world, starting with the earliest cinema to the very latest
releases.” IMDB uses data mining techniques to find relationships in its dataset and
structures it well allowing the user to navigate around the website easily and
efficiently.
In 2012 The AIUB (American international University-Bangladesh) started a project in
which they attempted to create a classification scheme of pre-release movie
popularity based on inherent attributes using C4.5 (an algorithm used to generate a
decision tree.) their aim was to basically attempt to create a system that would
predict how popular a film/ TV title would be based on the relationships found
between data gathered from other Film/TV titles.
The data gathered included:
• production budget
• actors
• directors
• country
• language
• release date
All of this information would be parsed and inserted into an SQL database where
queries will be created and sorted into its final data sets and analysed with the use of
WEKA for patterns in the relationships, examples would be whether the more money
spent on a film would result in a greater financial return or if films directed by a
certain director would be more likely to be popular.
Figure 1
“ The model and theoretical machine
learning steps as shown in this paper will
benefit various internet sites that are
dealing with movie information. It will also
aid producers and directors. It will also
assist the film financing organizations to
make decisions on movie rentals, streaming
services, brand sponsorship, etc”
Netflix is an American based internet streaming service that provides on demand TV
programmes and films to its subscribers. Netflix uses data mining to its advantage by
mining the films and TV programmes that the subscriber has watched as well as the
rating that they gave, Netflix will then use data mining techniques to find patterns in
the data and then proceed to produce recommendations to the subscriber.
On October 2nd 2006 the 'Netflix Prize' began, the aim of the competition was for its
competitors to create a collaborative filtering algorithm that improved Netflix's
prediction accuracy by 10%, the winners of the competition were BellKor's pragmatic
chaos team who in 2009 achieved an improvement of 10.06%.
Why did they do this? Customer satisfaction/retention is key to Netflix – they would
really like to improve their recommendation systems.
This technique is commonly used for
predicting a precise outcome such as
star ratings and whether the user is
likely to watch or not watch a TV
programme or film.
This technique is used to rank the
strength of a relationship with its target
attribute, for example the budget of a
film and its relationship with how
popular the film will be the same can
be done with actors, actresses or
directors that are involved with a film
and consequently how likely the film is
to be popular based on those
attributes.
This technique is used to detect results that do
not follow the normal pattern a good example of
this would be from the Netflix prize when the
film ‘Napoleon Dynamite’ caused problems for
the participants because of users varying ratings
of the film, some users rated the film poorly
whereas others rated it very highly making it
very hard to predict how popular the film was
going to be, some contestants claimed to be on
average eight-tenths of a star out but on films
such as ‘Napoleon Dynamite’ they were off by
an average of 1.2 stars.
This technique is used to find natural groups
within a data set, for example movie genres,
films by certain directors and TV or films that
contain a specific actor/actress.
Text analytics is the process of finding High quality information/knowledge from a
piece of text.
This is done through the use of software such as:
• Autonomy
• AeroText
• Medallia
These pieces of software analyse the text to find patterns and trends through statistical
pattern learning.
Around 80% of information in the world is currently stored in unstructured textual
format.
We can analyse a film or
TV programmes
popularity by extracting
reviews from websites
such as Rotten Tomatoes,
IMDB and Twitter. Both
Rotten Tomatoes and
Twitter contain API's
(application programming
interface) that will allow
us to write a program
that will interact with the
data set and extract the
data that we need. IMDB
however does not contain
an API meaning we would
have to extract the data
manually.
From Twitter we can search for the movie by
using the hashtag or any words that relates to
the film. For example for the film Twilight a
user can type in Breaking Bad or #BreakingBad
and get all information other users opinions
about the film around the world.
Or if the user wants to be more specific and
refine the result they can simply search
Breaking bad/ and other key words such as
good/ amazing/ terrible and they will be
presented with other people’s review on the
film.
Each tweet can be analysed to find key words
and phrases that are commonly used, to get an
understanding of the trends and patterns.