Transcript ppt - MIS

Data Mining and
Knowledge Acquizition
— Chapter 7 —
—Data Mining Overwiev
and Exam Questions—
2013/2014 Summer
1
Data Mining

Methodology
 Problem definition
 Data set selection
 Preprocessing transformations
 Functionalities





Classification/prediction
Clustering
Association
Sequential analysis
others
2
Methodology cont.

Algorithms
 For classification you can use


For clustering you can use





Decision trees ID3,C4.5 CHAID are algorithms
Partitioning methods k-means,k-medoids
Hierarchical AGNES
Probabilistic EM is an algorithm
Presenting results
 Back transformations
 Reports
Taking action
3
Two basic style of data mining




Descriptive
 Cross tabulations,OLAP,attribute oriented
induction,clustering,association
Predictive
 Classification,prediction
Questions answered by these styles
Difference between classification and prediction
4
Classification



Methods
 Decision trees
 Neureal networks
 Bayesian
 K-NN or model based reasoning
Adventages disadventages
Given a problem which data processing
techniques are required
5
Classification (cnt.d)

Accuracy of the model
 Measures for classification/numerical
prediction
 How to better estimate


How to improve



Holdout,cross validation, bootstraping
Bagging, boosting
For unbalanced classes
What to do with models

Lift charts
6
Clustering

Distance measures
 Dissimilarity or similarity
 For different type of variables

Ordinal,binary,nominal,ratio,interval
Why need to transform data
Partitioning methods
 K-means,k-medoids






Adventage disadventage
Hierarchical
Density based
probablistic
7
Association





Apriori or FP-Growth
How to measure strongness of rules
 Support and confidence
 Other measures critique of support
confidence
Multiple levels
Constraints
Sequential patterns
8
OLAP





Concept of cube
Fact table
 measures
Dimensions
Sheams
 Star, snowflake
Concept hierarchies
 Set grouping such as price age
 Parent child
9
Pre processing






Missing values
Inconsistencies
Redundent data
Outliers
Data reduction
 Attribute elimination
 Attribute combination
 Samplinng
Histograms
10
Exam Questions








Intorduction
Basic functionalities
Data description
Data preperation
Data warehousing olap
Clustering
classification/numerical prediction
frequent pattern mining
11
Introduction


Defining data mining problems
Data mining functionalities
12
Define data mining problems

1. Suppose that a data warehouse for Big-University
Library consists of the following three dimensions:
users, books, time, and each dimension has four
levels not including the all level. There are three
measures: You are asked to perform a data mining
study on that warehouse (25 pnt)

Define three data mining problems on that
warehouse: involving association, classification
and clustering functionalities respectively. Clearly
state the importance of each problem. What is the
advantage of the data being organized as OLAP
cubes compared to relational table organisation?
13
Define data mining problems

In data preprocessing stage of the KDD





What are the reasons for missing values? and
How do you handle them?
what are possible data inconsistencies
do you make any discritization
do you make any data transformations
do you apply any data reduction strategies
14
Define data mining problems



Define your target and input variables in
classification. Which classification techniques and
algorithms do you use in solving the classification
problem? Support your answer
Define your variables indicating their categories in
clustering Which clustering techniques and
algorithms do you use in solving the clustering
problem? Support your answer.
Describe association task in detail; specifying the
algorithm interestingness measures or constraints if
any.
15
Data mining on MIS

A data warehouse for the MIS department
consists of the following four dimensions:
student, course, instructor, semester and
each dimension has five levels including the
all level. There are two measures: count and
average grade. At the lowest level of average
grade is the actual grade of a student. You
are asked to perform a data mining study on
that warehouse (25 pnt)
16
Data mining on MIS 2


Define three data mining problems on that
warehouse: involving association, classification and
clustering functionalities respectively. Clearly state
the importance of each problem. What is the
advantage of the data being organized as OLAP
cubes compared to relational table organisation?
In data preprocessing stage of the KDD





What are the reasons for missing values? and How do you
handle them?
what are possible data inconsistencies
do you make any discritization
do you make any data transformations
do you apply any data reduction strategies
17
Data mining on MIS 3



Define your target and input variables in
classification. Which classification techniques and
algorithms do you use in solving the classification
problem? Support your answer
Define your variables indicating their categories in
clustering Which clustering techniques and
algorithms do you use in solving the clustering
problem? Support your answer.
Describe association task in detail; specifying the
algorithm interestingness measures or constraints if
any.
18
Final 2010/2011 Spring (MIS)





3 ( 35 pt.) The aim of Knowledge Discovery from
Databases (KDD) is to extract interesting, potentially
useful, …, knowledge from data. The extracted
knowledge can be represented in a knowledge base
similar to a database. Considering the data mining
functionalities and algorithms we covered in this course
describe five different knowledge types. For each type
discuss the following aspects:
a) From which functionality and algorithm they are
obtained?
b) How they are represented in knowledge base? (Do
not consider data structures )
c) What are the quality characteristics?
d) How they are used in the deployment phase?
19
BIS 541 2011/2012 Final





1. For each of the following problem identify relevant
data mining tasks
a) A weather analyst is interested in calculating
the likely change in temperatue for the coming days.
b) A marketing analyst is looking for the groups of
customers so as to apply different CRM strategies for
ecach group
c) A medical doctor must decide whether a set of
symptoms is an indication of a particular disease.
d) A educational psychologist would like to
determine exceptional students to sugget them for
special educational programs. .
20
BIS 541 2012/2013 Final





For each of the following problem identify relevant data
mining tasks with a brief explanation
a) A weather analyst is interested in wheather the
temperature will be up or down for the coming day
b) An insurance analyst intends to group policy holders
according to characteristics of customers and policies
c) A medical researcher is looking for symptoms that
are occurring together among a large set of pationes.
d) An educational program director would like to
determine likely GPA of applicant to a MA program from
their ALES scores, undergraduate GPAs and enterence
exam scores.
21
Basic Fuctionalities



Decision tree - ID3
 information gain
Association – Apriori
Clustering – k-means
22
Information gain
1.
Consider a data set of two attributes A and B.
A is continuous, whereas B is categorical,
having two values as “y” and “n”, which can
be considered as class of each observation.
When attribute A is discretized into two
equiwidth intervals no information is provided
by the class attribute B but when discretized
into three equiwidth intervals there is perfect
information provided by B. Construct a simple
dataset obeying these characteristics.
23
Decision tree

2. a-Construct a data set that generates the
tree shown below In addition the following
conditions are satisfied
Node 2
A=a1
Decision
Y
Node
3
A=a2


Node 4
B=b1
Decision N

Node 5
B=b2
Decision is Y

24
Midterm 2006/2007 Spring (MIS)

2. Show that entroy is not a symetric measure
of association like correlation coefficient is.
Construct a simple data set of two categorical
attributes A and B such that i knowing the
values of A provides perfect information to
predict B but ii) knowing the values of B does
not provide perfect information to precict A
25



at a particular node
when information gain is 0
when it gets maximum value
26
Associations
1.
2.
3.
In a particular database; AC and BC are
strong association rules based on the support
confidence measure. A and B are
independent items. Does this imply that
A  BC is also a strong rule based on the lift
measure? A,B,C are items in a transaction
database.
-if A B and BC are strong. Is AC a
strong rule
-if A B and AC are strong. İs BC a
strong rule
27
Data Description/Preprocessing
28
Midterm 2004/2005 Spring (MIS)

Consider the correlation coefficient between
two numerical variables. Does its umerical value
affected by the unit of measures of these
variables?. (such as measureing temperature in
oC or öF)
29
Midterm 2011/2012 Fall generate data



5. (10 points) Consider two continuous
variables X and Y. Generate data sets
a) where PCA (principle component analysis)
can not reduces the dimensionality from two to
one
b) where although the two variables are related
(a functional relationship exists between these
two variables), PCA is not able to reduce the
dimensionality from two to one
30
Midterm 2010/2011 Spring (MIS)





3. (25 points) Consider a data set of two continuous
variables X and Y. X is right skewed and Y is left
skewed. Both represent measures about same quantity
(sales categories, exam grades,…)
a) Draw typical distributions of X and Y separately.
b) Draw box plots of X and Y separately.
c) Draw q-plots (quantile) of X and Y separately.
d) Draw q-q plot of X and Y.
31
MIS 541 2012/2013 Final



1. (20 pts) Consider a data set of two
continuous variables X and Y. X both has the
same mean, both have no skewness
(symetric)ç X has a higher variance then Y.
Both represent measures about same quantity
(sales categories, exam grades,…)
a) Draw typical distributions of X and Y on
the same graph.
b) Draw box plots of X and Y separately.
32
Final 2011/2012 Fall data description



1 (20 points) Give two examples of outliers.
a) Where outliers are useful and essential
patterns to be mined.
b) Outliers are useless steaming from error or
noise.
33
Final 2011/2012 Fall preprocessing

2 (20 points) Considering the classification
methods we cover in class, describe two
distinct reasons why continuous input variables
have to be normalized for classification
problems(each reason 10 points).
34
Midterm 2008/2009 Spring






4. (20 points) Principle components is used for dimensionality
reduction then may be followed by cluster analysis – say for
segmentation purposes – Consider a two continuous variable
problem. Using scatter plots
a) Generate a data set where PCA reduces the dimensionality
from two to one
b) Generate a data set where although there is a relation between
the two variables, PCA
is not able to reduce the dimensionality to one
c) Generate a data set where there are natural clusters and PCA
can reduce the dimensionality
d) Generate a data set where there are natural clusters but PCA is
not the appropriate method for reducing the dimensionality
35
Midterm 2012/2013 Fall (MIS)





1. (20 pts) Consider a data set of two continuous
variables X and Y. X both has the same mean, both
have no skewness (symetric)ç X has a higher variance
then Y. Both represent measures about same quantity
(sales categories, exam grades,…)
a) Draw typical distributions of X and Y on the same
graph.
b) Draw box plots of X and Y separately.
c) Draw q-plots (quantile) of X and Y separately.
d) Draw q-q plot of X and Y.
36
Data Warehousing/OLAP


Design of olap cubes
Measures
37
Midterm 2005/2006 Spring (MIS)

A large hypermarket has lots of branchs
through out the country. Quantity purchased Qi,
price Pi, for each item i are stored in a
warehouse. The top management is interested
in finding the cheapest large sold items
minp(maxq item i). Is it possible to accomplish
this in a distributive maner? In other word is
minp(maxq item i) a distributive measure?
38
Final 2007/2008 Spring (MIS)



1. (25 pnt) Suppose an aggregation is to be
designed to obtain weekly dollar values from
daily
values by two different ways described
below. Can they be computed in a distributive
manner? (the database has day ID and dollar
value fields. Records are randomly selected and
assigned to different processing units)
a) Taking the daily averages
b) Taking the last day’s value of the week
39
Data warehouse for library

A data warehouse is constructed for the library of a university to
be used as a multi-purpose DSS. Suppose this warehouse
consists of the following dimensions: user , books , time
(time_ID, year, quarter, month, week, academic year, semester,
day), and . “Week” is considered not to be less than “month”.
Each academic semester starts and ends at the beginning and
end of a week respectively. Hence, week<semester.

Describe concept hierarchies for the three dimensions.
Construct meaningfull attributes for each dimension tables
above . Describe at least two meaningfull measures in the
fact table. Each dimension can be looked at its ALL level as
well.

What is the total number of cuboids for the library cube?

Describe three meaningfull OLAP queries and write sql
expresions for one of them.
40
OLAP Big University

2. (Han page 100,2.4) Suppose that the data
warehouse for the Big-University consists of the
following dimensions:
student,course,instructor,semester and two measures
count and average_grade. Where at the lowset
conceptual level (for a given student, instructor,course,
and semester) the average grade measure stores teh
actual grade of the student. At higher conceptual levels
the average_grade stores the average grade for the
given combination. (when student is MIS semester
2005 all terms, course MIS 541, instructor Ahmet Ak,
average_grade is the average of students grades in thet
course by that instructer in all semester in 2005)
41
cont.



a) draw a snawflake sheam diagram for that
warehouse
What are the concept hierarchys for the
dimensions
b) What is the total nmber of cuboids
42
MIS 542 Final S06 1 olap




1. MIS department wants to revise academic
strategies for the following ten years. Relevent
questions are: What portion of the courese are
required or elective? What is the full time part
time distribution of instuctors? What is the
course load of instructors? What percent of
technical or managerial courses are thought by
part time instructors? How all theses things
43
MIS 542 Final S06 1 cont.





changed over years? You can add similar
stategic quustions of your own. Do not conside
students aspects of the problem for the time
being. Desing and OLAP sheam to be used as a
strategic tool. You are free to decide the
dimensions and the fact table. Describe the
concept
hierarchies, virtual dimensions and calculated
members. Finally show OLAP opperations to
answer three of such strategic questions
44
Midterm 2006/2007 Spring

1. A data warehouse is constructed for the web site of
a e-commerce company to be used for customer
segmentation. Each visitor click stream data is recorded
Each session has an ID Suppose this warehouse
consists of the following dimensions: visitor, time,
product. There is a concept hierarcy for products which
is reflected to the design of the web site so that
products can be seen in a hierarchical manner. When a
product is seen it can be purchased. Only registered
customers can use the system so each visitor has an
ID. When registering a form is field out so that sociodemographic information is taken form a customer.
Suppose income (a a numerical variable), birthday,
gender, profesion, marital status is asked.
45
cont.




a) Describe concept hierarchies for the three
dimensions. Construct meaningful attributes for each
dimension tables above.(What transformations are
required before constructing these attributes) Describe
at least two meaningful measures in the fact table.
b) Each dimension can be looked at its ALL level as
well.
Describe three meaningful OLAP queries and write sql
expressions for one of them.
c) Define a clustering problem: Which variables are
important? Is there a missing value problem? What
data transformation are needed? Which algorithm
would you suggest?
46
Midterm 2007/2008 Spring



1. (20 points) Consider a shipment company
responsible for shipping items from one location to
another on predetermined due dates. Design a star
schema OLAP cube for this problem to be used by
managers for decision making purposes. The
dimensions are time, item to be shipped, person
responsible for shipping the item, location.. For each of
these dimensions determine three levels in the concept
hierarchy. Design the fact table with appropriate
measures:and keys (include two measure and at least
one calculated member in the fact table)
Show one drilldown and role up operations
Show the SQL query of one of the cuboids.
47
Midterm 2008/2009 Spring



1. (25 points) In an organization a data warehouse is to be
designed for evaluating performance of employees. To evaluate
performance of an employee, survey questionnaire is consisting a
set of questions with 5 Likered scale are answered by other
employees in the same company at specified times. That is,
performance of employees are rated by other employees.
Each employee has a set of characteristics including department,
education,… Each survey is conducted at a particular date applied
to some of the employees. Questions are aimed to evaluate broad
categories of performance such as motivation, cooperation
ability,…
Typically, a question in a survey, aiming to measure a specific
attitude about an employee is evaluated by another employee
(rated f rom 1 to 5) Data is available at question level.
48
cont.





Cube design: a star schema
Fact table: Design the fact table should contain one
calculated member. What are the measures and keys?
Dimension tables: Employee, and Time are the two
essential dimensions include a Survey and Question
dimensions as well. For each dimension show a concept
hierarchy.
State three questions that can be answered by that
OLAP cube.
Show drilldown and role up operations related to these
questions
49
MIS 541 2012/2013 Final






2. (20 pts) Suppose that a data warehouse for a
hospital consists of the following dimensions: time,
doctor and patient and the two measures count and
charge, where charge is the fee a doctor charge a
patient for a visit.
Design a warehouse with star schema:
a) Fact table: Design the fact table.
b) Dimension tables: For each dimension show a
reasonable concept hierarchy.
c) State two questions that can be answered by that
OLAP cube.
d) Show drilldown and roll up operations related to one
of these questions
50
BIS 541 2011/2012 Final




2. Develop a data warehouse for an insurance company
using fact constellations scheme. The company holds
insurance premiums paind by its customers for different
type of policies as well as the payments in case of
accidents to its customers. There are two facat tables
for premiums and payments respectively. The
dimensions are customer time, policy accident some are
sheered by the two fact tables.
a) design the fact tables : keys and measures
b) design the dimension tables their concept hierarchies
c) show one roll up and one drill down opperation
51
BIS 541 2012/2013 Final





Develop a data warehouse for a weather bureau
having so many probes located all over a large region,
using star scheme. These probes collect basic weather
data such as temperature , air pressure , humidity,… at
each hour. All the data is sent to a central station to be
processed. .
a) design the fact table : keys and measures
b) design the dimension tables their concept hierarchies
c) state two questions that can be answered by
querying the warehouse.
d) show one roll up and one drill down operation abour
one of these questions
52
BIS 541 2011/2012 Final





2. Develop a data warehouse for holding academic performance of
an university’s faculty members. The dimensions are time (here
academic year is important but the day of the publication is a bit
detailed) faculty member, paper. For an article publiched by a
factulty member at a particular paper, number of citations
taken.and the implact factor of that paper are important. Paper
can be journal articles, conference proceedings journals can be in
SCI or SSCI and each such ournal or conference has a prestige
factor a continous variable.
a) design the fact table : keys and measures
b) design the dimension tables their concept hierarchies
c) describe in word fife different types of queries that can be
answered by the OLAP cube
d) show two roll up and two drill down operation
53
Clustering
54
Clustering preferences

Consider a popular song competition. There are N competitors
A1, A2,… AN. Number of voters is very large; a substantial
fraction of the population of the country. Each voter is able to
rank the competitors form best to worst e.g. for voter 1
(A4>A2>A3>A1) meaning that there are four competitors and
A4 is the best for voter 1 A1 being the worst. Suppose
preference data is available for a sample of n voters at the
beginning of competition.

Develop a distance measure between the preferences of two
voters i and j

Suppose you have the k-means algorithm available in a
package. Describe how you can use the k-means
algorithm to clusters voters according to their preferences.
55
clustering

Construct simple data sets showing the
inadequacies of k-means clustering (20 pnt)

this algorithm is not suitable of even
spherical clusters of different sizes

What are the adventages and disadventage
of using k-means
56
clustering
1.
Consider a delivery center location decision
problem in a city where a set of related
products are to be delivered to markets
located in the city. Design an algortihm for this
lacation selection problem extending an
algortihm we cover in class. State clearly the
algorithm and its extensions.for this particular
problem.
57
MIS 542 Final S06 clustering


3. a) Describe how to modify k-means
algorithm so as to handle categorical variables
(binary, ordinal, nominal).
b) What is a disadventage of Agglomerative
hierarchical clustering method in the case of
large data. Suggest a way of eliminating this
disadventages while benefiting the adventages
of agglomerative methods
58
MIS 542 Midterm S08 clustering



Generate data set of two continuous variables X
and Y. Consider clustering based on density
When clustered with one variable there (either
X or Y) there is one cluster
When clustered with both variable there there
are two clusters
59
Final 2007/2008 Spring (MIS)





2. Considering the advantages and disadvantages of
partitioning methods such as k-means and density
based methods of clustering, generate two dimensional
data set
a) (5 pnt) Successfully clustered by k-means and
DBScan
b) (5 pnt) Successfully clustered by k-means but not by
DBScan
b) (5 pnt) Successfully clustered by DBScan but not by
k-means
d) (10 pnt) Suggest a clustering procedure combining
the two methods
60
Midterm 2008/2009 Spring (MIS)

5. (20 points) In a clustering problem either z
transformation or logistic transformation (y =
1/(1+exp(-z)) is applied to the original variables.
Discuss the effects of these transformations on the
quality and nature of clusters for a problem with two
continuous variales. Suppose then k-means algorithm is
used for clustering. Especially what is the consequence
of these transformations (logistic and z on the
similarity(dissimilarity between objects and nature of
the clusters fromed then)?
61
Final 20010/2011 Spring (MIS)


1 (35 pt.) Consider a time series problem: a
continuous variable observed in regularly
spaced time steps, such as daily dollar/TL
exchange rate (for each day a $/TL value is
available) or monthly inflation rate.
a) Suppose some time periods (days) data is
not available. Suggest a method for handling
missing value problems in time series data.
62
cont.


b) The continuous variable is to be discretized into piecewise linear
segments characterized by slope and duration. Slope can take say
five distinct values as very high, high, horizontal, low, very low.
Duration can take say three values short, medium and long. Plot a
time series data. Plot the piecewise linear discrete form on the
same graph. Propose a method for obtaining such piecewise linear
segments.
c) Following are examples of rules extracted from the piecewise
linear segments: A long period of boom(very high slope) is
followed by a short period of down, a medium period of down
movement is followed by a long period of horizontal behavior
Suggest a method for extracting such rules from piecewise linear
segments.
63
Midterm 2011/2012 Fall





In Question 3-5 artificial data sets are generated for
given situations.
3. (10 points) Consider a data set of two continuous
variables X and Y. There are two clusters (k=2)
Considering the advantages and disadvantages of
partitioning methods k-means and k-medoids of
clustering, generate two dimensional data set
a) (5 pnt) Produces almost the same clusters by kmedoids and k-means
b) (5 pnt) Produces different clusters by k-medoids and
k-means
64
Final 2011/2012 Fall


3 a (10 points) Generate data sets for two clustering
problems with two continuous variables. Two natural
clusters for the notion of density based clustering but
the quality of these clusters are low for a partitioning
approach based on dissimilarity such as k-means
3.b (10 points) Considering the advantages and
disadvantages of partitioning and hierarchical
agglomerative clustering approaches. Design a method
for combining the two approaches to improve good
clustering quality. (Finally there are hierarchies of
clusters)
65
Midterm 2011/2012 Fall


6. (25 points) A retail company asked to
segment its customers. Following variables are
available for each customer: age, income,
gender number of children, occupation, house
owner, have a car or not. There are 6 category
of goods sold by the company and total
purchases from each category is available for
each customer, in addition average
inter-purchase time is also included in the
database.
66
Midterm 2011/2012 Fall cont.





a) What are the types and scales of these variables?
b) If your tool has only k-means algorithm which of
these variables are more suitable for the segmentation
problem?
c) What data transformations are to be applied?
d) How do you reduce number of variables used in the
analysis?
e) If you want to include categorical variables into your
clustering, how would you treat them?
67
Midterm 2010/2011 Spring (MIS)

5. (25 points) Consider a data set representing the
interactions among a set of people. The degree of
interaction is a positive real number; high values can be
interpreted as, the two members are closely related
(they have close interactions such as heavy telephone
calls or mail traffic between them) In other words
rather then including the coordinates of variables
directly, the similarity/dissimilarity matrix is given. This
is a symmetric matrix. Develop an algorithm for
clustering similar objects into same clusters. Assume
that number of clusters (k) is given
68
Midterm 2010/2011 Spring (MIS)

4. (25 points) A strategy for clustering high
dimensional data of continuous variables is:
First apply principle components to reduce the
dimensionality of the data set and apply
clustering on the reduced form of the data.
Discuss the drawback(s) of this approach.
69
Midterm 2012/2013 Fall (MIS)




4. (20 pts) Consider a data set of two continuous
variables X and Y. Consider two data points P and Q.
Suppose Q is the origin (0,0). P is one unit away from
Q.
a) Draw the locus of points for P based on Eucledean,
Manhatan, and Chebychev distance notions
b) Suppose relative importance of X with respect to Y
is controlled by weighting X in the distance formulas.
What is the meaning of this weight being greater then
one and less dthen one respectively?
c) Draw the locus of points for P for the three
distance notions in part a) when X is weighted greater
then one and less then one respectively.
70
BIS 541 2011/2012 Final




3. Consider a customer segmentation problem
to be solved with k-means algorithm. . The
following variables are available in the dataset:
gender, member card information, total
spending in TL and education level.
a) what are the scales of these variables.?
b) How would you transform data before
applying clustering?
c) How do you find similarity/dissimilarity
between two customers?
71
BIS 541 2011/2012 Final



1. Generate two different data sets of two
continuous input variables X1 and X2 for a
clustering problem.
a) that would give almost the same set of
clustering results when solved by k-means and
k-medoids
b) that would give different set of clusters
when solved by k-means and k-medoids
72
Comparing clustering methods



Clustering methods
Partitioning, hierarchical, density based, modelbased: probabnlistic EM
Compare clustering methods
 Output
 İnterpreteation
 Sensitivity ot aoutliers
 Sepped of coputation
73
Classification





Decision trees
Neural networks
k-NN
Bayesian classification
Measuring and Improving Accuracy
74
MIS 542 Final S06 2











2. Given the training data set with missing values:
A(Size)
B(color)
C(shape)
Class
small
yellow
round
A
big
yellow
round
A
big
yellow
red
A
small
red
round
A
small
black
round
B
big
black
cube
B
big
yellow
cube
B
big
black
round
B
small
yellow
cube
B
75
MIS 542 Final S06 2 cont.




a) Apply the C4.5 algorithm to construct a decision
tree.
b) Given the new inputs X:size= small,color= missing,
shape=round.and Y:size= big,color= yellow,
shape=missing What is the prediction of the tree for X
and Y?
c) How do you classify the new data points given in
part b) using Bayesian Classification?
d) Analyse the possibility of pruning the tree. You can
make normal approximation to Binomial distribution
though number of observations is low. z value for upper
confidence limit of c=25% is 0.69.
76
MIS 542 Final S06 neural networks

4. Consider a classification problem with two
classes as C1 and C2. There are two numerical
input variables X1 and X2, taking values
between 0 and infinity. All observations are of
class C1, if they are above X2 = 1/X1 curve (a
hyperbola) All other observations are class C2.
Describe how multilayer perceptrons can
separate such a boundary using as few hidden
nodes as possible.
77
MIS 542 Midterm S08 2 classification






Consider a clasification problem with two continuous
variables X and Y and a categorical output with two
distinct values C1 and C2
Generate data set such that
A) Decision trees are appropriate for clasification
B) Decision trees are not appropriate for clasification
but a perceptron can classify the data succesfully
C) Even s single perceptron is not enough to classify
the data
D) How do you encorporate a perceptron into decision
trees so that cases in B and C can be clasified by an
hybrid approach of DTs and perceptron
78
Final 2010/2011 Spring




2 (30 pt.) Consider a prediction problem; e.g. predicting
weight using height(a continuous variable) as input,
solved by neural networks. Such methods as back
propagation try to minimize the prediction error but it is
claimed that the magnitude of error depends on the
weight: a prediction error of 0.5 for a baby with a short
height should not be the same as for an adult with a
height of 2.00 meters.
a) Make a scatter plot of such a hypothetical data set
for a two variable problem.
b) Plot the prediction error on another graph
c) Do you need to modify the back propagation
algorithm so as to handle such a situation? If so explain
your modification.
79
Final 2011/2012 Fall supervised learning



4. Illustrate the over fitting of neural networks
for the following cases by generating data sets.
a) (10 points) For a binary classification
problem with two continuous inputs.
b) (10 points) For a numerical prediction
problem (output being continuous) with one
continuous input variable.
80
Midterm 2011/2012 Fall generate data



4. (10 points) Consider a classification by a
decision tree problem. Consider a categorical
input variable A having two distinct values. The
output variable B has two distinct classes as
well. At a particular node of the tree there are
N data objects. Generate partitioning of data by
input variable A for the following
a) A does not provide any information: does
not decrease information gain at all.
b) A does provides perfect information:
decrease information gain as much as possible
81
MIS 541 2012/2013 Final

5. (20 pts) Consider a classification problem
solved by k-NN. Suppose in your dataset all
inputs are continuous variables. Why do you
need to apply any data transformations? What
data transformation is applied? Suppose the
variables are to be weighted after
transformations. Device a method for
determining optimal weights for variables s well
as determining optimal k value considering that
k-NN is a supervised learning method.
82
MIS 541 2012/2013 Final

1. (20 pts) Consider a decision tree with only
two branches in that the attribute selection
measure is entropy. Bearing in mind that each
candidate input attribute may have more then
two distinct values, how do you modify the ID3
algorithm to handle such a constraint on the
number of branches of the tree.
83
MIS 541 2012/2013 Final

2. (20 pts) Illustrate with plots of two
continuous inputs and binary class that one
layer neural networks are enough to classify
convex class boundaries Two hidden layers are
enough to capture even non convex class
boundaries.
84
MIS 541 2012/2013 Final



5..(20 pts) The follwing table consists of
training data from an employee database.
Predicted variable is status. Age,Salary and
Department are inputs
Design a multilayer feedforward neural network
for the given data. Label the noedes in the
input, hidden and output layers. Describe how
you encode the input and output variables,
specifiy the parameters of the network that can
be changed by the backpropegation algorithm.
85
Department
Status
Age
Salary
Sales
Senior
31-35
46K-50K
Sales
Junior
26-30
26K-30K
Sales
Junior
31-35
31K-35K
Systems
Junior
21-25
46K-50K
Systems
Senior
31-35
66K-70K
Systems
Junior
26-30
46K-50K
Systems
Senior
41-45
66K-70K
Marketing
Senior
36-40
46K-50K
Marketing
Junior
31-35
41K-45K
Secretary
Senior
46-50
36K-40K
Secretary
Junior
26-30
26K-30K
86
Midterm 2007/2008 Spring

(20 pnt) MIS department has a couple of
criteria in choosing graduate students such as
GAP, ALES score, interview point. Some
students may fail to complete the
program
most others graduate successfully. Considering
this as a binary
variable, describe how do
you decide the best weighting of the enterence
critera
could be designed. (Assume
enough data is available in out database)
87
Final 2003/2004 Spring




4. Consider the network topology shown below, there is
one input X and two output Y1 and Y2 , activation
function in nodes 2,3 and 4 are hyperbolic tangent
tanh(x)=(ex-e-x)/(ex+e-x).Currently all weights and
biases are zero.
a-Derive the backpropagation rule for the tanh
activation function for hidden and output units
b- Perform one iteration of the algorithm when the
data point (X=0,Y1=1,Y2=-1) is presented the to the
network
88
3

1


2
4

89
Midterm 2010/2011 Spring

2.(25 points) Consider a prediction problem with one
continuous input and one output that is solved by a
network topology as follows: there are n layers, in each
layer there is only one node with logistic transfer
function and no bias (constant) term. The nodes in the
layers are indexed from 1 (the output noede) to n
(input is send to that node) wi is the weight for layer i.
So w1 is the weight for the output node and wn is the
weight applied to input before sending it to the node
n (the first node) Derive the back propagation weight
update formula for weight wi. (i=1 for the output
weight and i i>1 hiddden node weights ) Note that
nodes are indexed in a reverse order (starting from
output to inputs ) for the sake of easiness. Derivative
of logistic function: y= 1/(1+exp(-x)) is y*(1-y)
90
Accuracy measures


For class balanjcy or unbalancy problems
Output variables with ordinary scale
 How do you modify the accuricy measure for
an ordinal output variable with three
different values
 Give an example for such a variable
91
Midterm 2008/2009 Spring

2.(20) Consider a classification problem in that
customers that are taking consumer credits from a
bank are classified into three risk groups The input
variables are age: discretized into 4 groups, income into
4 groups, education into four groups, gender, number
of months the customer is dealing with the bank and
average delay of payments in months, and current
value of the accont balance. The output variable has 3
categories as risky, normal or highly risky calculated by
some procedure and provided to the data miner.
Design an encoding schema for the input and output
variables so that the problem will be solved by a neural
network Show a typical topology of a feedforward
network architecture
92
Midterm 2008/2009 Spring

3. (20 points) Consider a classification by a
decision three problem. There are two
categorical input variables A and B having two
distinct values each. The output variable C has
two distinct classes. Suppose the dataset is
suitable for using decision threes. Is the order
of selection of variables affects the
classification error? Support your answer by
generating data sets pictorially. (stoping
condition is either a pure class is obtained or no
variables remains to be tested)
93
Midterm 2012/2013 Fall (MIS)


3. (20 pts) A data mining study for a targeted
marketing problem reveals that the only variable (X)
explaining the buying behavior is previous spending
(continuous) Probability of a customer returning to the
mail offer is P(buy) = 0.1*X, where
0<=X<=10.Suppose There are 100 customer whose X
variables are uniformly distributed between 0 and 10.
Suppose cost of dealing with a customer is c and
revenue from a buyer is r (r > c). What is the break
even point in terms of the previous spending X? That is
up to what value of X for a new customer the company
should treat that customer?
94
Midterm 2012/2013 Fall (MIS)

5. (20 pts) Consider a classification problem
solved by k-NN. Suppose in your dataset all
inputs are continuous variables. Why do you
need to apply any data transformations? What
data transformation is applied? Suppose the
variables are to be weighted after
transformations. Device a method for
determining optimal weights for variables s well
as determining optimal k value considering that
k-NN is a supervised learning method.
95
BIS 541 2011/2012 Final

4. Construct a particular node of a decision tree
There are 6 data points at that node. The
output is a categorical variable with two distinct
values. Generate a dtra set of three variables
one bieing the output (Y) the others are inputs
(X1 and X2) such that X1 reduces the
information gane as much as possible whereas
X2 dose not reduces the information gain at all.
96
BIS 541 2011/2012 Final



3. Generate data sets for a supervised learning
problem solved by neural networks.
a) There are two continuous independent
variables X1 and X2 and a class variable with
two different values such as yes and no. On the
same artificially generatred dataset illustrate
the concept of overfitting by neural networks.
b) Illustrate the behavior of training and test
errors as the complexity of the network
increases
97
BIS 541 2011/2012 Final



4. Consider a classification problem to be solved by kNN method. The output is whether the customer will
buy a product or not. The inputs are income, age,
education level of the customer and profession of the
customer (having here distinct values)
a) Describe the data transformations needed in the
preprocessing step to prepare the datra set to be
classified by k-NN
b) How the data transformations are different from
the solution of th same problem by neural networks.
98
BIS 541 2012/2013 Final









Based on a sample of 30 observations the population regression
model
Y i = 0+ 1x i + i
The least square estimates of intercept is 10.0
Sum of the values of dependent and independent variables are 450
and 150 respectively.
Estimated variance of dependent variable is 25, variance of the
residuals is 4
a) What is the least square estimate of slope coefficient? Interpret
the figure.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of determination.
d) Test the null hypothesis that the explanatory variable X does not
have a significant effect on Y at confidence level of 95%.Critical
value of F=0.05(1,28) = 4.20
99
BIS 541 2012/2013 Final





Evaluate the four classification methods:
decision threes, neural networks, Bayesian
classification and k-NN in terms of
a) accuricy
b) speed of model development and use
c) understandability and interpretability of
output
d) handling of outlayers if not handled in
preprocessing step
100
Frequent Pattern Mining






Association rules
 Apriori, FP-Growth
Multilevel rules
Quantitaitve variables
Interestingness measures
Constraint-bsed association rule mining
Sequential patten mining
101
MIS 541 2012/2013 Final

3. (20 pts) Consider association rules X Y
where X is a categorical variable with more
then two values and Y is originally continuous
but discretize into categories. Give example
variables for X and Y. Illustrate that confidence
as an interestingness measure may be
misleading. Suggest a modification to the
classical confidence so as to eliminate its
drawback for this type of variables.
102
MIS 541 2012/2013 Final



4. (20 pts) The price of each item is
nonnegative For the following cases
indicate the type of constraints
(monotone, anti-monotone, tough,
strongly convertible or succinct)
a) the sum of prices of items is less then
or equal to 10
b) the average price of items is less then
or equal to 20
103
BIS 541 2012/2013 Final




The questions about constaint-based
association rule mining
The price of each item is nonnegative For the
following cases indicate the type of constraints
(monotonic, anti-monotonic or none)
a) the sum of prices of items is less then or
equal to 10
b) the average price of items is less then or
equal to 20
104
MIS 542 midterm S06 association
constratint



The price of each item in a store is
nonnegative. For the following cases indicate
the type of constraints (such as: monotone,
untimonotone, tough, storngly convertable or
succinct)
a) Containing at least one Nintendo Game.
b) The average price of items is between
100 and 500.
105
Tips or the exam



Data discription for
Single variables
 Ordinal, nominal, continuous
For two variables
 One categorical the other continuous
 Both are continuous – correlation coeficient
106