Bayesian modeling

Download Report

Transcript Bayesian modeling

Bayesian Nominal Indicator
Modeling
Petri Nokelainen
[email protected]
Tampere University of Technology, Finland
Outline
•
•
•
•
•
2
Overview
Introduction to Bayesian Modeling
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization
Overview
• Current issues (big data, MOOC's, learning analytics) have
increased efforts to develop and apply predictive modeling not
only within statistics, but also in other disciplines, such as,
physics, economics, bioinformatics, linguistics, computer science
and education.
• Predictiveness guards against over-fitting and serves as a natural
criterion for the quality of the model.
– Classical statistical literature is not emphasizing this aspect too much,
since models are kept relatively simple to avoid over-fitting and to keep
the calculations reasonable.
• Nowadays increased computing power allows more complicated
models, such as Bayesian, fuzzy and neural networks, to be used.
3
Overview
• According to Breiman (2001), there are two statistical modeling
cultures:
The Data Modeling Culture assumes that the data are
generated by a given stochastic data model (such as linear
or logistic regression).
The Algorithmic Modeling Culture treats the data
mechanism as unknown, using, for example, decision trees
and neural networks.
• Although the first of these two cultures, focusing on data models,
is still dominating, many fields outside statistics are rapidly
adopting a wide variety of tools.
4
Overview
• Examples of the Algorithmic Modeling Culture in educational and
psychological research:
– Musso, M. F., Kyndt, E., Cascallar, E. C., & Dochy, F. (2013). Predicting general academic
performance and identifying differential contribution of participating variables using artificial
neural networks. Frontline Learning Research, 1, 42-71.
– Nokelainen, P., Silander, T., Ruohotie, P., & Tirri, H. (2007). Investigating the Number of Nonlinear and Multi-modal Relationships between Observed Variables Measuring a Growthoriented Atmosphere. Quality & Quantity, 41(6), 869-890.
– Nokelainen, P., & Ruohotie, P. (2009). Non-linear Modeling of Growth Prerequisites in a
Finnish Polytechnic Institution of Higher Education. Journal of Workplace Learning, 21(1), 3657.
– Nokelainen, P., Tirri, K., Campbell, J. R., & Walberg, H. (2007). Factors that Contribute or
Hinder Academic Productivity: Comparing two groups of most and least successful Olympians.
Educational Research and Evaluation, 13(6), 483-500.
– Villaverde, J. E., Godoy, D., & Amandi, A. (2006). Learning styles’ recognition in e-learning
environments with feed-forward neural networks. Journal of Computer Assisted Learning, 22,
197–206.
5
(Nokelainen & Silander, 2015.)
Overview
S
P
S
S
S
P
S
S
SPSS Extension
AMOS
6
MPlus
Overview
BDM = Bayesian
Dependency Modeling
BCM = Bayesian
Classification Modeling
BUMV = Bayesian
Unsupervised Modelbased Visualization
B-Course
BayMiner
7
(Nokelainen, Silander,
Ruohotie & Tirri, 2007.)
(Nokelainen & Ruohotie, 2009.)
http://b-course.hiit.fi/obc/
Bayesian Classification Modeling
The classification
accuracy of the best
model found is 83.48%
(58.57%).
COMMON FACTORS:
PUB_T
CC_PR
CC_HE
PA
C_SHO
C_FAIL
CC_AB
CC_ES
8
http://b-course.hiit.fi/obc/
Bayesian Dependency Modeling
9
http://www.bayminer.com
Bayesian Unsupervised Model-based Visualization
10
Outline
•
•
•
•
•
11
Overview
Introduction to Bayesian Modeling
Bayesian Classification modeling
Bayesian Dependency modeling
Bayesian Unsupervised Model-based Visualization
Introduction to Bayesian Modeling
• From the social science researchers point of view, the
requirements of traditional frequentistic statistical
analysis are very challenging.
• For example, the assumption of normality of both the
phenomena under investigation and the data is
prerequisite for traditional parametric frequentistic
calculations.
Continuous
age, income, temperature, ..
0
Discrete
0 1 2, ..
12
∞
FSIQ in the WAIS-III, Likert –scale,
favourite colors, gender, ..
Introduction to Bayesian Modeling
• In situations where
– a latent construct cannot be appropriately represented as a
continuous variable,
– ordinal or discrete indicators do not reflect underlying
continuous variables,
– the latent variables cannot be assumed to be normally
distributed,
traditional Gaussian modeling is clearly not
appropriate.
• In addition, normal distribution analysis sets minimum
requirements for the number of observations, and the
measurement level of variables should be continuous.
13
Introduction to Bayesian Modeling
• Frequentistic parametric statistical techniques are
designed for normally distributed (both theoretically
and empirically) indicators that have linear
dependencies.
– Univariate normality
– Multivariate normality
– Bivariate linearity
14
Introduction to Bayesian Modeling
15
(Nokelainen, 2008, p. 119)
Introduction to Bayesian Modeling
• The upper part of the figure
contains two sections, namely
“parametric” and “non-parametric”
divided into eight sub-sections
(“DNIMMOCS OLD”).
• Parametric approach is viable only
if
– 1) both the phenomenon modeled
and the sample follow normal
distribution,
– 2) sample size is large enough (at
least 30 observations),
– 3) indicators are continuous, and
– 4) dependencies between the
observed variables are linear.
• Otherwise non-parametric
techniques should be applied.
16
D = Design (ce = controlled experiment, co =
correlational study)
N = Sample size
IO = Independent observations
ML = Measurement level (c = continuous, d = discrete,
n = nominal)
MD = Multivariate distribution (n = normal, s = similar)
O = Outliers
C = Correlations
S = Statistical dependencies (l = linear, nl = non-linear)
Introduction to Bayesian Modeling
N = 11 500
17
Introduction to Bayesian Modeling
Bayesian method
(1) is parameter-free and the user input is not required, instead,
prior distributions of the model offer a theoretically justifiable
method for affecting the model construction;
(2) works with probabilities and can hence be expected to
produce robust results with discrete data containing nominal and
ordinal attributes;
(3) has no limit for minimum sample size;
(4) is able to analyze both linear and non-linear
dependencies;
(5) assumes no multivariate normal model;
(6) allows prediction.
18
(Nokelainen, 2008.)
Introduction to Bayesian Modeling
• Probability is a mathematical construct that behaves in
accordance with certain rules and can be used to
represent uncertainty.
– The classical statistical inference is based on a frequency
interpretation of probability, and the Bayesian inference is
based on ’subjective’ or ’degree of belief’ interpretation.
• Bayesian inference uses conditional probabilities to
represent uncertainty.
P(H | E,I) - the probability of unknown things or
”hypothesis” (H), given the evidence (E) and
background information (I).
19
Introduction to Bayesian Modeling
• The essence of Bayesian inference is in the rule, known
as Bayes' theorem, that tells us how to update our
initial probabilities P(H) if we see evidence E, in order to
find out P(H|E).
• A priori probability
• Conditional probability
• Posteriori probability
P(E|H) •P(H)
P(H|E)=
P(E|H)•P(H) + P(E|~H) •P(~H)
20
Introduction to Bayesian Modeling
• The theorem was invented by an english
reverend Thomas Bayes (1701-1761) and
published posthumously (1763).
• Pierre-Simon Laplace (1749-1827) published
the general form in 1812.
• Harold Jeffreys (1891-1989) further
developed Bayesian probability (e.g.,
Jeffreys prior).
21
Introduction to Bayesian Modeling
• Bayesian inference comprises the following three
principal steps:
(1) Obtain the initial probabilities P(H) for the unknown
things. (Prior distribution.)
(2) Calculate the probabilities of the evidence E (data)
given different values for the unknown things, i.e.,
P(E | H). (Likelihood or conditional distribution.)
(3) Calculate the probability distribution of interest
P(H | E) using Bayes' theorem. (Posterior
distribution.)
• Bayes' theorem can be used sequentially.
22
Introduction to Bayesian Modeling
– If we first receive some evidence E (data), and
calculate the posterior P(H | E), and at some later
point in time receive more data E', the calculated
posterior can be used in the role of prior to calculate
a new posterior P(H | E,E') and so on.
– The posterior P(H | E) expresses all the necessary
information to perform predictions.
– The more evidence we get, the more certain we will
become of the unknowns, until all but one value
combination for the unknowns have probabilities so
close to zero that they can be neglected.
23
C_Example 1: Applying Bayes’ Theorem
• Company A is employing workers on short term jobs
that are well paid.
• The job sets certain prerequisites to applicants linguistic
abilities.
• Earlier all the applicants were interviewed, but
nowadays it has become an impossible task as both the
number of open vacancies and applicants has increased.
• Personnel department of the company was ordered to
develop a questionnaire to preselect the most suitable
applicants for the interview.
24
C_Example 1: Applying Bayes’ Theorem
• Psychometrician who developed the instrument
estimates that it would work out right on 90 out of 100
applicants, if they are honest.
• We know on the basis of earlier interviews that the
terms (linguistic abilities) are valid for one per 100
person living in the target population.
• The question is: If an applicant gets enough points to
participate in the interview, is he or she hired for the job
(after an interview)?
25
C_Example 1: Applying Bayes’ Theorem
• A priori probability P(H) is described by the number of
those people in the target population that really are
able to meet the requirements of the task (1 out of 100
= .01).
• Counter assumption of the a priori is P(~H) that equals
to 1-P(H), thus it is = .99.
• Psychometricians beliefs about how the instrument
works is called conditional probability P(E|H) = .9.
– Instruments failure to indicate non-valid applicants, i.e., those
that are not able to succeed in the following interview, is
stated as P(E|~H) that equals to .1.
– These values need not to sum to one!
26
C_Example 1: Applying Bayes’ Theorem
P(E|H) • P(H)
• A priori probability
• Conditional probability
• Posterior probability
P(H|E)=
P(E|H)• P(H) + P(E|~H) • P(~H)
(.9) • (.01)
= .08
P(H|E)=
(.9) • (.01) + (.1) • (.99)
27
C_Example 1: Applying Bayes’ Theorem
28
C_Example 1: Applying Bayes’ Theorem
• What if the measurement error of the psychometricians
instrument would have been 20 per cent?
– P(E|H)=0.8 P(E|~H)=0.2
29
C_Example 1: Applying Bayes’ Theorem
30
C_Example 1: Applying Bayes’ Theorem
• What if the measurement error of the
psychometricians instrument would have been only one
per cent?
– P(E|H)=0.99 P(E|~H)=0.01
31
C_Example 1: Applying Bayes’ Theorem
32
C_Example 1: Applying Bayes’ Theorem
• Quite often people tend to estimate the
probabilities to be too high or low, as they are not
able to update their beliefs even in simple
decision making tasks when situations change
dynamically (Anderson, 1995).
• Monty Hall problem
– https://en.wikipedia.org/wiki/Monty_Hall_problem
33
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• One of the most important rules educational science scientific
journals apply to judge the scientific merits of any submitted
manuscript is that all the reported results should be based on so
called ‘null hypothesis significance testing procedure’ (NHSTP)
and its featured product, p-value.
• Gigerenzer, Krauss and Vitouch (2004, p. 392) describe ‘the null
ritual’ as follows:
– 1) Set up a statistical null hypothesis of “no mean difference”
or “zero correlation.” Don’t specify the predictions of your
research or of any alternative substantive hypotheses;
– 2) use 5 per cent as a convention for rejecting the null. If
significant, accept your research hypothesis;
– 3) always perform this procedure.
34
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
– A p-value is the probability of the observed data (or
of more extreme data points), given that the null
hypothesis H0 is true, P(D|H0) (id.).
• The first common misunderstanding is that the p-value of,
say t-test, would describe how probable it is to have the
same result if the study is repeated many times
(Thompson, 1994).
• Gerd Gigerenzer and his colleagues (id., p. 393) call this
replication fallacy as “P(D|H0) is confused with 1—P(D).”
35
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• The second misunderstanding, shared by both applied
statistics teachers and the students, is that the p-value
would prove or disprove H0. However, a significance test
can only provide probabilities, not prove or disprove null
hypothesis.
• Gigerenzer (id., p. 393) calls this fallacy an illusion of
certainty: “Despite wishful thinking, p(D|H0) is not the
same as P(H0|D), and a significance test does not and
cannot provide a probability for a hypothesis.”
– A Bayesian statistics provide a way of calculating a
probability of a hypothesis (discussed later).
36
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• My statistics course grades (Autumn 2006, n = 12)
ranged from one to five as follows: 1) n = 3; 2) n = 2; 3) n
= 4; 4) n = 2; 5) n = 1, showing that the lowest grade
frequency (”1”) from the course is three (25.0%).
– Previous data from the same course (2000-2005) shows that only five
students out of 107 (4.7%) had the lowest grade.
• Next, we will use the classical statistical approach (the
likelihood principle) and Bayesian statistics to calculate
if the number of the lowest course grades is
exceptionally high on my latest course when compared
to my earlier stat courses.
37
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• There are numerous possible reasons behind such
development, for example, I have become more critical
on my assessment or the students are less motivated in
learning quantitative techniques.
• However, I believe that the most important difference
between the last and preceding courses is that the
assessment was based on a test with statistical
computations.
– The preceding courses also had computational exercises, but were
assessed only with essay answers.
38
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• I assume that the 12 students earned their grade
independently (independent observations) of each
other as the computational exercise was conducted
under my (or assistant’s) supervision.
• I further assume that the chance of getting the lowest
grade (), is the same for each student.
– Therefore X, the number of lowest grades (1) in the scale from
1 to 5 among the 12 students in the latest stat course, has a
binomial (12, ) distribution: X ~ Bin(12, ).
– For any integer r between 0 and 12,
39
 12  r
P(r |  , n)    (1   )12 r
 r 
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• The expected number of lowest grades is 12(5/107) =
0.561.
• Theta is obtained by dividing the expected number of
lowest grades with the number of students: 0.561 / 12 
0.05.
• The null hypothesis is formulated as follows: H0:  =
0.05, stating that the rate of the lowest grades from the
current stat course is not a big thing and compares to
the previous courses rates.
40
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• Three alternative hypotheses (H1, H2, H3) are formulated
to address the concern of the increased number of
lowest grades (from 5 to 6, 7 and 8, respectively):
– H1: 12/(107/6) = .67 -> .67/12=.056  .06
– H2: 12/(107/7) = .79 -> .79/12=.065  .07
– H3: 12/(107/8) = .90 -> .90/12=.075  .08
• H1:  = 0.06; H2:  = 0.07; H3:  = 0.08.
41
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• To compare the hypotheses, we calculate binomial
distributions for each value of .
• For example, the null hypothesis (H0) calculation yields
 12  3
P (r |  , n)   .05 (1  .05)123
3
 12!  3
.05 (1  .05)123
 
 3!(12  3)! 
 479001600  3
12 3

.
05
(
1

.
05
)

 2177280 
 .017
42
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• The results for the alternative hypotheses are as
follows:
– PH1(3|.06, 12)  .027;
– PH2(3|.07, 12)  .039;
– PH3(3|.08, 12)  .053.
• The ratio of the hypotheses is roughly 1:2:2:3 and could
be verbally interpreted with statements like “the second
and third hypothesis explain the data about equally
well”, or “the fourth hypothesis explains the data about
three times as well as the first hypothesis”.
43
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• Lavine (1999) reminds that P(r|, n), as a function of r
(3) and  {.05; .06; .07; .08}, describes only how well
hypotheses explain the data; no value of r other than 3
is relevant.
– For example, P(4|.05, 12) is irrelevant as it does not describe
how well any hypothesis explains the data.
– This likelihood principle, that is, to base statistical inference
only on the observed data and not on a data that might have
been observed, is an essential feature of Bayesian approach.
44
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• The Fisherian, so called ‘classical approach’ to test the
null hypothesis (H0 :  = .05) against the alternative
hypothesis (H1 :  > .05) is to calculate the p-value that
defines the probability under H0 of observing an
outcome at least as extreme as the outcome actually
observed:
p  P(r  3 |   .05)  P(r  4 |   .05)  ...  P(r  12 |   .05)
45
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• As an example, the first part of the formula is solved as
follows:
P(r  3 |   .05) 
46
n!
12!
 r (1   ) nr 
.053 (1  .05)123  .017
r!(n  r )!
3!(12  3)!
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• After calculations, the p-value of .02 would suggest H0
rejection, if the rejection level of significance is set at 5
per cent.
– Calculation of p-value violates the likelihood principle by using
P(r|, n) for values of r other than the observed value of r = 3
(Lavine, 1999):
• The summands of P(4|.05, 12), P(5|.05, 12), …, P(12|.05,
12) do not describe how well any hypothesis explains
observed data.
47
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• A Bayesian approach will continue from the same point
as the classical approach, namely probabilities given by
the binomial distributions, but also make use of other
relevant sources of a priori information.
– In this domain, it is plausible to think that the computer test
(“SPSS exam”) would make the number of total failures more
probable than in the previous times when the evaluation was
based solely on the essays.
– On the other hand, the computer test has only 40 per cent
weight in the equation that defines the final stat course grade:
[.3(Essay_1) + .3(Essay_2) + .4(Computer test)]/3 = Final grade.
48
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
– Another aspect is to consider the nature of the
aforementioned tasks, as the essays are distance work
assignments while the computer test is to be performed under
observation.
– Perhaps the course grades of my earlier stat courses have a
narrower distribution due to violence of the independent
observation assumption?
• For example, some students may have copy-pasted text from other
sources or collaborated with other students.
– As we see, there are many sources of a priori information that
I judge to be inconclusive and, thus, define that null
hypothesis is as likely to be true or false.
49
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• This a priori judgment is expressed mathematically as
P(H0)  1/2  P(H1) + P(H2) + P(H3).
• I further assume that the alternative hypotheses H1, H2
or H3 share the same likelihood P(H1)  P(H2)  P(H3) 
1/6.
• These prior distributions summarize the knowledge
about  prior to incorporating the information from the
course grades.
50
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• An application of Bayes' theorem yields
P( H 0 | r  3) 
P( r  3 | H 0 ) P( H 0 )
P( r  3 | H 0 ) P( H 0 )  P( r  3 | H 1 ) P( H 1 )  P( r  3 | H 2 ) P( H 2 )  P( r  3 | H 3 ) P( H 3 )
1
P(r  3 | .017) P( )
2

1
1
1
1
P(r  3 | .017) P( )  P(r  3 | .027) P( )  P(r  3 | .039) P( )  P(r  3 | .053) P ( )
2
6
6
6
 0.30
51
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• Similar calculations for the alternative hypotheses yields
P(H1|r=3)  .16; P(H2|r=3)  .29; P(H3|r=3)  .31.
• These posterior distributions summarize the knowledge
about  after incorporating the grade information.
• The odds are about 1 to 2 (.30 vs. .70) that the latest
stat course had a higher rate of lowest grades than 0.05.
52
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• The difference between the classical and
Bayesian statistics would be only
philosophical (probability vs. inverse
probability) if they would always lead to
similar conclusions.
– In this case the p-value would suggest
rejection of H0 (p = .02).
– Bayesian analysis would also suggest
evidence against  = .05 (.30 vs. .70,
ratio of .43).
53
C_Example 2: Comparison of Traditional Frequentistic and
Bayesian Approach
• What if the number of the lowest grades
in the last course would be two?
– The classical approach would not
anymore suggest H0 rejection (p =
.12).
– Bayesian result would still say that
there is more evidence against than
for the H0 (.39 vs. .61, ratio of .64).
54
Outline
•
•
•
•
•
55
Overview
Introduction to Bayesian Modeling
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization
BCM = Bayesian
Classification Modeling
BDM = Bayesian
Dependency Modeling
BUMV = Bayesian
Unsupervised Modelbased Visualization
B-Course
56
Bayesian Classification Modeling
• Bayesian Classification Modeling (BCM) is implemented
in the B-Course software that is based on discrete
Bayesian methods.
– This also applies to Bayesial Dependency Modeling that is
discussed later.
• ’Quantitative’ indicators with high measurement lever
(continuous, interval) lose more information in the
discretization process than ’qualitative’ indicators
(ordinal, nominal) as they all are treated in the analysis
as nominal (discrete) indicators.
57
Bayesian Classification Modeling
• For example, variable ”gender” may include numerical
values ”1” (Female) or ”2” (Male) or text
values ”Female” and ”Male” in discrete Bayesian
analysis.
• This will inevitably lead to a loss of power (Cohen, 1988;
Murphy & Myors, 1998).
• However, ensuring that sample size is large enough is a
simple way to address this problem.
58
Sample size estimation
• N
– Population size.
• n
– Estimated sample size.
• Sampling error (e)
– Difference between the true (unknown)
value and observed values, if the survey
were repeated (=sample collected)
numerous times.
• Confidence interval
– Spread of the observed values that would
be seen if the survey were repeated
numerous times.
• Confidence level
– How often the observed values would be
within sampling error of the true value if
the survey were repeated numerous times.
59
(Murphy & Myors, 1998.)
Bayesian Classification Modeling
• Aim of the BCM is to select the variables that are best
predictors for different class memberships (e.g., gender,
job title, level of giftedness).
• In the classification process, the automatic search is
looking for the best set of variables to predict the class
variable for each data item.
60
Bayesian Classification Modeling
• The search procedure resembles the traditional linear
discriminant analysis (LDA, see Huberty, 1994), but the
implementation is totally different.
– For example, a variable selection problem that is addressed
with forward, backward or stepwise selection procedure in
LDA is replaced with a genetic algorithm approach (e.g.,
Hilario, Kalousisa, Pradosa & Binzb, 2004; Hsu, 2004) in the
Bayesian classification modeling.
61
Bayesian Classification Modeling
• The genetic algorithm approach means that variable
selection is not limited to one (or two or three) specific
approach; instead many approaches and their
combinations are exploited.
– One possible approach is to begin with the presumption that
the models (i.e., possible predictor variable combinations)
that resemble each other a lot (i.e., have almost same
variables and discretizations) are likely to be almost equally
good.
– This leads to a search strategy in which models that resemble
the current best model are selected for comparison, instead of
picking models randomly.
62
Bayesian Classification Modeling
– Another approach is to abandon the habit of always rejecting
the weakest model and instead collect a set of relatively good
models.
– The next step is to combine the best parts of these models so
that the resulting combined model is better than any of the
original models.
• B-Course is capable of mobilizing many more viable
approaches, for example, rejecting the better model
(algorithms like hill climbing, simulated annealing) or
trying to avoid picking similar model twice (tabu
search).
63
Bayesian Classification Modeling
64
(Nokelainen, Ruohotie, & Tirri, 1999.)
For an example of
practical use of BCM, see
Nokelainen, Tirri,
Campbell and Walberg
(2007).
65
Modeling of Vocational Excellence in Air Traffic
Control
•The study describes the characteristics and
predictors that explain air traffic controller’s
(ATCO) vocational expertise and excellence.
•It analyzes the role of natural abilities, selfregulative abilities and environmental conditions in
ATCO’s vocational development.
66
(Pylväs, Nokelainen, & Roisko, 2015.)
Modeling of Vocational Excellence in Air Traffic
Control
•The target population of the study consisted of
ATCOs in Finland (N=300) of which 28,
representing four different airports, were
interviewed.
•The research data also included interviewees’
aptitude test scoring, study records and employee
assessments.
67
Modeling of Vocational Excellence in Air Traffic
Control
• The research questions were examined by using
theoretical concept analysis.
• The qualitative data analysis involved both
content analysis and Bayesian classification
modeling.
68
Modeling of Vocational Excellence in Air Traffic
Control
69
Modeling of Vocational Excellence in Air Traffic
Control
RQ1a
What are the differences in characteristics between
the air traffic controllers representing vocational
expertise and vocational excellence?
70
Modeling of Vocational Excellence in Air Traffic
Control
"…the natural ambition of wanting to be good. Air
traffic controllers have perhaps generally a strong
professional pride."
”Interesting and rewarding work, that is the basis
of wanting to stay in this work until retiring.”
71
Modeling of Vocational Excellence in Air Traffic
Control
•"I read all the regulations and instructions
carefully and precisely, and try to think …the
majority wave aside of them. It reflects on work."
"…but still I consider myself more precise than the
majority […]a bad air traffic controller have delays,
good air traffic controllers do not have delays
which is something that also pilots appreciate
because of the strict time limits.”
72
Modeling of Vocational Excellence in Air Traffic
Control
73
Modeling of Vocational Excellence in Air Traffic
Control
74
Modeling of Vocational Excellence in Air Traffic
Control
75
Modeling of Vocational Excellence in Air Traffic
Control
76
Modeling of Vocational Excellence in Air Traffic
Control
77
Classification accuracy = 89.0 %.
78
Modeling of Vocational Excellence in Air Traffic
Control
79
Modeling of Vocational Excellence in Air Traffic
Control
80
Outline
•
•
•
•
•
•
81
Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization
BCM = Bayesian
Classification Modeling
BDM = Bayesian
Dependency Modeling
BUMV = Bayesian
Unsupervised Modelbased Visualization
B-Course
82
Bayesian Dependency Modeling
• Bayesian dependency modeling
(BDM) is applied to examine
dependencies between variables
by both their visual
representation and probability
ratio of each dependency
• Graphical visualization of
Bayesian network contains two
components:
– 1) Observed variables visualized as
ellipses.
– 2) Dependences visualized as lines
between nodes.
83
Var 1
Var 2
Var 3
C_Example 4: Calculation of Bayesian Score
• Bayesian score (BS), that is, the probability of the model
P(M|D), allows the comparison of different models.
Figure 9. An Example of Two Competing Bayesian
Network Structures
84
(Nokelainen, 2008, p. 121.)
C_Example 4: Calculation of Bayesian Score
• Let us assume that we have the following data:
x1 x2
1 1
1 1
2 2
1 2
1 1
• Model 1 (M1) represents the two variables, x1 and x2
respectively, without statistical dependency, and the model 2
(M2) represents the two variables with a dependency (i.e., with a
connecting arc).
– The binomial data might be a result of an experiment, where the five
participants have drinked a nice cup of tea before (x1) and after (x2) a test
of geographic knowledge.
85
C_Example 4: Calculation of Bayesian Score
• In order to calculate P(M1,2|D), we need to solve
P(D|M1,2) for the two models M1 and M2.
– Probability of the data given the model is solved by
using the following marginal likelihood equation
(Congdon, 2001, p. 473; Myllymäki, Silander, Tirri, &
Uronen, 2001; Myllymäki & Tirri, 1998, p. 63):
n
qi
P( D | M )  
i 1
86
j 1
( N )
'
ij
( N  N ij
'
ij
ri

)
k 1
( N
'
ijk
 N ijk )
'
( N ijk
)
C_Example 4: Calculation of Bayesian Score
• In the Equation, following symbols are
used:
– n is the number of variables (i indexes
variables from 1 to n);
– ri is the number of values in i:th variable (k
indexes these values from 1 to ri);
– qi is the number of possible configurations
of parents of i:th variable;
• The marginal likelihood equation
produces a Bayesian Dirichlet score that
allows model comparison (Heckerman et
al., 1995; Tirri, 1997; Neapolitan & Morris,
2004).
n
qi
P( D | M )  
i 1
87
j 1
( N )
'
ij
( N  N
'
ij
- Nij describes the number of
rows in the data that have j:th
configuration for parents of i:th
variable;
- Nijk describes how many rows
in the data have k:th value for
the i:th variable also have j:th
configuration for parents of i:th
variable;
- N’ is the equivalent sample
size set to be the average
number of values divided by
two.
'
ri
ijk
ijk
'
ij k 1
ijk

)
( N
N )
( N )
C_Example 4: Calculation of Bayesian Score
• First, P(D|M1) is calculated given the values of variable
x1:
(2/2)/1
(2/2)/2*1
N'
( )
qi
'
N'
N
'
'
(
 N ijk
)

(

N
1
ijk 2 )
r q
r q
P( D x1 | M 1 ) 
N'
N'
N'
(
 N ij )
(
)
(
)
qi
r q
r q
(1.00) (0.50  4) (0.50  1)

(1.00  5) (0.50)
(0.50)
 0.008  6.563  0.500
 0.027
88
x1
1
1
2
1
1
x2
1
1
2
2
1
C_Example 4: Calculation of Bayesian Score
• Second, the values for the x2 are calculated:
P( D x 2
N'
( )
qi
'
N'
N
'
'
(
 N ijk
)

(

N
1
ijk 2 )
r q
r q
| M1) 
N'
N'
N'
(
 N ij )
(
)
(
)
qi
r q
r q
(1.00) (0.50  3) (0.50  2)

(1.00  5) (0.50)
(0.50)
 0.008 1.875  0.750
 0.012
89
x1
1
1
2
1
1
x2
1
1
2
2
1
C_Example 4: Calculation of Bayesian Score
• The BS, probability for the first model P(M1|D), is 0.027
* 0.012  0.000324.
90
C_Example 4: Calculation of Bayesian Score
• Third, P(D|M2) is calculated given the values of variable
x1:
N'
( )
qi
'
N'
N
'
'
(
 N ijk
 N ijk
1 ) (
2)
r q
r q
P( D x1 | M 2 ) 
N'
N'
N'
(
 N ij )
(
)
(
)
qi
r q
r q
(1.00) (0.50  4) (0.50  1)

(1.00  5) (0.50)
(0.50)
 0.008  6.563  0.500
 0.027
91
C_Example 4: Calculation of Bayesian Score
• Fourth, the values for the first parent configuration
(x1 = 1) are calculated:
(0.50) (0.25  3) (0.25  1)

(0.50  4) (0.25)
(0.25)
 0.152  0.703  0.250
 0.027
92
C_Example 4: Calculation of Bayesian Score
• Fifth, the values for the second parent configuration
(x1 = 2) are calculated:
(0.50) (0.25  0) (0.25  1)

(0.50  1) (0.25)
(0.25)
 2.000 1.000  0.250
 0.500
93
C_Example 4: Calculation of Bayesian Score
• The BS, probability for the second model P(M2|D), is
0.027 * 0.027 * 0.500  0.000365.
94
C_Example 4: Calculation of Bayesian Score
• Bayes’ theorem enables the calculation of the ratio of
the two models, M1 and M2.
– As both models share the same a priori probability, P(M1) =
P(M2), both probabilities are canceled out.
– Also the probability of the data P(D) is canceled out in the
following equation as it appears in both formulas in the same
position:
 P( D | M 1 ) P( M 1 ) 


P( D)
P( M 1 | D) 
 0.000324


 0.88
P( M 2 | D)  P( D | M 2 ) P( M 2 )  0.000365


P
(
D
)


95
C_Example 4: Calculation of Bayesian Score
• The result of model comparison shows that
since the ratio is less than 1, the M2 is more
probable than M1.
• This result becomes explicit when we
investigate the sample data more closely.
• Even a sample this small (n = 5) shows that
there is a clear tendency between the
values of x1 and x2 (four out of five value
pairs are identical).
96
x1
1
1
2
1
1
x2
1
1
2
2
1
• How many models are there?
2
97
n*(n 1) / 2
http://b-course.hiit.fi/obc/howmanymodels.html
For an example of
practical use of BDM, see
Nokelainen and Tirri
(2010).
98
Our hypothesis regarding the first research question was that intrinsic goal
orientation (INT) is positively related to moral judgment (Batson & Thompson,
2001; Kunda & Schwartz, 1983).
It was also hypothesized, based on Blasi’s (1999) argumentation that emotions
cannot be predictors of moral action, that fear of failure (affective motivational
section) is not related to moral judgment.
Research evidence showed support for both hypotheses: firstly, only intrinsic
motivation was directly (positively) related to moral judgment, and secondly,
affective motivational section was not present in the predictive model.
99
(Nokelainen & Tirri, 2010.)
Conditioning the three levels of moral judgment showed that there is a positive
statistical relationship between moral judgment and intrinsic goal orientation. The
probability of belonging to the highest intrinsically motivated group three (M = 3.7
– 5.0) increases from 15 per cent to 90 per cent alongside with the moral judgment
abilities. There is also similar but less steep increase in extrinsic goal orientation
(from 5% to 12%), but we believe that it is mostly tied to increase in intrinsic goal
orientation.
100
(Nokelainen & Tirri, 2010.)
For an example of
practical use of BDM see
Nokelainen and Tirri
(2007).
101
102
(Nokelainen & Tirri, 2007.)
In conflict situations, my superior is able
to draw out all parties and understand
the differing perspectives.
My superior sees other people in positive
rather than in negative light.
My superior has an optimistic "glass half
full" outlook.
103
2% vs. 90%
21% vs. 78%
EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties and
understand the differing perspectives.”
EL_ii_09_26 “My superior sees other people in positive rather than in negative light.”
EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”
104
69%
66%
EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties
and understand the differing perspectives.”
EL_ii_09_26 “My superior sees other people in positive rather than in negative light.”
EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”
105
95%
85%
EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties
and understand the differing perspectives.”
EL_ii_09_26 “My superior sees other people in positive rather than in negative light.”
EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”
106
Outline
•
•
•
•
•
107
Overview
Introduction to Bayesian Modeling
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization
BCM = Bayesian
Classification Modeling
BDM = Bayesian
Dependency Modeling
BUMV = Bayesian
Unsupervised Modelbased Visualization
BayMiner
108
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
NON-REDUC.
UNSUPERVISED
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
PROJECTION TECH.
LINEAR
PCA
109
PROJ.PUR.
NON-LINEAR
MDS
NEUR.N.
SOM
PRIN.C.
ICA
BUMV
Bayesian Unsupervised Model-based Visualization
• Supervised techniques, for example, linear discriminant analysis
(LDA) and supervised Bayesian networks (BSMV, see Kontkanen,
Lahtinen, Myllymäki, Silander & Tirri, 2000) assume a given
structure (Venables & Ripley, 2002, p. 301).
• Unsupervised techniques, for example, exploratory factor
analysis (EFA) discover variable structure from the evidence of
the data matrix.
• Unsupervised techniques are further divided into four sub
categories: 1) Visualization techniques; 2) Cluster analysis; 3)
Factor analysis; 4) Discrete multivariate analysis.
110
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
111
UNSUPERVISED
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
Bayesian Unsupervised Model-based Visualization
• According to Venables and Ripley (id.), visualization techniques
are often more effective than clustering techniques discovering
interesting groupings in the data, and they avoid the danger of
over-interpretation of the results as researcher is not allowed to
input the number of expected latent dimensions.
• In cluster analysis the centroids that represent the clusters are
still high-dimensional, and some additional illustration techniques
are needed for visualization (Kaski, 1997), for example MDS (Kim,
Kwon, & Cook, 2000).
112
Bayesian Unsupervised Model-based Visualization
• Several graphical means have been proposed for visualizing highdimensional data items directly, by letting each dimension govern
some aspect of the visualization and then integrating the results
into one figure.
• These techniques can be used to visualize any kinds of highdimensional data vectors, either the data items themselves or
vectors formed of some descriptors of the data set like the fivenumber summaries (Tukey, 1977).
113
Bayesian Unsupervised Model-based Visualization
• Simplest technique to visualize a data set is to plot a “profile” of
each item, that is, a two-dimensional graph in which the
dimensions are enumerated on the x-axis and the corresponding
values on the y-axis.
• Other alternatives are scatter plots and pie diagrams.
114
Bayesian Unsupervised Model-based Visualization
• The major drawback that applies to all these techniques is that
they do not reduce the amount of data.
– If the data set is large, the display consisting of all the data items
portrayed separately will be incomprehensible. (Kaski, 1997.)
• Techniques reducing the dimensionality of the data items are
called projection techniques.
115
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
NON-REDUC.
CLUSTER ANALYSIS
REDUCING
PROJECTION TECH.
116
UNSUPERVISED
EFA
DISC. MULTIV. ANAL.
Bayesian Unsupervised Model-based Visualization
• The goal of the projection is to represent the input data items in a
lower-dimensional space in such a way that certain properties of
the structure of the data set are preserved as faithfully as
possible.
– The projection can be used to visualize the data set if a sufficiently small
output dimensionality is chosen. (id.)
• Projection techniques are divided into two major groups, linear
and non-linear projection techniques.
117
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
NON-REDUC.
CLUSTER ANALYSIS
REDUCING
PROJECTION TECH.
LINEAR
118
UNSUPERVISED
NON-LINEAR
EFA
DISC. MULTIV. ANAL.
Bayesian Unsupervised Model-based Visualization
• Linear projection techniques consist of principal component
analysis (PCA) and projection pursuit.
– In exploratory projection pursuit (Friedman, 1987) the data is projected
linearly, but this time a projection, which reveals as much of the nonnormally distributed structure of the data set as possible is sought.
– This is done by assigning a numerical “interestingness” index to each
possible projection, and by maximizing the index.
– The definition of interestingness is based on how much the projected data
deviates from normally distributed data in the main body of its
distribution.
119
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
NON-REDUC.
CLUSTER ANALYSIS
REDUCING
PROJECTION TECH.
LINEAR
PCA
120
PROJ.PUR.
UNSUPERVISED
NON-LINEAR
EFA
DISC. MULTIV. ANAL.
Bayesian Unsupervised Model-based Visualization
• Non-linear unsupervised projection techniques consist of
multidimensional scaling, principal curves and various other
techniques including SOM, neural networks and Bayesian
unsupervised networks (Kontkanen, Lahtinen, Myllymäki & Tirri,
2000).
121
Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
VISUALIZATION TECH.
NON-REDUC.
UNSUPERVISED
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
PROJECTION TECH.
LINEAR
PCA
122
PROJ.PUR.
NON-LINEAR
MDS
NEUR.N.
SOM
PRIN.C.
ICA
BUMV
Bayesian Unsupervised Model-based Visualization
• Aforementioned PCA technique, despite its popularity, cannot
take into account non-linear structures, structures consisting of
arbitrarily shaped clusters or curved manifolds since it describes
the data in terms of a linear subspace.
• Projection pursuit tries to express some non-linearities, but if the
data set is high-dimensional and highly non-linear it may be
difficult to visualize it with linear projections onto a lowdimensional display even if the “projection angle” is chosen
carefully (Friedman, 1987).
123
Bayesian Unsupervised Model-based Visualization
• Several approaches have been proposed for reproducing nonlinear higher-dimensional structures on a lower-dimensional
display.
• The most common techniques allocate a representation for each
data point in the lower-dimensional space and try to optimize
these representations so that the distances between them would
be as similar as possible to the original distances of the
corresponding data items.
• The techniques differ in how the different distances are weighted
and how the representations are optimized. (Kaski, 1997.)
124
Bayesian Unsupervised Model-based Visualization
• Multidimensional scaling (MDS) is not one specific tool, instead it
refers to a group of techniques that is widely used especially in
behavioral, econometric, and social sciences to analyze subjective
evaluations of pairwise similarities of entities.
• The starting point of MDS is a matrix consisting of the pairwise
dissimilarities of the entities.
• The basic idea of the MDS technique is to approximate the
original set of distances with distances corresponding to a
configuration of points in a Euclidean space.
125
Bayesian Unsupervised Model-based Visualization
• MDS can be considered to be an alternative to factor analysis.
• In general, the goal of the analysis is to detect meaningful
underlying dimensions that allow the researcher to explain
observed similarities or dissimilarities (distances) between the
investigated objects.
• In factor analysis, the similarities between objects (e.g., variables)
are expressed in the correlation matrix.
126
Bayesian Unsupervised Model-based Visualization
• With MDS we may analyze any kind of similarity or dissimilarity
matrix, in addition to correlation matrices, specifying that we
want to reproduce the distances based on n dimensions.
• After formation of matrix MDS attempts to arrange “objects”
(e.g., factors of growth-oriented atmosphere) in a space with a
particular number of dimensions so as to reproduce the observed
distances.
• As a result, the distances are explained in terms of underlying
dimensions.
127
Bayesian Unsupervised Model-based Visualization
• MDS based on Euclidean distance do not generally reflect
properly to the properties of complex problem domains.
• In real-world situations the similarity of two vectors is not a
universal property; in different points of view they in the end
may appear quite dissimilar (Kontkanen, Lahtinen, Myllymäki,
Silander, & Tirri, 2000).
• Another problem with the MDS techniques is that they are
computationally very intensive for large data sets.
128
Bayesian Unsupervised Model-based Visualization
• Bayesian unsupervised model-based visualization (BUMV) is
based on Bayesian Networks (BN).
• BN is a representation of a probability distribution over a set of
random variables, consisting of a directed acyclic graph (DAG),
where the nodes correspond to domain variables, and the arcs
define a set of independence assumptions which allow the joint
probability distribution for a data vector to be factorized as a
product of simple conditional probabilities. Two vectors are
considered similar if they lead to similar predictions, when given
as input to the same Bayesian network model. (Kontkanen,
Lahtinen, Myllymäki, Silander, & Tirri, 2000.)
129
Bayesian Unsupervised Model-based Visualization
• Naturally, there are numerous viable options to BUMV, such as
Self-Organizing Map (SOM) and Independent Component Analysis
(ICA).
• SOM is a neural network algorithm that has been used for a wide
variety of applications, mostly for engineering problems but also
for data analysis (Kohonen, 1995).
– SOM is based on neighborhood preserving topological map tuned
according to geometric properties of sample vectors.
• ICA minimizes the statistical dependence of the components
trying to find a transformation in which the components are as
statistically independent as possible (Hyvärinen & Oja, 2000).
– The usage of ICA is comparable to PCA where the aim is to present the
data in a manner that facilitates further analysis.
130
Bayesian Unsupervised Model-based Visualization
• First major difference between Bayesian and neural network
approaches for educational science researcher is that the former
operates with a familiar symmetrical probability range from 0 to
1 while the upper limit of asymmetrical probability scale in the
latter approach is unknown.
• The second fundamental difference between the two types of
networks is that a perceptron in the hidden layers of neural
networks does not in itself have an interpretation in the domain
of the system, whereas all the nodes of a Bayesian network
represent concepts that are well defined with respect to the
domain (Jensen, 1995).
131
Bayesian Unsupervised Model-based Visualization
• The meaning of a node and its probability table can be subject to
discussion, regardless of their function in the network, but it does
not make any sense to discuss the meaning of the nodes and the
weights in a neural network: Perceptrons in the hidden layers
only have a meaning in the context of the functionality of the
network.
• Construction of a Bayesian network requires detailed knowledge
of the domain in question.
– If such knowledge can only be obtained through a series of examples (i.e.,
a data base of cases), neural networks seem to be an easier approach. This
might be true in cases such as the reading of handwritten letters, face
recognition, and other areas where the activity is a 'craftsman like' skill
based solely on experience.
132
(Jensen, 1995.)
Bayesian Unsupervised Model-based Visualization
• It is often criticized that in order to construct a Bayesian network
you have to ‘know’ too many probabilities.
– However, there is not a considerable difference between this number and
the number of weights and thresholds that have to be ‘known’ in order to
build a neural network, and these can only be learnt by training.
• A weakness of neural networks tis hat you are unable to utilize
the knowledge you might have in advance.
• Probabilities, on the other hand, can be assessed using a
combination of theoretical insight, empiric studies independent
of the constructed system, training, and various more or less
subjective estimates.
133
(Jensen, 1995.)
Bayesian Unsupervised Model-based Visualization
• In the construction of a neural network, it is decided in advance
about which relations information is gathered, and which
relations the system is expected to compute (the route of
inference is fixed).
• Bayesian networks are much more flexible in that respect.
134
(Jensen, 1995.)
For an example of
practical use of BUMV,
see Nokelainen and
Ruohotie (2009).
135
Results showed that managers and teachers had higher growth motivation and level
of commitment to work than other personnel, including job titles such as cleaner,
caretaker, accountant and computer support.
Employees across all job titles in the organization, who have temporary or parttime contracts, had higher self-reported growth motivation and commitment to
work and organization than their established colleagues.
136
137
Links
• B-Course
• BayMiner
138
http://b-course.hiit.fi/obc
http://www.bayminer.com
References
• Anderson, J. (1995). Cognitive Psychology and its Implications. New York:
Freeman.
• Bayes, T. (1763). An essay towards solving a problem in the doctrine of
chances. Philosophical Transactions of the Royal Society, 53, 370-418.
• Bernardo, J., & Smith, A. (2000). Bayesian theory. New York: Wiley.
• Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science,
16(3), 199–231.
• Congdon, P. (2001). Bayesian Statistical Modelling. Chichester: John Wiley &
Sons.
• Friedman, J. (1987). Exploratory Projection Pursuit. Journal of American
Statistical Association, 82, 249-266.
• Gigerenzer, G. (2000). Adaptive thinking. New York: Oxford University Press.
139
References
• Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you
always wanted to know about significance testing but were afraid to ask. In D.
Kaplan (Ed.), The SAGE handbook of quantitative methodology for the social
sciences (pp. 391-408). Thousand Oaks: Sage.
• Gill, J. (2002). Bayesian methods. A Social and Behavioral Sciences Approach.
Boca Raton: Chapman & Hall/CRC.
• Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian
networks: The combination of knowledge and statistical data. Machine
Learning, 20(3), 197-243.
• Hilario, M., Kalousisa, A., Pradosa, J., & Binzb, P.-A. (2004). Data mining for
mass-spectra based diagnosis and biomarker discovery. Drug Discovery Today:
BIOSILICO, 2(5), 214-222.
• Huberty, C. (1994). Applied Discriminant Analysis. New York: John Wiley &
Sons.
140
References
• Hyvärinen, A., & Oja, E. (2000). Independent Component Analysis: Algorithms
and Applications. Neural Networks, 13(4-5), 411-430.
• Jensen, F. V. (1995). Paradigms of Expert Systems. HUGIN Lite 7.4 User
Manual.
• Kaski, S. (1997). Data exploration using self-organizing maps. Doctoral
dissertation. Acta Polytechnica Scandinavica, Mathematics, Computing and
Management in Engineering Series No. 82. Espoo: Finnish Academy of
Technology.
• Kim, S., Kwon, S., & Cook, D. (2000). Interactive Visualization of Hierarchical
Clusters Using MDS and MST. Metrika, 51(1), 39–51.
• Kohonen, T. (1995). Self-Organizing Maps. Berlin: Springer.
• Kontkanen, P., Lahtinen, J., Myllymäki, P., Silander, T., & Tirri, H. (2000).
Supervised Model-based Visualization of High-dimensional Data. Intelligent
Data Analysis, 4, 213-227.
141
References
• Kontkanen, P., Lahtinen, J., Myllymäki, P., & Tirri, H. (2000). Unsupervised
Bayesian Visualization of High-Dimensional Data. In R. Ramakrishnan, S. Stolfo,
R. Bayardo, & I. Parsa (Eds.), Proceedings of the Sixth International Conference
on Knowledge Discovery and Data Mining (pp. 325-329). New York, NY: The
Association for Computing Machinery.
• Lavine, M. L. (1999). What is Bayesian Statistics and Why Everything Else is
Wrong. The Journal of Undergraduate Mathematics and Its Applications, 20,
165-174.
• Lindley, D. V. (1971). Making Decisions. London: Wiley. Lindley, D. V. (2001).
Harold Jeffreys. In C. C. Heyde & E. Seneta (Eds.), Statisticians of the Centuries,
(pp. 402-405). New York: Springer.
• Murphy, K. R., & Myors, B. (1998). Statistical Power Analysis. A Simple and
General Model for Traditional and Modern Hypothesis Tests. Mahwah, NJ:
Lawrence Erlbaum Associates.
142
References
• Myllymäki, P., Silander, T., Tirri, H., & Uronen, P. (2002). B-Course: A WebBased Tool for Bayesian and Causal Data Analysis. International Journal on
Artificial Intelligence Tools, 11(3), 369-387.
• Myllymäki, P., & Tirri, H. (1998). Bayes-verkkojen mahdollisuudet [Possibilities
of Bayesian Networks]. Teknologiakatsaus 58/98. Helsinki: TEKES.
• Neapolitan, R. E., & Morris, S. (2004). Probabilistic Modeling Using Bayesian
Networks. In D. Kaplan (Ed.), The SAGE handbook of quantitative methodology
for the social sciences (pp. 371-390). Thousand Oaks, CA: Sage.
• Nokelainen, P. (2008). Modeling of Professional Growth and Learning:
Bayesian Approach. Tampere: Tampere University Press.
• Nokelainen, P., & Ruohotie, P. (2009). Investigating Growth Prerequisites in a
Finnish Polytechnic for Higher Education. Journal of Workplace Learning, 21(1),
36-57.
143
References
• Nokelainen, P., Silander, T., Ruohotie, P., & Tirri, H. (2007). Investigating the
Number of Non-linear and Multi-modal Relationships Between Observed
Variables Measuring A Growth-oriented Atmosphere. Quality & Quantity,
41(6), 869-890.
• Nokelainen, P., & Tirri, K. (2007). Empirical Investigation of Finnish School
Principals' Emotional Leadership Competencies. In S. Saari & T. Varis (Eds.),
Professional Growth (pp. 424-438). Hämeenlinna: RCVE.
• Nokelainen, P., Ruohotie, P., & Tirri, H. (1999). Professional Growth
Determinants-Comparing Bayesian and Linear Approaches to Classification. In
P. Ruohotie, H. Tirri, P. Nokelainen, & T. Silander (Eds.), Modern Modeling of
Professional Growth, vol. 1 (pp. 85-120). Hämeenlinna: RCVE.
• Nokelainen, P., & Silander, T. (2014). Using New Models to Analyze Complex
Regularities of the World: Commentary on Musso et al. (2013). Frontline
Learning Research, 2(3), 78-82.
144
References
• Nokelainen, P., & Tirri, K. (2010). Role of Motivation in the Moral and Religious
Judgment of Mathematically Gifted Adolescents. High Ability Studies, 21(2),
101-116.
• Nokelainen, P., Tirri, K., Campbell, J. R., & Walberg, H. (2004). Cross-cultural
Factors that Account for Adult Productivity. In J. R. Campbell, K. Tirri, P.
Ruohotie, & H. Walberg (Eds.), Cross-cultural Research: Basic Issues, Dilemmas,
and Strategies (pp. 119-139). Hämeenlinna: RCVE.
• Nokelainen, P., Tirri, K., & Merenti-Välimäki, H.-L. (2007). Investigating the
Influence of Attribution Styles on the Development of Mathematical Talent.
Gifted Child Quarterly, 51(1), 64-81.
• Pylväs, L., Nokelainen, P., & Roisko, H. (2015). Modeling of Vocational
Excellence in Air Traffic Control. Journal of Workplace Learning.
• Tirri, H. (1997). Plausible Prediction by Bayesian Interface. Department of
Computer Science. Series of Publications A. Report A-1997-1. Helsinki:
University of Helsinki.
145
References
• Tukey, J. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.
• Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S.
Fourth edition. New York: Springer.
146