Interpreting statistics

Download Report

Transcript Interpreting statistics

Unit VI. Image interpretation
MSc in Computational Sciences
Dr. Felipe Orihuela-Espina
Outline
 Interpreting statistics
 Causality
 Data mining
 Pattern recognition, machine learning
 Representation learning and manifold
embedding
 Deep learning
 Knowledge representation and discovery
 Interpretation guidelines
© 2015-16 Dr. Felipe Orihuela-Espina
2
Typical fMRI processing
Figure source: [Wellcome Trust; Tutorial on SPM]
© 2015-16 Dr. Felipe Orihuela-Espina
3
Typical fNIRS processing
Raw
Detrend
Low pass filtering
(decimation)
Averaging
Decimated and
detrended
© 2015-16 Dr. Felipe Orihuela-Espina
4
The three levels of analysis
 Data analysis often comprises 3 steps:
 Processing: Output domain matches input
domain
 Preparation of data; data validation, cleaning,
normalization, etc…
 Analysis: Reexpress data in a more convenient
domain
 Summarization of data: Feature extraction, computation
of metrics, statistics, etc…
 Understanding: Abstraction to achieve
knowledge generation
 Interpretation of data: Concept validation, reexpresion
in natual language, etc.
© 2015-16 Dr. Felipe Orihuela-Espina
5
The three levels of analysis
Processing
• f:XX such that X (domain) and X (co-domain) share the same space (even
the semantics of the space)
• E.g.: Apply a filter to a signal or image and you get another signal or image
Analysis
• f:XY such that X and Y do not share the same space (the dimensionality
might be the same but semantics may change)
• E.g.: Apply a mask to a signal or image and you get the discontinuities, edges
or a segmentation
Interpretation (a.k.a. Understanding)
• f:XH such that H is (natural) language encoding domain knowledge
• E.g.: Apply a model to a signal or image and you get some knowledge useful
for a human expert
© 2015-16 Dr. Felipe Orihuela-Espina
6
INTERPRETING STATISTICS
© 2015-16 Dr. Felipe Orihuela-Espina
7
Inferential Statistics
 “If your experiment needs
statistics, you ought to have
done a better experiment.”
Lord Sir Ernest Rutherford of Nelson
Neo Zelandés / Británico,
1871-1937
Padre de la física nuclear
Descubridor del protón
Nobel de Química 1908
© 2015-16 Dr. Felipe Orihuela-Espina
8
Quotations about statistal significance
 [BlandM1996] “Acceptance of statistics, though gratifying
to the medical statistician, may even have gone too far.
More than once I have told a colleague that he did not
need me to prove that his difference existed, as anyone
could see it, only to be told in turn that without the magic
p-value he could not have his paper published.”
 [Nicholls in KatzR2001] “In general, however, null
hypothesis significance testing tells us little of what we
need to know and is inherently misleading. We should be
less enthusiastic about insisting on its use.”
© 2015-16 Dr. Felipe Orihuela-Espina
9
Quotations about statistal significance
 [Falk in KatzR2001] “Significance tests do not provide the
information that scientists need, neither do they solve the
crucial questions that they are characteristically believed to
answer. The one answer that they do give is not a question
that we have asked.”
 [DuPrelJB2009] “Unfortunately, statistical significance is often
thought to be equivalent to clinical relevance. Many research
workers, readers, and journals ignore findings which are
potentially clinically useful only because they are not
statistically significant. At this point, we can criticize the
practice of some scientific journals of preferably publishing
significant results [...] ("publication bias").”
© 2015-16 Dr. Felipe Orihuela-Espina
10
Quotations about statistal significance
 [GardnerMJ1986, co-authored by Altman] “...the use of
statistics in medical journals has increased tremendously.
One unfortunate consequence has been a shift in emphasis
away from the basic results towards an undue concentration
on hypothesis testing. In this approach data are examined in
relation to a statistical "null" hypothesis, and the practice has
led to the mistaken belief that studies should aim at obtaining
"statistical significance”. [...] The excessive use of hypothesis
testing at the expense of other ways of assessing results has
reached such a degree that levels of significance are often
quoted alone in the main text and abstracts of papers, with no
mention of actual concentrations, proportions, etc, or their
differences. The implication of hypothesis testing- that there
can always be a simple "yes" or "no" answer as the
fundamental result from a medical study-is clearly false and
used in this way hypothesis testing is of limited value.”
© 2015-16 Dr. Felipe Orihuela-Espina
11
Modelling
Deterministic
model
Values of the
dependent
variables
Values of indepedent
and controlled
variables
Stochastic
model
© 2015-16 Dr. Felipe Orihuela-Espina
Expectation of
the dependent
variables
12
Stochastic analysis
 In stochastic dependencies two closely
related major analysis can be carried out:
 Regression analysis
 It defines the type of relation (linear, exponential,
logarithmic, hyperbolic, etc) between the variables
 It produces an equation a.k.a. model, describing the
relation
 Correlation analysis
 It defines the degree and consistency of the
in/dependence, or the degree of association between
the variables
 It produces a single value summarizing the strength of
the assumed relation
© 2015-16 Dr. Felipe Orihuela-Espina
13
Regression analysis
 Regression analysis involves a number
of statistical approaches for estimating
relations between variables.
 Regression analysis is widely used for:
A. Inference of relations between variables
B.
(modelling) and
Prediction of new outcomes and
observations (simulation)
© 2015-16 Dr. Felipe Orihuela-Espina
14
Linear univariate regression (deterministic)
Dependent
variable
Slope
Independent
variable
Intersection with
the ordinate axis
A more general
notation
Parameters
© 2015-16 Dr. Felipe Orihuela-Espina
15
Linear univariate regression (stochastic)
Deterministic
model
Uncertainty
Stochastic
model
© 2015-16 Dr. Felipe Orihuela-Espina
16
Linear univariate regression (stochastic)
Stochastic
model
Expressing
uncertainty (error)
explicitly for each
observation.
Error is the difference between the i-th observation and its
expectation. In other words, the difference between the
measurement and the real value (Yi-E[X]).
© 2015-16 Dr. Felipe Orihuela-Espina
17
Linear univariate regression (stochastic)
 For j independent and controlled variables:
 This is known as the additive linear model
 It relates one dependent variable with j
independent variables
Note that the unknowns are the βi coefficients
(a.k.a. parameters). Modelling consists of
estimating these coefficients according to a
certain criterion.
© 2015-16 Dr. Felipe Orihuela-Espina
18
Linear univariate regression (stochastic)
 In general, for n cases, a full system of
equations is generated:
© 2015-16 Dr. Felipe Orihuela-Espina
19
General linear model
 We can conveniently express the previous
model using matrices:
 where:
The 1s are necessary for the
intersection with the ordinate axis
β0. Sometimes, the model is
presented without a constant term,
2015-16
Dr. Felipe
Orihuela-Espina
and thus ©this
column
disappears.
20
General linear model
nx1
nx(j+1)
(j+1) x1
© 2015-16 Dr. Felipe Orihuela-Espina
nx1
21
Covariance
 Covariance expresses the trend or
tendency in the relation (linear) between
the variables
 If sXY>0 ⇒ if X increases, Y increases
 If sXY<0 ⇒ if X increases, Y decreases
Figure from: [http://biplot.usal.es/ALUMNOS/BIOLOGIA/5BIOLOGIA/Regresionsimple.pdf]
© 2015-16 Dr. Felipe Orihuela-Espina
22
Correlation coefficient
 Pearson correlation coefficient is an
index expressing the magnitude of the
linear association between two
quantitative random variables*, and
corresponds to the normalization of the
covariance:
Covariance
Standard
deviations
*For a formal definition of random variable, please check my slides of the course
Introduction to Statistics.
© 2015-16 Dr. Felipe Orihuela-Espina
23
Correlation coefficient
Figura from: [en.wikipedia.org]
© 2015-16 Dr. Felipe Orihuela-Espina
24
Correlation coefficient
¡Beware! This table is out of date. Some of the cells marked as “no desarrollados” are
already available. I didn’t have the time to update the table.
Table from: [http://pendientedemigracion.ucm.es/info/mide/docs/Otrocorrel.pdf]
© 2015-16 Dr. Felipe Orihuela-Espina
25
Adjustment
 Coefficient of determination R2:
 A key output of the regression analysis; it
represents the proportion of the variance in
the dependent variable that is predictable
from the independent variable.
 The coefficient of determination is NOT the
linear correlation coefficient r (that’s Pearson),
but as you can imagine it is closely related
  Yep! You guess it; one is the square of the
other… 
© 2015-16 Dr. Felipe Orihuela-Espina
26
Adjustment
Figure from: [Wolfram MathWorld]
© 2015-16 Dr. Felipe Orihuela-Espina
27
Hypothesis testing
 Considered the father of inferential
statistics
 The creator of ANOVA among other models
 Worked at Cambridge and UCL, he was
member of the Royal Society
 He actually substituted Pearson in his chair
at UCL
 As the genious he was, he also worked
and achieved recognition for its
contribution to many other fields:
mathematics, evolutionary biology,
genetics, etc
 In fact, he is also the father of population

genetics, describing evolutionary
phenomena as a function of variation and
distribution of allelic frequency
He further found the usefulness of the latin
squares to improve experimental designs for
agriculture.
Sir Ronald Aylmer Fisher (1890-1962)
Británico
A biography and some links:
http://www-history.mcs.st-andrews.ac.uk/Biographies/Fisher.html
© 2015-16 Dr. Felipe Orihuela-Espina
28
Null and Alternative Hypothesis
 Statistical testing is used to accept/rejct
hypothesis
 Null hypothesis (H0): There is no difference or

relation, and any observed difference is due to
chance
 H0: μ1=μ2
Alternative hypothesis (Ha): There is difference or
relation unlikely to be attributable to chance.
 Ha: μ1μ2
 Example:
 Research question: ¿Are men taller than women?
 Null hypothesis: There is no height difference among genders
 Alternative hypothesis: Gender makes a difference in height.
© 2015-16 Dr. Felipe Orihuela-Espina
29
Hypothesis Type / Directionality:
One-tail vs Two-tail
 One-tailed: Used for directional hypothesis testing
 Alternative hypothesis: There is a difference and we anticipate
the direction of that difference
 Ha: μ1<μ2
 Ha: μ1>μ2
 Two-tailed: Used for non-directional hypothesis testing
 Alternative hypothesis: There is a difference but we do not
anticipate the direction of that difference
 Ha: μ1μ2
 Example:
 Research question: ¿Are men taller than women?
 Null hypothesis: There is no height difference among genders
 Alternative hypothesis:
 One tail: Men are taller than women
 Two tail: One gender is taller than the other.
[Figures from: http://www.mathsrevision.net/alevel/pages.php?page=64]
© 2015-16 Dr. Felipe Orihuela-Espina
30
Significance Level (α)
and test power (1-β)
 The probability of making
Decision \
Reality
H0 true / Ha
False
H0 false / Ha
true
Accept H0;
Reject Ha
Ok
(p=1-α)
Type II Error
(β)
Reject H0;
Accept Ha
Type I Error
(p=α)
Ok
(1-β)
Type I Errors can be
decreased by altering the
level of significance (α)
 Unfortunately, this in turn
increments the risk of Type
II Errors.
 …and viceversa
 The decision on the
significance level should
be made (not arbitrarily
but) based on the type of
error we want to reduced.
© 2015-16 Dr. Felipe Orihuela-Espina
Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html]
31
Hypothesis Type / Directionality:
One-tail vs Two-tail
 Hypotheis directionality
affect statistical power
 One tail tests provide
Two tail test
more statistical power to
detect an effect
 Choosing a one-tailed

test for the sole purpose
of attaining significance
is not appropriate. You
may lose the difference
in the other direction!
Choosing a one-tailed
test after running a twotailed test that failed to
reject the null hypothesis
is not appropriate.
One tail test
Source: [http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm]
Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html]
© 2015-16 Dr. Felipe Orihuela-Espina
32
Independence of observations:
Paired vs Unpaired
 Paired: There is a one-to-one
(biyective) correspondence between
the samples of the groups
 If samples in one group are

reorganised then so should samples in
the other.
Examples:
 Randomized block experiments with two


units per block
Studies with individually matched
controls
Repeated measurements on the same
individual
 Unpaired: There is no correspodence
between the samples of the groups.
 Samples in one group can be
reorganised independently of the other
 Pairing is a strategy of design, not
 Example of paired data:
 N sets of twins to know if
the 1st born is more
aggresive than the
second
Twin
Pair
Aggresiveness score
1st born
2nd born
1
86
88
2
71
77
3
77
76
…
…
…
analysis (pairing occur before data
N
87
72
collection!). Pairing is used to reduce
Example adapted from [DinovI2005]
bias and increase precision
© 2015-16 Dr. Felipe Orihuela-Espina
33
[DinovI2005]
Parametric vs non-parametric
 Parametric testing: Assumes a certain
deistribution of the variable in the population to
which we plan to generalize our data
 Non-parametric testing: No assumption
regarding the distribution of the variable in the
population
 That is distribution free, NOT ASSUMPTION FREE!!
 Non-parametric tests look at the rank order of the
values
 Parametric tests are more powerful than non-
parametric ones and so should be used if possible
[GreenhalghT 1997 BMJ 315:364]
© 2015-16 Dr. Felipe Orihuela-Espina
34
Source: 2.ppt (Author unknown)
One way, two way,… N-way analysis
 Experimental design may be one-factorial, two
factorial,… N-factorial
 i.e. one research question at a time, two research

questions at a time, …N research questions at a time.
The more ways the more difficult the analysis
interpretation
 One-way analysis measures significance effects
of one factor only.
 Two-way analysis measures significance effects
of two factor simultaneously.
 Etc…
© 2015-16 Dr. Felipe Orihuela-Espina
35
Steps to apply a significance test
1. Define a hypothesis
2. Collect data
3. Determine the test to apply
4. Calculate the test value (t,F,χ2) and
reexpress as a probabilty p
5. Accept/Reject null hypothesis based on
degrees of freedom and significance
threshold
© 2015-16 Dr. Felipe Orihuela-Espina
[GurevychI2011]
36
Which test to apply?
 Selecting the right test depends on several
aspects of the data:
 Sample count (Low <30; High >30)
 Independence of observations (Paired,
Unpaired)
 Number of groups or datasets to be compared
 Data types (Numerical, categorical, etc)
 Assumed distributions
 Hypothesis type (One-tail, Two tail).
© 2015-16 Dr. Felipe Orihuela-Espina
[GurevychI2011]
37
Which test to apply?
Independent Variable
Number
Dependent Variable
Type
Number
Test
Type
Statistic
1 population
N/A
1
Continuous
normal
One sample ttest
Mean
2
independent
populations
2 categories
1
Normal
Two sample ttest
Mean
1
Non-normal
Mann
Whitney,
Wilcoxon rank
sum test
Median
1
Categorical
Chi square
test, Fisher’s
exact test
Proportion
3 or more
populations
Categorical
1
Normal
One way
ANOVA
Means
…
…
…
…
…
…
More complete tables can be found at:
•http://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.html
•http://bama.ua.edu/~jleeper/627/choosestat.html
•http://www.bmj.com/content/315/7104/364/T1.expansion.html
© 2015-16 Dr.
Felipe Orihuela-Espina
38
CAUSALITY
© 2015-16 Dr. Felipe Orihuela-Espina
39
Cogito ergo sum
Cause
Effect
Cogito
Sum
© 2015-16 Dr. Felipe Orihuela-Espina
40
Causation defies (1st level) logic…
 Input:
 “If the floor is wet, then it rained”
 “If we break this bottle, the floor will get wet”
 Logic output:
 “If we break this bottle, then it rained”
Example taken from [PearlJ1999]
© 2015-16 Dr. Felipe Orihuela-Espina
41
Why is causality so problematic?
A very silly example
 Cannot be computed from





the data alone
Systematic temporal
precedence is not sufficient
Co-ocurrence is not sufficient
It is not always a direct
relation (indirect relations,
transitivity/mediation, etc
may be present), let alone
linear…
It may occur across
frequency bands
YOU NAME IT HERE… 
Which process causes which?
Causality is so difficult that “it would be
very healthy if more researchers
abandoned thinking of and using terms
such as cause and effect” [Muthen1987 in
PearlJ2011]
© 2015-16 Dr. Felipe Orihuela-Espina
42
Causality requires time/order!
 “…there is little use in the practice of attempting to dicuss
causality without introducing time” [Granger,1969]
 …whether philosphical, statistical, econometrical, topological,
etc…
 Actually; “time” is NOT necessarily to be strictly understood in
the chronological sense (although most times is). “Time” here
means a mathematical relation of order in a set.
 Note that the use of order in a set is close to Lamport’s causality
[Lamport L (1978) Comm. ACM, 21(7):558-565].
 Also in topological causality this chronological causality is often
referred to as “timelike” to indicate that lays along the negative
signed dimensión of the Minkowski space
© 2015-16 Dr. Felipe Orihuela-Espina
43
Causality requires directionality/context!
 Algebraic equations, e.g. regression “do not
properly express causal relationships […] because
algebraic equations are symmetrical objects […]
To express the directionality of the underlying
process, Wright augmented the equation with a
diagram, later called path diagram in which arrows
are drawn from causes to effects” [PearlJ2009]
 Feedback and instantaneous causality in any case
are a double causation.
 In topological causality this is referred to a “nonspacelike” causality
© 2015-16 Dr. Felipe Orihuela-Espina
44
A real example
[OrihuelaEspinaF2010]
An ECG
[KaturaT2006] only claim that
there are interrelations
(quantified using MI)
© 2015-16 Dr. Felipe Orihuela-Espina
45
Statistical dependence
 Statistical dependence is a type of relation between any two
variables [WermuthN1998]: if we find one, we can expect to
find the other
Statistical independence
Association
(symmetric or assymettric)
Deterministic dependence
 The limits of statistical dependence
 Statistical independence: The distribution of one variable is the
same no matter at which level changes occur on in the other
variable
X and Y are independent  P(X∩Y)=P(X)P(Y)
 Deterministic dependence: Levels of one variable occur in an
exactly determined way with changing levels of the other.
 Association: Intermediate forms of statistical dependency
 Symmetric
 Asymmetric (a.k.a. response) or directed association
© 2015-16 Dr. Felipe Orihuela-Espina
46
Associational Inference ≡ Descriptive
Statistics!!!
 The most detailed information linking two variables
is given by the joint distribution:
P(X=x,Y=y)
 The conditional distribution describes how the
values of X changes as Y varies:
P(X=x|Y=y)=P(X=x,Y=y)/P(Y=y)
 Associational statistics is simply descriptive
(estimates, regressions, posterior distributions,
etc…) [HollandPW1986]
 Example: Regression of X on Y  is the conditional
expectation E(X|Y=y)
© 2015-16 Dr. Felipe Orihuela-Espina
47
Statistical dependence vs Causality
 Statistical dependence provide associational
relations and can be expressed in terms of a
joint distribution alone
 Causal relations CANNOT be expressed on
terms of statistical association alone [PearlJ2009]
 Associational inference ≠ Causal
Inference [HollandPW1986, PearlJ2009]
 …ergo, Statistical dependence ≠ Causal
Inference
 In associational inference, time is merely
operational
© 2015-16 Dr. Felipe Orihuela-Espina
48
Regression and Correlation;
two common forms of associational inference
 Regression Analysis: “the study of the
dependence of one or more response
variables on explanatory variables”
[CoxDR2004]
 Correlation is a relation over mean
values; two variables correlate as they
move over/under their mean together
(correlation is a ”normalization” of the
covariance)
© 2015-16 Dr. Felipe Orihuela-Espina
49
Regression and Correlation;
two common forms of associational inference
 Correlation ≠ Statistical
dependence
 If r=0 (i.e. absence of correlation), X and Y are statistically independent,
but the opposite is not true [MarrelecG2005].
 Correlation ≠ Causation [YuleU1900 in
CoxDR2004, WrightS1921]
 Yet, causal conclusions from a carefully design (often synonym of


randomized) experiment are often (not always) valid [HollandPW1986,
FisherRA1926 in CoxDR2004]
Strong regression ≠ causality [Box1966]
Prediction systems ≠ Causal systems [CoxDR2004]
© 2015-16 Dr. Felipe Orihuela-Espina
50
Coherence:
yet another common form of associational inference
 Coherence: Often understood as “correlation in the
frequency domain”
Cxy = |Gxy|2/(GxxGyy)
 where Gxy is the cross-spectral density,
 i.e. coherence is the ratio between the (squared)
correlation coefficient and the frequency components.
 Coherence measures the degree to which two series
are related
 Coherence alone does not implies causality! The
temporal lag of the phase difference between the signals
must also be considered.
© 2015-16 Dr. Felipe Orihuela-Espina
51
From association to causation
 Barriers between classical statistics and
causal analysis [PearlJ2009]
1. Coping with untested assumptions and
changing conditions
2. Inappropiate mathematical notation
© 2015-16 Dr. Felipe Orihuela-Espina
52
Causality
Stronger
 Zero-level causality: a statistical association, i.e.
non-independence which cannot be removed by
conditioning on allowable alternative features.
 i.e. Granger’s, Topological
 First-level causality: Use of a treatment over
another causes a change in outcome
 i.e. Rubin´s, Pearl’s
Weaker
 Second-level causality: Explanation via a
generating process, provisional and hardly lending
to formal characterization, either merely
hypothesized or solidly based on evidence
 i.e. Suppe’s, Wright’s path analysis
 e.g. Smoking causes lung cancer
Inspired from [CoxDR2004]
© 2015-16 Dr. Felipe Orihuela-Espina
It is debatable
whether second
level causality is
indeed causality
53
Variable types and their joint probability
distribution
 Variable types:
 Background variables (B) – specify what is fixed
 Potential causal variables (C)
 Intermediate variables (I) – surrogates, monitoring,

pathways, etc
Response variables (R) – observed effects
 Joint probability distribution of the variables:
P(RICB) = P(R|ICB)  P(I|CB)  P(C|B)  P(B)
…but it is possible to integrate over I (marginalized)
P(RCB) = P(R|CB)  P(C|B)  P(B)
In [CoxDR2004]
© 2015-16 Dr. Felipe Orihuela-Espina
54
Granger’s Causality
 Granger´s causality:
 Y is causing X (YX) if we are better
to predict X using all available
information (Z) than if the information
apart of Y had been used.
 The groundbreaking paper:
 Granger “Investigating causal
relations by econometric models and
cross-spectral methods” Econometrica
37(3): 424-438
 Granger’s causality is only a
statement about one thing
happening before another!
 Rejects instantaneous causality 
Considered as slowness in recording
of information
© 2015-16 Dr. Felipe Orihuela-Espina
Sir Clive William John Granger
(1934 –2009) – University of
Nottingham – Nobel Prize
Winner
55
Granger’s Causality
 “The future cannot cause the past” [Granger
1969]
 “the direction of the flow of time [is] a central
feature”
 Feedback is a double causation; XY and YX
denoted XY
 “causality…is based entirely on the
predictability of some series…” [Granger
1969]
 Causal relationships may be investigated in terms
of coherence and phase diagrams
© 2015-16 Dr. Felipe Orihuela-Espina
56
Topological causality

“A causal manifold is one with an assignment to
each of its points of a convex cone in the tangent
space, representing physically the future
directions at the point. The usual causality in M O
extends to a causal structure in M’.”
[SegalIE1981]

Causality is seen as embedded in the
geometry/topology of manifolds
 Causality is a curve function defined over the
manifold

The groundbreaking book:
 Segal IE “Mathematical Cosmology and
Extragalactic Astronomy” (1976)

 The father of causal manifolds is likely to be
Lorentz. Nevertheless, Segal’s contribution to the
field of causal manifolds is simply
overwhelming…
Irving Ezra Segal (1918-1998) Professor of Mathematics at MIT
© 2015-16 Dr. Felipe Orihuela-Espina
57
Causal (homogeneous Lorentzian)
Manifolds: The topological view of causality
 The cone of causality [SegalIE1981,RainerM1999,
MosleySN1990, KrymVR2002]
Future
Instant present
Past
© 2015-16 Dr. Felipe Orihuela-Espina
58
Causal (homogeneous Lorentzian)
Manifolds: The topological view of causality
 A relation of causality between the points of
pseudo-Riemannian manifold may be
[Kronheimer and Penrose, 1967, Proc.
Camb. Phil. Soc. 63:481-501]:
 Horismos

: Meaning that y lies on the causal cone
 Chronological or timelike

: Meaning that y lies inside the causal cone
 Non-spacelike (sometimes referred to simply as
causal)

: Meaning that y lies not outside the causal
cone
© 2015-16 Dr. Felipe Orihuela-Espina
59
Rubin Causal Model
 Rubin Causal Model:
 “Intuitively, the causal effect of one
treatment relative to another for a
particular experimental unit is the
difference between the result if the
unit had been exposed to the first
treatment and the result if, instead,
the unit had been exposed to the
second treatment”
 The groundbreaking paper:
 Rubin “Bayesian inference for
causal effects: The role of
randomization” The Annals of
Statistics 6(1): 34-58
 The term Rubin causal model
Donald B Rubin (1943 – ) –
John L. Loeb Professor of Stats
at Harvard
was coined by his student Paul
Holland
© 2015-16 Dr. Felipe Orihuela-Espina
60
Rubin Causal Model
 Causality is an algebraic difference:
treatment causes the effect Ytreatment(u)-Ycontrol(u)
…or in other words; the effect of a cause is always
relative to another cause [HollandPW1986]
 Rubin causal model establishes the conditions
under which associational (e.g. Bayesian)
inference may infer causality (makes assumptions
for causality explicit).
© 2015-16 Dr. Felipe Orihuela-Espina
61
Fundamental Problem of Causal Inference
 Only Ytreatment(u) or Ycontrol(u) can be observed on a
phenomena, but not both.
 Causal inference is impossible without making

untested assumptions
…yet causal inference is still possible under
uncertainty [HollandPW1986] (two otherwise identical
populations u must be prepared and all appropiate
background variables must be considered in B).
 Again!; Causal questions cannot be computed
from the data alone, nor from the distributions that
govern the data [PearlJ2009]
© 2015-16 Dr. Felipe Orihuela-Espina
62
Relation between Granger, Rubin and
Suppes causalities
Granger
Rubin’s model
Cause (Treatment)
Y
t
Effect
X
Ytreatment(u)
All other available
information
Z
Z (pre-exposure variables)
 Granger’s noncausality:

X is not Granger cause of Y (relative to
information in Z)  X and Y are conditionally
independent (i.e. P(Y|X,Z)=P(Y|Z))
Granger’s noncausality is equal to Suppes
spurious case
Modified from [HollandPW1986]
© 2015-16 Dr. Felipe Orihuela-Espina
63
Pearl’s statistical causality
(a.k.a. structural theory)
 “Causation is encoding behaviour
under intervention […] Causality
tells us which mechanisms [stable
functional relationships] is to be
modified [i.e. broken] by a given
action” [PearlJ1999_IJCAI]
Judea Pearl (1936-) Professor of
computer science and
statistics at UCLA
 Causality, intervention and
mechanisms can be encapsulated in
a causal model
 The groundbreaking book:
Sewall Green Wright
(1889-1988) – Father of
path analysis (graphical
rules)
 Pearl J “Causality: Models, Reasoning
and Inference” (2000)*
 Pearl’s results do establish
conditions under which first level
causal conclusions are possible
[CoxDR2004]
[PearlJ2000, Lauritzen2000, DawidAP2002]
* With permission of his 1995 Biometrika paper masterpiece
© 2015-16 Dr. Felipe Orihuela-Espina
64
Statistical causality
 Conditioning vs Intervening [PearlJ2000]
 Conditioning: P(R|C)=P(R|CB)P(B|C)  useful but

innappropiate for causality as changes in the past (B)
occur before intervention (C)
Intervention: P(R║C)=P(R|CB)P(B)  Pearl´s
definition of causality
 Underlying assumption: The distribution of R (and
I) remains unaffected by the intervention.
  Watch out! This is not trivial  serious
interventions may distort all relations [CoxDR2004]
 βCB=0  C╨B  P(R|C)=P(R║C)
Structural
coefficient
Conditional
independence
 i.e. if C and B are independent
there is no difference between
conditioning and intervention
© 2015-16 Dr. Felipe Orihuela-Espina
65
DATA MINING
© 2015-16 Dr. Felipe Orihuela-Espina
66
Initial definitions
 In a conditional probability P(x|y), the set
of P(y) are called the priors.
 The likelihood function is the probability of
the evidence given the parameters i.e. the
model: p(x|θ).
 The posterior probability is the probability
of the parameters i.e. the model, given the
evidence: p(θ|x).
© 2015-16 Dr. Felipe Orihuela-Espina
67
Initial definitions
 Factors of variation: Aspects of the data that
can vary separately.
 i.e. the intrinsic dimensionality of the manifold
 Computational element or unit: A
mathematical function or block that can be
reused to express more complex
mathematical functions.
 Examples: basic logic gates (AND, OR, NOT),
artificial neurons, decision trees, etc
 Fan-in: Maximum number of inputs of a particular
element
© 2015-16 Dr. Felipe Orihuela-Espina
68
Initial definitions
 System or computational model: A set of
interconnected computational elements, at times
represented by a graph.
 Size of a system: Number of elements in the system.
 Important to justify deep learning is the observation that
reorganizing the way in which computational units are
composed or connected can have a drastic effect on the
efficiency of representation size [BengioY2009, pg 19].
 Types or classes of models:
 Generative models: Models for randomly generating
observable data P(X,Y). These include HMMs, GMMs,
restricted Boltzmann Machines, etc
 Discriminative or conditional models: Models for capturing
the dependence of an unobserved variable Y on an observed
variable X, P(Y|X). These include linear discriminant
analysis, SVM, linear regressors, ANN, ...
© 2015-16 Dr. Felipe Orihuela-Espina
69
Posterior probability
 Using Bayes rules;
 p(θ|x) = [p(x|θ)p(θ)]/p(x)
 ...which can be "reexpressed" for easy
remembering as the directly proportional (∝)
relation:
 Posterior probability ∝ Likelihood ✕ Prior probability
 …or in other words, since the joint distribution
p(x,θ)=p(x|θ)p(θ), then
 Posterior probability ∝ Joint distribution
© 2015-16 Dr. Felipe Orihuela-Espina
70
Posterior probability
 From the above (i.e. previous slide), two
basic approximations for estimating posterior
probabilities follow [ResnikP2010]:
 The maximum likelihood estimation (MLE) which
amounts to counting and then normalizing so that
the probabilities sum to 1:
 MLE produces the choice most likely to have generated
the observed data.
 The maximum a posteriori (MAP) estimation
 MAP estimate is the choice that is most likely given the
observed data.
© 2015-16 Dr. Felipe Orihuela-Espina
71
Posterior probability
 Both, MLE and MAP give us the best
estimate according to their respective
definitions of "best".
 In contrast to MLE, MAP estimation applies
Bayes's Rule, so that our estimate can take
into account prior knowledge about what we
expect θ to be in the form of a prior
probability distribution P(θ).
 None, MLE nor MAP give a whole distribution
P(θ|x).
© 2015-16 Dr. Felipe Orihuela-Espina
72
Patterns
 Patterns are regularities in data.
[Wikipedia:Pattern_recognition]
 Patterns refers to models (regression or
classification) or components of models
(e.g. a linear term in a regression)
[FayyadU1996, pg 51]
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
73
Data mining
 Data mining is:
 “the application of specific algorithms for
extracting patterns from data.” [FayyadU1996]
 “the computational process of discovering
patterns in large data sets”
[Wikipedia:Data_mining]
 the analysis step of the "Knowledge Discovery
in Databases" (KDD) process [FayyadU1996,
Wikipedia:Data_mining]
© 2015-16 Dr. Felipe Orihuela-Espina
74
Data mining
• Discovering patterns in large data sets
[Wikipedia:Data_mining]
Data mining
Pattern
recognition
•Recognition of regularities (patterns) in
data [Wikipedia:Pattern_recognition
•Data driven classification [JainAK2000]
•Nearly synonymous with machine
learning
[Wikipedia:Pattern_recognition]
Different names for the
same thing?
Machine
learning
Knowledge
discovery
•Data-driven discovery of knowledge
•It adds processing (cleaning, selection)
steps to data mining [FayyadU1996]
•Construction and study of algorithms that
can learn (act of acquiring new knowledge)
from data
•Often overlaps with computational statistics
[Wikipedia:Machine_learning]
© 2015-16 Dr. Felipe Orihuela-Espina
75
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
76
Data mining
…
© 2015-16 Dr. Felipe Orihuela-Espina
77
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
Data mining
 Classification is
strongly related to
regression
[FayyadU1996]:
 Regression is learning a

function that maps a
data item to a realvalued prediction
variable.
Classification is
learning a function that
maps (classifies) a data
item into one of several
predefined classes.
© 2015-16 Dr. Felipe Orihuela-Espina
78
Learning
 Goal
 The objective of learning in AI is giving
computers the ability to understand our world
in terms of inferring semantic concepts and
relationships among these concepts.
 Scope:
 Single task: Observations comes from a
single task
 Multi-task: Observations comes from several
tasks at once
© 2015-16 Dr. Felipe Orihuela-Espina
79
Types of learning
 Supervised: Relys on known (labelled) examples a.k.a. the
training set, to find a discrete regressor
 Unsupervised: Finds regularities and structures (i.e. fits
probability distributions) to observations
 Reinforced: Updates the currently learn model based on
rewards assessing its outputs
 Semi-supervised: From an initially learned supervised model,
it evolves unsupervisedly by generating synthetic "rewards"
proportional to the likelihood of the new observations.
 Active: A particular case of semi-supervised learning in which the
new observations are chosen or selected from all arriving new
observations according to a certain criteria.
 Transfer: A particular case of semi-supervised learning in which
new observations comes from a new domain or task.
© 2015-16 Dr. Felipe Orihuela-Espina
80
Basic problems in learning
 Modelling: It refers to encoding dependencies between
variables under a given chosen form.
 In fact, modelling per se just refers to choosing this form,
and in its more minimalistic case it does not require the
model to be representative of the phenomenon, explicative
nor predictive! It may be just nuts, a silly model!
 Learning: It refers to optimizing the parameters of the
model by minimizing the loss functional i.e. a particular
criteria, e.g. least squares error.
 Inference or reconstruction: It refers to estimating
posterior probabilities of hidden variables given
observed ones, P(h|x) or h=f(x)
© 2015-16 Dr. Felipe Orihuela-Espina
81
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
82
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
83
Data mining
Feature selection
Feature extraction
© 2015-16 Dr. Felipe Orihuela-Espina
84
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
85
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
86
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
87
© 2015-16 Dr. Felipe Orihuela-Espina
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
88
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
89
Data mining
 Clustering
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
90
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
91
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015-16 Dr. Felipe Orihuela-Espina
92
Optimizing model selection





 Assumes LTI
Y1…Npre -> Hyperparameters for preprocessing
Xi,fs -> Feature selection method
Y1…Npre -> Hyperparameters for feature selection
Xi,class -> Classifier method
Y1…Nclass -> Hyperparameters for classification
© 2015-16 Dr. Felipe Orihuela-Espina
[EscalanteHJ2009]
 Xi,pre -> Combination of preprocessing methods
93
Data mining
 “Overfitting: When the algorithm searches
for the best parameters for one particular
model using a limited set of data, it can
model not only the general patterns in the
data but also any noise specific to the data
set, resulting in poor performance of the
model on test data. Possible solutions
include cross-validation, regularization,
and other sophisticated statistical
strategies.” [FayyadU1996]
© 2015-16 Dr. Felipe Orihuela-Espina
94
REPRESENTATION LEARNING
AND MANIFOLD EMBEDDING
© 2015-16 Dr. Felipe Orihuela-Espina
95
Representation learning and manifold
embedding
 A manifold is a topological space* that it is locally
Euclidean.
 The concept of manifold is the generalisation of the

traditional Euclidean (linear) space to adapt to nonEuclidean topologies.
Note that “locally Euclidean” does not mean that it is
constraint to a Euclidean metric globally, but only that
it is locally homeomorphic to a Euclidean space.
 In other words, a manifold is a k-dimensional
object placed in a n-dimensional ambient space
 A k-dimensional manifold is a submanifold with k
degrees of freedom, i.e. that can be described with
only k coordinates
* Remember; a topological space is set with a topology (strucutre). A topology is a set of subsets of the
original space that satisfy the following axioms; (i) the empty and the set itself are in the topology, (ii) the
unión of a finite collection of sets in the topology is also in the topology, and (iii) the intersetion of an arbitrary
collection of sets in the topology is also in the topology.
© 2015-16 Dr. Felipe Orihuela-Espina
96
Representation learning and manifold
embedding
 If the manifold is infinitely differentiable
then it is called a smooth manifold.
 A smooth manifold with a metric imposed
to induce the topology is called a
Riemannian manifold.
 A submanifold is a subset of a manifold
which is itself a manifold.
© 2015-16 Dr. Felipe Orihuela-Espina
[Wolfram, World of Maths]
[Carreira-Perpiñán,1997]
97
Representation learning and manifold
embedding
 An homeomorphism is a continuous bijective
transformation between topological spaces X and Y.
f:XY
 The fact that is continuous means that points which are
close in X are also close in Y, and points which are far
in X are also far in Y.
 The fact that it is bijective (or 1 to 1) means that it is
injective and surjective, and also imply that there exist
the inverse
f-1:YX
 If the homeomorphism is differentiable, i.e. if the
derivate and its inverse exists, then it is called
diffeomorphism.
© 2015-16 Dr. Felipe Orihuela-Espina
98
Representation learning and manifold
embedding
 An embedding is a map f:XY such that f is a
diffeomorphism from X to f(X), and f(X) is a smooth
submanifold of Y.
 An embedding is the representation of a topological
object (e.g. a manifold, graph, lattice, etc) in a certain
(sub-)space so that its topology is preserved.
 In particular, for manifolds, it preserves the open sets in the
underlying topology T.
[Roweis, 2000]
[Maaten, 2007]
[Bonatti,2006]
© 2015-16 Dr. Felipe Orihuela-Espina
99
Representation learning and manifold
embedding
 Summarizing…
 A manifold is any object which is locally linear
(flat).
 An embedding is a function from a space to
another so that the topology (shape) is preserve
through deformations (twisting and stretching)
 Ergo…
 Manifold embedding refers to the transformation
of your data whilst ensuring you do not alter the
intrinsic relations among the observations.
© 2015-16 Dr. Felipe Orihuela-Espina
100
Manifold Embedding: Nomenclature
 Manifold embedding is also called
 Manifold learning [Souvernir 2005]
 Multivariate data projection [[Mao,1995] in Demartines, 1997] or simply projection


[Venna2007]
Data embedding [Yang2004]
Representation learning [BengioY2010]
 The origin space is sometimes called:






High dimensional (input) space [Tenenbaum, 2000][Demartines,1997][Venna2007]
Vector space [Roweis, 2000][Sammon,1969][Brand,2003]
Data space [Souvernir 2005]
Observation space [Silva 2002]
Domain space [Yang 2004, 2005]
Feature space (usually in the context of pattern recognition and analysis)
 The destination space is usually more consistently called
 Low-dimensional space
 But other names include output space [Demartines, 1997][Venna2007]
  and I personally like…embedding space [Leff, 2007]
101
Manifold Embedding
 Dimensionality reduction is a particular case
of manifold embedding, in which the
dimension of the destination space is lower
than the original data space
 Domain specific data are often distributed (lay on,
or close to) a low dimensional manifold in a high
dimensional space [Yang, 2004]
 Topology or structure is retained/preserved if the
pairwise distances in the low dimensional space
approximate the corresponding pairwaise
distance in the feature space.
[Sammon,1969]
102
Manifold Embedding
 Variants
 Multiple manifold embedding
 Data lie in more than 1 manifold
 Multi-class manifold embedding
 Data lie in a single manifold, but sampling contains
large gaps, perhaps even fragmenting connected
components
103
Manifold embedding
 The intrinsic dimensionality (ID) of
a manifold has been defined as “the
number of independent variables that
explains satisfactorily” that manifold.
 Determination of the ID eliminates the
possibility of over- or under-fitting.
 Since it is always possible to find a
manifold of any dimension which
passes through all points in a data set
given enough parameters, the
problem of estimating the ID of a
dataset is ill-posed in the Hadamard
sense
 Note that is the case of interpolation,
which finds a 1-D curve to fit a dataset!
[CarreiraPerpiñán,1997]
Figure modified from [CarreiraPerpiñán,1997]
104
Manifold embedding
 Topological dimension is the “local”
dimensionality at every point
 i.e. the dimension of the tangent space
 The topological dimension is a lower bound of
the ID
 Example: Sphere:
 ID: 3
 Topological dimension: 2 (at every point the
sphere can be aproximated by a surface)
[Camastra, 2003]
105
Representation learning and manifold
embedding
 In manifold embedding, there are methods
for:
 Estimating the intrinsic dimensionality of the data,
without actually projecting the data.
 Generate a (meaningful) configuration by means
of a projection (data projection methods).
 a.k.a. Representation learning
 If the configuration is low dimensional then it is often
referred to as dimensionality reduction.
106
Manifold embedding
 Example of methods for estimating the intrinsic
dimensionality of data (without projection)
 Bennet’s algorithm [Bennet, 1969]

 Local eigenvalue estimator [Verveer et al, 1995]
Fukunaga and Olsen algorithm [Fukunaga et al, 1971]
 Bruske and Sommer work based on topology preserving map
[Bruske et al 1998]
 Trunk’s statistical approach (near neighbour
techniques) [Trunk, 1968] [[Trunk, 1976] in [Camastra,
2003]
 Pettis’ algorithm – Add assumption of uniformly distribution of
sampling to derive a simple expression.
 Near neighbour estimator [Verveer et al, 1995]
 Fractal based methods [Review by Camastra, 2003]
 Broomhead’s topological dimension of a time series
[Broomhead, 1987]
107
Representation learning and manifold
embedding
 Example of linear data projection methods
 PCA (Principal Component Analysis) [Refs-




LOTS!!]
MDS (Multidimensional Scaling, a.k.a. Principal
coordinate analysis) (Refs - LOTS!! – [Kruskal,
1974][Cox,1994])
ICA (Independent Component Analysis) [Comon,
1994]
CCA (Canonical Correlation Analysis) [Friman,
2002]
PP (Projection pursuit) [Carreira-Perpiñán, 1997]
108
Representation learning and manifold
embedding
 Example of non-linear data projection methods
 Sammon’s non linear mapping (NLM) [Sammon, 1969]
 GeoNLM [Yang, 2004b]
 Kohonen’s self organising maps (SOM) [Kohonen, 1997]
a.k.a. topologically continuous maps, and Kohonen maps
 Temporal Kohonen maps [Chappell,1993]
 Laplacian eigenmaps [Belkin, 2002, 2003]
 Laplacian eigenmaps with fast N-body methods [Wang, 2006]
 PCA based:
 Non-linear PCA [Fodor, 2002], Kernel PCA [Scholkopf, 1998],
Principal Curves [Carreira-Perpignan, 1997], Space partition
and locally applied PCA [Olsen and Fukunaga, 1973]
109
Representation learning and manifold
embedding
 Example of non-linear data projection
methods
 Isomap [Tenenbaum, 2000]
 FR-Isomap [Lekadir, 2006], S-Isomap [Geng, 2005],
ST-Isomap [Jenkins, 2004], L-Isomap [Silva, 2002], CIsomap [Silva, 2002]
 Locally linear embedding (LLE) [Roweis, 2000]
 Hessian Eigenmaps, a.k.a. Hessian Locally Linear
Embedding [Donoho, 2003]
 Curvilinear Component Analysis [Demartines,
1997]
 Curvilinear Distance Analysis (CDA) [Lee, 2002, 2004]
110
Representation learning and manifold
embedding
 Example of non-linear data projection
methods
 Kernel ICA [Bach, 2003]
 Manifold charting [Brand, 2003]
 Stochastic neighbour embedding [Hinton,
2002]
 Triangulation method [Lee, 1977]
 Tetrahedral methods: Distance preserving
projection [Yang, 2004]
111
Representation learning and manifold
embedding
 Example of non-linear data projection
methods
 Semidefinite embedding (SDE)
 Minimum Volume Embedding [Shaw, 2007]
 Conformal Eigenmaps [Maaten, 2007]
 Maximally angle preserving
 Maximum Variance Unfolding (MVU) [Maaten,
2007]
 Variant of LLE
 Diffusion Maps (DM)
 Based on a Markov random walk on the high
dimensional graph to get a measure of proximity
between data.
112
Representation learning and manifold
embedding
 Data representation refers simply to the
chosen feature space, i.e. the feature
vector [BengioY2013].
 The construction or learning of this feature
space goes under the name of feature
engineering and includes more rudimentary
subproblems such as feature selection and
extraction e.g. processing and
transformations.
© 2015-16 Dr. Felipe Orihuela-Espina
113
Representation learning and manifold
embedding
 A good representation is one that
disentangles the underlying factors of
variation [BengioY2013].
 As soon as there is a notion of
representation, one can think of a manifold
[BengioY2013].
© 2015-16 Dr. Felipe Orihuela-Espina
114
Local vs non-local generalization
 Local generalization
 It refers to an underlying assumption made by
many learning algorithms; the output f(x1) is
similar to f(x2) iff x1 is similar to (i.e. close to/in
the neighbourhood of) x2.
 Non-local generalization
 Learning a function that behaves differently in
different regions of the data-space requires
different parameters for each of these regions.
© 2015-16 Dr. Felipe Orihuela-Espina
115
Local generalization
 Local generalization is closely
related to manifold learning;
 Since a manifold is locally
Euclidean, it can be
approximated locally by linear
patches tangent to the manifold
surface.
 If it is smooth, then these
patches (i.e. the computational
units) will be reasonably large
and the number of patches
needed (i.e. the size of the
computational model) will be
small.
 However, if the manifold is
highly curved (i.e. complex
highly varying function) then the
patches will have to be small
increasing the number of
patches to characterise the
manifold.
Figure reproduced from [BengioY2009, pg 25]
© 2015-16 Dr. Felipe Orihuela-Espina
116
Local generalization
 Local generalization is related to the curse
of dimensionality.
 However what matters for generalization is is
not the [extrinsic] dimensionality, but the
number of variations of the function [i.e.
intrinsic dimensionality] that we want to learn.
 Generalization is mostly achieved by a
form of local interpolation between
neighbouring training examples.
© 2015-16 Dr. Felipe Orihuela-Espina
117
Representation learning and manifold
embedding
 Types of representations
 Expressive representations
 Distributed representation
 Overcomplete representations
 Invariant representations
© 2015-16 Dr. Felipe Orihuela-Espina
118
Representation learning and manifold
embedding
 Expressive representations: It refers to the ability of
capturing a huge number of input configurations with a
reasonable sized representation. In other words,
having few features suffices to cover most of the data
space.
 That's good old content validity meet computational spatial


efficiency (Felipe's dixit)
Traditional algorithms require O(N) parameters (and/or
O(N) training examples) to distinguish O(N) input regions.
Linear features e.g. as those learnt by PCA, cannot be
stacked to form deeper, more abstract representations
since the composition of linear operations yields another
linear operation.
 However, it is still possible to use the linear fetures in deep
learning; e.g. inserting a non-linearity between learned singlelayer linear projections.
© 2015-16 Dr. Felipe Orihuela-Espina
119
Representation learning and manifold
embedding
 Distributed representations: It
refers to having more than one
computational units charting a
certain region of the data space at
the same time. Distributed
representations are often
(always?) expressive.
 Example: Imagine one binary
classifier over certain space. It
partitions the space into 2
subregions. But having 3 classifiers
over that certain space can partition
the space into exponentially more
regions.
 Distributed representations can
alleviate the curse of the
dimensionality and the limitations
of local generalization.
Figure reproduced from [BengioY2009, pg 27]
© 2015-16 Dr. Felipe Orihuela-Espina
120
Representation learning and manifold
embedding
 Overcomplete representations: It refers to having
more (hidden) computational units i.e. degrees of
freedom, than training examples.
 Often lead to overfitting endangering generalization.
  May still be useful for ad-hoc predictive value or denoising
 However; “importantly, DBMs, (in the case of MNIST despite having million of parameters and only 60k
training samples), do not appear to suffer much from
overfitting” [SalakhutdinovR2009, pg453]
  ...hmmm, not sure about this; Salakhutdinov says so, but
he does not provide any evidence that this is the case.
© 2015-16 Dr. Felipe Orihuela-Espina
121
Representation learning and manifold
embedding
 Invariant representations: It refers to having computational
units which by having learn abstract concepts, they achieve
outputs which are invariant to local changes of the input. This
often need highly non-linear transfer functions.
 Invariance and abstraction goes hand in hand.
 Having invariant features is a long standing goal in pattern
recognition.
 Achieving invariance i.e. reducing sensitivity along a certain
direction of the data, does not guarantee to have disentangle a
certain factor of variance in the data. Although, invariance is
often good, the ultimate goal is not to achieve invariance, but to
disentangle explanatory factors [BengioY2013],
  … that's manifold embedding!.
 Therefore; the goal of building invariant features should be removing
sensitivity to directions of variance that are uninformative to the task.
 Building invariant representation often involves two steps;
 Low level features are selected to account for the data
 Higher level features are extracted from low level features
© 2015-16 Dr. Felipe Orihuela-Espina
122
DEEP LEARNING
© 2015-16 Dr. Felipe Orihuela-Espina
123
Deep learning
 Much of the actual effort in deploying machine
learning algorithms goes into feature engineering.
 Representation learning closely related to deep
learning, is about learning a representation of the
data i.e. feature space, that makes it easier to
extract useful information when building predictors
(e.g. classifiers, regressors, etc).
  Deep learning is a particular case of
representation learning.
 …it is just that right now (2015-16) it is on the crest of
the wave
© 2015-16 Dr. Felipe Orihuela-Espina
124
Deep learning
 Deep architectures are model
architectures composed of multiple levels
of non-linear operations or computational
elements.
 The number of levels i.e. the longest path
from an input node to an output node, is
referred to as depth of the architecture.
© 2015-16 Dr. Felipe Orihuela-Espina
125
Deep learning
 An architecture may be:
 Shallow architecture; often up to 3 levels of depth
 Deep architectures: More than 3 levels
 Example: Brain anatomy; 5-10 levels in the visual
system [SerreT2007]
  Funny enough, examples and systems used in
scientific papers devoted to deep learning hardly
go beyond 3 levels, e.g.
[SalakhutdinovR2013_TPAMI]. So not that deep!
© 2015-16 Dr. Felipe Orihuela-Espina
126
Deep learning
 Pros and cons in a nutshell
Pros
Cons
• Relaxes need for feature
engineering
•Modelling becomes truly
data-driven
• Bigger compartmentalization
of the search space achieved
(with a fixed number of
hidden variables)
• Higher complexity of the
model
• Larger number of parameters
• "Direct" training becomes
intractable
© 2015-16 Dr. Felipe Orihuela-Espina
127
Deep learning
 Deep Boltzmann
Machines (DBM): A
variant of Boltzmann
machines that instead of
having one single layer of
hidden variables (in
contrast to the RBM), has
multiple layers of hidden
variables; with units in
odd-numbered layers
being conditionally
independent given evennumbered layers and
viceversa.
Figure: Deep Boltzmann Machine with
3 layers. Figure reproduced
from [SalakhutdinovR2013_TPAMI]
© 2015-16 Dr. Felipe Orihuela-Espina
128
Questions that I'm unable to answer at the
moment
 Overfitting.
 Clearly deep models are prone to overfitting
considering they use overcomplete
representations.
  …it’s not me, but Bengio who warns about this!
 From its particular example with MNIST
images, [SalakhutdinovR2009, pg453] claims
this does not seem to be the case.
 However, he says so but fails to provide any
evidence that this is the case.
© 2015-16 Dr. Felipe Orihuela-Espina
129
Deep learning

To know more:
 [BengioY2009] Bengio, Y. (2009) "Learning deep architectures for AI" Foundations and trends in machine learning,










2(1):1-127
[BengioY2013] Bengio, Y.; Courville, A.; Vincent, P. (2013) "Representation learning: a review and new perspectives"
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798-1828
[DavisRA2001] Davis, R.A. (2001). Gaussian Processes, Encyclopedia of Environmetrics, Section on Stochastic
Modeling and Environmental Change, (D. Brillinger, Editor), Wiley, New York
[HintonGE2006] Hinton, Geoffrey E.; Osindero, Simon; Teh, Yee-Whye (2006) "A Fast Learning Algorithm for Deep
Belief Nets" Neural Computation 18:1527–1554
[LeCunY2006] LeCun, Yann; Chopra, Sumit, Hadsell, Raia; Ranzato, Marc Aurelio; Huang, Fu Jie (2006) "A tutorial on
energy-based learning" in Bakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds), Predicting
Structured Data, MIT Press
[ResnikP2010] Resnik, Philip and Hardisty, Erick (2010) "Gibbs sampling for the uninitiated" Technical Report CS-TR4956, Institute for Advanced Computer Studies, University of Maryland, 23 pp.
[SalakhutdinovR2008_ICML] Salakhutdinov, Ruslan and Murray, Iain (2008) "On the Quantitative Analysis of Deep
Belief Networks" 25th International Conference on Machine Learning (ICML), Helsinki, Finland
[SalakhutdinovR2009_AISTATS] Salakhutdinov, Ruslan and Hinton, Geoffrey (2009) "Deep Boltzmann Machines"
12th International Conference on Artificial Intelligence and Statistics (ICAISTATS), Clearwater Beach, Florida, USA,
pgs. 448-455
[SalakhutdinovR2013_TPAMI] Salakhutdinov, Ruslan; Tenenbaum, Joshua B. and Torralba, Antonio "Learning with
hierarchical-deep models" IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1958-1971
[SerreT2007] Serre, T.; Kreiman, M. K.; Cadieu, U.; Knoblich, U.; Poggio, T. (2007) "A quantitative theory of
immediate visual recognition" Progress in Brain Research, Computational Neuroscience: Theoretical Insights into
Brain Function, 165:33-56
[TehYW2010] Y.W. Teh. (2010) Dirichlet process. Encyclopedia of Machine Learning. Springer
© 2015-16 Dr. Felipe Orihuela-Espina
130
KNOWLEDGE REPRESENTATION
AND DISCOVERY
© 2015-16 Dr. Felipe Orihuela-Espina
131
Knowledge representation
 “Knowledge representation includes
ontologies, new concepts for representing,
storing, and accessing knowledge. Also
included are schemes for representing
knowledge and allowing the use of prior
human knowledge about the underlying
process by the knowledge discovery
system.” [FayyadU1996]
© 2015-16 Dr. Felipe Orihuela-Espina
132
Knowledge generation
 To arrive to knowledge from experimentation 3 steps
are taken:
 Data harvesting: Involving all observational and
interventional experimentation tasks to acquire data
 Data acquisition: experimental design, evaluation metrics,
capturing raw data
 Data reconstruction: Translates raw data into domain
data.
 Inverts the data formation process.
 E.g.: If you captured your data with a certain sensor and the
sensor throws electric voltages as output, then reconstruction
involves converting those voltages into a meaningful domain
variable.
E.g.: Image reconstruction

 Data analysis: From domain data to domain knowledge
 When big data is involved, it is often referred to as Knowledge
discovery
© 2015-16 Dr. Felipe Orihuela-Espina
133
Knowledge discovery
Figure from [Fayyad et al, 1996]
© 2015-16 Dr. Felipe Orihuela-Espina
134
Data interpretation
 Research findings generated depend on the philosophical approach
used [LopezKA2004]
 Assumptions drive methodological decisions
 Different (philosophical) approaches for data interpretation
[PriestH2001, part 1, LopezKA2004; but basically phylosophy in
general]
 Interpretive (or hermeneutic) phenomenology:
 Systematic reflection/exploration on the phenomena as a means to grasp the

absolute, logical, ontological and metaphysical spirit behind the phenomena
Affected by the researcher’s bias
Kind of your classical hypothesis driven interpretation approach [Felipe’s dixit]


Descriptive (or eidetic) phenomenology
 Favours data driven over hypothesis driven research [Felipe’s dixit based upon
the following]
 “the researcher must actively strip his or her consciousness of all prior expert
knowledge as well as personal biases (Natanson, 1973). To this end, some researchers
advocate that the descriptive phenomenologist not conduct a detailed literature review
prior to initiating the study and not have specific research questions other than the
desire to describe the lived experience of the participants in relation to the topic of
study” [Lopez KA 2004]
Important note: I do NOT
understand these very well 
© 2015-16 Dr. Felipe Orihuela-Espina
135
Data interpretation
 Different (philosophical) approaches for data interpretation
[PriestH2001, part 1, LopezKA2004; but basically phylosophy
in general] (Cont.)
 Grounded theory analysis
 Generates theory through inductive examination of data
 Systematization to break down data, conceptualise it and re-arrange it
in new ways
 Content analysis
 Facilitates the production of core constructs formulated from contextual
settings from which data were derived
 Emphasizes reproducibility (enabling others to establish similar results)
 Interpretation (analysis) becomes continual checking and questioning
 Narrative analysis
 Qualitative
 Results (often from interviews) are revisited iteratively detracting words
or phrases until core points are extracted.
Important note: I do NOT
understand these very well 
© 2015-16 Dr. Felipe Orihuela-Espina
136
Data analysis: more than just thinking your statistical test…
Figure source: [OrihuelaEspinaF2012, Workshop on Foundations of
Biomedical Knowledge Representation]
•Past: Make sense of
bygone situations or
explain an occurring
phenomena, establishing
associational or causal
relations
•Present: Decision making
• Future: Infer outcomes,
reasoning, prediction,
planning, optimization
•Hypothesis
driven vs
data driven
Quantitative vs
Qualitative
•Causality (Zero level, One level, Two level)
•Incorporation of domain knowledge (priors)
•Algorithm: complexity (order), strategy (e.g. greedy),
serial/parellel, exact real number computation
•Problem complexity (NP-complete, P-hard…)
•Problem size (Information representation, Regularization)
•Data relations and behaviour
•Validation theory: Type (Construct, Face, Convergent,
Ecological, External, Internal, etc), Technique (Leave one
out, cross-fold, gold standard, ground truth)
•Dimensionality (Intrisic vs Explicit)
•Learning (supervised, unsupervised, reinforcement)
•Comparison (metric and performance definition)
•Data quality and SNR
Processing
Analysis
Understanding
•Direct (Intervention) vs
Indirect (Sensing)
•Sampling
•Interviewing,
Behavioural simulation,,
observational
•Synthetic, Experimental,
Data base
•Positive vs
Negative/Complement
© 2015-16 Dr. Felipe Orihuela-Espina
•Type: Discrete,
Continuous,
Categorical / Nominal,
Ordinal / Ranked
•Digital vs Analogic
•Nature of data: Time
vs Space
•Deterministic vs
stochastic
•Observable vs Nonobservable
•One way, Two way, Nways
•Fundamental vs
Derived
137
Brain map of data analysis
NOT INTENDED TO
BE
COMPREHENSIVE!
© 2015-16 Dr. Felipe Orihuela-Espina
139
Why KR models for biomedical engineering?
 GOAL:
 Formalizing concepts and relations common in
biomedical imaging
 Affording more time for interpretation
 Advantages:
 favours automated data processing, automated
knowledge and data integration and semantic
integration [HoehndorfR2012]
 The formalization of experimental knowledge expects
that such knowledge is more easily reused to answer
other scientific questions [KingRD2009]
 Ensure reproducibility and quality results
[OrihuelaEspinaF2010]
 Leaves interpretation to humans!
© 2015-16 Dr. Felipe Orihuela-Espina
140
Knowledge generation can be streamlined:
e.g. Automated identification of natural laws
 A computer program extrapolated the
laws of motion from a pendulum’s
swings in just over a day.
 This tooks physicists centuries to
complete!
 Based on Symbolic Regression
 Symbolic regression automatically
searches for both the parameters to fit an
equation and the equation form
simultaneously
© 2015-16 Dr. Felipe Orihuela-Espina
141
Automating Science?
 “Computers with intelligence can design and
run experiments, but learning from the results
to generate subsequent experiments requires
even more intelligence.” [WaltzD2009]
 Goals of automation in science [WaltzD2009]:
 increase productivity by increasing efficiency
(e.g., with rapid throughput)
 improve quality (e.g., by reducing error)
 cope with scale
© 2015-16 Dr. Felipe Orihuela-Espina
142
Knowledge generation can be streamlined:
e.g. Robot scientist
Robot scientist
ADAM and
researcher Prof.
King
 LABORS (Laboratory Ontology for Robot Scientists) ontology [KingRD2011]
 Formalizes Adam’s functional genomics experiments
 Based on EXPO (Ontology of scientific experiments)
 Closing the loop; ADAM can decide on what experiment to do next

[WaltzD2009]
Limited to hypothesis-led discovery [KingRD2009]
© 2015-16 Dr. Felipe Orihuela-Espina
143
Knowledge generation can be streamlined:
EXPO
 EXPO: Ontology of scientific experiments
 Defines over 200 concepts for creating
semantic markup about scientific experiments
 OWL language
 EXPO to formalise generic knowledge about
scientific experimental design, methodology,
and results representation.
 [SoldatovaLN2006]
 EXPO is available at
http://expo.sourceforge.net/
© 2015-16 Dr. Felipe Orihuela-Espina
144
An overview of EXPO
[KingRD2006 presentation on EXPO]
© 2015-16 Dr. Felipe Orihuela-Espina
145
AN EXAMPLE OF KR WITH
FNIRS
© 2015-16 Dr. Felipe Orihuela-Espina
146
Challenges in KR in fNIRS experimentation
 How to choose?
 The region to interrogate?
 The best (most fair) analysis?
[OrihuelaEspina2010_OHBM]
 Inc. processing, parameterization, and analysis flow
 How to avoid:
 Physiological noise /systemic effect?
 Artefacts (e.g optode movement, ambient light)?
 How to ensure:
 Physiological plausability?
 Integrity / validity? [OrihuelaEspina2010_PMB]
 reuse of formalized experiment information? [KingRD2009]
© 2015-16 Dr. Felipe Orihuela-Espina
147
Challenges in KR in fNIRS experimentation:
Parameterization
[OrihuelaEspinaF2010_OHBM]
© 2015-16 Dr. Felipe Orihuela-Espina
148
Challenges in KR in fNIRS experimentation:
Modelling
light
tissue
light
Light
model
Chromophore
concentration
Neurovascular
coupling
Physiological
model
Physiological
information
[Inspired from Banaji, fNIRS Conference, 2012]
© 2015-16 Dr. Felipe Orihuela-Espina
149
Challenges in KR in fNIRS experimentation:
Modelling
 Is the data validated?
 Do we really need a physiological model?
 A model is useful only if they fulfil very high standards of

predictive capability and reliability
We learn about the phenomenon while building the model
(vicious circle)
 Purposes of models:
 Explain data /highlight gaps in understanding




 Raising open questions
Predict hard-to-measure quantities
Develop understanding and intuition
Prepare us for experimental data
Challenge dogmas
 May force us to ignore priors!
[Banaji, fNIRS Conference, 2012; Banaji, JTB, 2006, Banaji, PLoS CB, 2008]
© 2015-16 Dr. Felipe Orihuela-Espina
150
Challenges in KR in fNIRS experimentation:
Modelling
 What are the principles that we should
follow to build our model?
 How is the model going to interact with the
data?
Example of interaction 1
Simulated
data
Model
Example of interaction 2
Model
Modelled
data
Modelled
data
Compare
Compare
Subject /
Cohort
Observed
data
Subject /
Cohort
Observed
data
[Banaji, fNIRS Conference, 2012]
© 2015-16 Dr. Felipe Orihuela-Espina
151
Challenges in KR in fNIRS experimentation
 Closing the loop:
 from experiment design and data collection to hypothesis
formation and revision, and from there to new experiments
[WaltzD2009]
 Complex experiments having different sources
 Different NIRS devices (HITACHI, SHIMADZU, fNIRX), but
also difference sources eye-tracking, EEG, etc..
 Accomodating different optical modalities
 Lack of standard “final” representation format
 Medical standard DICOM; not as standard as pretended
 Each provider has its own file format.
 SNIRF: Shared Near Infrared File Format Specification
© 2015-16 Dr. Felipe Orihuela-Espina
152
Challenges in KR in fNIRS experimentation
 Problem size
 Information representation (relational, object
oriented)
 Sample size (Extrapolation, generalization,
regularization, ill posed problems i.e. Number of
observations vs number of covariates)
 Data mining and KD strategy [FayyadU1996]
 Model identification and parameterization
 Underparameterization: low flexibility to explain
complex data
 Overparameterization: Spurious model can explain any
data. Difficulties in parameter identification
 Level of detail
 Model baoundaries, parameters, variables, purpose
© 2015-16 Dr. Felipe Orihuela-Espina
153
Concept map: experimentation
Light
model
Physiological
model
© 2015-16 Dr. Felipe Orihuela-Espina
154
[OrihuelaEspinaF2010, PMB]
Taxonomy of factors in fNIRS experimentation
© 2015-16 Dr. Felipe Orihuela-Espina
155
Experimental factors limit interpretation
© 2015-16 Dr. Felipe Orihuela-Espina
156
INTERPRETATION
GUIDELINES
© 2015-16 Dr. Felipe Orihuela-Espina
157
Interpretation; generating knowledge
 Every analysis must translate the
physiological, biological, experimental, etc
concepts to a correct mathematical
abstraction.
 Every interpretation must translate the “maths” to
real world domain concepts.
 A common mistake in many papers is to
forget about understanding, and only stating
the patterns/findings found during the
analysis.
© 2015-16 Dr. Felipe Orihuela-Espina
158
Interpretation; generating knowledge
 Understanding is by far the hardest part of
data analysis.
 …and alas it is also the part where
maths/stats/computing are (so far) less
helpful.
© 2015-16 Dr. Felipe Orihuela-Espina
159
Interpretation guidelines
 Interpretation of results must be strictly
confined to the limits imposed by the
assumptions made during the image
formation, acquisiton, reconstruction,
processing and analysis.
 Rule of thumb: Data analysis takes at least 3
to 5 times data collection time. If it has taken
less, then your analysis is likely to be weak,
coarse or careless.
 Example: One month collecting data – 5 months
worth of analysis.
© 2015-16 Dr. Felipe Orihuela-Espina
160
Interpretation guidelines
 Look at your data! Know them by heart.
Visualize them in as many possible ways
as you can imagine and then a few more.
e background. Read
 Have a hug
everything out there closely and loosely
related to your topic.
© 2015-16 Dr. Felipe Orihuela-Espina
161
Interpretation guidelines
 Always try more than one analysis
(convergent validity).
 Quantitative analysis is often desirable, but
never underestimate the power of good
qualitative analysis.
 All scales are necessary and complementary;
 Structural, functional, effective
 Inter-subject, intra-subject
 Neuron-level, region-level
© 2015-16 Dr. Felipe Orihuela-Espina
162
Interpretation guidelines
 The laws of physics are what they are…
 …but research/experimentation results are
not immutable.
 They strongly depend on the decisions made
during the data harvesting, data reconstruction
and the three stages of the analysis process.
 It is the duty of the researcher to make the
best decision to arrive at the most robust
outcome.
 Interpretation, interpretation, interpretation…
LOOK at your data!
© 2015-16 Dr. Felipe Orihuela-Espina
163
THANKS, QUESTIONS?
© 2015-16 Dr. Felipe Orihuela-Espina
165