Transcript Slide 1

Unit V. Image interpretation
Dr. Felipe Orihuela-Espina
An apology…
 ☞ This unit contains some material that I
prepared for different talks.
 Some of the original slides were in Spanish
and thus there might remain some nontranslated slides. I’m working on the
translation but also on updating the examples
to medical images.
 Please accept my apologies for the
inconvenience this may cause.
© 2015. Dr. Felipe Orihuela-Espina
2
Outline
 Causality
 Interpreting statistics
 Data mining
 Pattern recognition, machine learning,
knowledge discovery
 Knowledge representation
 Interpretation guidelines
© 2015. Dr. Felipe Orihuela-Espina
3
The three levels of analysis
 Data analysis often comprises 3 steps:
 Processing: Output domain matches input
domain
 Preparation of data; data validation, cleaning,
normalization, etc…
 Analysis: Reexpress data in a more convenient
domain
 Summarization of data: Feature extraction, computation
of metrics, statistics, etc…
 Understanding: Abstraction to achieve
knowledge generation
 Interpretation of data: Concept validation, reexpresion
in natual language, etc.
07/07/2015
INAOE
4
The three levels of analysis
Processing
• f:XX’ such that X and X share the same space
• E.g.: Apply a filter to a signal or image and you get another signal or
image
Analysis
• f:XY such that X and Y do not share the same space
• E.g.: Apply a mask to a signal or image and you get the discontinuities,
edges or a segmentation
Interpretation (a.k.a. Understanding)
• f:XH such that H is natural language
• E.g.: Apply a model to a signal or image and you get some knowledge
useful for a human expert
07/07/2015
INAOE
5
Typical fMRI processing
Figure source: [Wellcome Trust; Tutorial on SPM]
6
Typical fNIRS processing
Raw
Detrend
Low pass filtering
(decimation)
Averaging
Decimated and
detrended
7
CAUSALITY
© 2015. Dr. Felipe Orihuela-Espina
8
Cogito ergo sum
Cause
Effect
Cogito
Sum
Felipe Orihuela-Espina (INAOE)
9
Causation defies (1st level) logic…
 Input:
 “If the floor is wet, then it rained”
 “If we break this bottle, the floor will get wet”
 Logic output:
 “If we break this bottle, then it rained”
Example taken from [PearlJ1999]
Felipe Orihuela-Espina (INAOE)
10
Causality requires time!
 “…there is little use in the practice of
attempting to dicuss causality without
introducing time” [Granger,1969]
 …whether philosphical, statistical,
econometrical, topological, etc…
Felipe Orihuela-Espina (INAOE)
11
Why is causality so problematic?
A very silly example
 Cannot be computed from





the data alone
Systematic temporal
precedence is not sufficient
Co-ocurrence is not sufficient
It is not always a direct
relation (indirect relations,
transitivity/mediation, etc
may be present), let alone
linear…
It may occur across
frequency bands
YOU NAME IT HERE… 
Which process causes which?
Causality is so difficult that “it would be
very healthy if more researchers
abandoned thinking of and using terms
such as cause and effect” [Muthen1987 in
PearlJ2011]
Felipe Orihuela-Espina (INAOE)
12
A real example
An ECG
[KaturaT2006] only claim that
there are interrelations
(quantified using MI)
[OrihuelaEspinaF2010]
Felipe Orihuela-Espina (INAOE)
13
Statistical dependence
 Statistical dependence is a type of relation between any two
variables [WermuthN1998]: if we find one, we can expect to
find the other
Statistical independence
Association
(symmetric or assymettric)
Deterministic dependence
 The limits of statistical dependence
 Statistical independence: The distribution of one variable is the
same no matter at which level changes occur on in the other
variable
X and Y are independent  P(X∩Y)=P(X)P(Y)
 Deterministic dependence: Levels of one variable occur in an
exactly determined way with changing levels of the other.
 Association: Intermediate forms of statistical dependency
 Symmetric
 Asymmetric (a.k.a. response) or directed association
Felipe Orihuela-Espina (INAOE)
14
Associational Inference ≡ Descriptive
Statistics!!!
 The most detailed information linking two


variables is given by the joint distribution:
P(X=x,Y=y)
The conditional distribution describes how the
values of X changes as Y varies:
P(X=x|Y=y)=P(X=x,Y=y)/P(Y=y)
Associational statistics is simply descriptive
(estimates, regressions, posterior
distributions, etc…) [HollandPW1986]
 Example: Regression of X on Y  is the
conditional expectation E(X|Y=y)
Felipe Orihuela-Espina (INAOE)
15
Regression and Correlation;
two common forms of associational inference
 Regression Analysis: “the study of the dependence of one or more
response variables on explanatory variables” [CoxDR2004]
 Strong regression ≠ causality [Box1966]
 Prediction systems ≠ Causal systems [CoxDR2004]
 Correlation is a relation over mean values; two variables correlate as
they move over/under their mean together (correlation is a
”normalization” of the covariance)
 Correlation ≠ Statistical dependence
 If r=0 (i.e. absence of correlation), X and Y are statistically independent, but the opposite is not true
[MarrelecG2005].
 Correlation ≠ Causation [YuleU1900 in CoxDR2004, WrightS1921]
 Yet, causal conclusions from a carefully design (often synonym of randomized) experiment are often
(not always) valid [HollandPW1986, FisherRA1926 in CoxDR2004]
Felipe Orihuela-Espina (INAOE)
16
Coherence:
yet another common form of associational inference
 Often understood as “correlation in the frequency
domain”
Cxy = |Gxy|2/(GxxGyy)
 where Gxy is the cross-spectral density,
 i.e. coherence is the ratio between the (squared)
correlation coefficient and the frequency components.
 Coherence measures the degree to which two series
are related
 Coherence alone does not implies causality! The
temporal lag of the phase difference between the signals
must also be considered.
Felipe Orihuela-Espina (INAOE)
17
Statistical dependence vs Causality
 Statistical dependence provide associational
relations and can be expressed in terms of a
joint distribution alone
 Causal relations CANNOT be expressed on
terms of statistical association alone [PearlJ2009]
 Associational inference ≠ Causal Inference
[HollandPW1986, PearlJ2009]
 …ergo, Statistical dependence ≠ Causal
Inference
 In associational inference, time is merely
operational
Felipe Orihuela-Espina (INAOE)
18
Causality requires directionality!
 Algebraic equations, e.g. regression “do not
properly express causal relationships […]
because algebraic equations are symmetrical
objects […] To express the directionality of
the underlying process, Wright augmented
the equation with a diagram, later called path
diagram in which arrows are drawn from
causes to effects” [PearlJ2009]
 Feedback and instantaneous causality in any
case are a double causation.
Felipe Orihuela-Espina (INAOE)
19
From association to causation
 Barriers between classical statistics and
causal analysis [PearlJ2009]
1. Coping with untested assumptions and
changing conditions
2. Inappropiate mathematical notation
Felipe Orihuela-Espina (INAOE)
20
Causality
Stronger
 Zero-level causality: a statistical association, i.e.
non-independence which cannot be removed by
conditioning on allowable alternative features.
 i.e. Granger’s, Topological
 First-level causality: Use of a treatment over
another causes a change in outcome
 i.e. Rubin´s, Pearl’s
Weaker
 Second-level causality: Explanation via a
generating process, provisional and hardly lending
to formal characterization, either merely
hypothesized or solidly based on evidence
 i.e. Suppe’s, Wright’s path analysis
 e.g. Smoking causes lung cancer
Inspired from [CoxDR2004]
Felipe Orihuela-Espina (INAOE)
It is debatable
whether second
level causality is
indeed causality
21
Variable types and their joint probability
distribution
 Variable types:
 Background variables (B) – specify what is fixed
 Potential causal variables (C)
 Intermediate variables (I) – surrogates, monitoring,

pathways, etc
Response variables (R) – observed effects
 Joint probability distribution of the variables:
P(RICB) = P(R|ICB)  P(I|CB)  P(C|B)  P(B)
…but it is possible to integrate over I (marginalized)
P(RCB) = P(R|CB)  P(C|B)  P(B)
In [CoxDR2004]
Felipe Orihuela-Espina (INAOE)
22
Granger’s Causality
 Granger´s causality:
 Y is causing X (YX) if we are better
to predict X using all available
information (Z) than if the information
apart of Y had been used.
 The groundbreaking paper:
 Granger “Investigating causal
relations by econometric models and
cross-spectral methods” Econometrica
37(3): 424-438
 Granger’s causality is only a
statement about one thing
happening before another!
 Rejects instantaneous causality 
Considered as slowness in recording
of information
Felipe Orihuela-Espina (INAOE)
Sir Clive William John Granger
(1934 –2009) – University of
Nottingham – Nobel Prize
Winner
23
Granger’s Causality
 “The future cannot cause the past” [Granger
1969]
 “the direction of the flow of time [is] a central
feature”
 Feedback is a double causation; XY and YX
denoted XY
 “causality…is based entirely on the
predictability of some series…” [Granger
1969]
 Causal relationships may be investigated in terms
of coherence and phase diagrams
Felipe Orihuela-Espina (INAOE)
24
Topological causality
 “A causal manifold is one with an
assignment to each of its points of a
convex cone in the tangent space,
representing physically the future
directions at the point. The usual
causality in MO extends to a causal
structure in M’.” [SegalIE1981]
 Causality is seen as embedded in the
geometry/topology of manifolds
 Causality is a curve function defined over the
manifdld
 The groundbreaking book:
 Segal IE “Mathematical Cosmology and
Extragalactic Astronomy” (1976)
 I am not sure whether Segal is the father
of causal manifolds, but his contribution
to the field is simply overwhelming…
Irving Ezra Segal (1918-1998) Professor of Mathematics at MIT
Felipe Orihuela-Espina (INAOE)
25
Causal (homogeneous Lorentzian)
Manifolds: The topological view of causality
 The cone of causality [SegalIE1981,RainerM1999,
MosleySN1990, KrymVR2002]
Future
Instant present
Past
Felipe Orihuela-Espina (INAOE)
26
Rubin Causal Model
 Rubin Causal Model:
 “Intuitively, the causal effect of one
treatment relative to another for a
particular experimental unit is the
difference between the result if the
unit had been exposed to the first
treatment and the result if, instead,
the unit had been exposed to the
second treatment”
 The groundbreaking paper:
 Rubin “Bayesian inference for
causal effects: The role of
randomization” The Annals of
Statistics 6(1): 34-58
 The term Rubin causal model
Donald B Rubin (1943 – ) –
John L. Loeb Professor of Stats
at Harvard
was coined by his student Paul
Holland
Felipe Orihuela-Espina (INAOE)
27
Rubin Causal Model
 Causality is an algebraic difference:
treatment causes the effect Ytreatment(u)-Ycontrol(u)
…or in other words; the effect of a cause is always
relative to another cause [HollandPW1986]
 Rubin causal model establishes the conditions
under which associational (e.g. Bayesian)
inference may infer causality (makes assumptions
for causality explicit).
Felipe Orihuela-Espina (INAOE)
28
Fundamental Problem of Causal Inference
 Only Ytreatment(u) or Ycontrol(u) can be observed on a
phenomena, but not both.
 Causal inference is impossible without making

untested assumptions
…yet causal inference is still possible under
uncertainty [HollandPW1986] (two otherwise identical
populations u must be prepared and all appropiate
background variables must be considered in B).
 Again! (see slide #15“Statistical dependence vs
Causality”); Causal questions cannot be computed
from the data alone, nor from the distributions that
govern the data [PearlJ2009]
Felipe Orihuela-Espina (INAOE)
29
Relation between Granger, Rubin
and Suppes causalities
Granger
Rubin’s model
Cause (Treatment)
Y
t
Effect
X
Ytreatment(u)
All other available
information
Z
Z (pre-exposure variables)
 Granger’s noncausality:
X is not Granger cause of Y (relative to information in
Z)  X and Y are conditionally independent (i.e.
P(Y|X,Z)=P(Y|Z))
 Granger’s noncausality is equal to Suppes spurious
case
Modified from [HollandPW1986]
Felipe Orihuela-Espina (INAOE)
30
Pearl’s statistical causality
(a.k.a. structural theory)
 “Causation is encoding behaviour under
intervention […] Causality tells us which
mechanisms [stable functional
relationships] is to be modified [i.e.
broken] by a given action”
[PearlJ1999_IJCAI]
 Causality, intervention and mechanisms
can be encapsulated in a causal model
 The groundbreaking book:
 Pearl J “Causality: Models, Reasoning and
Inference” (2000)*
 Pearl’s results do establish conditions
under which first level causal
conclusions are possible [CoxDR2004]
Felipe Orihuela-Espina (INAOE)
* With permission of his 1995 Biometrika paper masterpiece
Judea Pearl (1936-) Professor of computer science and
statistics at UCLA
Sewall Green Wright
(1889-1988) – Father of
path analysis (graphical
rules)
31
[PearlJ2000, Lauritzen2000, DawidAP2002]
Statistical causality
 Conditioning vs Intervening [PearlJ2000]
 Conditioning: P(R|C)=P(R|CB)P(B|C)  useful but

innappropiate for causality as changes in the past (B)
occur before intervention (C)
Intervention: P(R║C)=P(R|CB)P(B)  Pearl´s
definition of causality
 Underlying assumption: The distribution of R (and
I) remains unaffected by the intervention.
 Watch out! This is not trivial  serious interventions
may distort all relations [CoxDR2004]
 βCB=0  C╨B  P(R|C)=P(R║C)
Structural
coefficient
Conditional
independence
 i.e. there is no difference between
conditioning and intervention
Felipe Orihuela-Espina (INAOE)
32
INTERPRETING STATISTICS
© 2015. Dr. Felipe Orihuela-Espina
33
Estadística inferencial
 “If your experiment needs
statistics, you ought to have
done a better experiment.”
07/07/2015
INAOE
Lord Sir Ernest Rutherford of Nelson
Neo Zelandés / Británico,
1871-1937
Padre de la física nuclear
Descubridor del protón
Nobel de Química 1908
34
Modelado
Modelo
determinista
Valores de
variables
dependientes
Valores de variables
independientes y/o
controladas
Modelo
estocástico
07/07/2015
INAOE
Esperanza de
variables
dependientes
35
Regresión lineal univariable
 En dependencia estocástica podemos llevar a
cabo 2 tipos de análisis estrechamente
relacionados:
 Análisis de regresión
 Permite definir el “tipo” (lineal, exponencial/logarítmica,
hiperbólica, etc) de relación entre las variables
 Produce una ecuación que describe la relación entre
variables (cercana a la dependencia funcional)
 Análisis de correlación
 Permite definir el grado y consistencia de in/dependencia, o
grado de asociación, entre las variables
 Produce un valor que resume la fuerza de la relación entre
las variables
07/07/2015
INAOE
36
Análisis de Regresión
 El análisis de regresión es un conjunto de técnicas
estadísticas para estimar relaciones entre variables.
 El análisis de regresión es ampliamente usado para:
A. Inferencia de relaciones entre variables (modelado) y
B. Predicción de nuevos desenlaces/observaciones
(simulación)
 El aprendizaje máquina está fuertemente relacionado
con el análisis de regresión.
 Ejemplo: Los clasificadores son modelos regresivos
(discretos o continuos).
07/07/2015
INAOE
37
Regresión lineal univariable (determinista)
Variable
dependiente
Pendiente
Variable
independiente
Intersección
(Corte en el eje
de ordenada)
Una notación un
poco más
general
Parámetros
07/07/2015
INAOE
38
Regresión lineal univariable (estocástica)
Modelo
determinista
En presencia de
incertidumbre
Modelo
estocástico
07/07/2015
INAOE
39
Regresión lineal univariable (estocástica)
Modelo
estocástico
Expresando la
incertidumbre
(error) de forma
explícita para cada
observación
El error es la diferencia de la observación i-ésima de su
esperanza. En otras palabras, la diferencia entre la medición y el
valor real (Yi-E[X]).
07/07/2015
INAOE
40
Regresión lineal multivariable (estocástica)
 Para j variables independientes/ controladas:
 A este se le conoce como el modelo lineal
aditivo que relaciona una variable
dependiente con j variables independientes.
Observa que lo que no se conocen son
los coeficientes βi. El modelado consiste
en calcular o estimar estos coeficientes
(a menudo llamados parámetros)
07/07/2015
INAOE
41
Regresión lineal multivariable (estocástica)
 En general, para n casos, se forma un
sistema de ecuaciones:
07/07/2015
INAOE
42
Modelo general lineal
 Podemos expresar el modelo de regresión
múltiple anterior con matrices de una
forma más compacta:
 donde:
07/07/2015
Estos 1 son necesarios para la
intersección con el eje de ordenadas
β0. Hay veces que el modelo se
presenta sin término constante, y en
consecuencia esta columna
INAOE
desaparece
43
Modelo general lineal
nx1
07/07/2015
nx(j+1)
INAOE
(j+1) x1
nx1
44
Covarianza
 La covarianza expresa la tendencia en la
relación (lineal) entre las variables
 Si sXY>0 ⇒ cuando X crece, Y crece
 Si sXY<0 ⇒ cuando X crece, Y decrece
Figura de: [http://biplot.usal.es/ALUMNOS/BIOLOGIA/5BIOLOGIA/Regresionsimple.pdf]
07/07/2015
INAOE
45
Coeficiente de correlación
 El coeficiente de correlación de
Pearson es un índice que mide la
magnitud de la asociación lineal entre dos
variables aleatorias cuantitativas, y
corresponde con la normalización de la
covarianza:
Covarianza
07/07/2015
INAOE
Desviaciónes
estándar
46
Coeficiente de correlación
Figura de: [en.wikipedia.org]
07/07/2015
INAOE
47
Ajuste
 Coeficiente de determinación R2:
 El coeficiente de determinación no es el
coeficiente de correlación lineal muestral r (o
coeficiente de correlación de Pearson), pero
está estrechamente relacionado
 De hecho, como puedes imaginar; uno es el
cuadrado del otro… 
 Lectura recomendada:
 Mis diapositivas de la asignatura de estadística
07/07/2015
INAOE
48
Ajuste
Figura de: [Wolfram MathWorld]
07/07/2015
INAOE
49
Coeficiente de correlación
¡Cuidado! Esta tabla según yo está obsoleta; algunos de los indicados como “no
desarrollados” ya han sido desarrollado. Pero no he tenido tiempo de confirmarlo.
Tabla obtenida de: [http://pendientedemigracion.ucm.es/info/mide/docs/Otrocorrel.pdf]
07/07/2015
INAOE
50
Citas sobre la significancia estadística
 [BlandM1996] “Acceptance of statistics, though gratifying
to the medical statistician, may even have gone too far.
More than once I have told a colleague that he did not
need me to prove that his difference existed, as anyone
could see it, only to be told in turn that without the magic
p-value he could not have his paper published.”
 [Nicholls in KatzR2001] “In general, however, null
hypothesis significance testing tells us little of what we
need to know and is inherently misleading. We should be
less enthusiastic about insisting on its use.”
Citas sobre la significancia estadística
 [Falk in KatzR2001] “Significance tests do not provide the
information that scientists need, neither do they solve the
crucial questions that they are characteristically believed to
answer. The one answer that they do give is not a question
that we have asked.”
 [DuPrelJB2009] “Unfortunately, statistical significance is often
thought to be equivalent to clinical relevance. Many research
workers, readers, and journals ignore findings which are
potentially clinically useful only because they are not
statistically significant. At this point, we can criticize the
practice of some scientific journals of preferably publishing
significant results [...] ("publication bias").”
07/07/2015
INAOE
52
Citas sobre la significancia estadística
 [GardnerMJ1986, co-authored by Altman] “...the use of
statistics in medical journals has increased tremendously.
One unfortunate consequence has been a shift in emphasis
away from the basic results towards an undue concentration
on hypothesis testing. In this approach data are examined in
relation to a statistical "null" hypothesis, and the practice has
led to the mistaken belief that studies should aim at obtaining
"statistical significance”. [...] The excessive use of hypothesis
testing at the expense of other ways of assessing results has
reached such a degree that levels of significance are often
quoted alone in the main text and abstracts of papers, with no
mention of actual concentrations, proportions, etc, or their
differences. The implication of hypothesis testing- that there
can always be a simple "yes" or "no" answer as the
fundamental result from a medical study-is clearly false and
used in this way hypothesis testing is of limited value.”
07/07/2015
INAOE
53
Prueba de hipótesis
 Considerado el padre de la estadística
inferencial
 Creador de ANOVA entre otros
 Trabajo principalmente en Cambridge y
UCL, fue miembro de la Royal Society
 Reemplazó a Pearson en su cátedra en UCL
 Cómo buen genio trabajo en otros campos:
matemáticas, biología evolutiva, genética,
etc
 De hecho, también es el padre de la genética

poblacional, que describe los fenómenos
evolutivos en función de la variación y
distribución de la frecuencia alélica
También descubrió la utilidad del uso de los
cuadrados latinos para mejorar
significativamente los métodos agrícolas
Sir Ronald Aylmer Fisher (1890-1962)
Británico
Una biografía y algunos enlaces:
http://www-history.mcs.st-andrews.ac.uk/Biographies/Fisher.html
07/07/2015
INAOE
54
Null and Alternative Hypothesis
 Statistical testing is used to accept/rejct
hypothesis
 Null hypothesis (H0): There is no difference or

relation
 H0: μ1=μ2
Alternative hypothesis (Ha): There is difference or
relation
 Ha: μ1μ2
 Example:
 Research question: ¿Are men taller than women?
 Null hypothesis: There is no height difference among genders
 Alternative hypothesis: Gender makes a difference in height.
Hypothesis Type / Directionality:
One-tail vs Two-tail
 One-tailed: Used for directional hypothesis testing
 Alternative hypothesis: There is a difference and we anticipate
the direction of that difference
 Ha: μ1<μ2
 Ha: μ1>μ2
 Two-tailed: Used for non-directional hypothesis testing
 Alternative hypothesis: There is a difference but we do not
anticipate the direction of that difference
 Ha: μ1μ2
 Example:
 Research question: ¿Are men taller than women?
 Null hypothesis: There is no height difference among genders
 Alternative hypothesis:
 One tail: Men are taller than women
 Two tail: One gender is taller than the other.
[Figures from: http://www.mathsrevision.net/alevel/pages.php?page=64]
Significance Level (α)
and test power (1-β)
 The probability of making
Decision \
Reality
H0 true / Ha
False
H0 false / Ha
true
Accept H0;
Reject Ha
Ok
(p=1-α)
Type II Error
(β)
Reject H0;
Accept Ha
Type I Error
(p=α)
Ok
(1-β)
Type I Errors can be
decreased by altering the
level of significance (α)
 Unfortunately, this in turn
increments the risk of Type
II Errors.
 …and viceversa
 The decision on the
significance level should
be made (not arbitrarily
but) based on the type of
error we want to reduced.
Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html]
Hypothesis Type / Directionality:
One-tail vs Two-tail
 Hypotheis directionality
affect statistical power
 One tail tests provide
Two tail test
more statistical power to
detect an effect
 Choosing a one-tailed

test for the sole purpose
of attaining significance
is not appropriate. You
may lose the difference
in the other direction!
Choosing a one-tailed
test after running a twotailed test that failed to
reject the null hypothesis
is not appropriate.
One tail test
Source: [http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm]
Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html]
Independence of observations:
Paired vs Unpaired
 Paired: There is a one-to-one
(biyective) correspondence between
the samples of the groups
 If samples in one group are

reorganised then so should samples in
the other.
Examples:
 Randomized block experiments with two


units per block
Studies with individually matched
controls
Repeated measurements on the same
individual
 Unpaired: There is no correspodence
between the samples of the groups.
 Samples in one group can be
reorganised independently of the other
 Pairing is a strategy of design, not
analysis (pairing occur before data
collection!). Pairing is used to reduce
bias and increase precision
[DinovI2005]
 Example of paired
data:
 N sets of twins to
Twin Aggresiveness score
know
if the 1st
Pair
1st born 2nd born
born
is86more88
1
aggresive
than
2
71
77
3 second
77
76
the
…
…
…
N
87
72
Example adapted from [DinovI2005]
Parametric vs non-parametric
 Parametric testing: Assumes a certain
deistribution of the variable in the population to
which we plan to generalize our data
 Non-parametric testing: No assumption
regarding the distribution of the variable in the
population
 That is distribution free, NOT ASSUMPTION FREE!!
 Non-parametric tests look at the rank order of the
values
 Parametric tests are more powerful than non-
parametric ones and so should be used if possible
[GreenhalghT 1997 BMJ 315:364]
Source: 2.ppt (Author unknown)
One way, two way,… N-way analysis
 Experimental design may be one-factorial, two
factorial,… N-factorial
 i.e. one research question at a time, two research

questions at a time, …N research questions at a time.
The more ways the more difficult the analysis
interpretation
 One-way analysis measures significance effects
of one factor only.
 Two-way analysis measures significance effects
of two factor simultaneously.
 Etc…
Steps to apply a significance test
1. Define a hypothesis
2. Collect data
3. Determine the test to apply
4. Calculate the test value (t,F,χ2,p)
5. Accept/Reject null hypothesis based on
degrees of freedom and significance
threshold
[GurevychI2011]
Which test to apply?
 Selecting the right test depends on several
aspects of the data:
 Sample count (Low <30; High >30)
 Independence of observations (Paired,
Unpaired)
 Number of groups or datasets to be compared
 Data types (Numerical, categorical, etc)
 Assumed distributions
 Hypothesis type (One-tail, Two tail).
[GurevychI2011]
Which test to apply?
Independent Variable
Number
Dependent Variable
Type
Number
Test
Type
Statistic
1 population
N/A
1
Continuous
normal
One sample ttest
Mean
2
independent
populations
2 categories
1
Normal
Two sample ttest
Mean
1
Non-normal
Mann
Whitney,
Wilcoxon rank
sum test
Median
1
Categorical
Chi square
test, Fisher’s
exact test
Proportion
3 or more
populations
Categorical
1
Normal
One way
ANOVA
Means
…
…
…
…
…
…
More complete tables can be found at:
•http://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.html
•http://bama.ua.edu/~jleeper/627/choosestat.html
•http://www.bmj.com/content/315/7104/364/T1.expansion.html
DATA MINING
© 2015. Dr. Felipe Orihuela-Espina
65
Initial definitions
 In a conditional probability P(x|y), the set
of P(y) are called the priors.
 The likelihood function is the probability of
the evidence given the parameters i.e. the
model: p(x|θ).
 The posterior probability is the probability
of the parameters i.e. the model, given the
evidence: p(θ|x).
07/07/2015
INAOE
66
Initial definitions
 Factors of variation: Aspects of the data that
can vary separately.
 i.e. the intrinsic dimensionality of the manifold
 Computational element or unit: A
mathematical function or block that can be
reused to express more complex
mathematical functions.
 Examples: basic logic gates (AND, OR, NOT),
artificial neurons, decision trees, etc
 Fan-in: Maximum number of inputs of a particular
element
07/07/2015
INAOE
67
Initial definitions
 System or computational model: A set of
interconnected computational elements, at times
represented by a graph.
 Size of a system: Number of elements in the system.
 Important to justify deep learning is the observation that
reorganizing the way in which computational units are
composed or connected can have a drastic effect on the
efficiency of representation size [BengioY2009, pg 19].
 Types or classes of models:
 Generative models: Models for randomly generating
observable data P(X,Y). These include HMMs, GMMs,
restricted Boltzmann Machines, etc
 Discriminative or conditional models: Models for capturing
the dependence of an unobserved variable Y on an observed
variable X, P(Y|X). These include linear discriminant
analysis, SVM, linear regressors, ANN, ...
07/07/2015
INAOE
68
Posterior probability
 Using Bayes rules;
 p(θ|x) = [p(x|θ)p(θ)]/p(x)
 ...which can be "reexpressed" for easy
remembering as the directly proportional (∝)
relation:
 Posterior probability ∝ Likelihood ✕ Prior probability
 …or in other words, since the joint distribution
p(x,θ)=p(x|θ)p(θ), then
 Posterior probability ∝ Joint distribution
07/07/2015
INAOE
69
Posterior probability
 From the above (i.e. previous slide), two basic approximations for
estimating posterior probabilities follow [ResnikP2010]:
 The maximum likelihood estimation (MLE) which amounts to counting
and then normalizing so that the probabilities sum to 1:
 MLE produces the choice most likely to have generated the observed data.
 The maximum a posteriori (MAP) estimation
 MAP estimate is the choice that is most likely given the observed data.
 In contrast to MLE, MAP estimation applies Bayes's Rule, so that our


07/07/2015
estimate can take into account prior knowledge about what we expect θ
to be in the form of a prior probability distribution P(θ).
Both, MLE and MAP give us the best estimate according to their
respective definitions of "best".
None, MLE nor MAP give a whole distribution P(θ|x).
INAOE
70
Patterns
 Patterns are regularities in data.
[Wikipedia:Pattern_recognition]
 Patterns refers to models (regression or
classification) or components of models
(e.g. a linear term in a regression)
[FayyadU1996, pg 51]
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
© 2015. Dr. Felipe Orihuela-Espina
71
Data mining
 Data mining is:
 “the application of specific algorithms for
extracting patterns from data.” [FayyadU1996]
 “the computational process of discovering
patterns in large data sets”
[Wikipedia:Data_mining]
 the analysis step of the "Knowledge Discovery
in Databases" (KDD) process [FayyadU1996,
Wikipedia:Data_mining]
© 2015. Dr. Felipe Orihuela-Espina
72
Data mining
• Discovering patterns in large data sets
[Wikipedia:Data_mining]
Data mining
Pattern
recognition
•Recognition of regularities (patterns) in
data [Wikipedia:Pattern_recognition
•Data driven classification [JainAK2000]
•Nearly synonymous with machine
learning
[Wikipedia:Pattern_recognition]
Different names for the
same thing?
Machine
learning
Knowledge
discovery
•Data-driven discovery of knowledge
•It adds processing (cleaning, selection)
steps to data mining [FayyadU1996]
•Construction and study of algorithms that
can learn (act of acquiring new knowledge)
from data
•Often overlaps with computational statistics
[Wikipedia:Machine_learning]
© 2015. Dr. Felipe Orihuela-Espina
73
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
74
Data mining
…
© 2015. Dr. Felipe Orihuela-Espina
75
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
Data mining
 Classification is
strongly related to
regression
[FayyadU1996]:
 Regression is learning a

function that maps a
data item to a realvalued prediction
variable.
Classification is
learning a function that
maps (classifies) a data
item into one of several
predefined classes.
 That’s regression with a
threshold! [Felipe’s dixit]
© 2015. Dr. Felipe Orihuela-Espina
76
Learning
 Goal
 The objective of learning in AI is giving
computers the ability to understand our world
in terms of inferring semantic concepts and
relationships among these concepts.
 Scope:
 Single task: Observations comes from a
single task
 Multi-task: Observations comes from several
tasks at once
07/07/2015
INAOE
77
Types of learning
 Supervised: Relys on known (labelled) examples a.k.a. the
training set, to find a discrete regressor
 Unsupervised: Finds regularities and structures (i.e. fits
probability distributions) to observations
 Reinforced: Updates the currently learn model based on
rewards assessing its outputs
 Semi-supervised: From an initially learned supervised model,
it evolves unsupervisedly by generating synthetic "rewards"
proportional to the likelihood of the new observations.
 Active: A particular case of semi-supervised learning in which the
new observations are chosen or selected from all arriving new
observations according to a certain criteria.
 Transfer: A particular case of semi-supervised learning in which
new observations comes from a new domain or task.
07/07/2015
INAOE
78
Basic problems in learning
 Modelling: It refers to encoding dependencies between
variables under a given chosen form.
 In fact, modelling per se just refers to choosing this form,
and in its more minimalistic case it does not require the
model to be representative of the phenomenon, explicative
nor predictive! It may be just nuts, a silly model!
 Learning: It refers to optimizing the parameters of the
model by minimizing the loss functional i.e. a particular
criteria, e.g. least squares error.
 Inference or reconstruction: It refers to estimating
posterior probabilities of hidden variables given
observed ones, P(h|x) or h=f(x)
07/07/2015
INAOE
79
Local vs non-local generalization
 Local generalization
 It refers to an underlying assumption made by
many learning algorithms; the output f(x1) is
similar to f(x2) iff x1 is similar to (i.e. close to/in
the neighbourhood of) x2.
 Non-local generalization
 Learning a function that behaves differently in
different regions of the data-space requires
different parameters for each of these regions.
07/07/2015
INAOE
80
Local generalization
 Local generalization is closely
related to manifold learning;
 Since a manifold is locally
Euclidean, it can be
approximated locally by linear
patches tangent to the manifold
surface.
 If it is smooth, then these
patches (i.e. the computational
units) will be reasonably large
and the number of patches
needed (i.e. the size of the
computational model) will be
small.
 However, if the manifold is
highly curved (i.e. complex
highly varying function) then the
patches will have to be small
increasing the number of
patches to characterise the
manifold.
Figure reproduced from [BengioY2009, pg 25]
07/07/2015
INAOE
81
Local generalization
 Local generalization is related to the curse
of dimensionality.
 However what matters for generalization is is
not the [extrinsic] dimensionality, but the
number of variations of the function [i.e.
intrinsic dimensionality] that we want to learn.
 Generalization is mostly achieved by a
form of local interpolation between
neighbouring training examples.
07/07/2015
INAOE
82
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
83
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
84
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
85
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
86
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
87
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
88
© 2015. Dr. Felipe Orihuela-Espina
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
89
Data mining
 Clustering
[Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!]
© 2015. Dr. Felipe Orihuela-Espina
90
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
91
Data mining
[Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!]
© 2015. Dr. Felipe Orihuela-Espina
92
Optimizing model selection





 Assumes LTI
Y1…Npre -> Hyperparameters for preprocessing
Xi,fs -> Feature selection method
Y1…Npre -> Hyperparameters for feature selection
Xi,class -> Classifier method
Y1…Nclass -> Hyperparameters for classification
07/07/2015
INAOE
[EscalanteHJ2009]
 Xi,pre -> Combination of preprocessing methods
93
Data mining
 “Overfitting: When the algorithm searches
for the best parameters for one particular
model using a limited set of data, it can
model not only the general patterns in the
data but also any noise specific to the data
set, resulting in poor performance of the
model on test data. Possible solutions
include cross-validation, regularization,
and other sophisticated statistical
strategies.” [FayyadU1996]
© 2015. Dr. Felipe Orihuela-Espina
94
Deep learning
 Data representation refers simply to the
chosen feature space, i.e. the feature vector
[BengioY2013].
 The construction or learning of this feature space
goes under the name of feature engineering and
includes more rudimentary subproblems such as
feature selection and extraction e.g. processing
and transformations.
 A good representation is one that
disentangles the underlying factors of
variation [BengioY2013].
07/07/2015
INAOE
95
Deep learning
 Much of the actual effort in deploying machine
learning algorithms goes into feature engineering.
 Representation learning a.k.a. deep learning, is
about learning a representation of the data i.e.
feature space, that makes it easier to extract
useful information when building predictors (e.g.
classifiers, regressors, etc).
 As soon as there is a notion of representation, one
can think of a manifold [BengioY2013].
07/07/2015
INAOE
96
Deep learning
 Expressive representations: It refers to the ability of
capturing a huge number of input configurations with a
reasonable sized representation. In other words,
having few features suffices to cover most of the data
space.
 That's good old content validity meet computational spatial


efficiency (Felipe's dixit)
Traditional algorithms require O(N) parameters (and/or
O(N) training examples) to distinguish O(N) input regions.
Linear features e.g. as those learnt by PCA, cannot be
stacked to form deeper, more abstract representations
since the composition of linear operations yields another
linear operation.
 However, it is still possible to use the linear fetures in deep
learning; e.g. inserting a non-linearity between learned singlelayer linear projections.
07/07/2015
INAOE
97
Deep learning
 Distributed representations: It
refers to having more than one
computational units charting a
certain region of the data space at
the same time. Distributed
representations are often
(always?) expressive.
 Example: Imagine one binary
classifier over certain space. It
partitions the space into 2
subregions. But having 3 classifiers
over that certain space can partition
the space into exponentially more
regions.
 Distributed representations can
alleviate the curse of the
dimensionality and the limitations
of local generalization.
Figure reproduced from [BengioY2009, pg 27]
07/07/2015
INAOE
98
Deep learning
 Overcomplete representations: It refers to
having more (hidden) computational units i.e.
degrees of freedom, than training examples.
 Often lead to overfitting endangering
generalization.
 May still be useful for denoising [Felipe's inferred from
BengioY2009, pg 46]
 However; “importantly, DBMs, (in the case of
MNIST - despite having million of parameters and
only 60k training samples), do not appear to
suffer much from overfitting”
[SalakhutdinovR2009, pg453]
 ...hmmm, not sure about this; Salakhutdinov says so,
but he does not provide any evidence that this is the
case.
07/07/2015
INAOE
99
Deep learning
 Invariant representations: It refers to having computational
units which by having learn abstract concepts, they achieve
outputs which are invariant to local changes of the input. This
often need highly non-linear transfer functions.
 Invariance and abstraction goes hand in hand.
 Having invariant features is a long standing goal in pattern
recognition.
 Achieving invariance i.e. reducing sensitivity along a certain
direction of the data, does not guarantee to have disentangle a
certain factor of variance in the data. Although, invariance is
often good, the ultimate goal is not to achieve invariance, but to
disentangle explanatory factors [BengioY2013], that's manifold
embedding!. Therefore; the goal of building invariant features
should be removing sensitivity to directions of variance that are
uninformative to the task.
 Building invariant representation often involves two steps;
 Low level features are selected to account for the data
 Higher level features are extracted from low level features
07/07/2015
INAOE
100
Deep learning
 Deep architectures are model
architectures composed of multiple levels
of non-linear operations or computational
elements.
 The number of levels i.e. the longest path
from an input node to an output node, is
referred to as depth of the architecture.
07/07/2015
INAOE
101
Deep learning
 An architecture may be:
 Shallow architecture; often up to 3 levels of depth
 Deep architectures: More than 3 levels
 Example: Brain anatomy; 5-10 levels in the visual
system [SerreT2007]
 Funny enough, examples and systems used
in scientific papers devoted to deep learning
hardly go beyond 3 levels, e.g.
[SalakhutdinovR2013_TPAMI]. So not that
deep!
07/07/2015
INAOE
102
Deep learning
 Pros and cons in a nutshell
Pros
Cons
• Relaxes need for feature
engineering
•Modelling becomes truly
data-driven
• Bigger compartmentalization
of the search space achieved
(with a fixed number of
hidden variables)
• Higher complexity of the
model
• Larger number of parameters
• "Direct" training becomes
intractable
07/07/2015
INAOE
103
Deep learning
 Deep Boltzmann
Machines (DBM): A
variant of Boltzmann
machines that instead of
having one single layer of
hidden variables (in
contrast to the RBM), has
multiple layers of hidden
variables; with units in
odd-numbered layers
being conditionally
independent given evennumbered layers and
viceversa.
07/07/2015
Figure: Deep Boltzmann Machine with
3 layers. Figure reproduced
from [SalakhutdinovR2013_TPAMI]
INAOE
104
Questions that I'm unable to answer at the
moment
 Overfitting.
 Clearly deep models are prone to overfitting
considering they use overcomplete
representations.
 …it’s not me, but Bengio who warns about this!
 From its particular example with MNIST images,
[SalakhutdinovR2009, pg453] claims this does
not seem to be the case.
 However, he says so but fails to provide any evidence
that this is the case.
 I'm still unconvinced that, in general, deep
learning models do not simply overfit data.
07/07/2015
INAOE
105
Deep learning

To know more:
 [BengioY2009] Bengio, Y. (2009) "Learning deep architectures for AI" Foundations and trends in machine learning,










07/07/2015
2(1):1-127
[BengioY2013] Bengio, Y.; Courville, A.; Vincent, P. (2013) "Representation learning: a review and new perspectives"
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798-1828
[DavisRA2001] Davis, R.A. (2001). Gaussian Processes, Encyclopedia of Environmetrics, Section on Stochastic
Modeling and Environmental Change, (D. Brillinger, Editor), Wiley, New York
[HintonGE2006] Hinton, Geoffrey E.; Osindero, Simon; Teh, Yee-Whye (2006) "A Fast Learning Algorithm for Deep
Belief Nets" Neural Computation 18:1527–1554
[LeCunY2006] LeCun, Yann; Chopra, Sumit, Hadsell, Raia; Ranzato, Marc Aurelio; Huang, Fu Jie (2006) "A tutorial on
energy-based learning" in Bakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds), Predicting
Structured Data, MIT Press
[ResnikP2010] Resnik, Philip and Hardisty, Erick (2010) "Gibbs sampling for the uninitiated" Technical Report CS-TR4956, Institute for Advanced Computer Studies, University of Maryland, 23 pp.
[SalakhutdinovR2008_ICML] Salakhutdinov, Ruslan and Murray, Iain (2008) "On the Quantitative Analysis of Deep
Belief Networks" 25th International Conference on Machine Learning (ICML), Helsinki, Finland
[SalakhutdinovR2009_AISTATS] Salakhutdinov, Ruslan and Hinton, Geoffrey (2009) "Deep Boltzmann Machines"
12th International Conference on Artificial Intelligence and Statistics (ICAISTATS), Clearwater Beach, Florida, USA,
pgs. 448-455
[SalakhutdinovR2013_TPAMI] Salakhutdinov, Ruslan; Tenenbaum, Joshua B. and Torralba, Antonio "Learning with
hierarchical-deep models" IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1958-1971
[SerreT2007] Serre, T.; Kreiman, M. K.; Cadieu, U.; Knoblich, U.; Poggio, T. (2007) "A quantitative theory of
immediate visual recognition" Progress in Brain Research, Computational Neuroscience: Theoretical Insights into
Brain Function, 165:33-56
[TehYW2010] Y.W. Teh. (2010) Dirichlet process. Encyclopedia of Machine Learning. Springer
INAOE
106
KNOWLEDGE
REPRESENTATION
© 2015. Dr. Felipe Orihuela-Espina
107
Knowledge representation
 “Knowledge representation includes
ontologies, new concepts for representing,
storing, and accessing knowledge. Also
included are schemes for representing
knowledge and allowing the use of prior
human knowledge about the underlying
process by the knowledge discovery
system.” [FayyadU1996]
© 2015. Dr. Felipe Orihuela-Espina
108
Automating Science?
 “Computers with intelligence can design and
run experiments, but learning from the results
to generate subsequent experiments requires
even more intelligence.” [WaltzD2009]
 Goals of automation in science [WaltzD2009]:
 increase productivity by increasing efficiency
(e.g., with rapid throughput)
 improve quality (e.g., by reducing error)
 cope with scale
07/07/2015
INAOE
109
Knowledge generation can be streamlined:
e.g. Robot scientist
Robot scientist
ADAM and
researcher Prof.
King
 LABORS (Laboratory Ontology for Robot Scientists) ontology [KingRD2011]
 Formalizes Adam’s functional genomics experiments
 Based on EXPO (Ontology of scientific experiments)
 Closing the loop; ADAM can decide on what experiment to do next

[WaltzD2009]
Limited to hypothesis-led discovery [KingRD2009]
07/07/2015
© 2015. Dr. Felipe Orihuela-Espina
111
Knowledge generation can be streamlined:
EXPO
 EXPO: Ontology of scientific experiments
 Defines over 200 concepts for creating
semantic markup about scientific experiments
 OWL language
 EXPO to formalise generic knowledge about
scientific experimental design, methodology,
and results representation.
 [SoldatovaLN2006]
 EXPO is available at
http://expo.sourceforge.net/
07/07/2015
© 2015. Dr. Felipe Orihuela-Espina
112
An overview of EXPO
07/07/2015
[KingRD2006 presentation on EXPO]
© 2015. Dr. Felipe Orihuela-Espina
113
Knowledge generation
 To arrive to knowledge from experimentation 3 steps
are taken:
 Data harvesting: Involving all observational and
interventional experimentation tasks to acquire data
 Data acquisition: experimental design, evaluation metrics,
capturing raw data
 Data reconstruction: Translates raw data into domain
data.
 Inverts the data formation process.
 E.g.: If you captured your data with a certain sensor and the
sensor throws electric voltages as output, then reconstruction
involves converting those voltages into a meaningful domain
variable.
E.g.: Image reconstruction

 Data analysis: From domain data to domain knowledge
 When big data is involved, it is often referred to as Knowledge
discovery
07/07/2015
© 2015. Dr. Felipe Orihuela-Espina
114
Knowledge discovery
Figure from [Fayyad et al, 1996]
© 2015. Dr. Felipe Orihuela-Espina
115
Data interpretation
 Research findings generated depend on the philosophical approach
used [LopezKA2004]
 Assumptions drive methodological decisions
 Different (philosophical) approaches for data interpretation
[PriestH2001, part 1, LopezKA2004; but basically phylosophy in
general]
 Interpretive (or hermeneutic) phenomenology:
 Systematic reflection/exploration on the phenomena as a means to grasp the

absolute, logical, ontological and metaphysical spirit behind the phenomena
Affected by the researcher’s bias
Kind of your classical hypothesis driven interpretation approach [Felipe’s dixit]


Descriptive (or eidetic) phenomenology
 Favours data driven over hypothesis driven research [Felipe’s dixit based upon
the following]
 “the researcher must actively strip his or her consciousness of all prior expert
knowledge as well as personal biases (Natanson, 1973). To this end, some researchers
advocate that the descriptive phenomenologist not conduct a detailed literature review
prior to initiating the study and not have specific research questions other than the
desire to describe the lived experience of the participants in relation to the topic of
study” [Lopez KA 2004]
Important note: I do NOT
understand these very well, so do
not ask me! READ.
07/07/2015
© 2015. Dr. Felipe Orihuela-Espina
116
Data interpretation
 Different (philosophical) approaches for data interpretation
[PriestH2001, part 1, LopezKA2004; but basically phylosophy
in general] (Cont.)
 Grounded theory analysis
 Generates theory through inductive examination of data
 Systematization to break down data, conceptualise it and re-arrange it
in new ways
 Content analysis
 Facilitates the production of core constructs formulated from contextual
settings from which data were derived
 Emphasizes reproducibility (enabling others to establish similar results)
 Interpretation (analysis) becomes continual checking and questioning
 Narrative analysis
 Qualitative
 Results (often from interviews) are revisited iteratively detracting words
or phrases until core points are extracted.
07/07/2015
Important note: I do NOT
understand these very
well, so do not ask me!
READ.
© 2015. Dr. Felipe Orihuela-Espina
117
Why KR models for biomedical engineering?
 GOAL:
 Formalizing concepts and relations common in
biomedical imaging
 Affording more time for interpretation
 Advantages:
 favours automated data processing, automated
knowledge and data integration and semantic
integration [HoehndorfR2012]
 The formalization of experimental knowledge expects
that such knowledge is more easily reused to answer
other scientific questions [KingRD2009]
 Ensure reproducibility and quality results
[OrihuelaEspinaF2010]
 Leaves interpretation to humans!
07/07/2015
© 2015. Dr. Felipe Orihuela-Espina
118
AN EXAMPLE OF KR WITH
FNIRS
© 2015. Dr. Felipe Orihuela-Espina
119
Challenges in KR in fNIRS experimentation
 How to choose?
 The region to interrogate?
 The best (most fair) analysis?
[OrihuelaEspina2010_OHBM]
 Inc. processing, parameterization, and analysis flow
 How to avoid:
 Physiological noise /systemic effect?
 Artefacts (e.g optode movement, ambient light)?
 How to ensure:
 Physiological plausability?
 Integrity / validity? [OrihuelaEspina2010_PMB]
 reuse of formalized experiment information? [KingRD2009]
07/07/2015
INAOE
120
Challenges in KR in fNIRS experimentation:
Parameterization
[OrihuelaEspinaF2010_OHBM]
07/07/2015
INAOE
121
Challenges in KR in fNIRS experimentation:
Modelling
light
tissue
light
Light
model
Chromophore
concentration
Neurovascular
coupling
Physiological
model
Physiological
information
[Inspired from Banaji, fNIRS Conference, 2012]
07/07/2015
INAOE
122
Challenges in KR in fNIRS experimentation:
Modelling
 Is the data validated?
 Do we really need a physiological model?
 A model is useful only if they fulfil very high standards of

predictive capability and reliability
We learn about the phenomenon while building the model
(vicious circle)
 Purposes of models:
 Explain data /highlight gaps in understanding




 Raising open questions
Predict hard-to-measure quantities
Develop understanding and intuition
Prepare us for experimental data
Challenge dogmas
 May force us to ignore priors!
[Banaji, fNIRS Conference, 2012; Banaji, JTB, 2006, Banaji, PLoS CB, 2008]
07/07/2015
INAOE
123
Challenges in KR in fNIRS experimentation:
Modelling
 What are the principles that we should
follow to build our model?
 How is the model going to interact with the
data?
Example of interaction 1
Simulated
data
Model
Example of interaction 2
Model
Modelled
data
Modelled
data
Compare
Compare
Subject /
Cohort
Subject /
Cohort
Observed
data
Observed
data
[Banaji, fNIRS Conference, 2012]
07/07/2015
INAOE
124
Challenges in KR in fNIRS experimentation
 Closing the loop:
 from experiment design and data collection to hypothesis
formation and revision, and from there to new experiments
[WaltzD2009]
 Complex experiments having different sources
 Different NIRS devices (HITACHI, SHIMADZU, fNIRX), but
also difference sources eye-tracking, EEG, etc..
 Accomodating different optical modalities
 Lack of standard “final” representation format
 Medical standard DICOM; not as standard as pretended
 Each provider has its own file format.
 SNIRF: Shared Near Infrared File Format Specification
07/07/2015
INAOE
125
Challenges in KR in fNIRS experimentation
 Problem size
 Information representation (relational, object
oriented)
 Sample size (Extrapolation, generalization,
regularization, ill posed problems i.e. Number of
observations vs number of covariates)
 Data mining and KD strategy [FayyadU1996]
 Model identification and parameterization
 Underparameterization: low flexibility to explain
complex data
 Overparameterization: Spurious model can explain any
data. Difficulties in parameter identification
 Level of detail
 Model baoundaries, parameters, variables, purpose
07/07/2015
INAOE
126
Concept map: experimentation
Light
model
Physiological
model
07/07/2015
INAOE
127
Data analysis: more than just thinking your statistical test…
Figure source: [OrihuelaEspinaF2012, Workshop on Foundations of
Biomedical Knowledge Representation]
•Past: Make sense of
bygone situations or
explain an occurring
phenomena, establishing
associational or causal
relations
•Present: Decision making
• Future: Infer outcomes,
reasoning, prediction,
planning, optimization
•Hypothesis
driven vs
data driven
Quantitative vs
Qualitative
•Causality (Zero level, One level, Two level)
•Incorporation of domain knowledge (priors)
•Algorithm: complexity (order), strategy (e.g. greedy),
serial/parellel, exact real number computation
•Problem complexity (NP-complete, P-hard…)
•Problem size (Information representation, Regularization)
•Data relations and behaviour
•Validation theory: Type (Construct, Face, Convergent,
Ecological, External, Internal, etc), Technique (Leave one
out, cross-fold, gold standard, ground truth)
•Dimensionality (Intrisic vs Explicit)
•Learning (supervised, unsupervised, reinforcement)
•Comparison (metric and performance definition)
•Data quality and SNR
Processing
Analysis
Understanding
•Direct (Intervention) vs
Indirect (Sensing)
•Sampling
•Interviewing,
Behavioural simulation,,
observational
•Synthetic, Experimental,
Data base
•Positive vs
Negative/Complement
•Type: Discrete,
Continuous,
Categorical / Nominal,
Ordinal / Ranked
•Digital vs Analogic
•Nature of data: Time
vs Space
•Deterministic vs
stochastic
•Observable vs Nonobservable
•One way, Two way, Nways
•Fundamental vs
Derived
Brain map of data analysis
NOT INTENDED TO
BE
COMPREHENSIVE!
07/07/2015
INAOE
130
[OrihuelaEspinaF2010, PMB]
Taxonomy of factors in fNIRS experimentation
07/07/2015
INAOE
131
Experimental factors limit interpretation
07/07/2015
INAOE
132
INTERPRETATION
GUIDELINES
© 2015. Dr. Felipe Orihuela-Espina
133
Interpretation guidelines
 Understanding is by far the hardest part of
data analysis.
 …and alas it is also the part where
maths/stats/computing are less helpful.
 Look at your data! Know them by heart.
Visualize them in as many possible ways as
you can imagine and then a few more.
 Have a huge background. Read everything
out there closely and loosely related to your
topic.
134
Interpretation guidelines
 Always try more than one analysis
(convergent validity).
 Quantitative analysis is often desirable, but
never underestimate the power of good
qualitative analysis.
 All scales are necessary and complementary;
 Structural, functional, effective
 Inter-subject, intra-subject
 Neuron-level, region-level
135
Interpretation guidelines
 Every analysis must translate the physiological,
biological, experimental, etc concepts to a correct
mathematical abstraction. Every interpretation must
translate the “maths” to real world domain concepts.
 Again: Interpretation of results must be confined to the
limits imposed by the assumptions made during the
image reconstruction
 Rule of thumb: Data analysis takes at least 3 to 5
times data collection time. If it has taken less, then
your analysis is likely to be weak, coarse or careless.
 Example: One month collecting data – 5 months worth of
analysis.
136
Interpretation guidelines
 The laws of physics are what they are…
 …but research/experimentation results are
not immutable.
 They strongly depend on the decisions made
during the data harvesting, data reconstruction
and the three stages of the analysis process.
 It is the duty of the researcher to make the
best decision to arrive at the most robust
outcome.
 Interpretation, interpretation, interpretation…
LOOK at your data!
07/07/2015
INAOE
137
Final remarks
 Inferential statistics (SPM) are (currently) by
far the most popular approach.
 …perhaps due to their utter simplicity and
mathematical elegance together with their
flexibility to accomodate virtually every
experimental design
 …yet they are not the only option, and sometimes
not the best for a given goal (e.g. graph theory
superb competence for connectivity analysis)
 Analytical modelling (when correct) will always be
the safe shot, but complexity often prevents
accurate modelling
138
THANKS, QUESTIONS?
© 2015. Dr. Felipe Orihuela-Espina
139