Raychaudhury ADASS07

Download Report

Transcript Raychaudhury ADASS07

Finding outliers in multivariate
data with measurement errors
Somak Raychaudhury
School of Physics and Astronomy
Jianyong Sun and Ata Kabán
School of Computer Science
University of Birmingham, UK
Publications
# Robust Mixtures in the Presence of Measurement
Errors, Jianyong Sun, Ata Kabán and Somak Raychaudhury,
International Conference on Machine Learning (ICML07),
Corvallis, OR, June 2007
(astro-ph/0709.0928)
# Robust Visual Mining of Data with Error Information,
Jianyong Sun, Ata Kabán and Somak Raychaudhury,
European Conference on Machine Learning/Practice of
Knowledge Discovery in Databases (ECML/PKDD07),
Warsaw, Sep 2007
Designer Algorithms® for Astronomy
•Started 2004
•Algorithm development involving Machine learning (ANN, SVM, Gaussian
processes), latent variables (e.g. GTM), Bayesian methods (e.g. ICA) and
Genetic algorithms (model finding and fitting)
•Emphasis on algorithm development, not applying existing software
•1 PhD thesis complete: Kernel regression for time delays in gravitational
lenses, four other PhD students involved
•12 refereed publications, including ICML, ECML/PKDD, SDM, MNRAS, A&A
Physics and Astronomy
Computer Science
Xin Yao
Somak Raychaudhury
Peter Tino
Trevor Ponman
Ata Kaban
Ian Stevens
Jianyong Sun Ela Claridge
Bill Chaplin
Richard Dearden
Louisa Nolan
www.sr.bham.ac.uk/algorithms
Parameter Space of Observables
Flux at
1
Flux at
Time
2
Energy, Spectrum
Shape parameters
RA
Dec
Along each axis the measurements are characterized by the
position, extent, sampling and resolution. All astronomical
measurements span some volume in this parameter space
An example:
SDSS quasar
catalog
§
§
§
§
§
§
§
Courtesy:
Integers, real numbers, characters
http://astrostatistics.psu.edu/
Continuous and discrete variables
and Schneider et al 2005
Logarithmic or linear variables
Upper and lower limits
Infinite range, or bounded above or below
Missing data
Errors!
An Example: Discoveries of High-Redshift
Quasars and Type-2 Quasars in DPOSS
High-z QSO
Type-2 QSO
Djorgovski et al.
A Galilean Dialogue
Astronomer: I have a list of objects with some measured parameters,
and I’d like to discover unusual objects from them
Computer Scientist: All right- that’s a straightforward outlier detection
problem
Astronomer: Ok, here’s a list of quasars with four colours from SDSS
(Three weeks later) Computer Scientist : Here is the list of top outlying
objects.
Astronomer: Ah, of course, many of these have large measurement
errors. They might not be genuine outliers.
Computer Scientist : Error? What’s an error?
Astronomer: Erm- they are the uncertainties on the measurements.
Computer Scientist : You mean these are mistakes? Surely you could
have been more careful!
Astronomer: Oh, no, no…. The errors depend on instrumental
limitations and observational conditions, and are unavoidable with
physical measurements.
Computer Scientist : Um… OK, then give us your measurement errors.
Somak Raychaudhury ADASS07
Finding outliers
Robust density modelling aims at capturing the
structure of typical observations while dealing with
outliers
§ required to avoid biases of parameter estimates (here
outliers need to be thrown out)
§ Sometimes peculiar objects are of interest, for identifying
candidates of possibly new kinds of objects (e.g. from
archives of multi-wavelength astronomical images) that
deserve follow-up study (e.g. using spectroscopy).
§ Bottleneck: the likely overabundance of interesting objects
found- need to limit number of likely candidates
Somak Raychaudhury ADASS07
Goal of the exercise
• Given measurements from N objects, over d features,
together with their associated measurement errors
• Find the peculiar objects whose peculiarity is not due to
measurement errors (‘genuine’, potentially interesting
outliers),
• along with a model of the density of non-outliers.
• An outlier-robust mixture model for data with
known measurement errors
• Solve by a structured-variational EM algorithm
• Find the outliers
– Controlled experiments & Real data
-- Sun, Kabán and Raychaudhury (astro-ph/0709.0928)
Somak Raychaudhury ADASS07
Robust density
modelling
15
Peel & McLachlan:
Robust mixture
modeling using the t
distribution. Statistics
& Computing, 2000
10
5
0
It assumes that the
data are points.
-5
Instead, the real
scientific data also
contains error
estimates.
-10
-15
-15
-10
Somak Raychaudhury ADASS07
-5
0
5
10
15
Can knowledge
of the errors help
us infer the
‘genuine’
outliers?
15
10
5
0
-5
-10
-15
-15
-10
Somak Raychaudhury ADASS07
-5
0
5
10
15
Can knowledge
of the errors
help us infer the
‘genuine’
outliers?
15
10
5
0
-5
-10
-15
-15
-10
Somak Raychaudhury ADASS07
-5
0
5
10
15
Models with Latent (‘Hidden’) Variables
• In many applications there are 2 sets of variables:
– Variables whose values we can directly measure
– Variables that are “hidden”, cannot be measured
• Examples:
– Speech recognition:
• Observed: acoustic voice signal
• Hidden: label of the word spoken
– Face tracking in images
• Observed: pixel intensities
• Hidden: position of the face in the image
– Text modelling
• Observed: counts of words in a document
• Hidden: topics that the document is about
Slide adapted from P Smyth: KDD tutorial
Somak Raychaudhury ADASS07
S

Some existing latent variable models
y
A wide class of latent variable
models has the following form:

y  f(s,)  n
Specifications
required!
1) p(s)
observed
data

latent
variables
parameters
noise
2) f(.)
Linear models with Gaussian latent prior: FA, PPCA, PCA
Linear models with discrete latent prior: FMM
Linear models with non-Gaussian latent prior: IFA, ICA
Linear model with latent prior over +ve domain & +ve
parameters: NMF
Non-linear models with uniform latent prior: GTM, LTM
Somak Raychaudhury ADASS07
What are all these acronyms?
•
•
•
•
•
•
•
•
•
FA = Factor Analysis
PCA = Principal Component Analysis
PPCA = Probabilistic Principal Component Analysis
FMM = Finite Mixture Models
IFA = Independent Factor Analysis
ICA = Independent Component Analysis (noise-free IFA)
NMF = Non-negative Matrix Factorisation
GTM = Generative Topographic Mapping
LTM = Latent Trait Model
Somak Raychaudhury ADASS07
See astro-ph/0709.0928
The Student t distribution
“For many applied problems, the tails of the normal
distribution are shorter than required”
The t distributions were
discovered by William
Gosset in 1908. He
wrote under the name
‘A Student’
S (t ) 
(( k  1) / 2)
(1  t 2 / k ) ( k 1) / 2
k (k / 2)
Somak Raychaudhury ADASS07
Somak Raychaudhury ADASS07
Determining the number of
components
• We used a lower bound to the Minimum Message Length in
conjunction with our likelihood bound. Other methods are possible.
• The MML criterion is to maximise:
Somak Raychaudhury ADASS07
Learning with Latent Variables
• Guess some initial parameters

0
• E-step (Inference)
– For each case, and each unknown variable compute
p (S | known data,  0 )
• M-step (Optimization)
– Maximize L(  ) using p(S | ….. )
– This yields new parameter estimates 1



• This is
the EM algorithm:
– Guaranteed to converge to a (local) maximum of L(
Dempster, Laird, Rubin, 1977

Somak Raychaudhury ADASS07

)
SDSS quasars- can we find ones at
z>2 ?
• 10,000 quasars
extracted from the
SDSS (DR4) quasar
catalogue
• Five optical filters
(u,g,i,r,z). From
these, we construct 4
features,
u-r, g-r, i-r, r-z
• Spectroscopic
redshifts used to
validate the results.
Fan et al 2000
Somak Raychaudhury ADASS07
Bottom line: z>2.5 quasars as outliers
AUC (“outlierness”) vs. different redshift
thresholds. A large fraction of quasars at
redshift 2.5 or higher are detected with high
probability
This is not
possible
with 2D
projections
Somak Raychaudhury ADASS07
Conclusions
• Knowledge of errors is useful for making the
hunt for the ‘interesting’ outliers more directed
• Keeping some of the dependencies in the
variational posterior density can increase
accuracy
• The method can efficiently find high redshift
quasars as outliers from the SDSS photometric
quasar catalogue
• We are extending these findings to combined
statistical and visual analysis of data with
outliers and error information.
Somak Raychaudhury ADASS07
Somak Raychaudhury ADASS07
Somak Raychaudhury ADASS07
Implications of the choice for prior p(s)
– Types of models
Dense & distributed
Prototype-based
(compression)
(clustering)
Lee & Seung,
Nature 401,
788 (1999)
Sparse & distributed
(compression, clustering, structure
discovery)
Somak Raychaudhury ADASS07
Peel & McLachlan, Statistics & Computing, 2000.
Mixture of t-distributions
(Robust mixture)
• Two sets of latent variables:
– A discrete class-variable
z
p
– A Gamma variable u
• Maximum likelihood
estimation via EM
zn
un
nk
mk
• Inferred ‘peculiarity’ =
E[u|tn], which is now
obtained via marginalising
over the class-variable.
Somak Raychaudhury ADASS07
Sk
tn
N
K
A robust mixture for data with errors
• A heteroschedastic Gaussian error model (with unknown
mean and known variance) will account for the measurement
errors
• A mixture of Student-t densities will model the density of ‘w’.
K
p(wn )    k St (wn | k , k , k )
k1
• Putting these together, the data likelihood is:

N(t | k ,S 
Somak Raychaudhury ADASS07
k
)
u
• The joint likelihood of all
variables:
p
zn
nk
un
mk
wn
tn
Sk
Sn
N
Somak Raychaudhury ADASS07
K
Experiments and Results
1) How accurate is the structured variational approximation
employed?
- How it compares to a fully factorial approximation?
- How it compares to the ‘ground truth’ Markov-Chain-EM?
2) To what extent does knowledge of measurement errors help
us to infer the outliers that are not due to these errors?
- How it compares to ignoring the knowledge about the
errors?
- How it compares to the idealised case of having the clean
data with no measurement errors?
3) Application in Astronomy: Finding peculiar objects of interest
from the SDSS quasar catalogue (high-redshift quasars).
Somak Raychaudhury ADASS07
2) Accuracy of detecting ‘genuine’
outliers
• Synthetic data sets sampled from the model, starting from
3 well separated Gaussians and genuine outliers from a
uniform distribution
• Five different error levels defined: Diagonals of S will
range between [0,0.01], [0,0.1],[0,1],[0,10] and [0,100]
respectively.
• Computed the Area Under the ROC curve (AUC) achieved
by the inferred outlierness against the true outliers,
averaged over 10 independent repeats.
Somak Raychaudhury ADASS07
Accuracy of detecting “genuine” outliers: Results
on the synthetic data sets
In sample
Somak Raychaudhury ADASS07
Out of sample
• Comparison with a Maximum A Posteriori (MAP)
estimator (for un) in terms of outlier finding and
cluster finding, computed from 10 independent
realizations of the synthetic data
MAP
Var-EM
P-value
AUC
0.915 ± 0.13
0.964 ± 0.015
0.0890
Clust accuracy(%)
0.920 ± 0.15
0.931 ± 0.014
0.0079
– MAP is more variable and less accurate than varEMSome early results
• Synthetic data in 10-D: 3 clusters & outliers
• It is beneficial to represent the outlierness in a 3rd
dimension.
Somak Raychaudhury ADASS07