Simplicial Depth: An Improved Definition, Analysis, and

Download Report

Transcript Simplicial Depth: An Improved Definition, Analysis, and

Simplicial Depth: An Improved
Definition, Analysis, and Efficiency in
the Finite Sample Case
Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine
Tufts University
www.cs.tufts.edu/research/geometry
CCCG 2004
NSF grant #EIA-99-96237
Introduction
• Introduction to Data Depth
– Why?
– Examples
– Desirable Properties
• Simplicial Depth
– Definition
– Properties
– Problems
• Revised Definition
– Definition
– Properties
• Ongoing work
What is Data Depth and Why?
• Measures how deep
(central) a
d
given point p  R is relative to a
distribution or a data cloud.
– Deals with the shape of the data.
– Can be thought of as a measure
of how well a point characterizes
a data set
• Provides an alternative to classical
statistical analysis.
– No assumption about the
underlying distribution of the
data.
• Deals with outliers.
• Why study?
– Many measures are geometric in
nature.
– Can be computationally
expensive to compute depth.
Examples
•
•
•
•
Half-Space (Tukey, Location) (Tukey 75)
Regression Depth (Rousseeuw and Hubert 94)
Simplicial Depth (Liu 90)
… and many more.
S
2 Data Points in this Half-plane
p
HDS ; p   2
3 Data Points in this Half-plane
Desirable Properties of Data Depth
• Liu (90) / Serfling and Zuo (00)
–
–
–
–
P1 – Affine Invariance
P2 – Maximality at Center
P3 – Monotonicity Relative to Deepest Point
P4 – Vanishing at Infinity
• We propose (BRS 04)
– P5 – Invariance Under Dimensions Change
Affine Invariance (P1)
AS 
S
A – affine transformation
p
A p 
Maximality at Center (P2)
p is the center
q is any point
Monotonicity Relative to Deepest Point (P3)
point between p and q
p is the deepest point
r  0,1
DS ; q   DS ; p  r q  p 
q is any point
Vanishing at Infinity (P4)
q is far from the data cloud
lim DS ; q   0
q 
Invariance Under Dimensions Change (P5)
2
Is this an R data set?
1
Is this an R data set?
D d S ; p 
 C d ,b
b
D S ; p 
p
Simplicial Depth (Liu 90)
• The simplicial depth of a point p with respect to a
probability distribution F in R d is the probability that
a random closed simplex in R d contains p.
SDLiu F ; p   PF  p  S X1 ,, X d 1 
where SX1,, X d 1  is a closed simplex formed by
d+1 random observations from F.
• The simplicial depth of a point p with respect to a
data set S  X1,, X n  in R d is the fraction of closed
simplicies formed by d+1 points of S containing p.
n
SD S ; p      I p  S X , , X 
 3
where I is the indicator function.
1
Liu
i1
id 1
Sample Version of Simplicial Depth
• The simplicial depth of a point p with respect to a data set
S  X1 ,, X n in R d is the fraction of closed simplicies formed
by d+1 points of S containing p.
Total number of simplicies=
6
(3) =20
.2
p.3
.3
.3
.4
.3
.2
p is contained in 6 simplicies
6
The depth of p= __ =.3
20
.3
.4
.2
.3
.3
.4
.4
.4
.2
.3 .3
.3
.2
Properties
• Is a statistical depth function in the continuous
case. (Liu 90)
• Is affine invariant (P1) and vanishes at infinity
(P4) in the sample case. (Serfling and Zuo 00)
Problems in the Sample Case
• Does not always attain maximality at the
center (P2) and does not always have
monotonicity relative to the deepest point (P3).
(Serfling and Zuo 00)
• The depth on the boundary of cells is at least
the depth in each of the adjacent cells – causes
discontinuities.
• Does not have invariance under dimensions
change (P5).
Simplicial Depth (Liu 90) (BRS 04)
.6 .3
B
.4
.3
E
.8 .5
.4
X
.5 .35
C
.3
.4
.5
.3
Y
.7 .4
.4
.3
.3.6
.6 .3
.6 .3
D
Averaging number of closed and
open simplicies containing a point
A
Total number of simplicies = ( 53 ) = 10
Revised Definition (BRS 04)
• The simplicial depth of a point p with respect to a
data set S  X1,, X n  in R d is the average of the
fraction of closed simplicies containing p and the
fraction of open simplicies containing p, formed by
d+1 points of S.
• Equivalently
1
SDBRS S ; p    S ; p    S ; p 
2
•  S; p  - the fraction of simplicies with data points as vertices
which contain p in their open interior.
•  S; p - the fraction of simplicies with data points as vertices
which contain p in their boundary.
Properties of the Revised Definition
• Reduces to the original definition, for continuous
distributions and for points lying in the interior of
cells.
• Keeps ranking order of data points
• Can be calculated using the existing algorithms, with
slight modifications.
• Fixes Zuo and Serfling’s counterexamples.
• The depth on the boundary of two cells is the average
of the two adjacent cells.
• Invariant under dimensions change (P5) for the
change from R 2 to R1.
Invariance Under Dimension Change (P5)
• Degenerate simplicies
– Both points C and A (a point between B and C) lie within the open
(degenerate) simplex BCD – think of it as a very thin triangle.
– Both points B and D are vertices of the1(degenerate) simplex BCD.
SD S ; p 
• For a point, p, consider the ratio: SD2 S ; p 
– For both definitions, the ratio for a position (non-data point) is 2/3.
– For Liu’s definition, the ratio for a data point is not 2/3.
– For the BRS definition, the ratio for a data point is 2/3.
Remaining Problems (P2 and P3)
355
1120
587
1120
Remaining Problems (Data Points)
• Data points are still over counted – there can still be
discontinuities at data points. However, to fix the depth at data
points, more features need to be considered.
n 1
– Data points are inherently part of  2  simplicies (a point makes a
triangle with every other pair of points) and edges are inherently part of
 n  2

 simplicies (the two endpoints of an edge make a triangle with
 1 
every other vertex).
– To retain invariance under dimensions change (P5), given a data set in
R b, which lies on a d-flat, then the depth of a point when the data set is
evaluated as a d-dimensional data set should be a multiple of the depth
when the data set is evaluated as a b-dimensional data set.
• Neither of the above ideas completely solve the problem and it
appears that the best solutions take into account the geometry
of the entire data set.
Ongoing Work
• The current algorithm for finding the median (the
deepest point) is O(n4) to walk an arrangement of
O(n2) segments.
– We can improve this algorithm by comparing simplicial
depth and half-space depth.
– We are further improving this by considering simplicial
depth in the dual.
• The problems with data points are improved by
generalizing this work to higher dimensions.
• To find the depth at all points, we are using local
information to form an approximation for the depth
measure.
References
•
•
•
•
•
•
•
•
G. Aloupis, C. Cortes, F. Gomez, M. Soss, and G. Toussaint. Lower bounds for
computing statistical depth. Computational Statistics & Data Analysis, 40(2):223229, 2002.
G. Aloupis, S. Langerman, M. Soss, and G. Toussaint. Algorithms for bivariate
medians and a Fermat-Torricelli problem for lines. In Proc. 13th CCCG, pages 2124, 2001.
M. Burr, E. Rafalin, and D. L. Souvaine. Simplicial depth: An improved definition,
analysis, and efficiency for the sample case. Technical report 2003-28, DIMACS,
2003.
A. Y. Cheng and M. Ouyang. On algorithms for simplicial depth. In Proc. 13th
CCCG, pages 53-56, 2001.
J. Gil, W. Steiger, and A. Wigderson. Geometric medians. Discrete Math., 108(13):37-51, 1992. Topological, algebraical and combinatorial structures. Frolík's
memorial volume.
S. Khuller and J. S. B. Mitchell. On a triangle counting problem. Inform. Process.
Lett., 33(6):319-321, 1990.
R. Liu. On a notion of data depth based on random simplices. Ann. of Statist.,
18:405-414, 1990.
Y. Zuo and R. Serfling. General notions of statistical depth function. Ann. Statist.,
28(2):461-482, 2000.