Data interpretation = equivalence partition

Download Report

Transcript Data interpretation = equivalence partition

Real world data analysis and
interpretation
Dmitry Volchenkov
Project FP7 – ICT-318723 MATHEMACS
Big challenges of big data
May 22, 2013 — A full 90% of all the
data in the world has been generated
over the last two years.
All possible paths are taken into account, some paths are
more preferable then others.
Such a “path integral” distance induces geometry (volumes)!
Data interpretation = equivalence partition
• The data interpretation
always is based on an
equivalence partition on
the set of walks over a
database;
Evolution tree = the geometric
image of the database under
the equivalence partition
suggested by Linnaeus
Systema Naturæ (1735)
Spices are
the identical
“walks” over the
database of
morphological
taxa
Interpretation
Data interpretation = equivalence partition
• The data interpretation
always is based on an
equivalence partition on
the set of walks over a
database;
• Interpretation does not
necessary reveal a "true
meaning" of the data, but
rather represent a selfconsistent point of view on
that.
“Astrological” equivalence partition: walks of the given length n starting at the
same node are equivalent (Same day born people inherit a same/similar
personality).
Plan of my talk
Random
Walks
Geometry of
Data
How can we
save Europe?
• What is that?
• Automated
data
interpretation
• “Path integral”
distance
• Examples
• Analysis of the
GDP,
Inequality,
Polity data
series
Random Walks: What is that?
Physical model
Mathematical meaning
P1  P2  P3  P4  1
, a permutatio n matrix
if ,        0,
Symmetry of route choice:
the equivalent paths are equiprobable
then   Aut  
T,    0,
T
ij
 1, a stochastic matrix
j
RW is a stochastic automorphism expressing structural symmetries:
Equivalent walks are equiprobable
When equivalent paths are taken as equiprobable
Utility
functions

Dn  diag deg n (1),  deg n ( N ) , deg n (i )   A n ij
D1  D2   Dn 
j
Graph


A

Random
walks

T1
 T2
  Tn
 T




1
 2
  n

Maximal
complexity
I.C. are
important

  An

 
A 
Tn  Dn1 An , Tn , An  0
n
ij
 # Paths n i  j ,


# Paths 1  i  
# Paths 1 G 
# Paths n  i  
 in  
# Paths n G 
Maximal entropy
    i1 
H T      iTij log Tij
i, j
the entropy rate
Bits
Centrality
measures

 A2

A   max 

T  i





j
jV
i 1,... N
deg i 
,
2E
 i   i

jV
j
random walks
Mixing
the past-future mutual information
Defect
" Empiricism "
insensitiv e
Scale of random walks, Tn
PFMI  lim H S n   n  h 
n
Defect & boundary
repelling
" Theories" , good for all
practical purposes
Geometry of Data & Graphs
• Path integral sums over all
RWs to compute a
propagator.
• Propagator is the Green’s
function of the diffusion
operator:
T,   0,
G
 Tn 
n 
1
 " L1"
1 T
• The Drazin generalized
inverse (the group inverse
w.r.t. matrix multiplication)
preserves symmetries of
the Laplace operator:
LGL  L, GLG  G,
G, L  0
• Given two distributions x,y,
their scalar product:
x, y T
 x, G y 
• The (squared) norm of a
distribution:
x
•
2
T
 x, Gx 
The Euclidean distance:
xy
2
T
 x
2
T
 y
2
T
 2x, y T
Probabilistic geometry of graphs
Graph  A  T  D 1 A, D  diag deg(1), deg( N )
ˆ  D1 2 AD 1 2 ,  T
ˆ    ,   N ,
T
l
l l
l
1  s ,i  s , j
s  2 1   s  1,i  1, j
N
Gij  

 2 ,i

   1,i 1   2
 

 
N ,i

   1,i 1   N

 2, j
 
 
   1, j 1   2
 , 



 N, j
 
   1, j 1   N
 
First-passage time:
2
N
N
2
1  k ,i
i 
  i H ij
2
T
k 2 1  k  1,i
i 1

1
Commute time:
 
Kij  i  j
2
T

 k, j
 k ,i
 


 1, j 1  k
k  2   1,i 1  k
N
1  1     N ,  12,i   i 




2

2 ,i
2, j
, 3, j 
degi 
2E



 
 

 i , j T  ei , Ge j 



  PR N 1
 

j
, 3,i 

i
Can we hear first-passage times?
F. Liszt Consolation-No1
P. Tchaikovsky, Danse Napolitaine
V.A. Mozart, Eine Kleine Nachtmusik
Bach_Prelude_BWV999
R. Wagner, Das Rheingold
(Entrance of the Gods)
Can we hear first-passage times?
First-passage time
Recurrence time
Tonality: the
hierarchy of
harmonic
intervals
Tonality of
music
The basic pitches for the
E minor scale are "E",
"F#", "G", "A", "B".
The recurrence time vs. the first
passage time over 804 compositions
of 29 Western composers.
Can we see the first-passage times?
Tax assessment value of land ($)
Manhattan, 2005
(Mean) First passage time
(Mean) first-passage times in the city graph of Manhattan
SoHo
Federal Hall
10
East Village
100
1,000
Bowery
East Harlem
5,000
10,000
Why are mosques located close to railways?
NEUBECKUM:
Isolation Moschee   10  log
first - passage time (Moschee)
 12 dB
Min first - passage time
first - passage time (Kirche)
Isolation Kirche   10  log
 3 dB
Min first - passage time
Social isolation vs. structural isolation
Principal components by random walks
Representations of graphs & databases in the probabilistic
geometric space are essentially multidimensional!
1000 × 1000 data table (or a connected graph of 1000 nodes) is
embedded into 999-dimensional space!
Dimensions are unequal! ~
1
, k  2.... N
1  k
Kernel principal component analysis (KPCA) with
1
G

T

"L "

the kernel
1 T
n
n 
1
Nonlinear principal components by random walks
MILCH
K
= MILK
Matrix of lexical
distances, A
Dmilch, milk   2 5 ;

Stochastic
normalizat ion, T
d l1 , l2  
 G
T
n 
1
# List of words
n

 D
List of words
l1
,  l2 
1
 " L1"  Kernel PCA
1 T
In contrast to the covariance matrix which best explains the variance in the data with
respect to the mean, the kernel G traces out all higher order dependencies among data
entries.
How can we save Europe?
Could random walks help us to approach the problem?
Crisis for Europe as trust hits record low
No common trends for EU
Maddison historical GDP data
Kalman filter based on GDP data
+ Average over many evolution scenarios
… if we play the previous history.
No common trends for EU
SCENARIO #1
SCENARIO #2
… if we play the previous history.
Economic recovery after the WWII came at different rates in different parts of
Europe.
Maddison’s database retells us the
story about recovering after the WWII
Industrial countries have an edge on competitors as
GDP variations are limited to ± $500/year at most
Traditional capital shelters thrive for
larger variations
Fancy years are shown to elucidate tendencies
Maddison’s database predicts bankruptcy to the
countries that remained uninvolved in the global
recovery process.
IRAQ
Fancy years are shown to elucidate tendencies
To catch up with new tendencies,
we have to add more databases
Evolution of political Regimes
Inequality
Democracy/Autocracy indices
Top income shares
Polity IV tells us that
“Political distance” – the minimal number of political changes (reforms) required to convert
the political system of one country into that of another.
Trends in Governance in 1810
Trends in Governance in 2012
Polity IV tells us that
• Positive feedback,
reinforcing the
multiplication of the
number of polities;
dN
N
dt
• We witness the very
beginning of a chain
reaction process (of
atomization of the
polity landscape)
The World Top Income database
There
many inequality metrics
Too poor
toare
be rich,
Rising inequality marks wars
too rich to be poor
The Pareto principle:
income follows
a power law
probability distribution.
Vilfredo Pareto
Parabolic fit (sic!)
→ number of people →
I used the inverse Pareto-Lorenz coefficient (IPLC)
→ wealth →
… and wars multiply states
The World Top Income database
Too poor to be rich,
too rich to be poor
Rapidly rising inequality marks
wars & conflicts
Parabolic fit(!)
… and wars multiply polities
If the GDP-gain substantially outmatches/ lags below the mean (red
line), it probably comes at the cost of increasing inequality.
7,563 governance configurations
Regulation of
chief executive
recruitment
Unregulated
Openness of
Executive
Recruitment
Closed
Competitiveness
of Executive
Recruitment
Unregulated
Dual executive
designation
Transitional
Selection
Dual executive
election
Regulated
Open
Executive
constrains
Regulation of
Participation
Competitivene
ss of
participation
Unlimited
Authority
Unregulated
Unregulated
Intermediate
Multiple
identity
Repressed
Slight to
moderate
limitations
Sectarian
Suppressed
Intermediate
Substantial
limitations
Dual
hereditary/comp
etitive
Factional
Restricted
Transitional
Intermediate
Regulated
Competitive
Executive Parity
+ Interruption (foreign occupation) + Interregnum (anarchy) + Transitional
232 configurations have been
observed since 1800
"Tajikistan", 2013
"Nepal", 1945
"Korea North", 2013
"Libya", 2010
Foreign interruption
"Cuba", 2005
"Thailand", 2013
"Korea South", 2013
"United States", 2013
"Czech Republic", 2013
New configurations arise from
time to time …
"Estonia", 2013
Random walks on the graph of
political regimes
Transition matrix between types
of governance
Each political regime has its own
dynamics for GDP and IPLC
Process starts from the actual data
(GDPPC & IPLC) for 2013
+ Averaging over all collected histories
Most transitions happen within the groups of authoritarian states and presidential
republics, while liberal democracies and dictatorships are quite “sticky”.
Random walks on the graph of political
regimes
A state insists on a common economic and political destiny of its citizens.
However, the actual trends of different economic groups might be statistically
inconsistent.
There can be a common European
trend if …..
Germany vs. Greece
• if the workforce will be
able to migrate freely,
• and polities will be able to
split without wars.
Back to the City-States?
Possible splitting of a country is visible as the statistically inconsistent trends.
Polities proliferation score
Main factors resulting in multiplying scores:
1. inequality (stretches bandwidth of boxes); 2. Authoritarian regimes are short-lived,
quickly transforming to other modes of authoritarianism, provoking instability
Greece vs. Russia
Expected number of countries
Strong inequality worsens perspectives,
authoritarian governance worsens perspectives
USA vs. China
IPLC ~ O(GDPPC2)
Battle in Asia, concord in Europe
China (red) vs. Indonesia (blue)
Germany (dark) vs. Austria (light)
Conclusions
RWs formalize the process
of data interpretation
playing the role of
stochastic automorphisms
The method for summing
up all RWs →
Probabilistic geometry
RWs can be used in order
to combine different
(incomplete) databases
Kernel Principal Component
Analysis handles high-order
dependences in data
Some references
D.V., Ph. Blanchard, Mathematical Analysis of Urban Spatial Networks, ©
Springer Series Understanding Complex Systems, Berlin / Heidelberg. ISBN
978-3-540-87828-5, 181 pages (2009).
D.V., Ph. Blanchard, “Introduction to Random Walks on Graphs and Databases”, ©
Springer Series in Synergetics , Vol. 10, Berlin / Heidelberg , ISBN 978-3-64219591-4 (2011).
Volchenkov, D., “Markov Chain Scaffolding of Real World Data”, Discontinuity, Nonlinearity, and
Complexity 2(3) 289–299 (2013)| DOI: 10.5890/DNC.2013.08.005.
Volchenkov, D., Jean-René Dawin, “Musical Markov Chains ”, International Journal of Modern Physics:
Conference Series, 16 (1) , 116-135 (2012) DOI: 10.1142/S2010194512007829.
Volchenkov, D., Ph. Blanchard, J.-R. Dawin, “Markov Chains or the Game of Structure and Chance. From
Complex Networks, to Language Evolution, to Musical Compositions”, The European Physical Journal Special Topics 184, 1-82 © Springer Berlin / Heidelberg (2010).
Volchenkov, D., “Random Walks and Flights over Connected Graphs and Complex Networks”,
Communications in Nonlinear Science and Numerical Simulation, 16 (2011) 21–55
http://dx.doi.org/10.1016/j.cnsns.2010.02.016 (2010).