by the first passage time to it

Download Report

Transcript by the first passage time to it

We have no a physical image of the network or database, but
only individual objects recognized as nodes.
The distance between two vertices in a graph is the number of
edges in a shortest path connecting them.


T,
T
ij
 1, a stochastic matrix
j
T:
T,   0
A particular case:
, a permutatio n matrix
,   0    Aut  
T:
T,   0
 kTˆ  k k ,
1  1  2  N  1
defines a random walk on 
 
ij
2
T
  s ,i
 s, j
 


 1, j 1  s
s  2   1,i 1  s
N




2
 x0 , x1,..., xn1, xn  i, xn1, xn2 ,..., xnl  j, xnl 1,..., xs1, xs  i, xs1, xs2 ,...

i
2
T
  s ,i
 

s  2   1,i 1  s
N




2
N 

 
 s ,i s , j
i , j T     1      1 1T   Tn
s  2  1,i 1, j
n0
s 
The Moore-Penrose pseudoinverse
First-Encounters restore the Euclidean space structure:
i  j  i  j  2  i, j 
2
2
2
1A
Ri =1W
3 examples:
"We shape our buildings,
and afterwards our buildings shape us.“
• Sir Winston Churchill (October 28, 1943: while requesting
that the House of Commons be rebuilt exactly as before,
remaining insufficient to seat all its members.)
12
The more isolated is a place, the worse is the situation in that.
First-passage times
to Venetian canals
SoHo
East Village
Times Square
Federal Hall
Bowery
East Harlem
The data on the mean household income per year provided by
Growth (bell)  log10  PEmax PEmin 
The data taken from the
FPT  i 
I dB  Moschee   10  log
FPTmin
 12
I dB  Kirche   3
From Gray, R. D. and Q. D. Atkinson. 2003. Language tree divergence times
support the Anatolian theory of Indo-European origin. Nature 426: 435-439.
The tree-reconstruction phylogenetic methods based on the simple relation of ancestry fail to reveal full
complexity of multidimensional phylogenetic signal where language affinity is characterized by many
phonetic, morphophonemic, lexical, and grammatical isoglosses:
• evolutionary trees conflict with each other and with the traditionally accepted family arborescence;
• the languages known as isolates cannot be reliably classified into any branch with other living languages.
1. We present a fully automated method for building genetic language taxonomies where
the relationships between different languages in the language family are represented
geometrically, in terms of distances and angles, as in Euclidean geometry of everyday
intuition.
2. We have tested our method for the 50 major languages of Indo- European language
family;
3. and then investigated the Austronesian phylogeny considered again over 50 languages
Challenges:
1.
encoding
2.
Languages which belong to the same family may
not share many words in common, while languages
in two distinct families may share many words in
common.
The effect of bias between orthographic and
phonetic realizations of meanings
1.
We have used a short list of 200 words (Swadesh’s list) adopted to reconstruct
systematic sound correspondences between the languages, known to change at a
very slow rate containing terms which are common to all cultures – rather than a
complete dictionary.
2.
Swadeshs’ list for the languages written in the different alphabets were already
transliterated into English by Dyen et al.(1997), Greenhill et al.(2008).
3.
We have studied languages within a language family
 0,1
Dmilch, milk   2 5
Brahui is Dravidian by
the syntactic structure,
but 85% of all words
are Indo-European.
Levenshtein distance
(edit distance) is a measure
of the similarity between two
strings, the number of
deletions, insertions, or
substitutions required to
transform one into another.
MILCH
K
= MILK
The lexical distance between l1 and l2, can be interpreted as the average probability to distinguish them by a mismatch between two
characters randomly chosen from the orthographic realizations of Swadesh’s meanings.
Challenges:
The multivariate lexical signal is strongly correlated → PCA, ICA
representation
Any historical development in language cannot be described only in terms of ‘pair-wise’
interactions, but it reflects a genuine higher order influence among the different language
groups.
The kernel PCA method (Schölkopf et al.,1998) generalizes PCA to the case where we are
interested in taking all higher-order correlations between data instances.
The appropriate kernel was found in Blanchard &Volchenkov(2008):
The lexical distance between l1 and l2 is the average probability to
distinguish them by a mismatch between two characters randomly chosen
from the orthographic realizations of a Swadesh’s meaning.
P is the total probability of successful classification by an infinite series of matchings, for the two
languages in the language family,
 1 
Pli , l j   lim  T k li , l j   
, Tij  d li , l j 

n 
 1  T ij
k 0
n
N
 d l , l 
k 1
i
k
Pqk  k qk , 0  1  2    N
The rank-ordering of data traits, in accordance to their eigenvalues provides us with the natural geometric framework for dimensionality
reduction.
Pqk  k qk , 0  1  2    N
representation
1.
2.
3.
The four well-separated monophyletic spines
represent the four biggest traditional IE language
groups: Romance & Celtic, Germanic, Balto-Slavic,
and Indo-Iranian;
The Greek, Romance, Celtic, and Germanic
languages form a class characterized by
approximately the same azimuth angle (belong to
one plane);
The Indo-Iranian, Balto-Slavic, Armenian, and
Albanian languages form another class, with
respect to the zenith angle.
The systematic sound correspondences between the Swadesh’s words across the different
languages perfectly coincides with the well-known centum-satem isogloss of the IE family
(reflecting the IE numeral ‘100’), related to the evolution in the phonetically unstable
palatovelar order.
The normal probability plots fitting the distances r of
language points from the ‘center of mass’ to univariate
normality.
The data points were ranked and then plotted against their expected values
under normality, so that departures from linearity signify departures from
normality.
  x   2 
  x   2 
exp  
 exp  

 2 2 


2
t




2 t
2 2
interpretation
The univariate normal distribution is closely related to the time evolution of a
mass-density function under homogeneous diffusion in one dimension
  x   2 
  x   2 
exp  
 exp  

2




2

2
t




2 t
2 2
in which the mean value μ is interpreted as the coordinate of a point where all mass was
initially concentrated, and variance σ2 ∝ t grows linearly with time.
Nothing to do with the traditional glottochronological assumption about the steady
borrowing rates of cognates (Embelton, 1986)!
The values of variance σ2 give a statistically consistent estimate of age for each language
group.
Anchor events:
1.
2.
3.
4.
the last Celtic migration (to the Balkans and Asia Minor) (300 BC),
the division of the Roman Empire (500 AD),
the migration of German tribes to the Danube River (100 AD),
the establishment of the Avars Khaganate (590 AD) overspreading
Slavic people who did the bulk of the fighting across Europe.
From the time–variance ratio we can retrieve the probable dates for:
The break-up of the Proto-Indo-Iranian continuum.
The migration from the early Andronovo archaeological horizon
(Bryant, 2001).
by 2,400 BC
The end of common Balto-Slavic history
The archaeological dating of Trziniec-Komarov culture
before 1,400 BC
The separation of Indo-Arians from Indo-Iranians.
Probably, as a result of Aryan migration across India to Ceylon, as early as in
483BC (Mcleod, 2002)
before 400 BC
The division of Persian polity into a number of
Iranian tribes, after the end of Greco-Persian wars (Green,
1996).
The Anatolian hypothesis suggests the
origin in the Neolithic Anatolia and
Einkorn wheat
associates the expansion with the Neolithic
(Triticum boeoticum) agricultural revolution in the 8th and 6th
millennia BC (Renfrew,1987).
The graphical test to check three-variate normality of the distribution of the distances of the five proto-languages from a
statistically determined central point is presented by extending the notion of the normal probability plot. The χ-square distribution
is used to test for goodness of fit of the observed distribution: the departures from three-variant normality are indicated by
departures from linearity.
The use of the previously determined time–variance ratio then dates the initial break-up of the
Proto-Indo-Europeans back to 7,400 BC pointing at the early Neolithic date.
The components probe for a sample of 50 AU languages immediately uncovers the both
Formosan (F) and Malayo-Polynesian (MP) branches of the entire language family.
Headhunters
The distribution of languages spoken within Maritime Southeast Asia,
Melanesia, Western Polynesia and of the Paiwan language group in
Taiwan over the distances from the center of the diagram conforms to
univariate normality suggesting that an interaction sphere had
existed encompassing the whole region, from the Philippines
and Southern Indonesia through the Solomon Islands to Western
Polynesia, where ideas and cultural traits were shared and spread as
attested by trade (Bellwood and Koon,1989; Kirch,1997) and
translocation off animals (Matisoo-Smith and Robins,2004;
Larsonetal.,2007) among shore line communities.
By 550 AD
…pretty well before 600 –1200 AD while descendants from Melanesia settled in the distant apices of the
Polynesian triangle as evidenced by archaeological records (Kirch, 2000; Anderson and Sinoto,2002;
Hurlesetal.,2003).
A system for using dice to compose music randomly, without having to know neither
the techniques of composition, nor the rules of harmony, named Musikalisches
Würfelspiel (Musical dice game)(MDG) had become quite popular throughout Western
Europe in the 18th century:
• "The Ever Ready Composer of Polonaises and Minuets" was devised by Ph. Kirnberger, as early as
in 1757.
• The famous chance music machine attributed to W.A. Mozart ("K 516f") consisted of numerous
two-bar fragments of music named after the different letters of the Latin alphabet and destined to
be combined together either at random, or following an anagram of your beloved had been
known since 1787.
Every pitch in a musical piece is characterized with respect to the entire structure of the Markov chain by
its level of accessibility estimated by the first passage time to it that is the expected length of the shortest
path of a random walk toward the pitch from any other pitch randomly chosen over the musical score.
The values of first passage times to notes are strictly ordered in accordance to their
role in the tone scale of the musical composition.
By analyzing the typical magnitudes of first passage times to notes in one octave, we
can discover an individual creative style of a composer and track out the stylistic
influences between different composers.
Correlation and covariance matrices calculated for the medians of the first passage times in
a single octave provide the basis for the classification of composers, with respect to their
tonality preferences.