Musical dice game

download report

Transcript Musical dice game

Markov chain methods: Cases of study
Markov chain methods in Language
Evolution and Musical Dice Games
Dimitri Volchenkov (Bielefeld University)- 2nd 45′ talk
Changes in languages go on constantly affecting words through
various innovations and borrowings.
Although tree diagrams have become ubiquitous in representations of language
taxonomies, they obviously fail to reveal full complexity of language affinity characterized
by many phonetic, morphophonemic, lexical, and grammatical isoglosses;
not least because of the fact that the simple relation of ancestry basic for a branching
family tree structure cannot grasp complex social, cultural and political factors
molding the extreme historical language contacts.
• Evolutionary trees conflict
with each other and with
the traditionally accepted
family arborescence;
• The languages known as
isolates cannot be reliably
classified into any branch
with other living languages.
From Gray, R. D. and Q. D. Atkinson. 2003. Language tree divergence times
support the Anatolian theory of Indo-European origin. Nature 426: 435-439.
A number of additional edges & the
considerable reticulation in a
central part of the usual
phylogenetic trees are to represent
conflict between the different splits
due to contacts and combined
interactions between languages.
The more comprehensive the graphical
model is, the less clear are its visual
apprehension and interpretation
From "The Shape and Fabric of Language Evolution" by S.J. Greenhill, Q.D.
Atkinson, A. Meade, R.D. Gray
Idea: To geometrize phylogenetic
relations using the Markov chain
approach!
1. We present a fully automated method for building genetic language taxonomies where
the relationships between different languages in the language family are represented
geometrically, in terms of distances and angles, as in Euclidean geometry of
everyday intuition.
2. We have tested our method for the 50 major languages of Indo- European language
family;
3. and then investigated the Austronesian phylogeny considered again over 50 languages
•
encoding
•
•
Introduce a metric;
Implement the various
clustering techniques to
simplify the
representations of the
data
Express the relations
between some linguistic
features in a numerical
form
representation
interpretation
•
The meanings of the identified components
have to be assessed.
The idea of assessing the phylogeny of
languages via the similarity between words
having the same meanings.
D. D’Urville, ”Sur les îles du Grand Océan”, Bulletin de la
Société de Géographie 17, 1-21 (1832).
During his voyages in Pacific aboard the “Astrolabe” from 1826
to 1829, he collected comparative lists of 115 basic terms
posited as especially stable.
« La langue est partout la même! »
Dumont d'Urville (1790 –1842)
(The language is everywhere the same!)
He detected the Austronesian group of languages.
Glottochronology
The idea to count the number of words that
have been replaced in a language considering
a list which contains terms which are
common to all cultures and which concern
the basic activities of humans.
The choice is motivated by the fact that the
vocabulary learned during childhood changes
very slowly over time.
Morris Swadesh (1909 - 1967)
Swadesh list of words
1 . I (Pers.Pron.1.Sg.)
2 .You (2.sg)
3. We
4. This
5. That
6. Who?
7. What?
8. Not
9. All (of a number)
10.Many
11.One
12.Two
13.Big
etc…
Morris Swadesh (1909 - 1967)
Glottochronologists use the percentage of shared cognates
(words inferred to have a common historical origin) in order to
compute the distance between pairs of languages.
dist l1, l2   log t
Changes in vocabulary are supposed to accumulate year after
year, and two languages initially similar become more and more
different.
* Identification of cognates is a matter of sensibility,
personal knowledge, and historical records:
Spanish
Latin
Greek
lac
galactos
gala
leche
* The rates of lexical changes in words are all different, as
being probably related to the frequency of use of the
associated meanings.
encoding
* Comparison over a large vocabulary is LESS ACCURATE, as many
similar words rather carry information about the extreme historical
contacts, than about the actual language similarity.
Brahui is Dravidian by the syntactic structure,
despite 85% of all its words are Indo-European.
* Bias between orthographic and phonetic realizations of meanings!
Levenshtein’s distance:
(Edit distance) is a measure of the similarity between two strings: the number of
deletions, insertions, or substitutions required to transform one string into
another.
K
MILCH = MILK
 0,1 Dmilch, milk   2 5
Levenshtein’s distance:
(Edit distance) is a measure of the similarity between two strings: the number of
deletions, insertions, or substitutions required to transform one string into
another.
K
MILCH = MILK
 0,1 Dmilch, milk   2 5
The normalized edit distance between the orthographic realizations of two words can
be interpreted as the probability of mismatch between two characters picked from the
words at random.
The short list of meanings and its stability:
The stability of the meaning α over a sample of N languages is defined by
 0,1
The averaged distance in the r.h.s. is smaller for those words corresponding to meanings with a lower
rate of lexical evolution, since they tend to remain more similar in two languages. Therefore, to a larger
S(α) there corresponds a greater stability.
One should keep all
the meanings with
higher information,
take at least some of
the most stable
meanings in the linear
part of the curve and
exclude completely
those meanings with
lower information.
Swadesh list of words
1 . I (Pers.Pron.1.Sg.)
2 .You (2.sg)
3. We
4. This
5. That
6. Who?
7. What?
8. Not
9. All (of a number)
10.Many
11.One
12.Two
13.Big
etc…
Morris Swadesh (1909 - 1967)
A WELL ADJUSTED INPUT VOCABULARY
EXHIBITING UNIFORMLY HIGH
STABILITY OF ITEMS, WITH RESPECT
TO THE DEFINED DISTANCE;
stable
vocabulary
The 20 most stable words for
the IE and AU language
families, with their stability
values within the family.
In the different
language groups, the
different meanings
are stable!
1. We have used a short list of 200 words (Swadesh’s list) adopted to
reconstruct systematic sound correspondences between the languages, known
to change at a very slow rate containing terms which are common to all cultures
– rather than a complete dictionary.
2. Swadeshs’ list for the languages written in the different alphabets were already
transliterated into English by Dyen et al.(1997), Greenhill et al.(2008).
3. We have studied languages within a language family
A DISTANCE ACCUMULATING THE
distance
DIFFERENCES IN SYSTEMATIC SOUND
CORRESPONDENCES BETWEEN THE
REALIZATIONS OF INDIVIDUAL
MEANINGS.
The lexical distance between l1 and l2, can be interpreted
as the average probability to distinguish them by a mismatch
between two characters randomly chosen from the
orthographic realizations of Swadesh’s meanings.
A CLUSTERING MAPS THE MATRIX OF LEXICAL
DISTANCES CALCULATED OVER THE OPTIMIZED
clustering
VOCABULARY INTO LOW-DIMENSIONAL SPACE
OF LANGUAGE GROUPS.
Any historical development in language
cannot be described only in terms of
‘pair-wise’ interactions, but it reflects a
genuine higher order influence among
the different language groups.
The kernel PCA method (Schölkopf et
al.,1998) generalizes PCA to the case
where we are interested in taking all
higher-order correlations between data
instances.
The lexical distance between l1 and l2 is the average
probability to distinguish them by a mismatch between
two characters randomly chosen from the orthographic
realizations of a Swadesh’s meaning.
 1 
Pli , l j   lim  T li , l j   
 , Tij  d li , l j 
n 
 1  T ij
k 0
n
k
P is the total probability of successful
classification by an infinite series of
matchings, for the two languages in the
language family.
N
 d l , l 
k 1
i
k
Pqk  k qk , 0  1  2    N
The rank-ordering of data traits, in accordance to their
eigenvalues provides us with the natural geometric
framework for dimensionality reduction.
Pqk  k qk , 0  1  2    N
representation
1.
2.
3.
The four well-separated monophyletic spines
represent the four biggest traditional IE language
groups: Romance & Celtic, Germanic, Balto-Slavic, and
Indo-Iranian;
The Greek, Romance, Celtic, and Germanic languages
form a class characterized by approximately the
same azimuth angle (belong to one plane);
The Indo-Iranian, Balto-Slavic, Armenian, and Albanian
languages form another class, with respect to the
zenith angle.
The systematic sound correspondences between the Swadesh’s words across the different
languages perfectly coincides with the well-known centum-satem isogloss of the IE family
(reflecting the IE numeral ‘100’), related to the evolution in the phonetically unstable palatovelar
order.
The normal probability plots fitting the distances r of
language points from the ‘center of mass’ to univariate
normality.
The data points were ranked and then plotted against their expected values under
normality, so that departures from linearity signify departures from normality.
  x   2 
  x   2 
exp  
 exp  

 2 2 


2
t




2 t
2 2
interpretation
The univariate normal distribution is closely related to the time evolution of a
mass-density function under homogeneous diffusion in one dimension
  x   2 
  x   2 
exp  
 exp  

2




2

2
t




2 t
2 2
in which the mean value μ is interpreted as the coordinate of a point where all mass was
initially concentrated, and variance σ2 ∝ t grows linearly with time.
Nothing to do with the traditional glottochronological assumption about the steady
borrowing rates of cognates (Embelton, 1986)!
The values of variance σ2 give a statistically consistent estimate of age for each language
group.
Anchor events:
1.
2.
3.
4.
the last Celtic migration (to the Balkans and Asia Minor) (300 BC),
the division of the Roman Empire (500 AD),
the migration of German tribes to the Danube River (100 AD),
the establishment of the Avars Khaganate (590 AD) overspreading
Slavic people who did the bulk of the fighting across Europe.
From the time–variance ratio we can retrieve the probable dates for:
The break-up of the Proto-Indo-Iranian
continuum.
The migration from the early Andronovo archaeological horizon
(Bryant, 2001).
by 2,400 BC
The end of common Balto-Slavic history
The archaeological dating of Trziniec-Komarov culture
before 1,400 BC
The separation of Indo-Arians from Indo-Iranians.
Probably, as a result of Aryan migration across India to Ceylon, as early as in 483BC (Mcleod,
2002)
before 400 BC
The division of Persian polity into a
number of Iranian tribes, after the end of Greco-Persian wars
(Green, 1996).
before 400 BC
The Kurgan scenario postulating the IE origin among the people of
“Kurgan culture”(early 4th millennium BC) in the Pontic
steppe (Gimbutas,1982) .
Einkorn wheat
The Anatolian hypothesis suggests the origin
in the Neolithic Anatolia and associates the
expansion with the Neolithic agricultural
revolution in the 8th and 6th millennia
BC (Renfrew,1987).
The graphical test to check three-variate normality of the distribution
of the distances of the five proto-languages from a statistically
determined central point is presented by extending the notion of the
normal probability plot. The χ-square distribution is used to test for
goodness of fit of the observed distribution: the departures from
three-variant normality are indicated by departures from linearity.
The use of the previously determined time–variance ratio
then dates the initial break-up of the Proto-Indo-Europeans
back to 7,400 BC pointing at the early Neolithic date.
The components probe for a sample of 50 AU languages immediately uncovers the both
Formosan (F) and Malayo-Polynesian (MP) branches of the entire language family.
Headhunters
The distribution of languages spoken within Maritime Southeast Asia,
Melanesia, Western Polynesia and of the Paiwan language group in
Taiwan over the distances from the center of the diagram conforms
to univariate normality suggesting that an interaction sphere
had existed encompassing the whole region, from the
Philippines and Southern Indonesia through the Solomon Islands to
Western Polynesia, where ideas and cultural traits were shared and
spread as attested by trade (Bellwood and Koon,1989; Kirch,1997) and
translocation off animals (Matisoo-Smith and Robins,2004;
Larsonetal.,2007) among shore line communities.
By 550 AD
…pretty well before 600 –1200 AD while descendants from Melanesia settled in the distant apices of the
Polynesian triangle as evidenced by archaeological records (Kirch, 2000; Anderson and Sinoto,2002;
Hurlesetal.,2003).
The distributions of languages spoken in the islands of East Polynesia and of the Atayal language groups in Taiwan
over the radial coordinate from the center of the geometric representation break from normality.
They seem to evolve without extensive contacts with Melanesian populations, perhaps because of a rapid
movement of the ancestors of the Polynesians from South-East Asia as suggested by the ‘express train’ model
(Diamond, 1988) consistent with the multiple evidences on comparatively reduced genetic variations among
human groups in Remote Oceania (Lum et al., 2002; Kayseretal., 2006; Friedländeret al., 2008).
The ‘adiabatic’ model of evolution is conceived by that while the contact borrowings are improbable
the orthographic realizations of Swadesh’s meanings would accumulate emergent variations in spellings,
so that the radial coordinate of a remote language can formally grow unboundedly with isolation time.
Recognized in 1820
Recognized in 1750
• Tahiti is the foremost Austronesian settlement in
the Remote Oceania attested as early as 300 BC
(Kirch, 2000);
• Archaeological reconstructions (Kirch, 2000;
Anderson & Sinoto, 2002; Hurlesetal., 2003)
Hawaii had been settled by 600 AD;
• New Zealand by 1000 AD testifying the earliest
outset dates for the related languages.
The log-linear plot fitting the distances to remote languages
riding an ‘express train’ in the geometric representation to an
exponential distribution.
The lexical distances between languages are taken as the average probability to
encoding
distinguish them by a mismatch between two characters randomly chosen from
the orthographic realizations of a Swadesh’s meaning.
We considered an infinite sequential process of language classification described
by random walks on the matrix of lexical distances. As a result, the relationships
representation
between languages belonging to one and the same language family are translated
into distances and angles, in multidimensional Euclidean space.
The derived geometric representations of language taxonomy are used in order to test the
interpretation
various statistical hypotheses about the evolution of languages and to making accurate
inferences on the most significant events of human history by tracking changes in language
families through time.
The proposed method is fully automated and computationally simple.
Markov Chain Analysis of Musical Dice Games
A system for using dice to compose music randomly, without having to know
neither the techniques of composition, nor the rules of harmony, named
Musikalisches Würfelspiel (Musical dice game)(MDG) had become quite popular
throughout Western Europe in the 18th century. "The Ever Ready Composer of
Polonaises and Minuets" was devised by Ph. Kirnberger, as early as in 1757.
The famous chance music machine attributed to W.A. Mozart ("K
516f") consisted of numerous two-bar fragments of music named
after the different letters of the Latin alphabet and destined to be
combined together either at random, or following an anagram of
your beloved had been known since 1787.
Markov Chain Analysis of Musical Dice Games
From one hand, studies of Markov chains
aggregating pitches in musical pieces might
provide a neat way to efficient algorithms
for identifying musical features
important for a listener.
Markov Chain Analysis of Musical Dice Games
From one hand, studies of Markov chains
aggregating pitches in musical pieces might
provide a neat way to efficient algorithms
for identifying musical features
important for a listener.
From another hand, the analysis of weighted
directed graphs correspondent to the timeirreversible random walks defined on a
finite set of states (pitches) belonging to
a cyclic group, under the assumption of
octave equivalency is a daunting task for the
contemporary theory of networks being
therefore of a special theoretical interest.
In the MDG, we consider a note as an elementary event providing a natural
discretization of musical phenomena.
Namely, given the entire keyboard K of 128 notes (standard for the MIDI representations
of music) corresponding to a pitch range of 10.5 octaves, each divided into 12
semitones, we regard a note as a discrete random variable Xt .
In the musical dice game, a piece is generated by patching notes Xt taking values
from the set of pitches that sound good together into a temporal sequence {Xt} t≥1.
In the MDG, we consider a note as an elementary event providing a natural
discretization of musical phenomena.
Namely, given the entire keyboard K of 128 notes (standard for the MIDI representations
of music) corresponding to a pitch range of 10.5 octaves, each divided into 12
semitones, we regard a note as a discrete random variable Xt .
In the musical dice game, a piece is generated by patching notes Xt taking values
from the set of pitches that sound good together into a temporal sequence {Xt} t≥1.
Musical Dice Game is not a particular musical composition!
(*)
The relations between notes in (*) are rather described in terms of probabilities and
expected numbers of random steps than by physical time. Thus the actual length N
of a composition is formally put N → ∞, or as long as you keep rolling the dice.
F. Liszt Consolation-No1
V.A. Mozart, Eine-Kleine-Nachtmusik
Bach_Prelude_BWV999
R. Wagner, Das Rheingold
(Entrance of the Gods)
Markov’s chains determining random walks on such graphs are not
ergodic: it may be impossible to go from every note to every other note
following the score of the musical piece.
The values of first passage times to notes are strictly ordered in accordance
to their role in the tone scale of the musical composition.
The basic pitches for the E minor
scale are E, F#, G, A, B, C, and D.
The E major scale is based on E, F#,
G#, A, B, C#, and D#.
The A major scale consists of A, B,
C#, D, E, F#, and G#.
Tonality of Western music
The log-log scatter plot contains 12×804 points representing the recurrence time vs. the
first passage time to the 12 notes of one octave, over the MDG based on 804
compositions of 29 composers.
First passage times to notes feature a composer
By analyzing the typical magnitudes of
first passage times to notes in one
octave, we can discover an individual
creative style of a composer and track
out the stylistic influences between
different composers.
First passage times to notes feature a composer
By analyzing the typical magnitudes of
first passage times to notes in one
octave, we can discover an individual
creative style of a composer and track
out the stylistic influences between
different composers.
First passage times to notes feature a composer
By analyzing the typical magnitudes of
first passage times to notes in one
octave, we can discover an individual
creative style of a composer and track
out the stylistic influences between
different composers.
First passage times to notes feature a composer
By analyzing the typical magnitudes of
first passage times to notes in one
octave, we can discover an individual
creative style of a composer and track
out the stylistic influences between
different composers.
Correlation and covariance matrices calculated for the medians of the first
passage times in a single octave provide the basis for the classification of
composers, with respect to their tonality preferences.
The correlogram allows for identifying
a number of groups of composers
exhibiting similar preferences in
the use of tone scales, as correlations
are positive and strong within
each tone group while being weak or
even negative between the different
groups.
Classical Period of music
Romantic period in
classical music
Middle
Romantic
era
Late
Romantic
era
The correlogram allows for identifying
a number of groups of composers
exhibiting similar preferences in
the use of tone scales, as correlations
are positive and strong within
each tone group while being weak or
even negative between the different
groups.
Interestingly, the names of composers that are contiguous in the correlogram are often
found together in musical concerts and on records performed by commercial
musicians.