which information for which linguistic purposes?

Download Report

Transcript which information for which linguistic purposes?

Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
The example of the Greek and Latin LASLA databases
compared to others
Dominique LONGRÉE, LASLA – Université de Liège et FUSL (Bruxelles)
1
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
0. Introduction : objectives
• to share the expertise of the LASLA (50 years)
« Laboratoire d’Analyse statistique des Langues anciennes »,
set up in 1961 at the Liege University
• to offer a discussion
1) which information in a database and for which purposes ?
2) which influence on the results of our linguistic studies ?
• to compare the lemmatizing and tagging practices of LASLA
with practices of other (Greek and Latin) databases
2
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
0. Introduction : plan
• LASLA and its databases
• the research project LatLem
• the Opera Latina Web interface and the Hyperbase-Latin CD-Rom
• the process of tokenization
• the process of lemmatization
• the process of tagging (morphosyntactic tags)
• the process of tagging (syntactic, semantic and pragmatic tags)
• the research project LatSynt
3
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Databases : Greek – Latin
•The Laboratory for Statistic Analysis of Classical Languages (L.A.S.L.A.)
- set up in Septembre1961
- first research centre
- aiming to study classical languages (Greek and Latin)
- using automatic data processing technologies.
part of the Faculty of Philosophy and Letters at the University of Liège
• Missions :
1) a detailed study of Greek and Latin languages and literatures using
computer techniques as well as statistical and quantitative methods;
2) the making of literary data banks and computer tools in order to distribute
those data banks and make the most of them by all available Media.
4
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Greek Database
1.200.000 words/tokens :
Attic orators : Andocides, Antiphon, Isocrates and Lysias
Aristotle : De Anima, De partibus animalium, Categorie, Metafisica,
Fisica, Historia animalium.
Plato : 8 dialogues
All classic tragedies : Aeschylus, Sophocles, Euripides and fragments
Pausanias
christian authors : for example
St John Chrysostom’, De sacerdotio
Hesychius of Jerusalem , Homilies
5
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Greek Database
Facing each word form, appear the following data :
1. the reference of the word form, according to the ars citandi.
2. the lemma (the word as it appears in the dictionary of reference, which is
the Greek-English Lexicon, of H. G. Liddell, R. Scott et H. S. Jones).
3. the grammatical category of the word (POS)
lemma
ὦ2
κοινός
αὐτάδελφος
̓Ισμήνη ̓
κάρα 1
ἆρα
οἶδα
…….
token
ὦ
κοινὸν
αὐτάδελφον
Ισμήνης
κάρα
ἆρ ̓
οἶσθ ̓
reference
2 1 1 1 1
2 1 1 2 2
2 1 1 3 3
2 1 1 4 4
2 1 1 5 5
2 1 2 1 6
2 1 2 2 7
1
2
3
4
5
6
7
A
A
A
A
A
A
A
POS
λ
γ
γ
β
β
μ
ζ
6
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• Latin classical texts: 2.000.000 tokens
• The LASLA method :
- Étienne ÉVRARD, « Le laboratoire d’analyse statistique des
langues anciennes de l’Université de Liège », Mouvement
scientifique en Belgique, 9, 1962, p. 163-169 ;
- Joseph DENOOZ, « L’ordinateur et le latin, Techniques et
méthodes », Revue de l’organisation internationale pour l’étude des
langues anciennes par ordinateur, 1978, 4, p. 1-36.
7
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• the available fully lemmatized and encoded texts :
Classical texts (more than 2.000.000 words/tokens)
Caesar et alii
Cato
Catullus
Cicero :
rhetoric works : all;
philosophical works : partim
Curtius
Horatius
Iuvenalis
Lucretius
Ovidius
Persius
Petronius
Plautus : 8 plays
Plinius (Iunior)
Propertius
Sallustius
Seneca
Tacitus
Tibullus
Virgilius
8
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• the available fully lemmatized and encoded texts : other texts
Medio-Latin :
Sedulius Scottus
Hagiographic texts (300.000 words)
Neo-Latin :
Descartes
Spinoza
• next available texts : works in progress
Cicero (letters)
Cornelius Nepos
Livius
Suetonius
Historia Augusta
Busbecq (by L. Grailet)
9
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• 2.500.000 words/tokens
• Bibliotheca Teubneriana Latina : 13 millions tokens
• fully lemmatized texts,
• with a full morphosyntactic tagging and 1 syntactic tag
• systematically verified by a philologist
10
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
For each word of the text, :
1.the lemma (the word as it appears in the dictionary of reference, the Lexicon
totius latinitatis of Forcellini, éd. de Corradini, Padoue, 1864)
2. an index which enables to distinguish various homograph lemmas
ET 1 = adverb, ET 2 = coordinating conjunction
or to spot proper names or adjectives derived from proper names
N opposite Roma means “proper name”
3. the form as appearing in the text
4. the reference, according to the ars citandi
5. the complete morphologic analysis in alphanumeric format
6. regarding the verbs, syntactic indications :
main clauses verbs
subordinate clauses verbs (sorted by subordination type)
11
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• the information available for each latin form
Lemma + index
TextForm
Reference
Analysis
Lemme + indice
VRBS
ROMA
N
AB
PRINCIPIVM
REX
HABEO
LIBERTAS
ET
2
CONSVLATVS
LVCIVS
N
BRVTVS
N
INSTITVO
Forme du texte
URBEM
ROMAM
A
PRINCIPIO
REGES
HABUERE
LIBERTATEM
ET
CONSULATUM
L.
BRUTUS
INSTITUIT
Référence
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
Analyse
13C00
11C00
70600
12F00
13J00
52L14 &
13C00
81000
14C00
12A00
12A00
53C14 &
Index :
N
2
001
002
003
004
005
006
007
008
009
010
011
012
001
002
003
004
005
006
007
008
009
010
011
012
Name
ET 1 = adverb, ET 2 = coordinating conjunction
12
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• the information available for each latin form
Lemma + index
TextForm
Reference
Analysis
Lemme + indice
VRBS
ROMA
N
AB
PRINCIPIVM
REX
HABEO
LIBERTAS
ET
2
CONSVLATVS
LVCIVS
N
BRVTVS
N
INSTITVO
Forme du texte
URBEM
ROMAM
A
PRINCIPIO
REGES
HABUERE
LIBERTATEM
ET
CONSULATUM
L.
BRUTUS
INSTITUIT
Référence
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
41 001 0001
Analyse
13C00
11C00
70600
12F00
13J00
52L14 &
13C00
81000
14C00
12A00
12A00
53C14 &
Analysis
urbem :
habuere
13C00
52L14
&
1 : Noun 3 : 3d Decl.
5 : Verb 2 : 2d Conj. Act.
main clause
001
002
003
004
005
006
007
008
009
010
011
012
C : Acc. sing.
L : 3d pers. Plur
001
002
003
004
005
006
007
008
009
010
011
012
1 : Ind.
4 : Perfectum
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
• the information available for each latin form
&AT
&HERCVLES
&MEMORIA
&PARENS
&CLAVDIVS
&CAESAR
&FERO
&CVM
&IN
&PALATIVM
&SPATIOR
&AVDIO
&QVE
&CLAMOR
&CAVSA
&REQVIRO
Analysis
audisset :
requisisse
2AT
HERCULE
MEMORIA
1PARENTUM
NCLAUDIUM
NCAESAREM
FERUNT
3CUM
IN
PALATIO
SPATIARETUR
AUDISSET
QUE
CLAMOREM
CAUSAM
REQUISISSE
PL013000300100181000
PL013000300200290000
PL013000300300311F00
PL013000300400413M00
PL013000300500512C00
PL013000300600613C00
PL013000300700756L11&
PL013000300800882032
PL013000300900970600
PL013000301001012F00
PL01300030110115JC32PL013000301201254C35PL013000301301381000
PL013000301401413C00
PL013000301501511C00
PL013000301601653074-
BN
BN
AG
01 03300
01 03301
01 03302
01 03303
01 03304
01 03305
01,03306
01 03307
01 03308
01 03309
01 03310
01 03311
01 03312
01,03313
01 03314
01;03315
5JC32 5 : Verb J : 1st Conj. Dep. C : 3d pers. Sing 3 : Subj 2 : ImpPerf.
BN
cum clause
53074 5 : Verb 3 : 3d Conj. Act. 0: unpers.
7 : Inf
4 : Perfectum
AG
Accusativus cum Infinitivo
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla system for tagging
• old fashioned
• the project Latlem
15
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Greek and Latin Database
accessible through:
- index plublished by :
- G. Olms (Hildesheim)
- the Centre Informatique de Philosophie et Lettres (CIPL-Liège)
- for Greek texts : specific software
- for Latin texts :
- the Opera Latina Web interface
- the Hyperbase-Latin CD-Rom
16
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
accesible throug h“opera latina”: www.ulg.ac.be/cipl/lsl.htm
17
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
1. The Lasla Latin Database
accesible through the CD-Rom “Hyperbase-latin”
collaboration
with
the UMR 6039
« Bases, corpus,
langage »
(CNRS-University
of Nice)
18
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
2. “Tokenizing” a text : establishing the text
19
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
2. “Tokenizing” a text : segmenting the text into sentences
/Accusa senatum, accusa equestrem ordinem...,
accusa omnes ordines, omnes ciues../
/Accusa senatum,/
/accusa equestrem ordinem..., /
/accusa omnes ordines, omnes ciues../
20
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
2. “Tokenizing” a text : segmenting the text into words
compare with CD-Rom PHI 05 of the Packard Humanities Institute
• The
string
-ibil-
21
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
2. “Tokenizing” a text : segmenting the text into words
 compare with CD-Rom PHI 05 of the Packard Humanities Institute
• Vergil’s Aeneid :
arma virumque cano
arma uirumque cano
• clitic –que :
/que<blank>/, /que<,>/ , /que<;>/, /que<:>/, /que<.>/
atque, ubique, undique, quicumque
• amatus est / amatust
• animum aduertere / animaduertere
22
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
3. Lemmatizing a text
 to allow the recognition of the same lemma in its various occurrences in a text,
independently of the variety of its forms in those occurrences
1) Greek and Latin are inflected languages
Case
Singular
M
N.
V.
Acc.
G.
D.
Abl.
ingens
ingens
ingent em
ingent is
ingent î
ingent î
F
ingens
ingens
ingent em
ingent is
ingent î
ingent î
Plural
Nt
ingens
ingens
ingens
ingent is
ingent î
ingent î
M
ingent ês
ingent ês
ingent ês/îs
ingent ium
ingent ibus
ingent ibus
F
ingent ês
ingent ês
ingent ês/îs
ingent ium
ingent ibus
ingent ibus
Nt
ingent ia
ingent ia
ingent ia
ingent ium
ingent ibus
ingent ibus
23
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
3. Lemmatizing a text
 to allow the recognition of the same lemma in its various occurrences in a text,
independently of the variety of its forms in those occurrences
2) the Latin spelling is not completely fixed
- assimilation phenomena (inlicio/illicio; adtuli/attuli; quidquid/quicquid)
- haplologies (exspecto/expecto)
- weak phonological status of some phonemes
(harena/arena, exhibeo/exibeo, mihi/mi, consul/cosul, etc)
- transformation of diphthongs into monophthongs
(saeta/seta: plaudite/plodite, poenicus/punicus)
24
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
3. Lemmatizing a text
 to allow the recognition of the same lemma in its various occurrences in a text,
independently of the variety of its forms in those occurrences
2) the Latin spelling is not completely fixed
- elision, epenthesis, apheresis, contraction, as well as abbreviation
- disjunction of parts of the compound words or tmesis
- res publica for respublica
- quo modo for quomodo
- quam... ante for antequam
- morphologic diachronic and synchronic variant s
- pater familias/pater familiae
- siet/sit
- igni/igne
- fecerit/faxit.
25
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
3. Lemmatizing a text
 populus 1, “the people” and populus 2, “the poplar”
 licet 1, “it is allowed”, licet 2 “although”
26
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
3. Lemmatizing a text
Dux Wordform : cooccurrent wordforms
Ecart Corpus Extrait Mot
038 141 141 dux
005 336 8 romanus
005 170 7 auctor
004 2733 19 erat
004 626 8 bello
004 506 8 exercitus
004 482 8 hostium
004 299 7 miles
004 151 5 comes
004 119 4 militiae
004 113 4 deae
004 106 4 cohortibus
004 87 4 campis
004 53 3 diuersis
004 44 3 copiarum
004 39 3 uoluntatis
004 37 3 rati
004 20 3 gregis
Dux lemma : cooccurrent lemmas
Ecart Corpus Extrait Mot
038 932 934 dvx
015 1616 120 miles
014 1285 105 exercitvs1
009 25801 604 qve
009 2447 107 bellvm
009 2298 98 romanvsa
009 1910 88 hostis
008 1725 78 arma
008 1059 58 legio
007 1113 55 castra2
007 862 45 copia
007 615 37 imperator
007 519 36 avctor
006 40004 802 et2
006 1968 70 vrbs
006 786 39 tot
006 536 31 cohors
006 493 29 agmen
006 371 26 comes
27
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the POS tag
• research on Greek determiners (UMR6039-Nice, Michèle Biraud,)
- sequences α γ ε β and α μ γ ε β attested in the LASLA files
α : article
β : noun
γ : adjective
ε : adjective/pronoun
μ : particle
- οἱ ἄλλοι πάντες ἄνθρωποι
28
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the POS tag
• research on parallel and reminiscent passages between literary works
(Koen Van Haegendoren, Liège)
29
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the POS tag :
to characterise authors
and genres
30
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the POS tag :
LASLA Latin texts and
BFM French
medieval texts
31
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 lemmatization and POS : the case of the adjectives used as nouns
• a solution : amicus 1 noun vs. amicus 2 adjective
• but “sunt christiani” ?
> sanctus, beatus, fidelis or impius…
• another solution : sanctus is analyzed
21A00_4 when used as adjective ( 2 for adjective,
1 for first class,
A for singular nominative
4 for male)
21A0014 when used as a noun (the additional 1 indicating this use).
• also for fideles /omnes fideles, credentes /omnes credentes, laudantes,
audientes, legentes
32
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 full morphosyntactic analysis : Greek declension in Latin
33
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 full morphosyntactic analysis : 4th conjugation
34
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 full morphosyntactic analysis : the deviant forms
• a solution : a special code for the whole declension (domus)
• another solution : several lemmas
ex : a plural male accusative saxos, instead of the plural neutral saxa
- a form of a new male lemma saxus (not attested in the dictionaries);
- an anomalous form of the neutral lemma saxum
in both cases with the same codification (12: 1 for noun, 2 for 2nd décl.)
but :
ex : facta est tonitrua in aera
35
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 full morphosyntactic analysis : the deviant forms
ex : facta est tonitrua in aera
- tonitrua used as a singular nominative of the first declension.
- in the dictionaries,
tonitrus, us (4th decl.)
tonitruum, i (2nd decl)
tonitrus, i (m) (2nd decl)
tonitru (n) (4th decl.)
but no tonitrua.
- explained by a plural neutral of tonitruum reinterpreted
as a feminine singular, but how to lemmatize it?
• a solution :
- to consider tonitrua as a form of the lemma tonitruum;
- the peculiarity of its use only in the tag corresponding
to the morphosyntactic analysis
36
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 full morphosyntactic analysis : the deviant forms
ex :
- regular forms of the lemma dulcis, “smooth”,
- tagged with the code of the adjectives of the second class in -is (24)
- anomalous form dulciam
- tagged as a form of the lemma dulcis
- with the code of the adjectives of the first class in -is (21)
and in the Classical Latin corpus:
caelum (n) and caelus (m
inferni (m) and inferna (n)
cingula (f), cingulus (m) and cingulum (n)
37
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the full morphosyntactic analysis : narrative indicative tenses
38
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
4. Morphosyntactic tagging
 using the full morphosyntactic analysis : repeated sequences (adj.-adj)
39
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging
 the ‘Treebank’ approaches
- Index Thomisticus
- Latin Treebanks at Perseus
based on :
- the Dependency grammar (Prague Dependency Treebank )
- the Latin grammar of H.Pinkster
 a training corpus is tagged manually and other corpora are encoded
by using automatic taggers
 problems :
- a method imposing a specific linguistic framework
- mixing theoretical linguistic framework
- producing data which are not verified
40
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt
 an original and innovative research on word order and Latin sentence
structures
• Objectives:
-to develop automatic procedures for parsing based on word order rules
(in order to offer an alternative to ‘Treebank’ approaches)
-to evaluate the relevance of the recent linguistic descriptions
-to offer new tools for textual data analysis (TDA)
- for enuntiative structure modeling
- for Latin texts classification and segmentation
• Methods:
- to develop automatic procedures grounded on
-the already encoded morphological information in the LASLA
database
-the text linearity
- to refine and improve the computer programmes in successive stages
41
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
• Objective : - to mark out the boundaries of personal verb clauses
(provided with a subordinating word)
- to specify the level of their subordination (their “embedding”)
• from the alphanumeric data of the LASLA database:
1. Lemme
SVM
OMNINO
ITER
DVO
QVI
ITER
DOMVS
EXEO
POSSVM
2. 3.Forme
1 erant
omnino
itinera
duo
1 quibus
itineribus
domo
1 exire
1 possent
4.Référence
CE0060001001001
CE0060001002002
CE0060001003003
CE0060001004004
CE0060001005005
CE0060001006006
CE0060001007007
CE0060001008008
CE0060001009009
5. Morpho.
56L12
60000
13J000
31J00 5
46O32 1
13O00
12F00
56071
56L32
6. Synt.
&
– LN
42
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
&AT
&HERCVLES
&MEMORIA
&PARENS
&CLAVDIVS
&CAESAR
&FERO
&CVM
&IN
&PALATIVM
&SPATIOR
&AVDIO
&QVE
&CLAMOR
&CAVSA
&REQVIRO
Analysis
audisset :
cum :
2AT
HERCULE
MEMORIA
1PARENTUM
NCLAUDIUM
NCAESAREM
FERUNT
3CUM
IN
PALATIO
SPATIARETUR
AUDISSET
QUE
CLAMOREM
CAUSAM
REQUISISSE
BN
32
PL013000300100181000
PL013000300200290000
PL013000300300311F00
PL013000300400413M00
PL013000300500512C00
PL013000300600613C00
PL013000300700756L11&
PL013000300800882032
PL013000300900970600
PL013000301001012F00
PL01300030110115JC32PL013000301201254C35PL013000301301381000
PL013000301401413C00
PL013000301501511C00
PL013000301601653074-
BN
BN
AG
01 03300
01 03301
01 03302
01 03303
01 03304
01 03305
01,03306
01 03307
01 03308
01 03309
01 03310
01 03311
01 03312
01,03313
01 03314
01;03315
cum clause
subj. imp.
43
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
-1st stage :
quem […] uidi, LN14 (‘subordination in QVI’ & ‘perfect indicative’) :
transferred to both the recording of
- the quem form (lemma QVI) and
- the uidi form (lemma VIDEO)
- 2d stage:
&0014 +LN14 -LN14 +LN12 +GK32 -GK32 -LN12
44
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
- 3d stage :
<&0014>[+LN14 -LN14]{+LN12 [+GK32 -GK32] -LN12}.
- Final stage:
Tacite, Annales, 13,11,2 / P2849 /
<&0014>[+LN14-LN14]{+LN12[+GK32-GK32]-LN12}
<&secuta (est)> que lenitas in Plautium Lateranum [+quem ob
adulterium Messalinae ordine demotum -reddidit] senatui clementiam
suam obstringens crebris orationibus {+quas Seneca testificando
[+quam honesta -praeciperet] uel iactandi ingenii uoce principis uulgabat}
45
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
 analysing left dislocations:
5,22,2 P0909 1 [+BN35-BN35]<&0014>
ii [+cum ad castra -uenissent], nostri eruptione facta multis eorum
interfectis, capto etiam nobili duce Lugotorige suos incolumes
<&reduxerunt>
46
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
5. Syntactic tagging : the Project LatSynt – the first stage
•Results : to bring out
- linguistic regularities (prolepsis)
- distances between texts (Caesar – Tacitus)
- the importance of semantic and pragmatic phenomena
• Perspectives :
-to mark out the boundaries of complex syntagms (in order to mark out the
boundaries of subordinate clauses without subordinator)
-to promote interactions with other researches regarding the text topology (at
the micro- and macro structural levels)
-repeted segments (Hyperbase-latin in collaboration with BCL–
Nice/CNRS)
-syntactic and multidimensional « motifs » (in c. with BCL–Nice/CNRS)
- to use the results for texts segmentation and classification
47
Lemmatizing and tagging a corpus :
which information for which linguistic purposes?
6. Tagging : what else ?
 semantic and pragmatic information,
- semantic functions: Goal, Recipient, Agent, etc
- pragmatic functions: Rheme, Topic, Focus , etc
 building databases available for all kinds of research
 without imposing specific linguistic frameworks or analysis
 tokenization, lemmatization or tagging :
-not trivial processes
-requiring thorough theoretical thinking
48