Text Pre-Processing and Categorization (Meta-Data

Download Report

Transcript Text Pre-Processing and Categorization (Meta-Data

The Corpógrafo – a Web-based environment for corpora research
Corpógrafo V3: two years after…
Where to find Corpógrafo?
• Two years after its debut at CL2003, Corpógrafo reaches version 3
• Corpógrafo is now a mature environment, ready to be further expanded
• More than 100 regular users. More than 400 user accounts.
• Many lessons learned from practice: usability, technology, linguistics
• A corpus linguistics research community has grown along with Corpógrafo
• Large Terminology / Knowledge Engineering projects are now possible
Have a look at (version 3 will be on-line in August 2005):
http://www.linguateca.pt/corpografo
Regexp Concordance
KWIC / Window
N-Grams
Media file repository
PDF PS
TXT
General
Corpora
Studies
Corpógrafo’s workflow overview:
JPEG
WMF
DCR
HTML
Associate:
DOC
Text
Extraction
Web
QT WAV
Collect
Texts
1. explanation videos / pictures
Corpora
create and manage
multilingual
Terminology DB’s
(several languages)
Terminology
Extraction
2. Sound file (pronounciation)
search
Term Definitions and
Semantic Relations
extract
Term
Candidates
store terminological
entries, examples
and Meta-Data in DB
1. edit term meta-data (source,
authors, morphology, etc.)
2. match bilingual equivalents
3. obtain statistical information
from corpora about each term
Term Candidates list
Text Pre-Processing and
Categorization (Meta-Data)
Motivation
Under the hood
Build an environment that helps users in the entire process of corpora research. The tool
should not require advanced computer skills and should be easy to use by all types of
users, from students to researchers. Functionalities required:
• Corpógrafo is built over SAGI, a web operative system
developed by Linguateca. SAGI uses “LAMP”:
• Linux OS, Apache Web Server, MySQL RDBMS, Perl
• SAGI allows complete control over CGI processes and
helps programmers build web interfaces
• Web access: use anywhere, anytime from any computer. No software installations.
• Collect texts: text extraction from structured files, downloading texts from the Web
• Text pre-processing: “cleaning” text, segmentation, text annotation, text encoding
searchable or exchangeable format;
• Corpus search: regular expression concordances, collocation extraction, frequency
based statistics (N-grams count);
• Information extraction: terminology, semantic relations, conceptual maps
• Knowledge-resource building: specific-domain glossaries, thesauri, terminological
databases and ontologies; categorized word-lists;
• Comparable corpora studies: compilation and search over comparable corpora
• Exporting results to other formats and applications: to standard terminological
databases, translation memories, etc.
FLUP/CLUP
LINGUATECA
http://www.letras.up.pt
http://www.linguateca.pt
Corpora
1. query DB, navigate DB
2. export DB to XML file
3. automatic generation of
documentation (HTML)
The future…
• Corpógrafo under GPL soon.
• Multiple Corpógrafos installed in several university
departaments and countries: the “Corpógrafo Community”
• Centralized database to collect terminology / conceptual
maps from the Corpógrafo Community
• Large-Scale Terminology/ Knowledge Resources for
Specialized Search Engines, Technical Writing, Translation, etc
Linguateca – Our mission!
• Improving processing and research of the Portuguese language
• Fostering collaboration among researchers
• Providing public and free-of-charge tools to the community
Luís Sarmento
Belinda Maia
Diana Santos
[email protected]
[email protected]
[email protected]
Luís Cabral
[email protected]
Ana Sofia Pinto
[email protected]