Transcript Document

Corpora and the ‘general public’
Belinda Maia and Luís Sarmento
Universidade do Porto
Message on translation mailing
list (12.03.03)
I'm considering doing a
research about: Using
Reference Material in
Translation. But I don't know
if there is any resources to
depend on. And to be frank, I
don't understand what is the
topic about. What have
reference materials to do
with translation?
Reply on translation mailing
list (13.03.03)
Look for anything dealing
with parallel and comparable
corpora. There are a number
of relevant texts on the
subject.
Corpora is the magic new word
in the translation world.
Reply on translation mailing list
(13.03.03)
Parallel texts, dictionaries,
terminology databases, etc. can be
considered reference material.
If you use translation memories,
reference material has a narrow
definition: it is the text (stored
in a translation memory database)
selected to be reused in your
current translation.
btw, where does your interest on
the topic come from?
Corpora and the ‘converted’
• Converted = linguists, lexicographers, and
others who have followed the development
of corpora over a period of years
• It seems so obvious to us that corpora are
useful!
Corpora and the teacher
Corpora and the student
Simplicity of traditional
dictionaries
Using dictionaries
Digital dictionaries +
concordances
Problems with corpora use
Using corpora
Findings on use of corpora
Hits : 30106
Hits using regular expressions: 9882
Average results returned per hit: 1253
Hits with 0 results: 41%
Hits with fewer than 10 results : 65%
Hits with fewer than 100 results : 82%
Hits with fewer than 1253 results: 5%
Users who make one hit and never return: 23%
Users who make at least 10 hits: 46 %
Users who make at least 100 hits: 8%
Statistics for those making at
least 15 hits
• Total no. of sessions : 2820
No. of requests per session : 9
Maximum no. of requests per session: 428
Average length of session: 11 minutes
Users with at least 10 sessions: 21 %
Users with at least 20 sessions: 11%
Users with at least 100 sessions: : 1.6 %
Training in corpus use
Training in corpora analysis
Training with tools
GC – Integrated Web Environment for Corpora Linguistics
What is GC?
GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work
environment for Corpora-Based Linguistic Research. GC allows users to:
Motivation
• access several Corpora tools from a single entry point using a regular web browser
• Lack of Comprehensive, wide-scope Corpora Tools
• Commercial Packages are usually difficult to Integrate/Customize
• Tools are not prepared to support cooperative work.
• Linguistic knowledge is not usually integrated in tools.
• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)
• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)
• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)
• communicate and exchange results with other users
Internet Integration
GC provides seamless integration with the World
Wide Web allowing users to:
Developer’s Tasks:
• Integrate Existing Tools/Resources
BNC
• Develop Additional Generic Tools
• search specific Corpora resources on the Internet
CETEM
Público
COMPARA
Custom Interface
Custom Interface
Others
• use available translation-engines in parallel.
Developer Task:
• Interact with Users/Administrator
Custom Interface
• query the web for concordances
Custom Interface
• Develop Custom Tools for particular
research needs
DEV
Administrator’s Tasks:
• Concordance Engine
• Corpora Bot
• Taggers
• Statistics
Tool Pool
• Aligner (Semi-Auto)
Internet
• Custom Tools
Terminology DB
• Users, Groups and Disk Quotas
• Corpora Taxonomy (see box)
Inter-user
Communication
• Documentation Organization
• Access Service Statistics
ADM
USER
Teacher’s Tasks:
• Provide on-line tutorials
• Provide links to:
• on-line teaching material
• bibliography and other resources
Virtual
Desktop
Personal
Corpora
Terminology Extraction Tool
(Auto/Semi-Auto)
PS
Inter-User Communication
• Tagging and Aligning Cooperatively
TXT
RTF
HTML
• Messaging Service
• Exchange of Corpora Resources
PDF
DOC
Corpora Taxonomy
• Medium: written, spoken, multimedia
• Domain: Engineering, medicine, etc.
• Genre: scientific, technical, informative, etc.