First evaluation of Esfinge – a question answering system

Download Report

Transcript First evaluation of Esfinge – a question answering system

BACO
A large database of text and co-occurrences
Luís Sarmento
Universidade do Porto (NIAD&R) and Linguateca
[email protected]
Global Motivation
Stage 1: Data preparation and loading
* Obtain fast text query methods for a variety of “data-driven”
NLP techniques
WPT0
3
* Develop practical methods for querying current gigabyte
corpora (web collections…)
* Experiment scalable methods for querying the next generation
of terabyte corpora
Optimize Queries…
1.5M docs
(by Nuno Seco [email protected])
Setup
indexed database
sentence
splitting
document
metadata
metadata + text sentences
* 2Gb RAM
QA, Definition Extraction
6 GB
removal
12 GB
* 2.8 Ghz PIV
* Text at sentence level:
Duplicate
* 160 Gb IDE HD
* 1-4 word window contexts:
* Fedora Core 2
find MWE, collocations
index
data
* Perl 5.6
* word co-occurrence data:
* MySQL 5.0.15
WSD, context clustering
load
data
* DBI + DBD-Mysql
tabular format
Some Practical Problems
WPT03 - A public resource
* How to compile lists of n-grams (2,3,4…) in a 1B word collection?
* The WPT 03 is a resource built by XLDB Group
(xldb.di.fc.ul.pt), and distributed by Linguateca
(www.linguateca.pt)
* How to obtain co-occurrence info for all pairs of words in a 1B word collection?
* Which data structures are best (and easily available in Perl)
* 12GB, 3.7M web documents and ~1.6B words
hash tables? Trees? Others (Judy? T-Trees?)…
* Obtained from the Portuguese web search engine
TUMBA! http://www.tumba.pt
* How should all this data be stored and indexed in a standard RDBS?
Statistics
Stage 2: compiling dictionary + 2,3,4-grams
+ co-occurrence pairs
text
sentences
4
GRAMS
3, 4-grams + co-occurrence pairs
3
GRAMS
multiple iterations
N documents per iteration
temp files are sorted
CO-OC
PAIRS
Metadat a
Sentenc es
diction ary
2-grams
3-grams
4-grams
co-occu rrenc e
BACO total
1.529
35.575
6.834
54.610
173.608
293.130
761.044
-
Index
size (GB)
0.2
6.55
0.18
1.50
5.43
10.40
20.10
44.4
0.05
5.90
0.27
0.92
2.97
6.35
7.56
~
24
* MySQL Encoded database of text, n-grams and
information about co-occurrence pairs
2
GRAMS
disjoint division based on number of
chars
* Perl Module to easily query BACO instances
Some conclusions
single pass
DIC
* RDBS are a good alternative for querying gigabyte text
collections for NLP purposes
load data
Final Tables:
index data
BACO
Table
Table
size (GB)
Current Deliverables
13 iterations
* metadata
* text sentences
* Dictionary
* 2,3,4-grams
* co-occurrence pairs
# tuples
(millions)
* complex data pre-processing tasks, data modeling and
system tuning may be required
* current implementation deals with raw text but models may
be extended for annotated corpora
BACO: BAse de
* query speed depends on internal details of MySQL indexing
mechanism
Co-Ocorrências
* current performance may be improved by a more efficient
database scheme and parallelization
NIAD&R
Linguateca
* Improving processing and research on the Portuguese language
* Fostering collaboration among researchers
* Providing public and free-of-charge tools and resources to the community
http://www.linguateca.pt
* Research group started in 1998 as part of the LIACC (AI Lab) @
Universidade do Porto
* Research topics: Multi-Agent Systems, E-business
Technology, Machine Learning, Robotics, Ontologies
http://www.fe.up.pt/~eol/