GLOSSARY COMPILATION
Download
Report
Transcript GLOSSARY COMPILATION
GLOSSARY COMPILATION
Alex Kotov (akotov2)
Hanna Zhong (hzhong)
Hoa Nguyen (hnguyen4)
Zhenyu Yang (zyang2)
Roadmap
Problem definition
Motivation
Solution Framework
Demo
Conclusion
Problem definition
The purpose of an automatic glossary compiler is to
aid in the construction of a list of definitions across
a large collection of documents.
Definition is a concise description of what an entity
is.
Challenges:
Multiple ways to phrase a definition
Single term has multiple definitions Need
clustering
Motivation
Benefit for everyone:
Construct a glossary without marking index words
by hand;
Briefly look up the definition of a term in a book,
journal articles, a set of books or collection of
papers on a particular topic.
No current similar tool exists.
Solution framework
Query processing
Definition extraction
Minipar;
Clustering algorithm
Yahoo API;
K-means;
Technology
IE Toolbar.
Page processing
Goals
Fetch pages for a given query
Convert multiple formats into text format
Use multi-threading to accelerate
e.g., PDF files
Filter
Remove HTML tags, incomplete tokens…
Detect sentence boundaries.
Remove garbage
Page processing (cont.)
Process Query
query string
result set
Yahoo API
Fetch URL
html ?
.TXT
pdf ?
Remove Tag Convert to TXT
query pages
Sentence Segmentation
Garbage Cleaning
Definition extraction
Dependency parser (MINIPAR):
Based on the theory of dependency grammars;
Broad coverage parser;
Output is a parse tree representing head-modifier
relations.
Generic definition patterns:
Use generic semantic patterns to overcome the
syntactic variability (expressing the same meaning
with the same set of words by employing different
syntactic structures of a sentence);
Extensible, easily coded in XML, requires minimum
knowledge of linguistics.
Definition extraction
“Data Mining, also known as knowledge discovery in data bases, is the process
of automatically searching large volumes of data for patterns.”
Definition extraction
Simple and complex definitions;
Although it is usually used in relation to analysis of data, data
mining, like artificial intelligence, is an umbrella term and is used
with varied meaning in a wide range of contexts;
Data Mining can be defined as "The nontrivial extraction of
implicit, previously unknown, and potentially useful information
from data“.
Simple and complex terms being defined;
Data Mining;
Core of comparative genome analysis.
Extensible;
High accuracy (limited by the parser).
Clustering
Algorithm:
Similarity measure:
K-means;
Vector space model;
Challenges:
Define k;
Define similarity measure.
Demo
Thank you!
Questions?