GLOSSARY COMPILATION

Download Report

Transcript GLOSSARY COMPILATION

GLOSSARY COMPILATION
Alex Kotov (akotov2)
Hanna Zhong (hzhong)
Hoa Nguyen (hnguyen4)
Zhenyu Yang (zyang2)
Roadmap





Problem definition
Motivation
Solution Framework
Demo
Conclusion
Problem definition



The purpose of an automatic glossary compiler is to
aid in the construction of a list of definitions across
a large collection of documents.
Definition is a concise description of what an entity
is.
Challenges:
 Multiple ways to phrase a definition
 Single term has multiple definitions  Need
clustering
Motivation

Benefit for everyone:



Construct a glossary without marking index words
by hand;
Briefly look up the definition of a term in a book,
journal articles, a set of books or collection of
papers on a particular topic.
No current similar tool exists.
Solution framework

Query processing


Definition extraction


Minipar;
Clustering algorithm


Yahoo API;
K-means;
Technology

IE Toolbar.
Page processing

Goals

Fetch pages for a given query


Convert multiple formats into text format


Use multi-threading to accelerate
e.g., PDF files
Filter



Remove HTML tags, incomplete tokens…
Detect sentence boundaries.
Remove garbage
Page processing (cont.)
Process Query
query string
result set
Yahoo API
Fetch URL
html ?
.TXT
pdf ?
Remove Tag Convert to TXT
query pages
Sentence Segmentation
Garbage Cleaning
Definition extraction

Dependency parser (MINIPAR):




Based on the theory of dependency grammars;
Broad coverage parser;
Output is a parse tree representing head-modifier
relations.
Generic definition patterns:


Use generic semantic patterns to overcome the
syntactic variability (expressing the same meaning
with the same set of words by employing different
syntactic structures of a sentence);
Extensible, easily coded in XML, requires minimum
knowledge of linguistics.
Definition extraction
“Data Mining, also known as knowledge discovery in data bases, is the process
of automatically searching large volumes of data for patterns.”
Definition extraction

Simple and complex definitions;



Although it is usually used in relation to analysis of data, data
mining, like artificial intelligence, is an umbrella term and is used
with varied meaning in a wide range of contexts;
Data Mining can be defined as "The nontrivial extraction of
implicit, previously unknown, and potentially useful information
from data“.
Simple and complex terms being defined;


Data Mining;
Core of comparative genome analysis.

Extensible;

High accuracy (limited by the parser).
Clustering

Algorithm:


Similarity measure:


K-means;
Vector space model;
Challenges:


Define k;
Define similarity measure.
Demo
Thank you!
Questions?