Text Mining Project Engineering Project

Download Report

Transcript Text Mining Project Engineering Project

Engineering
Project
Text Mining Project
Anna Matveev
Daniel Itzhak
Ori Chajuss
L/O/G/O
www.themegallery.com
The Idea
Our project will be applying methods for
NER (Named Entity Recognition) of product
names.
Will be focusing on two domains:
1
Hi Tech/Electronics
2
Healthcare (drugs)
The Framework
• The project is done with “Digital
Trowel” company.
• Digital Trowel collects, distills and
disseminates data from the Internet
into the knowledge one needs to
make smart business decisions.
• Our project stands by itself but will
be used in future projects of the
company.
Named Entity Recognition
What NER has to offer now
Most existing NER tools focus on extracting names of people,
organizations and locations from text.
What we are offering
Apply existing methods for extracting product names.
(!) Product extraction is considered a more challenging task,
since the contextual clues are typically less indicative.
Main Goals and Profits
Target
Merit
Necessity
Extracting product names from
given text.
This will be done using known
learning machine algorithms such
as HMM and CRF.
Gives a lot of new opportunities
to analyze data from the web.
Example for Future Work
Cellphone Review
Without product tagging, to identify a name
of a cellphone (e.g. Nexus 1) is impossible
When products are tagged
The analysis is much easier and a market
review can be done with much better precision
Material
• Books
 “Text Mining Handbook”/Ronen Feldman
• Papers
 David Nadeau and Satoshi Sekine., A survey of named entity recognition and classification, Journal
of Linguisticae Investigationes 30:1 ; 2007.
• Binyamin Rosenfeld, Moshe Fresko, Ronen Feldman, A Systematic Comparison of Feature-Rich
Probabilistic Classifiers for NER Tasks, PKDD 2005: 217-227.
• Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005., Incorporating Non-local
Information into Information Extraction Systems by Gibbs Sampling, Proceedings of the 43nd Annual
Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
 Jun’ichi Kazama; Kentaro Torisawa
Exploiting Wikipedia as External Knowledge for Named Entity Recognition
EMNLP 2007
 Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., Weld, D. S. and
Yates, A.
Unsupervised Named-Entity Extraction from the Web: An Experimental Study.
Artificial Intelligence, 165, pp. 91-134. 2005.
•Software
 Python (and using Natural Language Processing Toolkit).