Slides - UW Faculty Web Server

Download Report

Transcript Slides - UW Faculty Web Server

MEBI 591C/598 – Text Mining/NLP
Subproblems
Meliha Yetisgen-Yildiz
From last week’s discussion
Presentation

Schedule: http://faculty.washington.edu/melihay/MEBI591C.htm


50 minutes presentation+discussion+question answering
Content:

Research/Project Idea


Motivation + Problem + Potential Solution
Survey or literature review

A general area



Available resources for a given area



Open source libraries
Data resources
Paper


Text mining: named entity recognition - gene name identification
Data Mining: classification, clustering
Conference or journal article
Preparation:


Email the plan + reading list at least 3 days prior to class
GoMap Discussion List
System Design

Team:


Example data released:


Marcin, Wynona, Karl, Stella, Francisco, Jeffry, Safiyyah (not
registered)
https://www.i2b2.org/NLP/Relations/Documentation.php
The fourth i2b2 challenge is a three tiered challenge that
studies:
1.
2.
3.
extraction of medical problems, tests, and treatments
classification of assertions made on medical problems
relations of medical problems, tests, and treatments
2010 - I2b2 Challenge

Important Dates:







March 5th – Registration opens
April 15th – Commitment to Participate in Challenge & Training
Data Release
July 15th – Test Data Release
September 1st – Short papers due
October 1st – Invitations to present at the Workshop
November, 2010 – Workshop
Preparations

Linux server + accounts (meliha)



Accounts
Dev environment
Subversion ?
Text Mining/NLP Sub-problems – Part 1




Sentence Delimiters
Tokenizers
Part-of-Speech Tags
Collocations
Sentence Delimiters



Document -> Paragraph -> Sentences
Sentence boundary disambiguation (SBD) is the problem in
NLP of deciding where sentences begin and end.
Sentence boundary identification is challenging because punctuation
marks are often ambiguous.

period may denote





Question marks and exclamation marks may appear


Abbreviation
Decimal point
Email address
About 47% of the periods in the Wall Street Journal corpus denote abbreviations.
embedded quotations, emotions, computer code, and slang
Tools:


OpenNLP has a class for sentence detection
NacTEM: http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector
Tokenization


Document -> Paragraph -> Sentence -> Tokens
Based on white-space characters

In Unicode (Unicode Character Database) the following codepoints are
defined as whitespace:












U+0009–U+000D (control characters, containing Tab, CR and LF)
U+0020 SPACE
U+0085 NEL (control character next line)
U+00A0 NBSP (NO-BREAK SPACE)
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000–U+200A (different sorts of spaces)
U+2028 LS (LINE SEPARATOR)
U+2029 PS (PARAGRAPH SEPARATOR)
U+202F NNBSP (NARROW NO-BREAK SPACE)
U+205F MMSP (MEDIUM MATHEMATICAL SPACE)
U+3000 IDEOGRAPHIC SPACE
Part-OF-Speech Tagging
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
the
girl
kissed
the
boy
on
the
cheek
TAGS
N
V
P
DET
Penn Tree POS Tags
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27.VB Verb, base form
28.VBD Verb, past tense
29.VBG Verb, gerund or present participle
30.VBN Verb, past participle
31.VBP Verb, non-3rd person singular present
32.VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
Applications of Tagging




Partial parsing: syntactic analysis
Information Extraction: tagging and partial parsing help
identify useful terms and relationships between them.
Information Retrieval: noun phrase recognition and querydocument matching based on meaningful units rather than
individual terms.
Question Answering: analyzing a query to understand what
type of entity the user is looking for and how it is related to
other noun phrases mentioned in the question.
Information Souces in Tagging

How do we decide the correct POS for a word?


Syntagmatic Information: Look at tags of other words in
the context of the word we are interested in.
Lexical Information: Predicting a tag based on the word
concerned. For words with a number of POS, they usually
occur used as one particular POS.
POS Approaches – Rule Bases
•
Basic Idea:
–
–
–
Assign all possible tags to words
Remove tags according to set of rules of type: if word+1 is an adj, adv, or
quantifier and the following is a sentence boundary and word-1 is not a verb
like “consider” then eliminate non-adv else eliminate adv.
Typically more than 1000 hand-written rules, but may be machinelearned.
POS Approaches – Machine Learning
•
•
•
•
Based on probability of certain tag occurring given various
possibilities
Requires a training corpus
Training corpus may be different from test corpus.
Examples
•
•
•
Hidden Markov Model Taggers
Transformation Based Taggers
Maximum Entropy Taggers
Ling572 (Advanced Statistical Methods in NLP) http://courses.washington.edu/ling572/winter10/teaching_slides/new_syll
abus.htm
Tagging Accuracy


Ranges from 95%-97%
Depends on:



Amount of training data available.
Difference between training corpus and dictionary and the
corpus of application.
Unknown words in the corpus of application.
Tagging Unknown Words
•
•
•
•
•
•
New words added to (newspaper) language 20+ per
month
Plus many proper names …
Increases error rates by 1-2%
Method 1: assume they are nouns
Method 2: assume the unknown words have a probability
distribution similar to words only occurring once in the
training set.
Method 3: Use morphological information, e.g., words
ending with –ed tend to be tagged VBN.
POS Taggers
Freely downloadable Part of Speech Taggers
 Stanford POS tagger Loglinear tagger in Java (by Kristina Toutanova)

hunpos An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see
below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.
 MBT: Memory-based Tagger Based on TiMBL TreeTagger A decision tree based tagger from the University of
Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English,
German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris,Windows,
and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.

SVMTool POS Tagger based on SVMs (uses SVMlight). LGPL.

ACOPOST (formerly ICOPOST) Open source C taggers originally written by by Ingo Schröder.
Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under
GNU public license.

MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger Java POS tagger. A sentence
boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version
worked with JDK1.3+. Class files, not source.

fnTBL A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger,
but also NP chunking and general chunking models.

mu-TBL An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other
things by Torbjörn Lager. Web demo also available. Prolog.
 YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL
2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
 QTAG Part of speech tagger An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English
and German parameter files. [Java class files, not source.]
Collocations


A collocation is an expression consisting two or more
words that correspond to some conventional way of
saying things
Methods:

Simplest solution – counting

Google 5-gram corpus (2006)






ceramics collectables fine 130
ceramics collected by 52
ceramics collection , 144
ceramics collection . 247
Use POS Tags
Use Noun Phrase Chunking / Parsing
NLP/Text Mining POINTERS

NLP BOOKS:


Manning and Schütze, Foundations of Statistical Natural
Language Processing (MIT Press, 1999).
Jurafsky, Daniel, and James H. Martin. 2009. Speech and
Language Processing: An Introduction to Natural Language
Processing, Speech Recognition, and Computational Linguistics.
2nd edition. Prentice-Hall.
Books on Regular Expressions


Jeffrey E.F. Friedl, Mastering Regular Expressions, O’Reilly.
Jan Goyvaerts, Regular Expressions Cookbook, O’Reilly
NLP Research Groups

Stanford NLP Group


CMU NLP Group


http://nlp.cis.upenn.edu/
NACTEM – National Center for Text Mining


http://www.cs.cmu.edu/~nasmith/nlp-cl.html
Upenn NLP Group


http://nlp.stanford.edu/
http://www.nactem.ac.uk/
UW – Turing Center

http://turing.cs.washington.edu/
NLP Libraries

List of tools from Stanford NLP webpage


Mallet – Machine learning for language toolkit




MinorThird is a collection of Java classes for storing text, annotating text, and learning to
extract entities and categorize text.
CMU - http://sourceforge.net/apps/trac/minorthird/wiki
OpenNLP



MALLET is a Java-based package for statistical natural language processing, document
classification, clustering, topic modeling, information extraction, and other machine learning
applications to text.
UMASS - http://mallet.cs.umass.edu/
Minorthird


http://nlp.stanford.edu/links/statnlp.html
OpenNLP hosts a variety of java-based NLP tools which perform sentence detection,
tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference
using the OpenNLP Maxent machine learning package.
http://opennlp.sourceforge.net/
GATE


General architecture for NLP tasks
http://gate.ac.uk/
Biomedical NLP and Text Mining Tools

Metamap (MMTx) - NLM


Negex, Context – University of Pittsburg – BluLab


http://mmtx.nlm.nih.gov/
http://www.dbmi.pitt.edu/blulab/index.html
Ctakes – Mayo Clinic

https://cabigkc.nci.nih.gov/Vocab/KC/index.php/OHNLP_Documentation_a
nd_Downloads
Bio-medicial Text Mining Tools













Chilibot — A tool for finding relationships between genes or gene products.
EBIMed - EBIMed is a web application that combines Information Retrieval and Extraction from Medline.[1]
FABLE — A gene-centric text-mining search engine for Medline
GOAnnotator, an online tool that uses semantic similarity for verification of electronic protein annotations using GO terms
automatically extracted from literature.
GoPubMed — retrieves Medline abstracts for your search query, then detects ontology terms from the Gene Ontology and
Medical Subject Headings in the abstracts and allows the user to browse the search results by exploring the ontologies and
displaying only papers mentioning selected terms, their synonyms or descendants.
Information Hyperlinked Over Proteins (iHOP)[2]: "A network of concurring genes and proteins extends through the scientific
literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing
millions of Medline abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in
Medline can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research."
LitInspector — Gene and signal transduction pathway data mining in Medline abstracts.
NextBio- Life sciences search engine with a text mining functionality that utilizes Medline abstracts and clinical trials to return
concepts relevant to the query based on a number of heuristics including ontology relationships, journal impact, publication
date, and authorship.
PubAnatomy — An interactive visual search engine that provides new ways to explore relationships among Medline literature,
text mining results, anatomical structures, gene expression and other background information.
PubGene — Co-occurrence network display of gene and protein symbols as well as MeSH, GO, PubChem and interaction
terms (such as "binds" or "induces") as these appear in Medline records (that is, PubMed titles and abstracts).
TexFlame, an online tool that renders a single Medline abstract as a Systems Biology Graphical Notation (SBGN)-like graph.
The graph is a complete syntactic-semantic representation of the abstract.
Whatizit - Whatizit is great at identifying molecular biology terms and linking them to publicly available databases.
XTractor — Discovering Newer Scientific Relations Across PubMed Abstracts. A tool to obtain manually annotated,expert
curated relationships for Proteins, Diseases, Drugs and Biological Processes as they get published in Medline.
Literature-based discovery tools



Arrowsmith - UIC-based site for searching links
between two literatures within Medline. Also contains the
Author-ity tool for disambiguating authors on scientific
papers, and the Anne O'Tate tool for summarizing a
results of a PubMed query.
BITOLA helps biomedical researchers make new
discoveries by discovering potentially new relations
between biomedical concepts.
Manjal another LBD tools by Padmini Srinivasan