Transcript Shanker
SIG New Grad 2010
Vijay Shanker
General Info
• Research interests: Natural Language
Processing, Text Mining, Machine Learning,
NLP/Software Engineering
• Funding: NSF, NIH/NLM, USDA
• Teaching: Logic, Theory of Computation,
Machine Learning/Data Mining, Topics in
Natural Language Processing
Machine Learning with Minimal
Supervision
• Supervised Machine Learning requires annotated
data – what should the output be for each
instance.
• More success leads to wider use, but
– Lot of data to be annotated
– Often requires expert annotator
• Minimal Supervision
– Active learning
– Domain Adaptation
– Semi-supervised or unsupervised
Active Learning
Michael Bloodgood (2009)
• The learned model decides which instance should be
annotated next.
• Mike’s dissertation focused on active learning in a
situation where there is data imbalance
• Partially learned model (SVM) would figure out the
hard instances
• For Imbalance, adjusted model to penalize errors on
minority class according to degree of imbalance
• Statistical methods to infer the degree of imbalance
with high confidence
• Success in a variety of IR (e.g., text classification), NLP
and other domains
Active Learning Contd.
• When do we stop? A fundamental but
understudied research problem
• When performance (accuracy is good enough)
– but we can’t use annotated data to decide
• Method based on stabilization of predictions
• ICML workshop on active Learning (2008),
HLT/NAACL (2009), CoNLL (2009)
Domain Adaptation
• ML/Statistical NLP methods not successful when
application domain is different from training
domain
• Use same model but adapt to domain
• Figure out what is independent of domain and
what varies
• Use unannotated data from domain to make a
good guess and bootstrap with learned model
• Example– POS tagging (J. Miller & M. Torii)
– Syntax same but lexicon different (EMNLP 2008)
Learning from Positive Data
• Existence of incomplete databases (especially
in Bioinformatics)
• Databases only provide positive data
• But can’t learn without negative data
• Again issues involving bootstrapping, choosing
examples, imbalance etc.
Text Mining and Information
Extraction
• Mining – look for nuggets
• Text Mining – look for information/patterns in large
amounts of text
• Information Extraction – look for specific kind of
information from documents (text, websites, etc)
• Relation Extraction – two or more entities in specific
relation – from unstructured data to structured data
(Databases)
• Employer-Employee relation, 2 proteins that interact,
books and authors
• Much of my current work on biomedical research
articles
Text Mining
eGIFT -- Oana Tudor
• Large scale experimental data involve
thousands of genes
• Tens of thousands of genes. Biologists might
know about a couple of hundred genes well
• Incomplete databases
• Still most of information in the literature only,
besides the papers give the relevant context
to interpret information
– but significant IR issues in search for genes
eGIFT
• Addresses gene based search (IR)
• Compares gene specific literature vs
background set to identify “key terms”
• Categorizes terms
• Links to sentences and literature
• Picking one good sentence to suggest relation
• Summary of gene and its properties
DNA Topoisomerase II alpha
• Identify “keywords” without any human intervention
• Top2a literature -- 758 abstracts
– nuclear (0.3), break (0.1), corepressor (0.003)
• Background -- 1.98 million abstracts
– nuclear (0.1), break (0.001), corepressor (0.05)
• Statistical tests for significance
• Extend or not?
– Nuclear, double strand break, anticancer, chromosome
segregation, …
Identify Sentences
• Pick sentences with keywords based on where they
appear, where gene name appears, # of keywords
etc.
– DNA Topoisomerase IIA expression levels are related to tumor
growth and to resistance to anticancer chemotherapy
– DNA Topoisomerase IIalpha is an essential enzyme for
chromosome segregation during mitosis
– DNA topoisomerase II is a nuclear enzyme whose decatenating
activity on newly replicated DNA is essential to successful cell
division
• SMBM 2008, BMC Bioinformatics 2010
Current and Future Extensions
• Selecting and Ranking sentences – SVM rank
– Features – vagueness and sentence complexity
• Summarizing a gene and its properties –
wikigene
• Assist in interpretation of large scale
experiments, hypothesis generation and
knowledge discovery
Information Extraction
• Several relation extraction tasks, many from
biomedical text
• Research issues are not specific to any domain
or task
– Beyond sentences
– What kind of features (lexical, syntactic, semantic
features) help?
– Learn starting from positive data
– Could sentence simplification help?
Text Simplification for RE
• Sentences in WSJ, paper abstracts, etc. are
quite structurally complicated
• To humans the information to be extracted
may look straightforward
• But not straightforward for systems, e.g.,
because parsing (especially in new domains)
remains a hard task. So often these systems
rely on much “shallower” features.
Simplification Example
• MAPK phosphorylates BCL-2 at Serine 112.
• MAPK isolated from rat brain tissue
phosphorylates BCL-2 at Serine 112.
• MAPK phosphorylates BCL-2, a member of the
BAX family, at Serine 112.
• MAPK phosphorylates BCL-2 at Serine 112 and
BAD at Tyrosine 25.
Sentence Simplification
• MAPK isolated from rat brain tissue
phosphorylates BCL-2 at Serine 112.
– MAPK phosphorylates BCL-2 at Serine 112.
– MAPK was isolated from rat brain tissue.
• MAPK phosphorylates BCL-2 at Serine 112 and
BAD at Tyrosine 25.
– MAPK phosphorylates BCL-2 at Serine 112.
– MAPK phosphorylates BAD at Tyrosine 25.
NLPA for SE
• Collaboration with Lori Pollock
• Code as a different type of NL document
• Abbreviation expansion -- wcStrBrk
– Acronyms – International Business Machine (IBM)
vs AuctionEntry ae;
• Importance of verb object relation
– NL – verb classification e.g., drink, sip, gulp vs. eat,
bite, …
– No subject; object in name or first parameter
Software Word Usage Model
• SWUM – virtual remodularization (OO vs
procedural) – where in the code is some
action completed
– Software search -- query expansion (Shepherd etal
AOSD 2006), query reformulation (Hill etal ICSE
2009)
– Software navigation – Hill et al ASE 2009
– Comment generation – Giri et al ASE 2010, ICSE
2011
Summary Comment Generation
• Giriprasad ASE 2010
• NL vs software code
– Location in document – differing importance
– Flow of information – lexical chains vs control and
data flow
• High level actions (ICSE 2011)
– Structural aspects, code conventions and design
patterns
• Text Segmentation vs code segmentation
– Readability (para), refactoring code, internal
comments
Naming Conventions
Classes and methods
• Call site – imperative statements vs. comment
–
–
–
–
saveAuction(ae)
savesAuction() or c.contains(str)
auctionSaved()
savedAuction
• Boolean returns – proposition and sentential
window.isClosed()
• Verbs to nouns. Tyranny of nouns.
– Traditional vs action-oriented classes
– book, sequenceManager, xmlParser
Other Projects
• Colin Kern – with Prof. Liao
– Transmembrane protein structure prediction
– Incorporating local and long distance relations
• Dan Blanchard – with Prof. Jeffrey Heinz
– Unsupervised word segmentation (e.g., speech)
– What if we didn’t have a lexicon? Infant language
acquisition