Presentation - British Computer Society
Download
Report
Transcript Presentation - British Computer Society
Flexible Text Mining using
Interactive Information Extraction
David Milward
[email protected]
Text mining vs. Data Mining
• Data mining
–
getting new knowledge from databases
–
suggesting new relationships, trends, patterns
• Text mining
2
–
getting nuggets of information
from text
–
extracting relationships
–
structured results to feed into
data mining, visualisation or
databases
company activity company
Sanofi
bid
Aventis
Roche partner Antisoma
Text Data Mining
• Emphasizes finding new knowledge from text
• Typically knowledge that is implicit within multiple
documents
3
What is the relationship to IR?
• IR finds the most relevant documents
• Text mining finds information from within
documents, or across documents
– What drugs are used for psoriasis treatment?
– Who are associated directly or indirectly with the
Board of Exxon?
• There is overlap …
– we often search to answer a question, not to find a
document
4
Traditional Information Extraction
• Uses natural language processing to distinguish
– Sanofi bid for Aventis
– Aventis bid for Sanofi
• Provides structured results for easy review and analysis
company activity company
Sanofi
bid
Aventis
Roche partner Antisoma
• Uses normalised terminology to allow integration with
databases e.g.
– Preferred term: Sanofi,
– Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo …
• But:
– typically limited to patterns on a single sentence
– constructing, testing and running queries can take days
• Appropriate if you always have the same question e.g.
want to run over a newsfeed every night
5
I2E: Interactive Information Extraction
• A new concept
• Encompasses
– keywords → documents
– patterns → relationships (structured output)
• Queries ranging from:
– General Motors
– General Motors & acquisition in the same
document
– Automotive companies & acquisitions in the
same sentence
– What companies is General Motors
associated with?
• Not limited to patterns within sentences
e.g.
– Merger and acquisition activity in
documents mentioning Japan
• Fast, scalable, versatile
6
I2E
Information Extraction
NLP
Structured
Output
Text Search
Taxonomies/
Ontologies
Linguistic Processing
• Groups words into meaningful units
• Morphology allows search for different forms of
words
sentences
noun phrases
verb groups
morphology -
match entities
match actions
different forms
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
7
Monitoring Merger and Acquisition Activity
8
Company Positions
9
Using I2E in the Life Sciences
• Good resources
– Scientific abstracts are readily
available in XML
– Large number of existing
taxonomies/terminologies
Very large scale
•• Relatively
large scale
– 16 million abstracts relevant to life
– 17
million Growing
abstracts???? a year
sciences.
–– Large
Largenumbers
numbersof
ofinternal
internalreports
reports
and
full-text
articles
and full-text articles
–– Internal
be>>1000
Internaldocuments
documentscan
often
1000
pages,
may
be
PDF
images
pages, may be PDF images
–– Taxonomies/terminologies
Taxonomies/terminologies are
are
large,
often
deeply
structured
large, often deeply structured e.g.
>• 100K
350Kconcepts
nodes, ??? synonyms
> 400K synonyms
– Still need to augment terminology
– Still
need to areas
augment terminology
for specific
for specific areas
10
Examples of Pharma Questions
• R&D
– Which proteins interact with metabolite X?
– What are the reaction kinetics for canonical pathway Y?
– What attributes are common to sets of biomarker genes
– What are the known associations between expressed genes
and environmental factors.
– What dosages of compound B cause adverse reactions?
• Competitive Intelligence
– Which companies are working on technology C?
– What compounds are available for in-licensing in a disease
area?
– Which research groups are my competitors collaborating
with?
11
Linking Drugs to Adverse Events
12
Measurements
• Extraction of numerical parameters,
– e.g. amounts, dosages, concentrations
13
Benefits of Flexible Text Mining
• The ideal final query may use
– co-occurrence of terms within a document or sentence
– a precise linguistic pattern
– a mixture of both
• It depends on
– the nature of the task
– the availability of terminologies
– the kind of documents (news vs. science, abstract vs. full text)
– the time available to check results
• Flexibility to mix different techniques is also critical for fast
development of queries
– e.g. start with broad queries to explore the “results space”,
then home in
14
I2E: Better Results, Faster
10
Count of Link
9
8
[c] Reln
suppress
7
regulate
phosphorylate
6
mediate
interact
inhibit
5
induce
inactivate
4
co-express
block
3
bind
activate
2
1
0
BCL2
CDKN1A
DMPK
EPHB2
INS
MAP2K1
MAPK1
[c] Gene2
15
MAPK3
MAPK7
RB1
STK3
VIM
Fast query
creation
Fast return of
results
Fast review and
analysis
Impact of I2E
• Significant reduction in time spent searching/reading
the literature
– weeks reduced to days or hours
• Structure the unstructured to
– provide systematic and comprehensive review of
information content
– enable integration with traditional structured data
– allow complex analysis of literature derived information
– generate hypotheses, gain insight
16