Neurocognitive Approach to Clustering PubMed
Download
Report
Transcript Neurocognitive Approach to Clustering PubMed
Neurocognitive approach to
clustering of PubMed query results
P. Matykiewicz, Włodzisław Duch,
Dept. of Informatics, Nicolaus Copernicus Uni, Toruń, Poland
P.M. Zender, K.A. Crutcher, J.P. Pestian
Cincinnati Children's Hospital Medical Center, Ohio, USA
Google: W. Duch
ICONIP 2008,Auckland, NZ
Plan
How can we help medical professionals
to find relevant information?
•
•
•
•
•
•
•
•
Neurocognitive informatics.
Semantic memory and other types of memory.
Creating semantic memory.
UMLS as a semantic memory.
Spreading activation.
Literature based discovery.
Neurocognitive approach to literature based discovery.
Plans for the future.
Neurocognitive informatics
Computational Intelligence. An International Journal (1984)
+ 10 other journals with “Computational Intelligence”,
D. Poole, A. Mackworth R. Goebel,
Computational Intelligence - A Logical Approach.
(OUP 1998), GOFAI book, logic and reasoning.
• CI: lower cognitive functions, perception, signal analysis,
•
•
action control, sensorimotor behavior.
AI: higher cognitive functions, thinking, reasoning, planning etc.
Neurocognitive informatics: brain processes can be a great
inspiration for AI algorithms, if we could only understand them ….
What are the neurons doing? Perceptrons, basic units in multilayer
perceptron networks, use threshold logic – NN inspirations.
What are the networks doing? Specific transformations, memory,
estimation of similarity.
How do higher cognitive functions map to the brain activity?
Neurocognitive informatics = abstractions of this process .
Types of memory
Neurocognitive approach to NLP: at least 4 types of memories.
Long term (LTM): recognition, semantic, episodic + working memory.
Input (text, speech) pre-processed using recognition memory model to
correct spelling errors, expand acronyms etc.
For dialogue/text understanding episodic memory models are needed.
Working memory: an active subset of semantic/episodic memory.
All 3 LTM are coupled mutually providing context for recognition.
Semantic memory is a permanent storage of conceptual data.
•
•
“Permanent”: data is collected throughout the whole lifetime of the
system, old information is overridden/corrected by newer input.
“Conceptual”: contains semantic relations between words and uses
them to create concept definitions.
Semantic Memory Models
Endel Tulving „Episodic and Semantic Memory” 1972.
Semantic memory refers to the memory of meanings and understandings.
It stores concept-based, generic, context-free knowledge.
Permanent container for general knowledge (facts, ideas, words etc).
Hierarchical Model
Collins Quillian, 1969
Semantic network
Collins Loftus, 1975
Semantic memory
Hierarchical model of semantic memory (Collins and Quillian, 1969),
followed by most ontologies.
Connectionist spreading activation model (Collins and Loftus, 1975), with
mostly lateral connections.
Our implementation is based on connectionist model, uses relational
database and object access layer API.
The database stores three types of data:
• concepts, or objects being described;
• keywords (features of concepts extracted from data sources);
• relations between them.
IS-A relation us used to build ontology tree, serving for activation
spreading, i.e. features inheritance down the ontology tree.
Types of relations (like “x IS y”, or “x CAN DO y” etc.) may be defined
when input data is read from dictionaries and ontologies.
SM & neural distances
Activations of groups of neurons presented in activation space define
similarity relations in geometrical model (McClleland, McNaughton,
O’Reilly, Why there are complementary learning systems, 1994).
Similarity between concepts
Left: MDS on vectors from neural network.
Right: MDS on data from psychological experiments with perceived
similarity between animals.
Vector and probabilistic models are approximations to this process.
Sij ~ (wi,Cont)|(wj,Cont)
Creating SM
The API serves as a data access layer providing logical
operations between raw data and higher application layers.
Data stored in the database is mapped into application
objects and the API allows for retrieving specific concepts/keywords.
Two major types of data sources for semantic memory:
1. machine-readable structured dictionaries directly convertible into
semantic memory data structures;
2. blocks of text, definitions of concepts from dictionaries/encyclopedias.
3 machine-readable data sources are used:
•
•
•
The Suggested Upper Merged Ontology (SUMO) and the the MIdLevel Ontology (MILO), over 20,000 terms and 60,000 axioms.
WordNet lexicon, more than 200,000 words-sense pairs.
ConceptNet, concise knowledgebase with 200,000 assertions.
Creating SM – free text
WordNet hypernymic (a kind of … ) IS-A relation + Hyponym
and meronym relations between synsets (converted into
concept/concept relations), combined with ConceptNet relation
such as: CapableOf, PropertyOf, PartOf, MadeOf ...
Relations added only if in both Wordnet and Conceptnet.
Free-text data: Merriam-Webster, WordNet and Tiscali.
Whole word definitions are stored in SM linked to concepts.
A set of most characteristic words from definitions of a given concept.
For each concept definition, one set of words for each source dictionary is
used, replaced with synset words, subset common to all 3 mapped back to
synsets – these are most likely related to the initial concept.
They were stored as a separate relation type.
Articles and prepositions: removed using manually created stop-word list.
Phrases were extracted using ApplePieParser + concept-phrase relations
compared with concept-keyword, only phrases that matched keywords
were used.
ULMS: Expert Semantic Memory
Biomedical domain: hundreds of controlled vocabularies, hierarchies
and ontologies.
•
•
•
•
•
GO - gene ontology, used for gene annotation.
ICD-9-CM - used for billing in US hospitals.
SNOMED CT - used in electronic medical record systems.
MeSH - used in annotation of biomedical literature in PubMed.
Psychological Index Terms - used to annotate articles in
psychology/psychiatry domain in PsycARTICLES citation database.
All of these sources and ~90 other sources connected together create:
Unified Medical Language System (ULMS).
This is the most detailed description of concepts and relations between
them created so far.
Some facts about UMLS
UMLS version 2007AC has:
• 92 English sources, including SNOMED CT, MeSH, ICD-9-CM,
ICD-10 ect.
• 54,245 ambiguous phrases;
• 3,723,408 unique English phrases;
• 1,516,299 concepts.
Concepts have:
• 16,918,281 unique structural (semantic) relations.
• 13,226,382 unique co-occurrence (associative) relations (e.g.
PubMed medical subject headings co-occurrence).
• attributes, contexts, definitions, semantic types, ...
Is it a good basis for semantic/episodic memory and spreading
activation networks approximating associations in expert’s brain?
Enhancing representations
Experts reading the text activate their semantic memory and add a lot of
knowledge that is not explicitly present in the text.
Semantic memory is difficult to create: co-occurrence statistics does not
capture structural relations of real objects and features.
Better approximation (not as good as SM): use ontologies adding parent
concepts to those discovered in the text.
Ex: IBD => [C0021390] Inflammatory Bowel Diseases
-> [C0341268] Disorder of small intestine
-> [C0012242] Digestive System Disorders
-> [C1290888] Inflammatory disorder of digestive tract
-> [C1334233] Intestinal Precancerous Condition
-> [C0851956] Gastrointestinal inflammatory disorders NEC
-> [C1285331] Inflammation of specific body organs
-> [C0021831] Intestinal Diseases
-> [C0178283] [X]Non-infective enteritis and colitis
[C0025677] Methotrexate (Pharmacologic Substance) =>
-> [C0003191] Antirheumatic Agents
-> [C1534649] Analgesic/antipyretic/antirheumatic
Example without inhibition
Literature based discovery
Biomedical research is divided into highly specialized fields and
subfields, with poor communication between them.
The rate of growth of publications makes it difficult for a
researcher to derive connections between concepts from
different research specialties.
Mining hidden connections among biomedical concepts from
large amounts of scientific literature is one of the important goals
pursued in this field.
Swanson explored biomedical literature to find novel connections
between medical concepts. He proposed that “Fish Oil” may be
used as a cure for “Reynaud's Disease”.
Researchers followed up his finding and the hypothesis turned
out be true.
Literature based discovery example
• Swanson found the hidden connection between “Fish Oil” and
“Reynaud's Disease” by finding common set of concepts from the
document set on “Fish Oil” and “Reynaud's Disease” .
Raynaud’s
disease
Fish Oil
High blood viscosity
Platelet aggregation
You can make medical disoveries!
Literature based discovery using
Visual Language System
VLS Hypothesis:
quicker recognition of interesting relations when graph is
presented as icons
First consistent graphs
are needed.
Graphs of consistent concepts
General GCC idea:
• when the text is read and understood activation of semantic subnetwork
in the expert brain is spread to new patterns, corresponding to related
concepts;
• new concepts automatically have to fit to the active network, assuming
meanings that increase overall network activation, or the consistency of
text interpretation.
Many approximations of this process may be defined.
Success depends on the quality of semantic network.
Explicit competition/inhibition among network nodes is important.
1. Recognition of concepts.
2. Spreading activation from concepts that are in the text to related
concepts.
3. Build graph inhibiting concepts that are irrelevant.
PubMed queries
Searching for:
"Alzheimer disease“ [MeSH Terms] AND "apolipoproteins e“
[MeSH Terms] AND "humans“ [MeSH Terms]
returns 2899 citations with 1924 MeSH terms.
Out of 16 MeSH hierarchical trees only 4 trees have been
selected: Anatomy; Diseases; Chemicals & Drugs;
Analytical, Diagnostic and Therapeutic Techniques &
Equipment. The number of concepts is 1190.
Loop over:
Cluster analysis;
Feature space enhancement through ULMS relations between
MeSH concepts;
Inhibition, leading to filtering of concepts.
Create graphical representation.
Initial step - MDS showing clusters
First step of activation:
new concepts that are added
First step of activation: concepts that
represent clusters/relations between them
First step of activation:
clusters after enhancement
2nd step of activation:
new concepts that are added
2nd step of activation: concepts that
represent clusters/relations between them
6th step of activation: concepts that
represent clusters/relations
6th step of activation: concepts that are
used in all steps of spreading activation
6th step of activation: clusters
Future work
Collaborative work with:
Graphical designers
• Design glyphs as a basis of for icon
• Design rules how glyphs are connected to create an icon
• Design layout for consistent graphs
Computer scientists
• Study effects of inhibition (different feature selection
methods)
• Study properties of spreading activation algorithm
• Apply to other fields (e.g. text classification)
Field experts
• Study performance of experts when text graph vs. icon
graph is presented
• Rate graphs based on their content
Thank
you
for
lending
your
ears
...
Google: W. Duch => Papers/presentations/projects