Visualization of Relational Text Information
Download
Report
Transcript Visualization of Relational Text Information
Visualization of Relational Text
Information
for Biomedical Knowledge
Discovery
James W. Cooper
IBM T J Watson Research Center
Hawthorne, NY
Overview
Prior work
Java based text mining
Computation of unnamed relations
Graphical display of relations
Text
Text
Text
Text
Text
Tex
t
Text
Text
Tex
t
Relations between terms
Noun phrase co-occurrence statistics [Roark,
Charniak]
Choose seed words and look for terms near them.
[Brin] [Gravano, Agichtein]
– Repeat
Biomedical domain
– Blaschke used dictionary of common verbs
– Pustejovsky found inhibit relations
Stevens, Palakal, Mostafa
– Detected abstract-wide co-occurrence using dictionary
of genes and useful verbs.
Graphical Displays
Biolayout – protein similarity
ProtInAct – interactive system using yFiles
Zhang – interactive 3D system
Jenssen – gene network
Leroy – GeneScene
BioLayout –Enright and Ouzounis
Five related protein families and their
corresponding relationships.
Spheres represent proteins and lines
represent protein similarities.
ProInAct- Spencer and Bennett
Proteins clustered by functional interaction
Zhang-Protein interaction mapping
Jenssen – A literature network
Lines connect genes that have co-occurred in 1 or more papers.
Leroy –GeneScene
What would we like to do?
Find scientifically meaningful connections
between important terms.
– Such as Swanson’s Reynaud’s disease – fish
oil connection.
Allow exploration of relations by user.
Filter the relations by ontology or term
types
Perform path analysis
Let the user vary the graphical display.
Data we analyzed
Two sets of patent data
– 584 patents on Viagra and phosphodiesterase
inhibitors.
– 1514 patents on quinolones (like Cipro)
Recognized major technical terms in each
patent.
Filtered organic chemical nomenclature.
The Talent text mining system
Text Analysis and Language Engineering
Tools
– Finds multiword noun phrases
– Does shallow parse
– Can extract NPs and VGs
As well as all other sentence parts
The JTalent Library
Java class library with JNI interface
– To Talent DLL
Creates database load files of terms
–
–
–
–
Paragraph
Sentence
Offset
Term type (NP, VG)
TalentShow Demo
The KSS Library
Java class library of functions for
– Accessing a database (DB2, Access)
– Manipulating a search engine
– Manipulating tables of information created by
JTalent.
Database Tables
Documents
– Title, author, URL, ID
TermDocs
–
–
–
–
–
Term
Paragraph
Sentence
Offset
Type
Dictionary of terms, types and IDs
– Such as MeSH
Computing term information
Compute unique terms from Termdocs
Compute frequency
Compute salience
– Based on frequency
– Number of docs they appear in more than
once
Compute term relations
Named relations based on abbreviation
expansions.
Unnamed relations based on proximity,
with weight based on how frequently they
occur near each other.
Mutual information weight:
totalterms paircount
m log
freq1 freq2
Tuning Computed relations
Select only terms above a salience
threshold.
Only relations in which one or both are
members of an ontology.
Store relations in a database table for rapid
access:
Term | weight | term
Original System
Visual client
SOAP server
– Queries database to get relations
– Round trip for each new query
Instead, we export the data for the user to
visualize as they wish.
Exporting relations
Save relations and ontology information in xml file.
<relation>
– <term>
<iq>78</iq>
<source>MeSH</source>
<relationDocuments>
– <doc> 34</doc
– </term>
– <term> </term>
</relation>
This XML file is a portable version of the computed
relations that we can then use with any number of
viewers.
A Graphical Relations Viewer
Creates a Java Relations object for each
relation it reads from the XML file.
Inserts them into a Trie structure based on
lower cased first term.
– If there is already a Relation at that point, it
adds them to a Vector for that term.
Creates an alphabetical list of all terms in a
2nd Trie.
Using the Viewer
When you enter part of a
term, it shows all terms
starting with that fragment in
the left list box.
When you click on a term, it
shows all its relations in the
right list box.
Lexical Navigation
Displays relations
between terms
graphically and allows
you to explore them
without formulating a
specific query.
Possible enhancements
Show only terms belonging to an ontology.
Show only higher IQ terms
Show the documents the relations occur in.
Show the ontology reference.
Show computed paths
Show more kinds of named relations.
– Inhibits, expresses
Evaluations of Information
Visualization
Few, if any, graphical displays have been
evaluated thus far for effectiveness.
Usability studies are hard to construct and carry
out.
Intuition seems to show
– that exploration may result in discoveries.
– Relations more than one step apart seem best
displayed graphically.
Remains to be shown that such visualizations are
actually useful.
Differences in Intent
Displays may represent information your
system has discovered.
– Gene – protein relations
Or they may represent data from which the
user may discover new information.
– New 2nd or 3rd order relationships
These are rather different applications of
visualization technology
Summary
Java-based text mining system
Database of terms and positions
Computation of relations
Export as XML
Graphical relations viewer
The value of such visual interfaces has not
yet been established.
Acknowledgements
Bhavani Iyer – XML export
Eric Brown – DictMatcher hash code
Daniel Tunkelang – graphical layout
Bob Mack – paper suggestions