Visualization of Relational Text Information

Download Report

Transcript Visualization of Relational Text Information

Visualization of Relational Text
Information
for Biomedical Knowledge
Discovery
James W. Cooper
IBM T J Watson Research Center
Hawthorne, NY
Overview
Prior work
 Java based text mining
 Computation of unnamed relations
 Graphical display of relations
Text

Text
Text
Text
Text
Tex
t
Text
Text
Tex
t
Relations between terms


Noun phrase co-occurrence statistics [Roark,
Charniak]
Choose seed words and look for terms near them.
[Brin] [Gravano, Agichtein]
– Repeat

Biomedical domain
– Blaschke used dictionary of common verbs
– Pustejovsky found inhibit relations

Stevens, Palakal, Mostafa
– Detected abstract-wide co-occurrence using dictionary
of genes and useful verbs.
Graphical Displays
Biolayout – protein similarity
 ProtInAct – interactive system using yFiles
 Zhang – interactive 3D system
 Jenssen – gene network
 Leroy – GeneScene

BioLayout –Enright and Ouzounis
Five related protein families and their
corresponding relationships.
Spheres represent proteins and lines
represent protein similarities.
ProInAct- Spencer and Bennett
Proteins clustered by functional interaction
Zhang-Protein interaction mapping
Jenssen – A literature network
Lines connect genes that have co-occurred in 1 or more papers.
Leroy –GeneScene
What would we like to do?

Find scientifically meaningful connections
between important terms.
– Such as Swanson’s Reynaud’s disease – fish
oil connection.
Allow exploration of relations by user.
 Filter the relations by ontology or term
types
 Perform path analysis
 Let the user vary the graphical display.

Data we analyzed

Two sets of patent data
– 584 patents on Viagra and phosphodiesterase
inhibitors.
– 1514 patents on quinolones (like Cipro)
Recognized major technical terms in each
patent.
 Filtered organic chemical nomenclature.

The Talent text mining system

Text Analysis and Language Engineering
Tools
– Finds multiword noun phrases
– Does shallow parse
– Can extract NPs and VGs

As well as all other sentence parts
The JTalent Library

Java class library with JNI interface
– To Talent DLL

Creates database load files of terms
–
–
–
–
Paragraph
Sentence
Offset
Term type (NP, VG)
TalentShow Demo
The KSS Library

Java class library of functions for
– Accessing a database (DB2, Access)
– Manipulating a search engine
– Manipulating tables of information created by
JTalent.
Database Tables

Documents
– Title, author, URL, ID

TermDocs
–
–
–
–
–

Term
Paragraph
Sentence
Offset
Type
Dictionary of terms, types and IDs
– Such as MeSH
Computing term information
Compute unique terms from Termdocs
 Compute frequency
 Compute salience

– Based on frequency
– Number of docs they appear in more than
once
Compute term relations
Named relations based on abbreviation
expansions.
 Unnamed relations based on proximity,
with weight based on how frequently they
occur near each other.
 Mutual information weight:

 totalterms  paircount 

m  log 
freq1  freq2


Tuning Computed relations
Select only terms above a salience
threshold.
 Only relations in which one or both are
members of an ontology.
 Store relations in a database table for rapid
access:
 Term | weight | term

Original System
Visual client
 SOAP server

– Queries database to get relations
– Round trip for each new query

Instead, we export the data for the user to
visualize as they wish.
Exporting relations


Save relations and ontology information in xml file.
<relation>
– <term>



<iq>78</iq>
<source>MeSH</source>
<relationDocuments>
– <doc> 34</doc
– </term>
– <term> </term>


</relation>
This XML file is a portable version of the computed
relations that we can then use with any number of
viewers.
A Graphical Relations Viewer
Creates a Java Relations object for each
relation it reads from the XML file.
 Inserts them into a Trie structure based on
lower cased first term.

– If there is already a Relation at that point, it
adds them to a Vector for that term.

Creates an alphabetical list of all terms in a
2nd Trie.
Using the Viewer


When you enter part of a
term, it shows all terms
starting with that fragment in
the left list box.
When you click on a term, it
shows all its relations in the
right list box.
Lexical Navigation

Displays relations
between terms
graphically and allows
you to explore them
without formulating a
specific query.
Possible enhancements
Show only terms belonging to an ontology.
 Show only higher IQ terms
 Show the documents the relations occur in.
 Show the ontology reference.
 Show computed paths
 Show more kinds of named relations.

– Inhibits, expresses
Evaluations of Information
Visualization



Few, if any, graphical displays have been
evaluated thus far for effectiveness.
Usability studies are hard to construct and carry
out.
Intuition seems to show
– that exploration may result in discoveries.
– Relations more than one step apart seem best
displayed graphically.

Remains to be shown that such visualizations are
actually useful.
Differences in Intent

Displays may represent information your
system has discovered.
– Gene – protein relations

Or they may represent data from which the
user may discover new information.
– New 2nd or 3rd order relationships

These are rather different applications of
visualization technology
Summary
Java-based text mining system
 Database of terms and positions
 Computation of relations
 Export as XML
 Graphical relations viewer
 The value of such visual interfaces has not
yet been established.

Acknowledgements
Bhavani Iyer – XML export
 Eric Brown – DictMatcher hash code
 Daniel Tunkelang – graphical layout
 Bob Mack – paper suggestions
