An Evaluation of Text Mining Tools as Applied to Selected

Download Report

Transcript An Evaluation of Text Mining Tools as Applied to Selected

Data
Information
Wisdom
Knowledge
An Evaluation of Text Mining Tools
as Applied to Selected Scientific
and Engineering Literature
Walter J. Trybula, Ph.D., IEEE Fellow
Ronald E. Wyllys, Ph.D.
ASIS 2000 – Chicago, Illinois
November 14, 2000
Introduction
• Data volume is growing and sources of
information are more diverse
• There is a need to evaluate this information
• There are tools that claim to be able to find
information in textbases
• An investigation of existing tools would
provide a measure of their ability.
• If such tools worked, it might be possible to
discover new knowledge.1
1
As described by Swanson as Undiscovered Public Knowledge
[email protected]
14 November 2000
Objective/Goals
• Provide a means of testing the existing
instruments to determine their ability to
“find” knowledge.
• Determine if any of these instruments
provide useful insight to the data.
• Evaluate the findings of domain experts to
determine if the instruments are helpful.
• Develop recommendations based on the
results of the experiments.
[email protected]
14 November 2000
Overview of Process
• Selected a technical area with known
commonality (lithography masks).
• Collected the most recent reports available.
• Compile results into textbase for analysis
by text mining tools.
• Have domain experts evaluate the results.
• Analyze their conclusions and draw
recommendations for future directions.
[email protected]
14 November 2000
Example of Commonality
[email protected]
14 November 2000
Selection of Information
• Information from leading researchers was
collected.
– Asian efforts on X-ray technology.
– U.S. efforts on X-ray technology.
– European efforts on Ion Projection Lithography.
– U.S. efforts on Electron Projection Lithography.
– U.S. efforts on Extreme UltraViolet technology.
• Data was their annual update on technology
progress provided for yearly review.
• All reports, presentations, and data were
assembled into a single textbase for analysis.
[email protected]
14 November 2000
Sources of Data
U.S.
Europe
Asia
Concerns:
-Language
-Terminology
-Program (format)
[email protected]
14 November 2000
Text Mining Tools
• Selected three types of Text Mining
Instruments available for desk-top
operation.
– Key terms identified with pointers to text
– Excerpt presentation format
– Hierarchal tree-structure presentation
• Did not include Self-Organizing Maps (SOMs)
• Included a search engine for baseline
evaluation of the results (AltaVista).
[email protected]
14 November 2000
Text Mining Tools
Text Mining Tool that returns Key Terms
[email protected]
14 November 2000
Text Mining Tools
Text Mining Tool that returns Excerpts
[email protected]
14 November 2000
Text Mining Tools
Text Mining Tool that returns Hierarchy
[email protected]
14 November 2000
Results
• No method provided any novel results. There was
some difficulty with mixed format documents.
• Domain experts were required to evaluate the
output and determine importance of delivered
information.
• Graphical information presentation was preferred
over simple text.
• Search Engine provided many pointers to
occurrences of search terms.
• There was no evidence that this approach provided
any novel knowledge.
[email protected]
14 November 2000
Conclusions
• Text Mining instruments are in a developmental
stage and need refinement to be more useful.
• Text Mining instruments must be able to handle
data in various formats, i.e., documents,
spreadsheets, presentations, etc.
• Without a defined goal of what data will be
delivered, there is no commonality among the
various instruments.
• Experts had difficulty in retrieving information that
was known to be present due to methodology of
evaluating information in textbase.
• There must be a cohesive direction provided for the
development of these instruments.
[email protected]
14 November 2000
Recommendations
Future Directions – Information Needs
• An Instrument that evaluates the text in the
textbase and provides an accurate representation
of the information contained therein.
• An Instrument that provides this information in a
manner that can be accurately and quickly
evaluated by the intended user.
• An Instrument that draws the best elements from
existing work and provides information based on
proven methodologies. (In rapidly evolving
technologies, efforts in one area may ignore
developments in others. This is not acceptable.)
[email protected]
14 November 2000
Recommendations
Data Mining Process
Start with existing methodology.
[email protected]
14 November 2000
Recommendations
Text Mining Process
Develop new methodology from existing ones.
[email protected]
14 November 2000
Future Directions – Instrument Needs
Recommendations
• There needs to be a cohesive direction for future
work. The existing development must draw on the
knowledge developed in the Library Science field.
• Can build from Data Mining to derive Text Mining
functionality. A key concern will remain the method
of presenting the results.
• Need to have some agreement on the purpose of
the Text Mining Instruments
– What is the purpose of “mining” text?
– What kind of user will there be?
– What is the anticipated outcome?
• Consider the application of the latest software
developments, e.g., Groove, Napster, for
information sharing.
[email protected]
14 November 2000
Challenges
Recommendations
• Establish a “goal” for the results of Text Mining.
What will be accomplished?
• Drive toward widespread application, i.e., desktop
and handheld applications.
• Incorporate latest hardware developments, i.e.,
distributed, parallel processing and wireless
communications.
• Deliver what the intended user needs.
• Don’t reinvent the “wheel”
– Have the Library Science, the Information Science, and the
Computer Science people work together.
[email protected]
14 November 2000
Acknowledgements
• Dean Brooke Sheldon, Sanda Erdelez, Mary
Lynn Rice-Lively (GSLIS, University of
Texas at Austin).
• John Konopka of IBM.
• The International SEMATECH team
including Scott Mackay, Mark Mason, Phil
Seidel, David Stark.
• The various technology champions for their
efforts in providing the latest technology
information.
[email protected]
14 November 2000