Transcript Document

TextMOLE: Text Mining Operations
Library and Environment
Daniel B. Waegel
and
April Kontostathis, Ph.D.
Ursinus College
Collegeville PA
What?
Advanced application for indexing and
searching a text database.
Allows users to quickly analyze a corpus
of documents and determine which
parameters will provide maximal retrieval
performance.
Who?
Instructors - demonstrate information retrieval
concepts in the classroom
Students – hands-on exploration of concepts
often covered in an introductory course in
information retrieval or artificial intelligence
Reseachers - ‘quick and dirty’ analysis of an
unfamiliar collection
Juniors and Seniors – capstone experiences in
computer science
Why?
Students unfamiliar with applications which require
manipulation of unstructured text
IR students develop basic IR systems, but do not have
time to implement and test a variety of parameters
Existing systems do not tightly integrate indexing and
retrieval functions
– R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
Addison Wesley/ACM Press, New York, 1999.
– R. K. Belew. Finding Out About. Cambridge University Press, 2000.
– G. Salton. The SMART Retrieval System–Experiments in Automatic
Document Processing. Prentice Hall, Englewood Cliffs, New Jersey,
1971.
Time! Students in AI do not even have time to
implement a basic IR system.
How?
Overview of the Application
– Indexing
– Single Query Retrieval
– Multiple Query Retrieval
Sample Assignments
– Artificial Intelligence
– Information Retrieval
– Capstone Projects
Indexing
Single Query Specification
Single Query Results
Multiple Query Specification
Multiple Query Results
How?
Overview of the Application
– Indexing
– Single Query Retrieval
– Multiple Query Retrieval
Sample Assignments
– Artificial Intelligence
– Information Retrieval
– Capstone projects
Information Retrieval Course
Assignment 2
– Assumes Assignment 1 was having students develop
their own rudimentary IR systems
– Using a corpus provided by the instructor or
developed by the student (min. 100 documents)
Convert to XML format
Parse with TextMOLE
Identify a set of standard queries for the collection (truth set
not necessary)
Vary parameters (stemming vs. no stemming, various
weighting schemes, various stop lists)
Decide which set of parameters work best for your collection.
Write a paper describing your experiments and the results,
be sure to defend your conclusions!
Information Retrieval Course
Assigment 3 or 4
– Using the corpus from the previous assignment
(minimum of 100 documents)
– Develop a set of standard queries
– Determine which documents are truly relevant to
these queries (involves lots of reading and frustration)
– Use the Multiple Query function of TextMOLE to
determine precision and recall
Alternate
– Use one or more of the Gold Standard Collections
that have set of standard queries with truth sets
(TextMOLE can convert them to XML format)
Artificial Intelligence Course
IR Assignment
– Instructor provides set of documents in XML format
and set of standard queries (with or without result set)
– Instructor provides students with parameters to use
(ex. Stemming, log entropy weighting for both
indexing and retrieval)
– Students try to find the ‘best’ stop word list for this
collection
– Write brief paper describing experiments and results
Capstone Experiences in
Computer Science
Migrate TextMOLE to another
platform
–
–
–
–
–
Open GL
Java
Web based
Relational Database
Library Functions
Add additional parameters to
basic Search and Retrieval
– N-grams instead of words
– Noun phrases (using a tool
like flex)
– Clustering
– Latent Semantic Indexing
Add additional IR applications
–
–
–
–
–
Emerging trend detection
Classification
First Story Detection
Filtering
Summarization
Research in Computer
Science
– Develop your own weighting
scheme
– Identify additional features for
indexing
– Develop a new Gold Standard
collection
Where?
Version 1.0 now available online!
http://webpages.ursinus.edu/akontostathis/TextMOLE
Contact [email protected] with
questions and comments