No Slide Title

Download Report

Transcript No Slide Title

Text Mining with D2K/T2K
July 9, 2004
Duane Searsmith
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
[email protected]
Office: (217) 244-9129
http://alg.ncsa.uiuc.edu
Michael Welge, Director, [email protected]
Loretta Auvil, Project Manager, [email protected], (217) 265-8021
Outline
• Text Mining Brief Intro
• Unsupervised
• Supervised
• Information Extraction
•…
• ALG Technology Pieces
• Demonstrations
• Discussion
alg | Automated Learning Group
What is text mining?
•
In simplified and practical terms it is the extraction of a relatively
small amount of information of interest from a mass amount of text
data.
But …
•
You might not know what you’re looking for.
•
•
How to recognize a needle.
•
•
Discovering patterns in the haystack. (clustering, mining associations)
Sifting through the haystack. (model building, supervised learning)
Just the facts please.
•
Enumerating the make and model of every needle. (information extraction)
alg | Automated Learning Group
Common Tasks for Text Mining & Analysis
•
Information retrieval
•
Automatic grouping (clustering) of documents
•
(Active) Classification
•
Information extraction
•
Topic detection and tracking
•
Automatic summarization
•
“Understanding” text and question answering
•
Machine Translation
alg | Automated Learning Group
Text Preprocessing
• Preprocessing (Text -> Numeric Representation)
• Tokenization
• Sentence Splitting
• Part-of-Speech Tagging
• Term Normalization (Stemming)
• Filtering (Stops)
• Chunking
• Term Extraction
• Filtering (Again)
• Term Weighting
• Other Transformations
• Resource Taxing
alg | Automated Learning Group
Clustering: Document Self-Organization
• Agglomerative (bottom up)
• Quadratic time complexity
• Sampling
• Random
• Partition
• Hard vs. Soft
• Unsupervised method
Strongly
Similar Arcs
Kept
Weakly
Similar Arcs
Broken
• Basic notion to all of these approaches is some heuristic for
measuring similarity between documents and document
groups (term co-occurrence)
alg | Automated Learning Group
How to Recognize a Needle
•
To classify your data you often need to build a
model.
•
To build a model you typically need examples
from a “teacher” – metaphorically speaking.
•
Finding good examples can be hard.
•
T2K can also use active learning to help find
good examples faster making model building
easier.
alg | Automated Learning Group
Pattern Mining
• Finding frequent item sets -> Rule
Discovery
• Many methods: Apriori, Charm,
FPGrowth, CLOSET
• Working with Jiawei Han and students -Hwanjo Yu and Xiaolei Li
• Application: topic tree construction
alg | Automated Learning Group
Just the Facts Please
• Finding a document that has the
information you need is often not the end
goal.
• To extract information you must first
recognize it – you need to build a model,
and that means you need to have
examples.
• Levels of IE: What’s hard and what’s
harder?
alg | Automated Learning Group
D2K
alg | Automated Learning Group
D2K Overview
D2K Features
•
•
•
•
•
Extension of existing API
•
•
•
Enhanced Distributed Computing
•
•
•
Allows modules that are re-entrant to be executed remotely.
Uses Jini services to look up distributed resources.
Includes interface for specifying the runtime layout of a distributed itinerary.
Processor Status Overlay
•
Shows utilization of distributed computing resources.
Distributed Checkpointing
Resource Manager
•
•
•
Provides the capability to programmatically connect modules and set properties.
Allows D2K-driven applications to be developed.
Provides ability to pause and restart an itinerary.
Provides a mechanism for treating selected data structures as if they were stored in
global memory.
Provides memory space that is accessible from multiple modules running locally as
well as remotely.
Batch Processing / Web Services
alg | Automated Learning Group
D2K/T2K/I2K - Data, Text, and Image Analysis
Information Visualization
alg | Automated Learning Group
The Technology Pieces
•
•
•
•
•
•
•
The Engine (distributed, parallelized, persistent)
Core Modules (building blocks)
T2K is a specialized set of modules for text analysis
I2K is a specialized set of modules for image analysis
D2K Toolkit (rapid development environment)
ThemeWeaver is an independent application that uses the
D2K engine to run algorithms constructed from T2K
modules. It is a demonstration platform
Other D2K driven applications (StreamLined, EMO, …)
Applications
Toolkit
Core Modules
T2K
D2K Engine
alg | Automated Learning Group
I2K
T2K Core 1.0 (Beta)
•
•
•
•
•
•
•
•
•
•
•
Tokenization
POS Tagging
Stemming
Chunking
Filters
Term Weighting
Supervised /
Unsupervised
Learning
GATE Integration
Pattern Mining
Text Streams
Summarization
alg | Automated Learning Group
ThemeWeaver
alg | Automated Learning Group
ThemeWeaver: Prototype Text Clustering Application
•
Hard clustering algorithms
• Modified Kmeans (3 sampling methods)
•
Soft clustering
• Suffix tree based algorithm
• Can be used for longer documents
•
Visualizations
• “Single link” graph representation
• Dendogram cluster tree
• Clusters over time
•
Drill down and backtrack UI
•
D2K/T2K Driven
alg | Automated Learning Group
The ALG Team
Students
Staff
Loretta Auvil
Peter Bajcsy
Colleen Bushell
Dora Cai
David Clutter
Lisa Gatzke
Vered Goren
Chris Navarro
Greg Pape
Tom Redman
Duane Searsmith
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge
alg | Automated Learning Group
Tyler Alumbaugh
Bradley Berkin
Jacob Biehl
John Cassel
Peter Groves
Olubanji Iyun
Sang-Chul Lee
Young-Jin Lee
Xiaolei Li
Brian Navarro
Scott Ramon
Sunayana Saha
Martin Urban
Bei Yu
Hwanjo Yu
* Demo / Discussion *
alg | Automated Learning Group