Text Based Information Retrieval

Download Report

Transcript Text Based Information Retrieval

Text Based Information
Retrieval - Text Mining
PKB - Antonie
Background
• Human dificults to process huge information
• Computer can do better with matemathics
– why don’t also use computer to process huge
information?
• A Large text to find:
– Terrorist attack on 1995?
– Terrorist movement and bomb relation?
• Relates to Information Retreival, Data Mining
and Text Mining
Terminology
• Data Mining
A step in the knowledge discovery process consisting of
particular algorithms (methods), produces a particular
enumeration of patterns (models) over the data.
• Data Mining is a process of discovering advantageous patterns
in data.
• Knowledge Discovery Process
The process of using data mining methods (algorithms) to
extract (identify) what is knowledge according to the
specifications of measures and thresholds, using a database
along with any necessary preprocessing or transformations.
What kind of data in Data Mining?
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced Database
Systems
–
–
–
–
Object-Relational
Multimedia
Text
Heterogeneous and
Distributed
– WWW
Data Mining Application:
• Market analysis
• Risk analysis and
management
• Fraud detection and
detection of unusual patterns
(outliers)
• Text mining (news group,
email, documents) and Web
mining
• Stream data mining
Knowledge Discovery
Required effort for each KDD Step
• Arrows indicate the direction we hope the effort should go.
What Is Text Mining?
“The objective of Text Mining is to exploit information contained in textual
documents in various ways, including …discovery of patterns and
trends in data, associations among entities, predictive rules, etc.”
(Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data
analysis that leads to heretofore unknown information, or to answers
for questions for which the answer is not currently known.” (Hearst,
1999)
“The non trivial extraction of implicit, previously unknown, and
potentially useful information from (large amount of) textual data”.
An exploration and analysis of textual (natural-language) data by
automatic and semi automatic means to discover new knowledge.
Text Mining (2)
• What is “previously unknown” information ?
– Strict definition
• Information that not even the writer knows.
– Lenient (lunak) definition
• Rediscover the information that the author
encoded in the text
• e.g., Automatically extracting a product’s name
from a web-page.
Text Mining Methods
• Information Retrieval
– Indexing and retrieval of textual documents
• Information Extraction
– Extraction of partial knowledge in the text
• Web Mining
– Indexing and retrieval of textual documents and
extraction of partial knowledge using the web
• Clustering
– Generating collections of similar text documents
Text Mining Application
• Email: Spam filtering
• News Feeds: Discover what is interesting
• Medical: Identify relationships and link
information from different medical fields
• Marketing: Discover distinct groups of potential
buyers and make suggestions for other products
• Industry: Identifying groups of competitors web
pages
• Job Seeking: Identify parameters in searching
for jobs
Information Retrieval (1)
• Given:
– A source of textual documents
– A well defined limited query (text based)
• Find:
– Sentences with relevant information
– Extract the relevant information and
ignore non-relevant information (important!)
– Link related information and output in a predetermined format
• Example: news stories, e-mails, web pages,
photograph, music, statistical data, biomedical data, etc.
• Information items can be in the form of text, image,
video, audio, numbers, etc.
Information Retrieval (2)
• 2 basic information retrieval (IR) process:
– Browsing or navigation system
• User skims document collection by jumping from
one document to the other via hypertext or
hypermedia links until relevant document found
– Classical IR system: question answering
system
• Query: question in natural language
• Answer: directly extracted from text of document
collection
• Text Based Information Retrieval:
– Information item (document) :
• Text format (written/spoken) or has textual description
– Information need (query):
• Usually in text format
Classical IR System Process
Intelligent Information Retrieval
• meaning of words
– Synonyms “buy” / “purchase”
– Ambiguity “bat” (baseball vs. mammal)
• order of words in the query
– hot dog stand in the amusement park
– hot amusement stand in the dog park
Why Mine the Web?
• Enormous wealth of textual information on the Web.
– Book/CD/Video stores (e.g., Amazon)
– Restaurant information (e.g., Zagats)
– Car prices (e.g., Carpoint)
• Lots of data on user access patterns
– Web logs contain sequence of URLs accessed by users
• Possible to retrieve “previously unknown”
information
– People who ski also frequently break their leg.
– Restaurants that serve sea food in California are likely to be
outside San-Francisco
Mining the Web
Web
Spider
Documents
source
Query
IR / IE
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
What is Web Clustering ?
• Given:
– A source of textual
documents
– Similarity measure
Documents
source
Similarity
measure
Clustering
System
• e.g., how many
words are common
in these documents
• Find:
•
Several clusters of documents
that are relevant to each other
Doc
Doc
Doc
Doc
Doc
Doc
Do
Doc
Docc
Doc
Text characteristics
• Large textual data base
– Efficiency consideration
• over 2,000,000,000 web pages
• almost all publications are also in electronic form
• High dimensionality (Sparse input)
– Consider each word/phrase as a dimension
• Dependency
– relevant information is a complex conjunction of
words/phrases
• e.g., Document categorization.Pronoun disambiguation
Text characteristics
• Ambiguity
– Word ambiguity
• Pronouns (he, she …)
• “buy”, “purchase”
– Semantic ambiguity
• The king saw the rabbit with his glasses. (? meanings)
• Noisy data
• Example: Spelling mistakes
• Not well structured text
– Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”
– Speech
Text mining process
• Text preprocessing
– Syntactic/Semantic
text analysis
• Features Generation
– Bag of words
• Features Selection
– Simple counting
– Statistics
• Text/Data Mining
– ClassificationSupervised
learning
– ClusteringUnsupervised
learning
• Analyzing results
Syntactic / Semantic text
analysis
• Part Of Speech (pos) tagging
• Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
• Word sense disambiguation
• Context based or proximity based
• Very accurate
• Parsing
• Generates a parse tree (graph) for each sentence
• Each sentence is a stand alone graph
Feature Generation: Bag of words
• Text document is represented by the words it contains
(and their occurrences)
–
–
–
–
e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”}
Highly efficient
Makes learning far simpler and easier
Order of words is not that important for certain applications
• Stemming: identifies a word by its root
– Reduce dimensionality
– e.g., flying, flew  fly
– Use Porter Algorithm
• Stop words: The most common words are unlikely to help
text mining
– e.g., “the”, “a”, “an”, “you” …
Feature selection
• Reduce dimensionality
– Learners have difficulty addressing tasks with
high dimensionality
• Irrelevant features
– Not all features help!
• e.g., the existence of a noun in a news article is
unlikely to help classify it as “politics” or “sport”
• Use Weightening
Text Mining: Classification
definition
• Given: a collection of labeled records (training set)
– Each record contains a set of features (attributes), and
the true class (label)
• Find: a model for the class as a function of the
values of the features
• Goal: previously unseen records should be
assigned a class as accurately as possible
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
Text Mining: Clustering definition
• Given: a set of documents and a similarity
measure among documents
• Find: clusters such that:
– Documents in one cluster are more similar to one
another
– Documents in separate clusters are less similar to one
another
• Goal:
– Finding a correct set of documents
Similarity Measures:
•
•
Euclidean Distance if attributes are continuous
Other Problem-specific Measures
• e.g., how many words are common in these documents
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Evaluation:What Is Good
Classification?
• Correct classification: The known label
of test sample is identical with the class
result from the classification model
• Accuracy ratio: the percentage of test
set samples that are correctly classified
by the model
• A distance measure between classes
can be used
– e.g., classifying “football” document as a
“basketball” document is not as bad as
classifying it as “crime”.
Evaluation: What Is Good
Clustering?
• Good clustering method: produce high
quality clusters with . . .
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering method is also
measured by its ability to discover some or
all of the hidden patterns
Text Classification: An Example
Ex#
Hooligan
1
2
3
4
5
6
7
8
An English football fan
…
During a game in Italy
…
England has been
beating France …
Italian football fans were
cheering …
An average USA
salesman earns 75K
The game in London
was horrific
Manchester city is likely
to win the championship
Rome is taking the lead
in the football league
Yes
Hooligan
Yes
Yes
No
A Danish football fan
?
Turkey is playing vs. France.
The Turkish fans …
?
10
No
Yes
Test
Set
Yes
Yes
10
Training
Set
Learn
Classifier
Model
Decision Tree: A Text Example
Splitting Attributes
Ex#
Hooligan
1
2
3
4
5
6
7
8
10
An English football fan
…
During a game in Italy
…
England has been
beating France …
Italian football fans were
cheering …
An average USA
salesman earns 75K
The game in London
was horrific
Manchester city is likely
to win the championship
Rome is taking the lead
in the football league
English
Yes
Yes
Yes
Yes
No
Yes
MarSt
Single, Divorced
No
No
Yes
Income
> 80K
NO
Married
NO
< 80K
YES
Yes
Yes
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
Classification by DT Induction
• Decision tree
–
–
–
–
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases:
– Tree construction
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown
sample
– Test the attribute of the sample against the decision
tree
Summary
• Text is tricky to process, but “ok” results are easily
achieved
• There exist several text mining systems
– e.g., D2K - Data to Knowledge
– http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
• Additional Intelligence can be integrated with text
mining
– One may play with any phase of the text mining process
Summary
• There are many other scientific and statistical text mining
methods developed but not covered in this talk.
– http://www.cs.utexas.edu/users/pebronia/text-mining/
– http://filebox.vt.edu/users/wfan/text_mining.html
• Also, it is important to study theoretical foundations of
data mining.
– Data Mining Concepts and Techniques / J.Han & M.Kamber
– Machine Learning, / T.Mitchell