Text Mining - Index of

Download Report

Transcript Text Mining - Index of

Text-Mining: analysis
of text data
Dunja Mladenić
J.Stefan Institute, Ljubljana, Slovenia
and Carnegie Mellon University, USA
http://www-ai.ijs.si/DunjaMladenic/
http://www.cs.cmu.edu/~dunja/
Web user profiling
• imagine the user browsing the Web,
most of the time by clicking hyperlinks
• goal: provide help by highlighting the
clicked hyperlinks (we assume that the user is
clicking on interesting hyperlinks)
– induce a profile for each user separately
– the profile can be used to predict clicking
on hyperlinks (in our case), to collect
interesting Web-pages, to compare
different users and share knowledge
between them (collaborative agents)
Structure of the personal browsing
assistant - Personal WebWatcher
URL
User
modified page
URL
proxy
(adviser)
User profile
page
The Web
Personal WebWatcher
Personal
WebWatcher
in action (1996)
Highlight interesting
hyperlinks
Data Pyramid
Wisdom
Knowledge plus experience
Knowledge
Information plus rules
Information
Data plus context
Data
What is Data Mining?
• Data mining (knowledge discovery in
databases - KDD, business
intelligence):
– finding interesting (non-trivial, hidden,
previously unknown and potentially
useful) regularities in large datasets
• “Say something interesting about the data.”
• “Decribe this data.”
Data Mining: Potential usage
• Market analysis
• Risk analysis
• Fraud detection
• Text Mining
• Web Mining
• ...
Why text analysis?
• The amount of text data on
electronic media is growing daily
– e-mail, business documents, the Web,
organized databases of documents,...
• There is a lot of information
contained in the text
• Available methods and approaches
enabling solving interesting and
non-trivial problems
Problem description (I)
• Text information filtering
• Help with browsing the Web
• Generation and analysis of user profiles
Automatic document categorization and
keyword assignment to documents
• Document clustering
• Document visualization
Document authorship detection
Document copying identification
Language identification in text
Document categorization
Document Classifier
labeled documents
???
document category
(label)
unlabeled document
Yahoo! page for
one category
Automatic document
categorization
• Problem: given is a set of content categories
filled with documents.
• The goal is: to automatically insert a new
document (assign one or more relevant
categories to a new document).
• Content categories can be structured (eg.,
Yahoo, Medline) or unstructured (eg.,
Reuters)
• The problem is similar to assigning
keywords to documents
Document to categorize:
CFP for CoNLL-2000
Some
predicted
categories
Our approach to document
categorization
• Data is obtained from the existing collection
of manually categorized documents, where
the used content categories are structured
• Using Text Mining methods, we constructed
a model that captures manual work of
editors
• The model is used to automatically assign
content categories and the corresponding
keywords to new, previously unseen
documents
System architecture
Feature construction
Web
vectors of n-grams
labeled documents
(from Yahoo! hierarchy)
Subproblem definition
Feature selection
Classifier construction
??
Document Classifier
unlabeled document
document category (label)
Summary of experiments
and results
• learning from categorization hierarchy:
considering only promising categories during
the classification (5%-15% of categories)
• extended document representation:
new features for sequences of two words
• feature subset selection: Odds ratio using
50-100 best features (0.2%-5%)
• More can be
found at our
project page
www.cs.cmu.edu/
~TextLearning/
pww/yplanet.html
Document authorship
detection
• Problem: based on a database of
documents and authors, assign the
most probable author to a new
document
• Solution is based on the fact that
each author uses a characteristic
frequency distribution over words and
phrases
Document copying
identification
• Problem: predict probability that a
given document was copied (partially
or completely) from some other
document(s) from our database
• Algorithm uses complex indexing
methods on (different length) parts
of documents and compares them
against the given document
Natural language
identification
• Text data analysis systems commonly
use some natural language dependent
methods
• Need for identification of natural
language the document is written in
• Problem: for a given text identify the
natural language it is written in
selecting among the predefined
languages
Algorithm for natural
language identification
• Basic algorithms are simple: for each
language build a characteristic frequency
table of pairs and triples of letters that
can be simply used to identify a document
language (TextCat publicly available
system, covers 60 languages)
• Problem is with short documents - in this
case we can use mechanisms for language
dependent stop-words detection (stopwords are frequent in all languages)
Problem description (II)
Topic identification and tracking in
time series of documents
• Document indexing based on content
and not only keywords
Content segmentation of text
Document summarization
• Link analysis
Information extraction
Topic identification and tracking
in time series of documents
• Problem: given is a time-sequence of
documents (news) - based on this
document sequence we want to:
– identify document that introduces new
topic
– from the sequence of new documents
identify documents about existing topics
and connect them into a topic sequence
Text segmentation based on
content
• Problem: divide text that has no given
structure (content table, paragraphs,
etc.) into segments with similar
content
• Example applications:
– topic tracking in news (spoken news)
– identification of topics in large,
unstructured text databases
Algorithm for text
segmentation
• Algorithm:
– Divide text into sentences
– Represent each sentence with words and
phrases it contains
– Calculate similarity between the pairs of
sentences
– Find a segmentation (sequence of delimiters), so
that the similarity between the sentences
inside the same segment is maximized and
minimized between the segments
Text Summarization
• Task: Given a text document create a summary
reflecting the document’s contents
• Three main phases:
– Analyzing the source text
– Determining its important points
– Synthesizing an appropriate output
• Most methods adopt linear weighting model –
each text unit (sentence) is assessed by:
– Weight(U)=LocationInText(U)+CuePhrase(U)+Statis
tics(U)+AdditionalPresence(U)
• …output consists from topmost text units
(sentences)
Information extraction
• Collect a set of Home pages from the
Web and build a “soft” database of
people (name, address, coworkers,
research areas and publications,
biography...)
• Collect electronic seminar
announcements and extract location
(room number), start and end time,
name of the speaker
Where are we now?
• Growing interest and need for handling
large collections of text
• The area is present in Slovenia for over 5
years with strong international connection
– joint R&D project with: Microsoft Research,
European and American research institutions,
cooperation with Boeing
• Organization of international events
focused on Text Mining (ICML-99, KDD-2000, ICDM2001)
Instead of conclusions...
• Text Mining enables solving some
problems that are often not expected to
be addressed by computers:
– document authorship detection,
identification of related content or finding
“interesting” people, document segmentation
and organization, automatic collection of
officer names for the selected sector
companies, finding experts in some area, who
is involved with whom (discovering social
networks), ...
To find more information check:
<http://www-personal.umich.edu/~wfan/text_mining.html>
<http://ai.about.com/library/weekly/aa102899.htm>
<http://extractor.iit.nrc.ca/bibliographies/ml-applied-to-ir.html>
<http://www.content-analysis.de/>
get research papers at <http://www.researchindex.com>
• KDD-2000 Text Mining Workshop
<http://www.cs.cmu.edu/~dunja/WshKDD2000.html>
• ECAI-2000 ML for Information Extraction
<http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html>
• PRICAI-2000 Text and Web MiningWorkshop
<http://textmining.krdl.org.sg/cfp.html>
• IJCAI-2001 Adaptive Text Extraction and Mining Workshop
<http://www.smi.ucd.ie/ATEM2001/>, Text Learning: Beyond Supervision
<http://www.cs.cmu.edu/~mccallum/textbeyond/>
• ICDM-2001 Text Mining Workshop
<http://www-ai.ijs.si/DunjaMladenic/TextDM01/>
• ECML/PKDD-2001 Text Mining tutorial
<http://www-ai.ijs.si/DunjaMladenic/TextDM01/Tutorial.ps>
Link Analysis
• Mechanisms for detecting which
vertices in the graph (pages on the
web) are more important on the basis
of link structure:
– Hits algorithm (Hubs & Authorities)
(Kleinberg 1998)
– PageRank (Page 1999) weighting (used by
Google to better rank good pages)
Link analysis on Amazon data
• We downloaded product pages from
Amazon.com web site:
– …products are connected with cross-sell
relation (“customers who bought this product
also bought following products…”)
– 130.000 books and 32.000 music CDs connected
into graph
• Question: which products (books or CDs)
are the most important?
• …we used Hits algorithm to calculate the
weights
– Harry Potter & Beatles won the test.
Popular books
1.
2.
3.
4.
5.
6.
7.
8.
•
Harry Potter and the Goblet of Fire (Book 4): J K
Rowling, Mary Grandpre
The Beatles Anthology: The Beatles, Paul McCartney,
George Harrison, Ringo Starr, Lennon, John Lennon
Prodigal Summer: Barbara Kingsolver
Harry Potter and the Sorcerer's Stone (Book 1): J K
Rowling
The Mark : The Beast Rules the World (Left Behind
#8): Tim LaHaye, Jerry B Jenkins
Harry Potter and the Chamber of Secrets (Book 2): J K
Rowling
Harry Potter and the Prisoner of Azkaban (Book 3): J K
Rowling, Mary Grandpre
The Sibley Guide to Birds (Audubon Society Nature
Guides Ser.): David Allen Sibley
....
Popular CDs
1. The Beatles
2. A Day Without Rain: Enya
3. Lovers Rock: Sade
4. All That You Can't Leave Behind: U2
5. Riding With The King: Eric Clapton, BB King
6. Black and Blue: Backstreet Boys
7. Sailing To Philadelphia: Mark Knopfler
8. You're The One: Paul Simon
9. Kid A: Radiohead
10. Music: Madonna
11. Red Dirt Girl: Emmylou Harris
12. Renee Fleming
• ...