Transcript document

From Textual Information
to Numerical Vectors
Shilpa Venkataramana
Kent State University
1
Introduction






To Mine Text we need to process it in a form that Data
Mining procedures use.
From earlier chapter, this involves generating features in
a spread sheet format
Classical data mining looks at highly structured data
Spreadsheet Model is embodiment of representation that
is supportive of predictive modeling.
Predictive text mining is simpler and more restrictive
than open ended data mining.
Text mining is unstructured because very far from the
spreadsheet model that we need to process data for
prediction.
2
Introduction




Transformation of data to spreadsheet model is methodical
and carefully organized procedure to fill in cells in a
spread sheet.
We have to determine nature of column in spread sheet.
Features are easy to obtain , some are difficult. Features
(word in a text-easy) ; grammatical function of a word in a
sentence
Discuss ????

How to Obtain the kinds of features generated from Text
3
Collecting Documents




Text Mining is collect data
Web page retrieval application for an intranet implicitly
specifies the relevant documents to be the web pages on
the intranet
If documents are identified, then they can be obtainedmain issue – cleanse the samples and ensure high quality
Web application compromising a number of autonomous
Websites, one may deploy s/w tool such as WebCrawler to
collect the documents
4
Collecting Documents





Other application u have a logging process attached to an
input data steam for a length of time (eg -> email audit u
will log in the incoming and outgoing messages at mail
server for a period of time)
For R&D work of Text Mining - we need generic data. –
Corpus
Accompanying Software is Reuter which is called Reuter’s
corpus(RV1)
Early days (1960’s and 1970’s)1 million works was
considered –
size of collection of size of collection Brown corpus
consist of 500 samples for 2000 words of American
English test
5
Collecting Documents

European corpus was modeled on Brown corpus – British
English


o
1970’s 0r 80’s more resource were available- govt sponsored.
Some widely used corpora –Penn Tree Bank (collection manually
parsed sentences from Journal)
Resource is World Wide Web. Web crawlers can build
collections of pages from a particular sit such as yahoo.
Give n size of web, collections require cleaning before use
6
Document Standardization




When Documents are collected, you can have them
in different formatsSome documents may be collected in word format
or simple text with ASCII format. To process these
documents we have to convert them to standard
formats
Standard Format –XML
XML is Extensible Markup Language
7
Document Standardization-XML



Standard way to insert tags onto text to identify it’s parts.
Each Document is markedoff from corpus through XML
XML will have tags






<Date>
<Subject>
<Topic>
<Text>
<Body>
<Header>
8
XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
9
XML
The main reason to identify the parts is to allow selection
of those parts that is used to generate features
Selected document is concatenated into strings- separated
by tags
Document Standardization-



Why should we care ??


Advantage of data standardization is mining tools can be applied
without having to consider the pedigree of document.
10
Tokenization



Document Collected in XMl Format- examine the data
Break the characters into words – TOKENS
Each token is an instance of a type; the number of tokens
is higher than the number of types





2 tokens “the” occurs twice in a sentence. Refer to occurrence of a type
Character space , tab are not tokens but white spaces
Comma, Colon are tokens (between characters) eg:- USA,INDIA
between numbers are delimiter (121,135)
Apostrophe –number of uses (Delimiter or part of token) eg:D’Angelo

When it is followed by a terminator – internal quote (Tess’.)
11
Tokenization -Pesudocode
Dash is a terminator a token preceeded ir follwed by another dash
(522-3333)


Without identifying token it is diffuicult to imagine
extracting higher level information from document
12
Lemmatization


Once a character stream has been segmented after
sequence of tokens
Next Step ?? Convert each tokens to standard forms –
Stemming or Lemmatization. (Application dependent)



Reduce the number of distinct types in corpus and increase
frequency of occurrence of individual types
English Speaker’s agree nouns Book and Books are 2 forms of same
word- advantage to eliminate kind of variation
Normalization regularize grammatical variants –Inflectional
Stemming
13
Stemming to a Root





grammatical variants (singular/plural present/past)
It is always advantageous to eliminate this kind of
variation before further processing
When normalization is confined to regular grammatical
variants such as singular/plural and present/past, the
process is called Inflectional stemming
The intent of these stemmers is to reach a root of no
inflectional or derivational prefixes or suffixes- end
result aggressive stemming Example
reduce number of types is text
14
Stemming Pseudocode
15
Vector Generation for prediction






Consider the problem of categorizing documents
Characteristic feature are tokens or words they contain.
Without deep analysis we can choose to describe each
document by features that represent the most frequent
tokens.
There is collective features called dictionary.
Tokens or words in the dictionary forms the basis for
creating a spreadsheet of numeric data corresponding to
document collection.
Each Row-> document; column ->feature
16
Vector Generation for prediction








Cells in a spreadsheet is a measurement of feature for a document.
Basic model of data, we will simply check the presence or absence of
words
Checking for words is simple because we do not check each word
in dictionary. We will build a hash table. Large samples of digital
documents are readily available. – confidence on variation and
combinations of words that occurs
If prediction is our goal then we need one more column for correct
answer.
In preparing data for learning, information is available from
document labels. Our labels are binaries and answers which is also
called as class)
Instead of generating global dictionary for class we consider words
in class that we r trying to predict.
If this class is far smaller than the negative class which is typical –
local dictionary is far smaller than global dictionary
Another reduction in dictionary size is to compile a lost of
stopwords and remove them from dictionary.
17





Stopwords are almost never have any predictive
capability, such articles a & the pronouns as it and they.
Frequency information on the word counts can be quite
useful in reducing the dictionary size and improve
predictive performance
Frequent words are stopwords and can be deleted.
Alternative approach to local dictionary generation is to
generate a global dictionary from all documents in the
collection . Special feature selection routines will
attempt to select a subset of words that have greatest
potential of prediction- independent(selction methods)
If we have 100 topics to categorize, then 100 problems to
solve . Our choices are 100 small dictionary or 1 global
dictionary .
18








Vectors implied by spreadsheet model will be
regenerated to correspond to small dictionary
Instead of placing the word in the dictionary -> follow a
path printed dictionary and avoid storing every
variation of word. (no singular/plural/past/present)
Verbs stored in stemming manner.
Add a layer of complexity in text – gain in performance
and size is reduced
Universal procedure is trim words to their root form ->
difference in meaning (exit /exiting)- context of
programming (different meanings)
Small Dictionary- u can capture the best words easily.
Use of tokens and stemming are examples of helpful
procedures in smaller dictionaries. Improve ability of
managing of learning and accuracy
Document can be converted to spread sheet .
19




Each column is feature. Row is a document
Model of data for predictive text mining in terms of
spread sheet that populated by ones or zeros.
Cells represent the presence of dictionary words in a
document collection. Higher accuracy-> additional
transformations
They are







Word Pairs and collocations
Frequency
Tf-idf
Word Pairs and Collocations:- They serve to increase size
of dictionary improve performance of prediction
Instead of 0’s & 1’s in cells; the frequency of word can be
used. word “the” occurs 10 times count of “the” is used)
Count give better results than binary in cells.
This leads to compact solutions same solution of binary
data model. Yet additional frequency yield simpler
solution.
20


Frequencies are helpful in prediction but add complexity
to solutions.
Compromise that works – 3 value system.1/0/2








Word did not occur -0
Word occurred one -1
Word occurred 2 or more times -2
Capture much added value of frequency without adding
much complexity to model.
Another variant is zeroing the values below the
threshold where tokens min frequency before being
considered any use.
Reduce the complexity of spread sheet – used in Data
Mining algorithms
Other methods to reduce complexity are chi square,
mutual Information, odds Ratio ..etc
Next step beyond counting frequency is modify the
count by perceived importance of that word .
21




Tf-idf:- Compute the weightings or scores of words
Values of positive numbers that we capture the absence
or presence of the words.
Eq (a) we see that weight assigned to word j-term of
frequency modified by a scale factor for importance of
word. Scale factor is inverse document frequency (eq (b))
Simply checks for number of documents containing the
word df(j) and reverse scaling.




Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
Idf(j) = log(N/ df(j)) ------ Eq(b)
When a word appears in a document, the scale is
lowered and perhaps zero. if word is unique , appears in
few documents - scale factor zooms upward and
appears important
Alternative of this tf-idf formulation exist, but
motivation is same. Result is positive score that replaces
the simple frequency or binary (T/F) entry in our cell in
spreadsheet.
22


Another variant is weight the tokens from different parts
of the document.
Which Data Transformation Method are BEST????







No Universal answer.
Best predictive accuracy is dependent on mating all these
methods.
Best variation is one method may not be the one for other. Test
ALL
Describe data as populating a spread sheet-cells are 0
Small subset of dictionary words.
Text Classification a text corpus 1000’s words. Each
individual document ,unique tokens.
Spread sheet for that document is 0. Rather than store
all 0’s its better to represent the spread sheet as a set of
sparse vectors (row is list of pairs , one element of pair is
column and other is corresponding nonzero value). By
not storing the non zero It will increase memory
23
Multi Word Features





Features are associated with single words ( tokens
delimited with white space)
Simple scenario is extended to include pair of words eg:bon and viant . Instead of separating we could feature the
word as bonviant.
 Why stop at pairs? Why not consider a multiword
features??
Unlike word pairs , the words need not be consecutive.
Eg:- Don Smith as feature – we can ignore is middle name
Leroy that may reappear in some reference to the person.
In this case we have to accommodate many reference to
the noun that involve a number of adjectives with desired
adjective not the adjacent to the noun. Eg:- we want to
accept a phrase broken and dirty vase as an instance
24
broken vase







X number if words occurring within a maximum window size y(y>=x
naturally)
How features are extracted from text- specialized methods???
If we use frequency methods, combinations of words that are relatively
frequent.
Straight forward implementation is simple combination of x words in
window y
Measuring the value of multiword feature is done correlation between
words in potential multiword features measures on mutual
information or likelihood ratio is used!!!
An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
Multiword features are not too found in document collection, but they
are higly predictive
size (T ) log 10 ( freq(T )) freq(T )
AM (T ) 
wordiT freq(wordi )
25
26
Labels for Right Answers:









For prediction an extra column is added to the spreadsheet
Last column contains the labels, looks no different from others.
It’s a 0 or 1 indicating a right answer with either True/false
In the sparse vector format are appended to each vector separately
as either a one (positive class) or a Zero (negative class)
Feature Selection by Attribute Ranking:In addition to frequency based approaches, feature
selection can be done in number of ways.
Select a set of feature for each category to form a local
dictionary for the category
Independent ranking feature attributes according to their
predictive abilities for category under consideration.
Predictive ability of an attribute can be measured by
certain quantity how its is correlated
Lets assume n number of documents; xi presence or
absence of attribute j in x; y to denote label of document in
last column
27

A commonly used ranking score is information gain
criterion which is
IG ( j )  Llabel  L( j )
1
LLabel   Pr( y  c) log 2 (1 / pr ( y  c))
c 0
1
1
v 0
c 0
L( j )   Pr( xi  v) Pr( y  c | xi  v) log 2 (1 / pr ( y  c | xi  v))


The quantity L(j) is number of bits required to encode the
label and the attribute j minus the number of bits
required to encode the attribute.
Quantities are needed to compute L(j). Can be easily
estimated using the estimators
pr ( xi  v) 
freq( x j  v)  1
n2
freq( x j  v, label  c)  1
pr ( y  c | x j  v) 
freq( x j  v)  2
28
Sentence Boundary Determination







If the XML markup for corpus doesn't mark sentence
boundaries, necessary to mark the sentence
Necessary to determine when a period is part of a token
and when it is not
For more sophisticated way linguistic parsing, the
algorithms often require complete sentence as input.
Extraction algorithms operate text a sentence at a time
Algorithms are optimal, sentences are identified clearly
Sentence boundary determination is problem of deciding
which instances of period followed by white space are
sentence delimiters and which are not, since we assume
characters ? ! –classification problem
Algorithm – accuracy and adjustments will give better
performance
29
30
Thank you  !!!!
31