Natural Language Processing and Textual Analysis

Download Report

Transcript Natural Language Processing and Textual Analysis

Natural Language Processing and Textual Analysis
in Finance and Accounting
Tim Loughran
and
Bill McDonald
University of Notre Dame
1
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
… “ ‘Cause you know sometimes words have two
meanings.”
2
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• What do we call this?
– Textual analysis
– Natural language processing
– Sentiment analysis
– Content analysis
– Computational linguistics
3
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Increased interest attributable to:
– Bigger, faster computers
– Availability of large quantities of text
– New technologies derived from search engines
4
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Examples of data sources:
– EDGAR (1994-2011, 22.7 million filings)
– WSJ News Archive (XML encapsulated, 2000 -> )
– Audio transcripts (e.g., conference calls)
– Web sites
– Google searches
– Twitter / Stocktwits
5
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Programs
– Black boxes (Wordstat, Lexalytics, Diction …)
– Two critical components
• Ability to download data and convert into
string/character variable
• Ability to parse large quantities of text
6
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Most modern languages provide for both of these
functions:
– Perl
– Python
– SAS Text Miner
– VB.net
7
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Parsing large quantities of text: REGEX
• Regular expressions example
– Regex that attempts to identify sentences
(?<=^|[\.!\?]\s+|\n{2,})
[A-Z][^\.!\?\n]{20,}(?=([\.!\?](\s|$)))
8
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Summary of technical literature:
Natural languages are messy and difficult to parse
with computers.
Current Issues in Parsing Technology
Masaru Tomita
Kluwer Academic Publishing, 1991
p. 1
9
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Tripwires – some examples
– Parsing out 10-K segments
– “May”
– Disambiguation of abbreviations
– Older files are less structured
10
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Download 10-X
– Download master files for each year/qtr
"ftp://ftp.sec.gov/edgar/full-index/YYYY/QTR#"
11
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Identify target forms from master file
• Download forms
– http://www.sec.gov/Archives/target file name
12
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Iterate thru forms:
– Clean up text file
• Remove ASCII-Encoded segments (e.g., graphics, pdfs,
etc.)
• Remove XBRL
• Remove tables (<TABLE>.*?</TABLE>)
• Remove all remaining markup tags (HTML)
• Re-encode character entity references (e.g., &AMP = &)
13
Overview
Data/Programs
Sample App
• Iterate thru forms:
Stemming
Word Lists
Resources
(continued)
– Parse form into tokens
• Regex: ?i:\b[-A-Z]{2,}\b
• Iterate thru each token to see if it matches an entry in a
master dictionary
• Tabulate words
14
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
When creating word lists, should we list root words
(lexemes) and stem, or expand all root words to include
inflections?
15
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Stemming
– Programmatically collapse words down to root
lexeme:
• expensive, expensed, expensing => expense
• Inflection
– depreciate=>depreciated/depreciates/depreciating/depreciation
– Avoids morphologies like: blind / blinds; odd / odds;
bitter / bitters
16
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• The text processing literature shows that stemming
does not in general improve performance. Essentially
stemming does not work for morphologically rich
languages.
17
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Loughran/McDonald JF 2011 word lists
– Create a dictionary of all words occurring in 10-Ks
from 1994-2007.
– Classify words occurring in 5% or more of the
documents.
18
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Loughran/McDonald JF 2011 word lists
– Fin-Neg – negative words (e.g., loss, bankruptcy, indebtedness,
felony, misstated, discontinued, expire, unable). N=2,349
– Fin-Pos – positive words (e.g., beneficial, excellent, innovative).
N = 354
Notice that in financial reporting it is unlikely that negative
words will be negated (e.g., not terrible earnings), whereas
positive words are easily qualified or compromised. Although
you can easily account for simple negation, typical forms of
negation are difficult to detect.
19
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Loughran/McDonald JF 2011 word lists
– Fin-Unc – uncertainty words. Note here the
emphasis is more so on uncertainty than risk (e.g.,
ambiguity, approximate, assume, risk). N = 291
– Fin-Lit – litigious words (e.g., admission, breach,
defendant, plaintiff, remand, testimony). N = 871
20
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Loughran/McDonald JF 2011 word lists
– Modal Strong – e.g., always, best, definitely,
highest, lowest, will. N = 19
– Modal Weak – e.g., could, depending, may,
possibly, sometimes. N = 27
21
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Use of word lists:
“Content analysis stands or falls by its categories.
Particular studies have been productive to the extent
that the categories were clearly formulated and well
adapted to the problem”
Berelson (1952, p 92)
22
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Ziph’s law – the most frequent word will appear
twice as often as the second most frequent word and
three times as often as the third, etc. Much like the
distribution of market cap in finance.
• Always look at the words driving your counts
23
Overview
Data/Programs
Sample App
Stemming
Word Lists
Resources
• Resources:
– www.nd.edu/~mcdonald/Word_Lists.html
•
•
•
•
Sentiment dictionaries
Master dictionary
Lists of stop words
1994-2011 10-X file summaries spreadsheet
25