Natural Language Processing and Textual Analysis in Finance and Accounting

cabbagecommitteeAI and Robotics

Oct 24, 2013 (3 years and 9 months ago)

90 views

Natural Language Processing and Textual Analysis
in Finance and Accounting

Tim Loughran

and

Bill McDonald


University of Notre Dame

1

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources




… “ ‘Cause you know sometimes words have two
meanings.”

2

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


What do we call this?



Textual analysis



Natural language processing



Sentiment analysis



Content analysis



Computational linguistics


3

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Increased interest attributable to:



Bigger, faster computers



Availability of large quantities of text



New technologies derived from search engines


4

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


E
xamples
of data sources
:



EDGAR
(
1994
-
2011,
22.7 million filings)


WSJ News Archive
(XML encapsulated, 2000
-
> )


Audio transcripts
(e.g
.,
conference calls)


Web sites


Google searches


Twitter /
Stocktwits

5

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Programs



Black boxes
(
Wordstat
,
Lexalytics
, Diction
…)



Two critical components


Ability to download data and convert into
string/character variable


Ability to parse large quantities of
text

6

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Most modern languages provide for both of these
functions
:



Perl


Python


SAS Text Miner


VB.net

7

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Parsing large quantities of text: REGEX



Regular expressions example


Regex that attempts to identify sentences


(?<=^|[
\
.!
\
?]
\
s+|
\
n{2,})

[A
-
Z][^
\
.!
\
?
\
n]{20,}(?=([
\
.!
\
?](
\
s
|$)))

8

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Summary of technical literature:




Natural languages are messy and difficult to parse
with computers.


Current Issues in Parsing Technology

Masaru Tomita

Kluwer Academic Publishing, 1991

p. 1



9

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Tripwires


some examples



Parsing out 10
-
K segments



“May”




Disambiguation of abbreviations




Older files are less structured

10

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Download 10
-
X


Download master files for each year/
qtr


"ftp://ftp.sec.gov/
edgar
/full
-
index/YYYY/QTR#"




11

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Identify target forms from master file




Download forms


http://www.sec.gov/Archives/
target file name



12

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Iterate thru forms:



Clean up text file


Remove ASCII
-
Encoded segments (e.g., graphics,
pdfs
,
etc.)


Remove XBRL


Remove tables
(<TABLE>.*?</TABLE>)


Remove all remaining markup tags (HTML)


Re
-
encode character entity references
(e.g., &AMP = &)

13

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Iterate thru forms:
(continued)



Parse form into tokens


Regex:
?i:
\
b[
-
A
-
Z]{2,}
\
b



Iterate thru each token to see if it matches an entry in a
master dictionary



Tabulate
words

14

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources





When creating word lists, should we list root words
(lexemes) and stem, or expand all root words to include
inflections?

15

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources



Stemming



Programmatically collapse words down to root
lexeme:



expensive, expensed, expensing
=>
expense



Inflection



depreciate
=>
depreciated/depreciates/depreciating/depreciation



Avoids morphologies like: blind / blinds; odd / odds;
bitter / bitters


16

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


The text processing literature shows that stemming
does not in general improve performance. Essentially
stemming does not work for morphologically rich
languages.



17

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Loughran/McDonald
JF

2011
word lists



Create a dictionary of all words occurring in 10
-
Ks
from 1994
-
2007.



Classify words occurring in 5% or more of the
documents.


18

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Loughran/McDonald
JF

2011 word lists



Fin
-
Neg



negative words (e.g.,
loss, bankruptcy, indebtedness
,
felony, misstated, discontinued, expire, unable)
.
N=2,349



Fin
-
Pos



positive words (e.g.,
beneficial,
excellent, innovative
).
N =
354



Notice that in financial reporting it is unlikely that negative
words will be negated (e.g.,
not terrible earnings
), whereas
positive words are easily qualified or compromised
. Although
you can easily account for simple negation, typical forms of
negation are difficult to detect.



19

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Loughran/McDonald
JF

2011 word lists



Fin
-
Unc



uncertainty words. Note here the
emphasis is more so on uncertainty than risk (e.g.,
ambiguity, approximate,
assume, risk
).
N =
291



Fin
-
Lit


litigious words (e.g.,
admission, breach,
defendant, plaintiff, remand, testimony
). N =
871


20

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Loughran/McDonald
JF

2011 word lists



Modal
Strong


e.g.,
always, best, definitely,
highest, lowest, will
. N =
19



Modal
Weak


e.g.,
could, depending, may,
possibly, sometimes
. N =
27

21

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Use of word lists:


“Content analysis stands or falls by its categories.
Particular studies have been productive to the extent
that the categories were clearly formulated and well
adapted to the problem”


Berelson

(1952, p 92)



22

Overview


Data/Programs

Sample App

Stemming

Word Lists

Resources


Ziph’s

law


the most frequent word will appear
twice as often as the second most frequent word
and
three times as often as the third, etc. Much like the
distribution of market cap in finance.



Always look at the words driving your counts


23

Overview



Data/Programs

Sample App

Stemming

Word Lists

Resources


Resources
:



www.nd.edu/~mcdonald/Word_Lists.html



Sentiment dictionaries


Master dictionary


Lists of stop words


1994
-
2011 10
-
X file summaries
spreadsheet

25