isaca-kc.org

Download Report

Transcript isaca-kc.org

Data Mining
Steven Haenchen
CPA/ABV, CFE
Presentation to the Institute of Greater Kansas City Chapter of ISACA
Kansas City, Missouri
September 11, 2008
Today's objectives
Understand What Data Mining Is
Options for looking for the needle in the
haystack
Considerations for looking for evidence
when a crime is known or suspected
Understand Limits to Data Mining
ISASC-KC 9/11/2008
2
Objectives
Respond to
Request for Documents
Prepare
For
Case
Internal Investigation –
Proactive
ISASC-KC 9/11/2008
3
What Data Mining Is
Get Data
Extract Information
Find Great, Exciting
Results!
ISASC-KC 9/11/2008
4
Get Data
Data Warehouses are generally built by
someone other than the data miner.
Many platforms (mainframes, PCs,
notebooks, paper files, Web pages, cell
phones, fax machines, notepads)
Many languages (English, Spanish,
French, Legal, Medical, Numeric,
Abbreviations, Code, Image)
ISASC-KC 9/11/2008
5
Start Simple
Begin with data you can easily obtain
Begin with data you understand
Use simple techniques
Use creativity and enthusiasm with an
iterative process in mind!
Expand the data
Expand the mining
ISASC-KC 9/11/2008
6
Data You Can Obtain
Server data files (financial, human
resources, marketing, email, Web site)
PC/Notebook files (Spreadsheets, Letters,
Memos, email, pictures)
Choices:
– Use original data or obtain a copy
– Use original format or convert to “my” format
ISASC-KC 9/11/2008
7
Paradigm – The Wrapper (No)
Report 1
DATA 1
DATA 2
Routines 1
Routines 2
ISASC-KC 9/11/2008
Report 2
8
Paradigm – The Wrapper (Yes)
DATA 1
CommonFormatted
DATA
Data Conversion 1
Report 1
DATA 2
Data Conversion 2
Routines
ISASC-KC 9/11/2008
Report 2
9
Data You Can Obtain - Example
The Customer, Vender, and Employee
master lists obtained from Finance in
Excel
The Employee master list obtained from
Human Resources in Excel
The Stock Investor list obtained from
Treasury in Excel
List of Businesses in Yellow Pages
(ignored)
ISASC-KC 9/11/2008
10
Data You Understand
State the problem and
formulate the hypothesis
– Can’t do that well if you
don’t understand the data
Perceived
Opportunity
Preprocess the Data
– Benefits
– Costs
Pressure
ISASC-KC 9/11/2008
Rationalization
11
Data You Understand - Example
(State the problem and formulate the hypothesis)
Looking for fictitious payees
Fictitious payees have the same address of
another payee
(Preprocess the data)
All lists “normalized” to contain the same data
fields for each record
Address fields “normalized” to standard
abbreviations, spaces removed, upper case.
ISASC-KC 9/11/2008
12
Use simple techniques
Identify every address and count the times
it occurs
Begin looking at customers, vendors, etc.
with addresses repeated the most often
ISASC-KC 9/11/2008
13
Use creativity and enthusiasm with
an iterative process in mind!
123 Apple (123APPLE) did not match with
123 Apple Street (123APPLEST)
When matching addresses, only match the
shorter of the two strings!
ISASC-KC 9/11/2008
14
Use creativity and enthusiasm with
an iterative process in mind!
123 Ap Place (123APPL) matched with
123 Apple Street (123APPLEST)
Much more involved, but use a scoring
method of each string to other with longer
matches scored higher than shorter
strings, each to the other, averaged. Only
use score > xx%.
Understand you will have some “noise”.
ISASC-KC 9/11/2008
15
Read the following words:
Enugh
Lnuch
G00D F00D
N00B
Emial
ISASC-KC 9/11/2008
16
Expand the data
Obtain the amount paid each employee
(expenses) and vendor
Eliminate employees and vendors with
payments less than $xxxx
ISASC-KC 9/11/2008
17
Use creativity and enthusiasm with
an iterative process in mind!
What makes sense for the business?
– What states should employees reside in? Any
outside those states or ZIP codes?
– Same with customers and vendors
– Investigate large payments to Post Office
boxes (Is vendor known? In phone book?)
– Do employee/vendor’s expenses get charged
to the same account every time?
ISASC-KC 9/11/2008
18
Expand the mining - Example
Categorize employee/vendors by expense
type
Identify “relevant” types and exclude
remainder
For relevant types, determine mean and
standard deviations of each invoice
Analyze invoices that are “outliers”
ISASC-KC 9/11/2008
19
Today's objectives
Understand What Data Mining Is
Options for looking for the needle in the
haystack
Considerations for looking for evidence
when a crime is known or suspected
Understand Limits to Data Mining
ISASC-KC 9/11/2008
20
Refresher
Stated the problem and formulated the
hypothesis
Collected the data
Preprocessed the data
Kept it Simple in an area with Knowledge
READY for Creativity and
Enthusiasm!
ISASC-KC 9/11/2008
21
Preprocessing
Simple transformations
Cleansing and scrubbing, outliers
Integration
Aggregation
Normalizing
Data smoothing
Differences
Ratios
ISASC-KC 9/11/2008
22
Preprocessing (continued)
Benefits – Simpler data, faster processing
Costs – may not be as accurate as the
original data; may not have the ability to
expand testing with details no longer
available
ISASC-KC 9/11/2008
23
Analysis Techniques
Missing data
Outlier analysis
Statistical inference
Cluster analysis
Decision trees
Association rules
Artificial neural networks
Genetic algorithms
Fuzzy logic
Visualization
ISASC-KC 9/11/2008
24
Analysis Techniques
Missing data
Outlier analysis
Statistical inference
Cluster analysis
Decision trees
Association rules
Artificial neural networks
Genetic algorithms
Fuzzy logic (partial)
Visualization
ISASC-KC 9/11/2008
25
Missing Data
Ignore the records
Investigate the records
Estimate the data
Compute the probable data
WHAT MAKES SENSE?
Noise happens!
ISASC-KC 9/11/2008
26
Outlier Analysis
Ignore the records
Investigate the records
BOTH
WHAT MAKES SENSE?
Noise happens!
ISASC-KC 9/11/2008
27
Statistical Inference
Find your outliers
Find repetitive transactions
Find outliers after
adjustment for cycles
Find outliers after
adjustment for relationships
Benford’s Law
ISASC-KC 9/11/2008
28
Cluster Analysis
Group similar together
– Based on your knowledge
– Based on algorithms
Compare groups
Compare group ratios
– e.g. office supplies per
employee by office; toner
usage per office
ISASC-KC 9/11/2008
29
Decision Trees
Advanced cluster
analysis
ISASC-KC 9/11/2008
30
Association Rules
Market-Basket Analysis
– Airfare with motel?
– Per diem with motel?
– Hardware with installation charge?
Algorithm Apriori
– Airfare, Motel, Cell Phone relationships
Web Mining
– Hits
– Paths (to, from)
ISASC-KC 9/11/2008
31
Artificial Neural Networks
Advanced topic
Letting the
computer do
your work
ISASC-KC 9/11/2008
32
Genetic Algorithms
Advanced topic
Mathematically letting the computer
“evolve” your prediction
You investigate variances from prediction
ISASC-KC 9/11/2008
33
Fuzzy Logic
Text Mining
Canonical forms
Keyword searching
Keyword relationship
Context
– The boy hit it over the fence.
Occurrence
ISASC-KC 9/11/2008
34
Fuzzy Logic
Text Mining
Canonical forms
Keyword searching
Keyword relationship
Context
– The boy hit it over the fence.
Occurrence
ISASC-KC 9/11/2008
35
Visualization
Series
Descri
ption
Market yield on U.S.
Treasury
securities at 1month
constant
maturity,
quoted on
investment
basis
Market yield on U.S.
Treasury
securities at 3month
constant
maturity,
quoted on
investment
basis
Market yield on U.S.
Treasury
securities at 6month
constant
maturity,
quoted on
investment
basis
Market yield on U.S.
Treasury
securities at 1year constant
maturity,
quoted on
investment
basis
Market yield on U.S.
Treasury
securities at 2year constant
maturity,
quoted on
investment
basis
Market yield on U.S.
Treasury
securities at 3year constant
maturity,
quoted on
investment
basis
Unit:
Percent:_Per_Year
Percent:_Per_Year
Percent:_Per_Year
Percent:_Per_Year
Percent:_Per_Year
Percent:_Per_Year
Multiplier:
1
1
1
1
1
1
Currency:
NA
NA
NA
NA
NA
NA
Unique
Identifi
er:
H15/H15/RIFLGFCM0
1_N.B
H15/H15/RIFLGFCM0
3_N.B
H15/H15/RIFLGFCM0
6_N.B
H15/H15/RIFLGFCY0
1_N.B
H15/H15/RIFLGFCY0
2_N.B
H15/H15/RIFLGFCY0
3_N.B
Time Period
RIFLGFCM01_N.B
RIFLGFCM03_N.B
RIFLGFCM06_N.B
RIFLGFCY01_N.B
RIFLGFCY02_N.B
RIFLGFCY03_N.B
10/20/2003
0.900
0.950
1.040
1.320
1.880
2.420
10/21/2003
0.920
0.930
1.040
1.320
1.890
2.410
10/22/2003
0.920
0.960
1.030
1.280
1.800
2.330
10/23/2003
0.920
0.960
1.030
1.300
1.840
2.370
10/24/2003
0.920
0.950
1.030
1.270
1.760
2.260
10/27/2003
0.950
0.980
1.050
1.310
1.830
2.350
10/28/2003
0.980
0.960
1.030
1.250
1.710
2.230
10/29/2003
0.980
0.960
1.040
1.290
1.790
2.320
10/30/2003
0.970
0.960
1.040
1.320
1.860
2.390
ISASC-KC 9/11/2008
36
1 month
1 year
ISASC-KC 9/11/2008
11/12/2007
10/29/2007
10/15/2007
10/1/2007
9/17/2007
9/3/2007
8/20/2007
8/6/2007
7/23/2007
7/9/2007
6/25/2007
6/11/2007
Visualization
Treasury Yields
6
5
4
3
2
1
0
30 year
37
Visualization
One Month Treasury Yields
6.000
5.000
4.000
3.000
2.000
1.000
6/11/2007 7/11/2007
8/11/2007
9/11/2007 10/11/2007 11/11/2007
ISASC-KC 9/11/2008
38
Visualization
Bach Agen
Net Profit From Business Per Schedule C
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
2000
2001
2002
ISASC-KC 9/11/2008
2003
2004
39
Visualization
11/16/2007
1 month
1 year
30 year
ISASC-KC 9/11/2008
40
Today's objectives
Understand What Data Mining Is
Options for looking for the needle in the
haystack
Considerations for looking for evidence
when a crime is known or suspected
Understand Limits to Data Mining
ISASC-KC 9/11/2008
41
Refresher
Stated the problem and formulated the
hypothesis
READY for Creativity and
Enthusiasm!
ISASC-KC 9/11/2008
42
Gather the Data
- Determine if indeed a crime has been
committed
- Determine status of crime (ongoing?)
- Review organization security and audit
policy
- Determine need for law enforcement
assistance
ISASC-KC 9/11/2008
43
Gather the Data (continued)
- Handling of Evidence
- Relevant
- Supported by a foundation for its introduction
into court
- Legally obtained
- Properly identified
- Properly preserved
- Integrity of the Evidence
ISASC-KC 9/11/2008
44
Gather the Data (continued)
Depends on the allegations
–
–
–
–
–
–
–
–
–
–
System data
Access logs (servers, routers, physical security)
Hard drive images (computers, printers)
Cell phone storage
USB drives
Email server
Backup tapes
CDs, DVDs, diskettes
Fax logs, fax memory
Surveillance video
ISASC-KC 9/11/2008
45
Assumptions
Financial data is examined
Losses calculated
– Usually an iterative process as you learn
more
– Make sure calculations are reasonable
Looking for additional evidence – primarily
on subject’s imaged drive(s)
ISASC-KC 9/11/2008
46
Preprocess Data
Heterogeneous data
Recovered files that had been deleted
Partially recovered files that had been
deleted
Deleted files found elsewhere: backup
tapes or backups to server drives,
diskettes, USB drives, etc.
ISASC-KC 9/11/2008
47
Data Mining
Identify concepts specific to the case
– Names
– Accounts
– Amounts
– Etc.
Search data
– Physically
– Logically
ISASC-KC 9/11/2008
48
Data Mining
Iterative process
READY for Creativity and
Enthusiasm!
ISASC-KC 9/11/2008
49
Fuzzy Logic
Text Mining
Canonical forms
Keyword searching
Keyword relationship
Context
– The boy hit it over the fence.
Occurrence
ISASC-KC 9/11/2008
50
Special Consideration
Hidden files
Files with false extensions
– Renamed
– Doc in Xls
– CRC validation tables
Encrypted files
Messages embedded in pictures
ISASC-KC 9/11/2008
51
Encrypted File Cracking
Characters in Password: 40
Try 60 / minute
0
1
0
1
40
1
2
1,600
27
3
64,000
1,067
4
2,560,000
42,667
5
102,400,000
1,706,667
6
4,096,000,000
68,266,667
7
163,840,000,000
2,730,666,667
8
6,553,600,000,000
109,226,666,667
ISASC-KC 9/11/2008
52
Encrypted File Cracking (con’t)
Bit Encryption
32
4,294,967,296
64
18,446,744,073,709,600,000
96
79,228,162,514,264,300,000,000,000,000
128
340,282,366,920,938,000,000,000,000,000,000,000,000
256
115,792,089,237,316,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
ISASC-KC 9/11/2008
53
Special Consideration
Infected files
Passwords
Pulling the plug
– Automatic updates
– Remote access
– Email updates
The interview
ISASC-KC 9/11/2008
54
Data Limitations
Must understand computer hardware and
systems
– Company email in email server if deleted by
subject, but Internet access to personal email
is not
– What information is hidden in the registry?
– What information is hidden in temp files and
cookies?
– What are retention processes in place (not
policies)
ISASC-KC 9/11/2008
55
Today's objectives
Understand What Data Mining Is
Options for looking for the needle in the
haystack
Considerations for looking for evidence
when a crime is known or suspected
Understand Limits to Data Mining
ISASC-KC 9/11/2008
56
Available Data
Yours – everything
Theirs – reasonably accessible;
reasonably expected to aid in
understanding the issues of the case
Backup tapes?
– Cost sharing
ISASC-KC 9/11/2008
57
Understand Limits to
Data Mining
It is tempting to develop a theory to fit an oddity found
in the data.
One can find evidence to support any preconception if
you let the computer churn long enough.
A finding makes more sense if there is a plausible
theory for it. But a beguiling story can disguise
weaknesses in the data.
The more factors or features in a data set the
computer considers, the more likely the program will
find a relationship, valid or not.
Peter Coy, Business Week
ISASC-KC 9/11/2008
58
Questions???
Steven Haenchen
CPA/ABV, CFE
MCSD, MCDBA, MCSA, MCSE
Haenchen Valuation Services, Inc.
Communicating Complex Messages Clearly
10436 Oakmont
Overland Park, KS 66215
Web: www.haenchenvaluations.com
Email: [email protected] or
[email protected]
Work: (913) 825-5235
Cell: (913) 488-7187
ISASC-KC 9/11/2008
59