Theoretical database

Download Report

Transcript Theoretical database

What is the Proof
necessary for
Truth
(whatever that is)
Presentation to
FRIAMGroup's Applied Complexity Lecture Series
Santa Fe, NM USA
24 August 2005
Tom Johnson
Managing Director
Institute for Analytic Journalism
Santa Fe, New Mexico
Wedtech - 24 Aug. 05
1
What is the IAJ
Analysis using a variety of
tools and methods from multiple
disciplines
Understand multiple phenomena
Communicate results to multiple
audiences in a variety of ways.
Wedtech - 24 Aug. 05
2
Cornerstones of IAJ
 General Systems Theory
 Statistics
 Visual statistics/infographics
 Simulation modeling
Wedtech - 24 Aug. 05
3
Prob of day
So what’s the
problem of the day
for analytic journalists?
Wedtech - 24 Aug. 05
4
So what’s the problem?
 Ever increasing -- beyond estimate -number of public records databases
 DB increasingly used for broad
spectrum of decision-making
 Assumption that data, as given, is
correct. Anecdotal evidence
suggests that’s not so.
Wedtech - 24 Aug. 05
5
Examples of bad data
 St. Louis Post-Dispatch 1997-98:
  350 S.Ill. Sex offenders
 “…found that hundreds of convicted sex offenders don't actually live
at the addresses listed on the sex offender registries for St. Louis, St.
Louis County and the Metro East area.”
 Every record carried probability between 30-50% of
error
 1999 - City of St. Louis: “About 700 Sex Offenders Do Not
Appear To Live At The Addresses Listed On A St. Louis Registry.”
 Boston 2000
 BPD - 6 detectives assigned to cleaning up sex offenders
DB
Wedtech - 24 Aug. 05
6
Examples of bad data
2000 - Florida voter registration rolls
 State hires DBT Online/Choicepoint to “purge
rolls.”
 “Some [counties] found the list too
unreliable and didn't use it at all. …
Counties that did their best to vet the file
discovered a high level of errors, with as
many as 15 percent of names incorrectly
identified as felons.”
Source:Plast, Greg. http://www.gregpalast.com/detail.cfm?artid=55
Wedtech - 24 Aug. 05
7
More bad data
2004 - Dallas Morning News
 “…The state criminal convictions database is so
riddled with holes that law enforcement officials
say public safety is at risk.
“… the state has only 69 percent of the
complete criminal histories records for 2002. In
2001, the state had only 60 percent. Hundreds of
thousands of records are missing.”
Wedtech - 24 Aug. 05
8
Surely there is a simple solution….
 Is there a methodology to measure, to know --
or to anticipate -- the quality, i.e. veracity, of a
given database?
 What are the best -- and most objective -- ways
to “X-ray” a DB to note internal problems or
potential problems?
 Hoping for answers from statisticians, data
miners, forensic accountants, bioinformatics,
genomics, physics, etc. ‘cause journalists don’t
have much of a clue
Wedtech - 24 Aug. 05
9
Approaches to database analysis
 Theoretical/statistical
 What can we know about a database only based on
its size and whether a record’s field/cell is occupied?
 Are there cheap, fast and good templates/tools to Xray the DB?
 Contextual/statistical
 How would knowing the context/meaning of data --
or lack of data -- in cells change our answers to
previous questions?
 Are there methodologies to help us weigh the
importance of a variable relative to the veracity of a
record? e.g. is “name” more important than SS#?
Wedtech - 24 Aug. 05
10
Approaches to database analysis
 Theoretical/statistical
 What can we know about a database -- and its
Both/all
approaches vary
with the
Contextual/statistical
being of data - Howquestion(s)
would knowing the context/meaning
or lack of data -- in cells change our answers to
asked
previous question?A
potential validity -- only based on its size and
whether a record’s field/cell is occupied?
 Are there cheap, fast and good templates/tools to Xray the DB

 Are there methodologies to help us weigh the
importance of a variable relative to the veracity of a
record? e.g. is “name” more important than SS#?
Wedtech - 24 Aug. 05
11
Theoretical database structure
 DB =
 Metadata
 Coding sheet
 Fields/elements
 Field tag (name)
 Character limited/open field
 Numeric/alpha
 End-of-Record character
 Number of records
Wedtech - 24 Aug. 05
12
Theoretical database
 Assume matrix - 100 records, 10 fields
 Assume a given -- and occupied -- index
field (serial record number)
Wedtech - 24 Aug. 05
13
Theoretical database
 Assume matrix - 100 records, 10 fields
 Assume a given -- and occupied -- index
field (serial record number)
 Does a record's LCI (Loaded Cell Index),
from 10% to 100%, constitute "proof" of
anything?
Wedtech - 24 Aug. 05
14
Theoretical database
 LAs (logical adjacencies) not necessarily
physically adjacent in record layout.
 Like genome, data present -- or not
present -- in a field can trigger the
presence or lack of data in another.
Fld #1
Fld #2
Fld #3
Fld #4
Fld #5
Fld #6
Fld #7
Fld #8
Fld #9
Fld #10
Wedtech - 24 Aug. 05
15
Assumptions???
When software achieves
 The greater a record’s LCI, the greater
critical mass,
it can has
never
potential (probability?)
that record
can only
enough “Proof”be
to fixed;
achieveit“True
Data"be
status.
discarded and rewritten.
 Do we think this is true?
 Probably, even when
we have
idea
Same
for no
DBs?
what the data is/means. Still, “proof”
How
do programmers
seems to occupy
a density-of-data
continuum reaching
for some
critical
measure
that
critical
mass.
mass?
 How do we measure that criticality?
Wedtech - 24 Aug. 05
16
Assumptions???
 Probably, even when we have no idea what the data is/means.
Still, “proof” seems to occupy a continuum reaching for some
critical mass.
How do we measure that criticality?
 When focus is on individual record, must
have context/meaning/definition for the
variables/elements, otherwise a
nonsensical array of possibly random
numbers.
There is no opportunity for Proof of
anything, much less Truth.
Wedtech - 24 Aug. 05
17
Search for patterns (in 100+k records)
Are there patterns? How can I
quickly identify them?
Are there consistencies?
Do populated cells suggest
anything about hierarchy of
importance?
Are there "Logical
Adjacencies,“ (LAs)?
Wedtech - 24 Aug. 05
18
Demographics of a database
Logical Adjacencies
Patterns in LAs?
Is there a hierarchy of
import/value of LAs?
Are there various thresholds of
LAs present, i.e. is it better
Proof to have four LAs than
three?
Maybe, maybe not. So how do
we create rules to weigh (a) a
cell and (b) weigh LAs.
Wedtech - 24 Aug. 05
19
Demographics of a database
Logical Adjacencies
If a record does not meet
some standard of LA-ness, do
we discard it from the analysis
because it lacks the potential
for Proof? (Discarded outlier problem)
Do patterns of populated cells
suggest anything about
hierarch of importance or
only data input process?
Are some records “better”
records?
Any “truth” to be found?
Tools to quickly, easily see
these answers?
Wedtech - 24 Aug. 05
20
Working with the real stuff
 Fundrace 2004 Neighbor Search
http://www.fundrace.org/neighbors.php
 Political Money Line
http://www.fecinfo.com/cgi-win/indexhtml.exe?MBF=zipcode
Wedtech - 24 Aug. 05
21
Missing data problem. Significant?
Wedtech - 24 Aug. 05
22
Realities of DBs
The NAME problem
Can this be “cleaned”
automatically?
Wedtech - 24 Aug. 05
23
“Dirty” campaign contributions
Same person?
Wedtech - 24 Aug. 05
24
“Dirty” campaign contributions
Same person?
Wedtech - 24 Aug. 05
25
“Dirty” campaign contributions
Same person? Same job?
How do we easily spot these problems in
large DB?
How do we rectify them in large DB?
Wedtech - 24 Aug. 05
26
Wrong data
Huh?
Is there any way to vet this cell’s data?
How many triangulated db’s necessary to meet
some “proof” index?
Does this field have importance (The hierarchy
of importance?) to be worth X time/money to
verify?
Is there a better way than drawing a sample
and tracking down original data?
Wedtech - 24 Aug. 05
27
Ver 1.0 workshop

April 9-12, 2006
Seeking
suggestions:
Workshop on
public database verification for
journalists and social scientists


Automated
“Ver”
as in “verification” and “verify” and, from the
Spanish verb ver: “to see; to look into; to examine.”

Affordable
Ver
1.0 Objectives
1. Developing new statistical methods for DB
verification;
2.
Building a flowchart/decision
tree for the DB
various
DBs
verification process;
3. Easily
Developing
rules for creation and
of a hierarchy
of
understood
with
importance/significance of record elements, i.e.
error
trapping
variables,
in common databases.
Generic or easily adopted to

Easy to learn/apply
Wedtech - 24 Aug. 05
28
What is the Proof
necessary for
Truth
(whatever that is)
Presentation to
FRIAMGroup's Applied Complexity Lecture Series
Santa Fe, NM USA
24 August 2005
Tom Johnson
Managing Director
Institute for Analytic Journalism
Santa Fe, New Mexico
Wedtech - 24 Aug. 05
29