Data Quality Slides
Download
Report
Transcript Data Quality Slides
Barna Saha, AT&T Research Laboratory
Joint work with:
Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava
Data Quality
Data Cleaning
IRS Vs Federal Mathematician
“You owe us $10,000 plus accrued interest
in taxes for last year. You earned $36,000
last year but only had $1 withheld from
your paycheck for Federal taxes.”
The Federal Government Agency had only
allocated enough storage on the computer to
handle withholding amounts of $9999.99 or
less . Amount withheld was $10001.00. The
last $1 made the crucial difference.
“ How could I work the entire year
and only have $1 withheld ? I do not
have time to waste on this
foolishness. Goodbye !”
The Risk of Massive ID Fraud
In May 2004, Ryan Pirozzi of Edina,
Minnesota opened his mail box and
found more than a dozen bank
statements inside.
None of the accounts were his !
Because of a data entry error made by a
clerk at the processing center of Wachovia
Corp, a large bank headquartered in the
Southeastern USA, over the course of 9
months, Pirozzi received the financial
statements of 73 strangers. Their names,
SSN, bank account numbers constitute an
identity thief's dream !
The Risk of Massive ID Fraud
Pirozzi began receiving completed
1099 tax forms belonging to many of
these people .
Finally one day in January 2005, a strange
thing happened. Mr. Pirozzi went to his
mail box and discovered an envelope from
Wachovia that contained his completed
1099 tax form. That was the first piece of
correspondence that he received from the
bank that actually belonged to them.
Source of these stories
800 houses in Montgomery County,
Maryland, were put on auction block in 2005
due to mistakes in the tax payment data of
Washington Mutual Mortgage
Data Quality Tools
Real world data is often dirty:
Inconsistent
Inaccurate
Incomplete
Stale
Enterprises typically find data error rates of
approximately 1%-5%, for some companies, it is
above 30%.
Quality
Tools:
Dirty dataData
costs US
businesses
600 billion dollars
• Detect and repair errors
annually.
• Differentiate
between
dirty and clean data
Data cleaning accounts
for 30%-80%
of
development time and budget in most data
warehouse projects.
A Systematic Approach to Improve Data
Quality
Impose integrity constraints
Semantic rules for data
Errors and inconsistencies in data emerge as violation of
the constraints
Integrity Constraints: Functional
Dependency [Codd, 1972]
Name
Type
State
Price
Vat
CH1
Clothing
Maryland
$50
$3
BK45
Book
New Jersey
$120
$15
FN30
Furniture
Washington $100
$0
CH1
Clothing
Maryland
$6
BK66
Book
Washington $80
$100
Functional Dependency:
(Name, Type, State)
(Price, Vat)
$10
New Integrity Constraints for Data Quality
Name
Type
State
Price
Vat
CH1
Clothing
Maryland
$50
$3
BK45
Book
New Jersey
$120
$15
FN30
Furniture
Washington $100
$0
CH1
Clothing
Maryland
$6
BK66
Book
Washington $80
$100
$10
Functional Dependency:
(Name, Type, State)
(Price, Vat)
1. If (Type=Book) then the above FD holds.
2.If (State=Washington) then (Vat=0)
Conditional Functional
Dependency
New Integrity Constraints for Data Quality
Conditional Functional Dependency
Sequential Dependency
Aggregation Dependency: Discovering Conservation Rules
Motivations
Infrastructure networks are continuously monitored
over time.
Example:
Highway monitoring systems use road sensors to identify traffic
buildup and suggest alternate routes.
Routers in an IP telecommunications network maintain counters to
keep track of the traffic flowing through them.
Power meters measure electricity flowing through different systems
Monitored to troubleshoot customer problems, check network
performance and understand provisioning requirements.
Data Quality Problems in Infrastructure
Networks
Missing or delayed data, especially over large interval
of times, can be detrimental to any attempt to ensure
reliable and well-functioning network.
IP network monitoring typically uses the UDP protocol, so
measurements can be delayed (or even lost) when there is high
network congestion.
Sometimes a new router interface is activated and traffic is flowing
through it, but this interface is not known to the monitoring system; in
this case, there is missing data that is hard to detect.
Data Quality Problems in Infrastructure
Networks
Missing or delayed data, especially over large interval
of times, can be detrimental to any attempt to ensure
reliable and well-functioning network.
Monitoring road networks in the presence of sensor failures or
unmonitored road segments
Monitoring electricity networks in the presence of hacked power
meters or if someone is diverting (stealing) electricity, etc.
Detecting data quality issues is difficult when
monitoring large and complex networks
Approach
Impose integrity constraints to capture the semantics
of data
Provide concise summary of data where the rules
hold/fail efficiently.
Integrity Constraint: Conservation Rules
In many infrastructure networks, there exists a
conservation law between related quantities in
monitored data.
Kirchoff’s Node Law of Conservation
of Electricity : The current flowing into
a node in an electric circuit equals the
current flowing out of the node.
Road Network Monitoring: Every car
that enters an intersection must exit.
Telecommunication Networks: Every
packet entering a router must exit.
And many more…
Conservation Rules
One to One Matching
Match each incoming event to each outgoing event, and report average
delay/ loss as measure for violation of conservation laws.
Infeasible with respect to storage and processing costs to collect individual
packets/ monitor individual events
Monitoring systems provide aggregate counts at regular
intervals.
Conservation Rules
Incoming traffic at a router
We expect the two time
series to be identical
Outgoing traffic at a router
time
Matching incoming and outgoing aggregated traffic at every time point may
not reveal true data quality issues.
Clock synchronization error
Queuing delay
Compare aggregated total over time windows.
Conservation Rules: Confidence of an
Interval
b1 b2 b3 b4 b5
10 8 6 4 6
Incoming traffic at a router
IN
a1 a2 a3 a4 a5
b1 b2 b3 b4 b 5
10 8 6 4 6
a1 a2 a3 a4 a5 34
Outgoing traffic at a router
OUT
Confidence of an interval
=
Ignores duration of violation
∑ ai
∑ bi
Conservation Rules: Confidence of an
interval
b1 b2 b3 b4 b5
10 8 6 4 6
IN
b1 b2 b3 b4 b 5
Incoming traffic at a router
a1 a2 a3 a4 a5
10 8 6 4 6
a1 a2 a3 a4 a5
Outgoing traffic at a router
OUT
5
Confidence=1
Confidence=0
Rightward Matching between IN and OUT
34
Earth Mover Distance
A measure of distance between two distributions over
some region D.
Interpret the distributions as two different ways of
piling up a certain amount of dirt over the region D.
EMD is the minimum cost of turning one pile into the
other.
Cost is assumed to be amount of dirt moved times the
distance by which its is moved.
Also known as Wasserstein distance.
Rightward Matching (RM): A special case of
Earth Mover Distance (EMD)
10 8 6 4 6
Incoming traffic at a router
10 8 6 4 6
IN
Outgoing traffic at a router
OUT
5
Confidence=1
EMD=0
Maximum EMD Possible=114
Confidence=0
EMD=114
Maximum EMD Possible=114
Only right shifting
Simple greedy algorithm works
RM: Interpretation by area over
cumulative counts
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
Confidence of an interval I
= area(CUM-OUT(I))/ area (CUM-IN(I))
RM: Interpretation by area over
cumulative counts
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
Find all intervals with confidence >= 0.9 (say)
RM: Interpretation by area over
cumulative counts
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
Return a minimum collection of intervals with confidence >=
0.9 (say) covering at least 95% (say) of data
Finding intervals with high confidence
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
• Trivial using O(n3) time
• Try all possible n2 intervals
• For each interval using O(n) time find the confidence
Finding intervals with high confidence
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
• Easy to do in O(n2) time
• Compute in linear time confidence of all the intervals
that start from a specific point
Finding intervals with high confidence
Cumulative count
CUMIN
CUMIN
CUM-OUT
CUM-OUT
time
• How do you solve it in sub-quadratic time ?
• Only maximal intervals
Finding intervals with high confidence
Relax confidence:
If outputs I, then conf(I) ≥ c/(1+ε)
(no false positives)
If conf(I*) ≥ c, output I I* with conf(I) ≥ c/(1+ε)
(no false negatives)
Finding intervals with high confidence
Algorithm 1
Finding intervals with high confidence
Generating Sparse Set of Intervals
=
=
Compute the confidence of intervals with
growing geometrically by a factor of
Finding intervals with high confidence
Generating Sparse Set of Intervals
=
=
Compute the confidence of intervals with
growing geometrically by a factor of
Finding intervals with high confidence
Running time depends on areaB
Finding intervals with high confidence:
Avoiding dependency on area
• Main Idea:
• Consider each possible ending point of intervals instead
of starting points
• Compute confidence of intervals with interval lengths
growing exponentially in 1+ε
Finding intervals with high confidence:
Avoiding dependency on area
Finding intervals with high confidence:
Avoiding dependency on area
Discount Models
Finding minimum collection of maximal
intervals with support threshold
Partial set cover on line
Can be solved exactly in quadratic time using dynamic
programming
Can be solved in linear time if we allow constant factor
approximation using greedy algorithm
Greedy gives 7-approximation
Finding minimum collection of maximal
intervals with support threshold
Partial set cover using greedy algorithm
If OPT chooses t intervals then
We can choose at most t intervals that do not intersect any of the
OPT intervals.
We can choose at most 6 intervals that intersect a particular OPT
intervals.
Credit Card Data
Dec
Jan
Entrance-Exit Data
Network Monitoring Data
Running Time on Job-log Data
Area Based
Non-area Based
Summary
We study data quality problems that arise frequently in
many infrastructure networks.
We propose rules that express conservation laws
between related quantities, such as those between the
inbound and outbound counts reported by network
monitoring systems.
We present several confidence metrics for
conservation rules.
We give efficient approximation algorithms for finding
a concise set of intervals that satisfy (or fail) a supplied
conservation rule given a confidence threshold