Class slides July 12, 2000 - NYU Computer Science Department

Download Report

Transcript Class slides July 12, 2000 - NYU Computer Science Department

Data Quality
Class 8
Agenda
• Tests will be graded by next week
• Project
• Discovery
–
–
–
–
Domain discovery
Domain identification
Nulls
Clustering for rule discovery
Discovery
•
•
•
•
Domain Discovery
Mapping Discovery
Key Discovery
Rule Discovery
First...
• Consult the manual
• Consult the experts
Automated Discovery
• Searching for:
–
–
–
–
obvious
insight
patterns
knowledge
What are characteristics of
domains?
–
–
–
–
–
–
They are typically finite data sets
They are relatively static
They may be used in many contexts
The number of values is relatively small
“intuitive distribution”
Attributes that use domains are typically not
null
Intuitive Distribution
• The distribution may be an even distribution
• The distribution of values takes on the
characteristics of the context
• Example: Security transactions
–
–
–
–
Most will be denominated in US dollars
Many will be denominated in £, ¥,
Fewer in other currencies
The currency will draw its values from a domain, but
the distribution will be largely skewed to USD, then
GBP, JPY, then others...
Domain Discovery
• The Brute Force method
• Good for “seeding” a set of domains when no
information is already available
• For each column, take all distinct values
• Apply heuristics:
– How does the number of distinct values compare to the
number of records from which the values were
extracted?
– Is distribution relatively equal?
– How many nulls are there?
– Are the data values similar to other attributes’ values?
Pattern-based Domain Discovery
• Analyze value patterns
• Start from the bottom up:
• Tag each character
•
•
•
•
digit
letter
punctuation
white space
• Sort found patterns, see if there is any
overriding similarity
Pattern-based Domains - 2
• If nothing stands out, go up to next level
• Characterize individual strings by classes:
•
•
•
•
•
•
•
•
•
alphabetic
numeric
alphanumeric
verbs
nouns
name words
business words
address words
email addresses
Patterns... – 3
• Where do we get these word classifications?
•
•
•
•
•
•
Experience
USPS
Census department
SIC classifications
Domain registry
Standards bodies
• It may be worthwhile to collect words by
classification as reference data
• These tables can also be scanned using the same
methods for fast matching
Domain Identification
• Given an already existing set of domains,
how can we determine if a column makes
use of one of those domains
• Algorithm:
– All distinct values are extracted
– Perform a match against known domains
– Compute 3 ratios: agreement, overlap,
disagreement
Fast Matching: Method 1
• We must be able to quickly match a pair of
data value sets
• Method 1: Hash Tables
– Load first domain into a hash table (O(n))
– Search for each value of second domain (O(n))
Fast Matching: Method 2
• Bloom Filters
• An approximate method for testing set
membership
• If the bolom filter test says no, the answer is
no
• If the bloom filter says yes, it still might be
no, with low (limited) probability
Bloom Filters
• Use k hash functions (h1, h2, .., hk)
• For each member of the set, invoke each of the
hash functions, and set a bit in a bit vector for each
result
• For testing a value for membership:
– invoke all hash functions
– if not all result bits are set in bit vector, the value is not
in the set
– If all the bits are set, there is still a small chance the
value is not in the set
– That probability is bounded based on the number of bits
in the bit vector and the number of hash functions
Bloom Filters – 2
• Table is used optimally when it is half-filled with
1’s
• The best choice for sizing the table is based on
expectation of finding a false positive
• This is calculated as a function of n, the number of
words, and m, the size of the table, and k, the
number of hash functions
• k is approxamtely ln 2 * m/n
• The error probability is 2-k
• (email me for more details on this analysis)
Three Ratios
• agreement is calculated as the ratio of distinct
attribute values that are present in a domain to the
total number of distinct values in the attribute
• overlap is calculated as the number of domain
member values that do not appear in the attribute
divided by the number of domain values
• disagreement as the number of values that appear
in the attribute but are not members of the domain
Domain Matching
1. All of the values used in the attribute are
members of the known domain, and the all of
the values in the domain are used in the
attribute. (agreement = 100%, overlap= 0%,
disagreement = 0%)
•
2.
GOOD MATCH
All of the values in the attribute are members of
the known domain, but there are domain
members that are not used in the attribute.
(agreement = 100%, overlap > 0%,
disagreement = 0%)
•
Also GOOD MATCH (explore as subdomain?)
Domain Matching
3. Some of the attribute values are members of the
known domain, but some of the values used in
the attribute are not members of the known
domain. (agreement < 100%, disagreement >
0%)
•
either not a match, or perhaps the known domain is a
subdomain
4. None of the values used in the attribute are taken
from the known domain. (agreement = 0%,
overlap= 100%, disagreement = 100%)
•
Not a match
Sub and Super Domains
• subdomain: a set of values that is a subset of
another domain
• superdomain: slightly more values than in
the known domain
• Overloaded attribute: some of the values
belong to one domain, other values belong
to another domain
Mapping Discovery
• Much harder than domain discovery
• Must look for connectivity patterns
• one-to-one mappings are easiest
One-to-One Mappings
•
•
•
•
Both attributes must belong to a domain
O(n2) algorithm
Extract value pairs from both attributes
When all distinct pairs have been extracted, there
may not be any duplicate entries of the source
attribute value.
• Because different source values can map to the
same target value, the third characteristic does not
hold for the target attribute value
Mapping Heuristics
• We can also apply some heuristics:
– See if the number of distinct values in the
source attribute is greater than or equal to the
number of distinct values in the target attribute.
– See if the attribute names appear together in
more than one table or relation
• We can also apply rule discovery, since 1-1
mappings are similar to functional
dependencies
Rule Discovery
•
•
•
•
Null rules
Value ranges
Key discovery
Association rules
Null Value Rules
• For each attribute:
– Are there missing values?
• Opens other questions, such as whether nulls are really
allowed, what kinds of nulls are there, etc.
– If so, is there any relationship to other attributes?
• This starts to look like a mapping from another attribute to the
null value
– Are there existing null value representations?
• This would be reflected potentially as anomalous patterns that
appear with some frequency
Rule Discovery Techniques
• Clustering
• Decision Trees
• Association Rules
Clustering
• A set of algorithms that are used to segment items
in a set into separate partitions, where each value
is assigned to a specific partition
• We group items by some set of properties, and
assign a meaning to each cluster
• Clustering hinges on the notion of “distance”
between items in a set, making use of a distance
function to determine how close any two items are
to each other
• We can cluster value sets composed of values
pulled from multiple attributes
Clustering – 2
• We can discover relationships between values
based on assignment to a cluster
• Clusters are then sources for rules
• Two kinds of clustering:
– Hierarchical (such as a classification taxonomy)
– Non-hierarchical (such as a distribution into
equivalence classes)
• Clustering can be performed in one of 2 ways:
– Agglomerative – collect clusters together
– Divisive – break clusters apart
Similarity and Distance
• To assign data items to clusters, we need a notion
of distance and similarity
• Obviously, similarity is measured as a function of
distance
• All items to be explored are projected into an ndimensional space
• Distance can be measured based on traditional
distance functions, or specialized distance
functions
Distance Measures
• Euclidean distance: the distance between any two
points is the sum of the squares of the difference
between the corresponding data points
• City Block distance: this measures the distance in
terms of walking along the line segments in a grid
• Exact-match distance: For variables whose values
are not allocated along continuous dimensions, the
values assigned are either 1 for an exact match, or
0 for not a match
Divisive Clsutering
1. Assign all values into a single cluster.
2. Determine a division of each cluster that will
maximize the similarity between all the elements
within one side of the division and maximizes
the difference between any pair of items across
the division.
3. Repeat step 2 until either some predefined limit
of clusters has been found, or some predefined
limit of cluster size (minimum, of course, is 1)
has been reached.
Divisive Clustering – 2
• Requires the storage of a lot of data
• Each pair of clusters needs to be examined
to measure the difference between them
• Computationally intensive
• Not very popular
Agglomerative Clustering
1. Assign each data element to its own
cluster
2. Find the two closest clusters and combine
them into a single cluster
3. Repeat step 2 until there is only one
cluster left
• The tricky part is determining which
clusters are closest to each other
Agglomerative Clustering – 2
• Methods for determining cluster closeness:
– Single link method: In this method, the distance
between any two clusters is determined by the distance
between the nearest neighbors in the two clusters
– Complete link: In this method, the distance between
any two clusters is determined by the greatest distance
between any two data items in the two different clusters
– Unweighted pair-group average: In this method, the
distance between any two clusters is calculated as the
average distance between all pairs of data items in the
two different clusters
Agglomerative Clustering – 3
– Weighted pair-group average: This method is similar to
the unweighted pair-group average, except that the size
of the cluster is used as a weight
– Unweighted pair-group centroid: A centroid is a point
calculated as the average point in a space formed by a
cluster. For any multidimensional space, the centroid
effectively represents the center of gravity for all the
points in a cluster. In this method, the distance between
any two clusters is measured as the distance between
centroids of the two clusters
– Weighted pair-group centroid: This is similar to the
unweighted pair-group centroid method, except that the
size of the cluster is used as a weight
Value Ranges Using Clustering
• This uses an agglomerative method
• Look at aggregating values based on closeness in
euclidean space
• Assume each value is in its own cluster
• Aggregate based on distance ot other clusters
• For a single dimensions (such as integer values),
each cluster will represent a value range
• We can set limits on distance as a way to stop the
iteration
Clustering for Classification
• We can use clustering as a way to calssify records
based on attributes of our choice
• If we have some ideas about rule relations, we can
test it out using clustering
–
–
–
–
Extract the attribute values expected in the rule
Normalize any values
Apply any weights
Perform clustering
• Once the clustering has been done, those
characteristics that are the drivers in determining
similarity can be used as the basis for defining
new rules
K Means Clsutering
•
K stands for the number of clusters we would
like to find
1. Arbitrarily select K seeds that are to be the first guess
at determining the centroids of the clusters.
2. For each record, assign the record to a cluster based
on the cluster centroid to which it is nearest.
3. After all records have been assigned to a cluster,
recompute the centroids of the clusters.
4. If the centroids changed from the previous iteration,
then repeat the process starting at step 2.
5. If the centroids did not change from the previous step,
then the process is finished.