Learning to mine Rank Correlated Numerical Attributes.

Download Report

Transcript Learning to mine Rank Correlated Numerical Attributes.

Mining Rank-Correlated Sets Of
Numerical Attributes.
Index
Introduction.
Need for finding new measures for mining numerical datasets.
–
problems with previous association rules with discretization methods.
–
Discretization example and its disadvantages.
Proposed Methods and concepts.
Rank and Rank correlation measures.
–
What is Rank and Rank correlation?
–
Introducing New Rank correlation Measures.
–
Observations.
–
Defining the support for new measures.
–
Example Dataset.
–
Calculating the Rank correlation measure values on the example dataset.
–
Properties of the new Rank correlation measures.
Categorical Attribute combined with Numerical attributes.
–
Explaining the new measures on categorical and mixed data.
Defining confidence and Association Rules with new support measures.
–
Example and explanation.
Finding Ranks with attribute value ties in the dataset.
–
Defining new support measures with average rank values.
Algorithms for finding out frequent itemsets using new Rank based support measures
–
Explicit pair method for converting the data set to normal one.
–
Using Algorithm with Special Indices.
–
Explaining search traversal techniques Range-tree and Kd-tree.
Experimental Evaluation.
–
Results of applying new support measures on a meteorological data.
–
Performance evaluation for the newly found support measures.
Introduction
The motivation for the research reported upon in this paper is an application where we
want to mine frequently occurring patterns in a meteorological dataset containing measurements from
various weather stations in Belgium over the past few years. Each record contains a set of
measurements (such as temperature or pressure) taken in a given station at a given time point,
together with extra information about the stations (such as location or altitude).
The classical association rule framework [1], however, is not adequate to deal with
numerical data directly. Most previous approaches to association rule mining for numerical attributes
were based on discretization.
For mining the rank correlated numerical datasets three new support measures are
proposed. Their application on the real time meteorological dataset and according performance
evaluations are also presented.
Discretization and Concept hierarchy
Discretization
•
reduce the number of values for a given continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be used to replace actual data values
Concept hierarchies
•
reduce the data by collecting and replacing low level concepts (such as numeric values for
the attribute age) by higher level concepts (such as young, middle-aged, or senior)
E.g. If you have a attribute Age with values (22,23,24,35,36,37,55,56,57).Applying discretization
and breaking this into intervals we get,
Age (Young(20…35), MiddleAge(35….55), Old(55….80))
here ,
Young gives you (12,13,24)
MiddleAge
(35,36,37)
Old
(55,56,57)
Discretization Example
Table People
RecordID
Age
Married Status
NumCars
100
23
NO
1
200
25
YES
1
300
29
NO
0
400
34
YES
2
500
38
YES
2
Minimum Support = 40 % ,Minimum Confidence = 50 %
Rules
Support
Confidence
(Age: 30…..39) and (married: Yes)
=> NumCars : 2
40 %
100 %
(NumCars: 0….1)
=> (Married: No)
40 %
66.66 %
The above simple approach (discretization) to the quantitative database has two major drawbacks:
“MinSup”:
If the number of intervals for a quantitative attribute is large, the support for any single interval
can be low. Hence without using larger intervals ,some rules involving this attribute may not be found
because they lack minimum support.
“MinConf”:
There is some information lost whenever we partition values into intervals. Some rules may have
minimum confidence only when an item in the antecedent consists of a single value (or a small
interval). This information loss increases as the interval sizes become larger.
For example, in Figure 2, the rule
(NumCars: O) => (Married: No)” has 100 % confidence and 20 % Support.
But if we had partitioned the attribute NumCars into intervals such that 0 and 1 cars end up in the
same partition, then the closest rule is
(NumCars: O.. 1) => (Married: No)”, which only has 66.67 % confidence and 60 % support.
Conclusion: As the interval size increases the confidence decreases and support increases and
also vice versa.
A “catch-22” situation– if the intervals are too large, some rules may not have minimum confidence; if
they are too small, some rules may not have minimum support.
Disadvantages of Discretization
Discretization disadvantages can be listed as:
First of all it always incurs an information loss, since values falling in the same bucket become
indistinguishable and small differences in attribute value become unnoticeable.
On the other hand, very small changes in values close to a discretization border may cause
unjustifiably large changes in the set of active rules.
If there are too many Discretization intervals, discovered rules are replicated in each interval making
the overall trends hard to spot.
It is also possible that rules will fail to meet the minimum support criterion when they are split among
many narrow intervals.
What is Rank and Rank Correlation measures?
Rank:
Let D be the database and H be its header i.e. set of attributes. The rank of a value of a numerical
attribute in A (t.A), is the index of the value of the attribute in the database when its ordered ascending w.r.t
A. That is the smallest value of t.A in A gets a rank 1,etc.Consider following table,
Person
Height
Weight
A
5.5
53
B
5.7
55
Person
C
5.8
45
Rank by height
1
2
3
4
5
6 7
8
D
5.9
50
Rank by weight
3
4
1
2
5
7 8
6
E
6.0
58
F
6.1
65
G
6.2
68
H
6.3
60
Sample Table Person
A B C D E F G H
Same table represented with Rank Values.
Rank Correlation:
“Rank correlation” is the study of relationships between different rankings on the same
set of items. It deals with measuring correspondence between two rankings, and
assessing the significance of this correspondence. Calculating the Rank Correlation
Coefficients is described in the next section.
Assumption: For most part of this paper it is assumed that all values of numerical
attributes are Distinct .
Rank correlation measures
Kendall’s ┬
Spearman’s F
Spearman’s ρ
1.Observatios for ρ and F:
When two attributes are highly positively correlated, the ranks of their values will be
close within most tuples. Hence the sums ∑t ε D|r (t.A) – r(t.B))2| and ∑t ε D|r (t.A) – r(t.B)) will be small if A
and B are correlated. On the other hand, if A and B are uncorrelated, the expected absolute difference
between the ranks of A and B is (|D| - 1)/2.resulting in very high values for these sums. The formulas are
chosen in such a way that they are 1 for total positive correlation (e.g. A=B) and -1 for total negative
correlation (e.g. A and B are in reverse order)
2.Observations for ┬ :
The formula actually gives you the probability that a randomly chosen pair of tuples (s,
t), s.A < t.A and s.B < t.B. In this case too the formula is chosen such that it is 1 for total positive correlation
and -1 for total negative correlation. Due to this connection to the probability this formula is natural than the
other two.
Measuring the support for Numerical attributes
By using the above coefficients following support measures are proposed.
where I is set of numerical attributes.
suppρ and suppF : aggregating the difference between the maximum and minimum rank values.
supp┬,
:
all pairs of tuples (s,t) are taken and the number of pairs such that s comes before t for all
attributes. This number is then divided by the maximal number of such possible pairs.
Example Weather Dataset:
Consider the following example Weather dataset;
Where,
W = Wind Speed, P = Pressure,
T = Temperature, H = Humidity.
The numbers in bracket indicate the rank of the value within the attribute.
In order to count the supp┬ ({T , P , H }), we need to count the number of pairs of tuples such that the first
tuple is smaller than the second tuple in T, P and H. e.g. the pair (t5 , t1) is such a pair, and (t2 , t3) is not.
As there are 6 pairs of tuples that are in order [ (t5 , t3) (t5 , t2) (t5 , t1)] and [t4 , t3) (t4 , t3) (t4 , t3)], supp┬
is 6/10.
•
supp┬ is very complicated to compute.
Properties:
1.
Average supp┬ :
Let A1,A2,A3,A4….Ak be k statistically independent numerical attributes. Then the expected or
average value of the supp┬ is given by,
2.
Support for Kendall’s formula is never greater than the support calculated by Spearman’s Footrule.
The property “Support for a super set can not be greater than its sub set” holds true for those support
measures too. This property can be stated as:
0 <= suppm (J) <= suppm (I) <= 1
This property can be used to calculate the upper bound for supp┬, because in contrast to the supp┬
itself it is easy to compute.
Categorical attributes
The above definitions of support are also applicable to ordinal attribute as their domains
are ordered.
Integrating categorical attributes with numerical attributes :
Categorical attributes will be used to define the context in which the support of numerical attribute is
computed.
Example: We have attributes related to weather data. Then a sample rule,
“In measurements of Antwerp and Brussels, the temperature and wind speed in Antwerp are
higher then in Brussels in 70 % of the cases”
Can be given as,
Association rules
As we have defined the support for categorical attributes we can easily define the confidence for the rule.
The interpretation of the above association rule with confidence c is, for all pairs of tuples (t1 ,t2)with t1
satisfying J1, and t2 satisfying J2, if t1 is smaller than t2 in I1,then there is a chance with confidence c that t1
is also smaller than t2 in I2
Example:
Consider the rule,
In Antwerp if the wind speed in one measurement is greater than in another measurement ,then there is
a probability of
65% that the cloudiness will be higher as well.
Measuring Ranks with ties
At the start of this presentation we had a assumption that all numerical values of an attribute are
Distinct. That wont be the case in the real time data.
This happens with the Ordinal data too. e.g an attribute Cloudiness that can take values (low,
medium, High). For this attribute at least 1/3rd of the pairs will be tied.
 Modifying Support Measures:
supp┬:
For our support measure supp┬ , we can just keep our original definition; that is, if two tuples have a
tie on an attribute of I, they won’t contribute to suppt (I).
Suppρ & SuppF:
i.e., suppose a value t.A is repeated m times in attribute A, and there are p tuples with an Avalue smaller than t.A. Then, the rank of value t.A can have any of the following values
1.In the range of p + 1, . . . , p + m.
2.The average = art.A = (p + 1 + m)/2
3.The maximal rank = r.A = p + m.
Assigning the average rank to tied values, is the most natural choice. But for the minimal and
maximal rank, this sum respectively decreases and increases, hence introducing a bias.
For rankings with ties Suppρ(I) & SuppF(I)
The definition of supp┬ does not change
For ranks with no ties these formulas get reduced to the original ones.
Algorithms
1.
2.
As only ranks are compared instead of the actual values in the database, we first transform
the database by replacing every attribute value by its rank w.r.t. this attribute, which can be
done in O(|H||D| log (|D|)) time.
Generate all sets of attributes satisfying the minimum support threshold,
3.
In the case of supp., or suppF , it can be done by the standard level-wise search in which
candidate itemsets are generated only when all of its subsets are known to be frequent.
Counting the supports requires only a single scan through the data in every iteration.
4.
In the case that supp┬ is used as support measure, the mining task becomes inherently
more difficult since the number of pairs of transactions is quadratic in the size of the database.
Explicit creation of pairs of transactions
A brute force solution to the presented mining problem, is to explicitly combine all pairs of transactions
from the original database D, and create a Boolean database D', such that
With this new database we can apply the general methods for finding frequent itemsets. So the
problem is reduced to standard frequent itemset mining on the new database D‘.
•
Disadvantages:
1.Size of the new database D‘ ,it is quadratic in the size to the original database.
2. Time needed to create it.
3. So it can be applied to only small databases.
The advantage: it is most of the time very efficient in combination with depth-first frequent itemset
mining algorithms because all pairs that do not support an itemset will be removed from its branch Such
fine level of pruning will not be possible in the other approaches.
Using Spatial Indices
Spatial Indices:
These are data structures especially designed to allow
multidimensional data and to allow searches in several dimensions simultaneously.
for
searching
in
When the database is too large and the brute force method is not applicable, we need an
efficient mechanism to count supp┬ for all candidate itemsets. For a given attribute set I, this comes down
to the following nested for:
Disadvantage:
1.The double loop is not feasible as it performs a quadratic operation for each candidate
set.If we can replace this inner for loop by some other search technique to find out the smaller values
in I this would become an efficient technique.
Optimization by using support property of supp┬ and suppF
Recall the property for the newly found support measures supp┬ and suppF,
As suppF can be computed a lot more efficiently compared to supp T , we will first compute it for every candidate
itemset. In case it gives a value under the minimum support threshold, we can already prune the itemset from the search
space without computing the more expensive supp ┬ .
After generating the frequent itemset our next task is to prune the itemsets which are not in the support
constraints.
Generating all candidate itemsets in reverse order using “Reverse Depth First” ,gives us the best result
because, it is guaranteed that all subsets of a given candidate itemset are generated before the candidate itself.
E.G. the subsets of { a ,b , c, d} will be generated in the following manner,
{d}
{a}
{c}
{a, d}
{c, d}
{a, c}
{b}
{a, c, d}
{b, d}
{a, b}
{b, c}
{a, b, d}
{b, c, d}
▲
►
Kd-tree and Range-tree are the two search traversal techniques used for the above process.
Experimental Evaluation
Rules found in the weather data :
In the table, the altitude of the sun (solar alt), the amount of precipitation (precip), the amount of
cloudiness (cloud), and the wind speed (w speed).
Association rules found in weather data of one station in Brussels.
•
The first rule simply claims that in 66% of the cases, the temperature is higher if the altitude of the
sun is higher. The second rule indicates that if it is raining harder, then there is probability of 64% that the
cloudiness is bigger as well. Intuitively, one would expect these confidences would be higher.
•
The last three rules indicate that higher cloudiness or higher wind speed positively influence the
amount of rain, and apparently, with both higher cloudiness and wind speed this effect becomes even
stronger. Indeed; for two randomly selected tuples, the probability of having a higher precipitation is only
32%.
•
When, however, we know that the cloudiness in the first tuple is higher, or the wind speed in the first
tuple is higher, or both, the probability of the first tuple having a higher precipitation increases to
respectively 48%, 44%, and 57%.
Performance evaluation of supp┬
Data and Resources Used:
1.Real life meteorological dataset and a number of benchmark dataset.
2.All algorithms implemented in C programming language.
3.Tested on 2.2 GHz Opteron system with 2GB RAM.
4.supp┬ is more applied than the other two measures.
Observations:
1.Implementation of Explicit pairs ran out of memory for 1000 records.
2.Kd-trees incur almost no memory overhead and fits in main memory.
3.Range trees uses a lot of memory and scaled hundreds of thousands of
records depending on t he minsup parameter.
Conclusions:
1. Overall Range tree method is the most efficient for not too low minimal
support. Because then it generates a lot of itemsets and runs out of memory.
2.For 1% minimum support Kd trees achieve the best performance because
they incur almost no memory loss.
3.Explicit pairs method can not handle data of 10000 records at all because the
generated 49 995 000 new records do not fit into the memory.
Performance of suppp and suppF
Mining patterns in both these cases is much more efficient than in supp┬. But both these
methods are very sensitive to minimum support level chosen. e.g. for suppp the value of minimum
support must be very low, in the range of tenth of millionth of percent. It shows that theoretical
maximum value is rarely achieved.
Conclusions:
1.Three new support measures for continuous attributes based on rank methods
are introduced.
2. Discretization required previously for numerical data is removed.
3. Algorithms are presented for mining frequent itemsets based on those
definitions.
4. Tests results performed on real world meteorological data are presented and
the efficiency of the algorithms have been verified on large databases.
Questions