Efficient Itemset Extraction Using IMine Index

Transcript Efficient Itemset Extraction Using IMine Index

EFFICIENT ITEMSET EXTRACTION
USING IMINE INDEX
By
U.P.Pushpavalli
II Year ME(CSE)
OBJECTIVE
The main objective is to provide an index support for
frequent itemset mining.
To provide a compact and complete structure for item set
extraction .
Implemented by FP based and LCM based algorithms.
 A frequent itemset is an itemset whose support is ≥
minsup
 Support:
For rule of form A=>B, Support refers to percentage
of transaction in D that contain AUB.
 Confidence:
For rule of form A=>B, confidence is the conditional
probability that B is true when A is known to be true.
 support(LHS U RHS) / support(LHS)
Existing-Apriori Algorithm

Uses database scan and pattern matching to
collect counts for the candidate itemsets

Any subset of a frequent itemset must be
frequent.
Apriori –Example
Database D
TID
10
20
30
40
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Itemset
a
b
c
d
e
Freq 1-itemsets
Sup
2
3
3
1
3
Itemset
a
b
c
e
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Sup
2
3
3
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
Bottleneck of Apriori:


Huge candidate sets
Multiple scans of database
Mining Frequent Patterns- Without
Candidate Generation
Large database is compressed into a compact,
Frequent-Pattern tree (FP-tree) structure

Highly condensed, but complete for frequent
pattern mining

Avoids costly database scans

Divide-and-conquer methodology

Avoids candidate generation
FP-tree
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
m:2
b:1
p:1
b:1
min_support = 3
Drawbacks:
 Requires two database scans
 Rebuilding tree for every support count
 Memory utilization high
IMINE-PROPOSED SYSTEM

Covering index.
No constraints are enforced during the index
creation phase.

Efficiently exploited by various item set extraction
algorithms.

Physical organization supports efficient data
access during item set extraction.

Support item set extraction in large data sets.
System Flow Diagram
Creating I-Tree based on the FP-tree data structure
Creating I-Btree based on the B+Tree structure
Extraction task – Reading selected I-Tree portions.
Data access methods frequent-item,Support and Itembased projection
Designing IMine Physical organization to reduce I/O
Item set mining- Implementing FP-based and LCM
algorithms
Performance evaluation
MODULES:
Implementation of I-tree
I-Btree
IMine Data Access Methods
IMine Physical Organization
Item set mining using FP-based and LCM algorithms
Index Structure
Characterized by 2 components and provide 2
levels of indexing
 I-Tree (Itemset-Tree)
Prefix-tree based on FP-tree data structure.
Scans the database once.

I-Btree (Item-Btree)
Reading selected I-Tree portions during
extraction .
IMine
Parent pointer
First child pointer
Right brother pointer
I-Tree
IMine
I-Btree
I-TREE
I-Tree layers:
 Top layer
Very frequently accessed during the mining
process.
Nodes with high support are stored.

Middle layer
Quite frequently accessed during the mining
process.

Bottom layer
Rarely accessed during the mining process
Nodes with unitary support are stored.
Physical organization:
 Minimize the cost of reading the data needed for
the current extraction process
 Correlation types:
Intratransaction correlation
 I-Tree layers
Intertransaction correlation
 I-Tree path correlation
I/O analysis for index data access:
 Through I-Btree, block 3 is loaded in the buffer
cache.
 Following the node parent, block 1 is loaded
[p:3]→[d:5] →[h:7] →[e:7] →[b:10] is in
memory
 If the 2 blocks are still in the buffer cache,
reading other prefix path does not require
IMine data access method:
 Frequent-item based projection
Support projection-based algorithm
 FP-growth

Support-based projection
Support level-based and array-based algorithm
 Apriori and LCM v.2

Item-based projection
Load all transactions
Loading frequent-item based projected DB:
 Ex: item p appears in 2 nodes [p:3] , [p:2]

Starting from I-Btree and reading 2
prefix path for p
[p:3→d:5→h:7→e:7→b:10]
[p:2→i:2→h:3→e:3]
Loading Support-based projected DB:

Given the I-Tree ,subpaths between the I-Tree
roots and the first node with an infrequent item.

Reads a node subtree by means of a top-down
depth-first I-Tree visit exploiting both the node
child and brother pointers.
Item Set Mining
Step1:
 The needed index data is loaded
Step2:
 Item set extraction takes place on loaded data
I-MINE
I_BTree
LCM
IMINE -Execution Time
IMINE-Memory Usage
Software Specification
Operating system
: Windows XP/Vista
Language
: JDK 1.6.1 and above
Back End
: SQLServer2000
Conclusion
Provide a complete and compact representation of
transactional data
Supports different algorithmic approaches to item set
extraction
Performance better than the existing FP-growth ,
LCM v.2 algorithms.
Future Enhancements
Compact structure
distributions
suitable
Incremental update of the index
for
different
data
Thank You

Efficient Itemset Extraction Using IMine Index

Transcript Efficient Itemset Extraction Using IMine Index

Directory