Enhancing Data Analysis with Noise Removal Hui Xiong, Member

Download Report

Transcript Enhancing Data Analysis with Noise Removal Hui Xiong, Member

Kamiya Chaudhary
Daniel Green
Noise – Irrelevant, inconsistent, duplicate and
missing data
Introduced in the system - hardware failures,
programming errors and gibberish input
Impact - unnecessary increase in storage data,
can result in wrong analysis as incorrect data
will be evaluated
Ideally the data used for the analysis should be
relevant data to get accurate results.However, in
real world it it practically impossible to have
such system with no noisy data. Therefore,
certain techniques are proposed to remove the
noisy data as much as possible to improve the
quality of analysis.
Goal is to remove the data that hinder the
analysis




Distance Based
Density Based
Clustering Based
HCleaner
Hcleaner : HyperClique-Based Data Cleaner
Eliminates data objects that are not tightly
connected to other data objects in the data set.
Hyperclique patterns generated instead of
Frequent patterns.
Hyperclique patterns contains items that are
strongly correlated with each other.
Hyperclique patterns
Support
H-confidence
{earings,gold ring,bracelet}
0.019%
45.8%
{nokia battery,nokia adapter, nokia
wireless phone}
0.049%
52.8%
{coffee maker, can opener, toaster}
0.014%
61.5%
{baby bumper pad, diaper stacker, baby
crib sheet}
0.028%
72.7%
{skirt tub, 3pc bath set, shower curtain}
0.26%
74.4%
{jar cookie, container 3pc, box bread,
soup tureen, goblets 8ps}
0.012%
77.8%





Generates Hyperclique pattern
Eliminates objects not part of hyperclique
pattern based on h-confidence
H-confidence is the measure that reflects the
overall affinity among items within the
pattern
The higher the h-confidence threshold is the
more the object are closely related
Includes size 3 hyperclique patterns because
there can be strong co-relation of a nonrelevant object with another irrelevant object.



Therefore, an object appearing in size-3
pattern means that there are at least 2
objects which have a guaranteed pairwise
similarity with an object
Computational cost is from generating
hyperclique patterns for size-3
The data to be labeled as noise is not input to
algorithm but a result of it.



Noise reduction is absolute step for data
cleansing
Data cleansing contributes in getting accurate
analysis results
As proposed by the paper,Hcleaner does not
need to go beyond hyperclique patterns and
no combinatorial growth of pattern space is
required which makes it efficient and scalable
algorithm.



http://wwwusers.cs.umn.edu/~kumar/papers/noise_removal_tkde.pdf
http://www.mimuw.edu.pl/~son/datamining/DM/4preprocess.pdf
http://books.google.com/books?id=gdkY4QHy0XIC&pg=PA8
87&lpg=PA887&dq=noise+reduction+data+warehouse&sour
ce=bl&ots=280EhMs0dB&sig=1TKs37HYn9LFlqqPCoHvAHGTKY&hl=en&sa=X&ei=lstaVPidJ8f6oQT8goLYCw&
ved=0CFYQ6AEwBg#v=onepage&q=noise%20reduction%20da
ta%20warehouse&f=false
Kamiya Chaudhary
Daniel Green


Problem statement
Proposed solution
Dormant accounts are scattered by banks. It
would be beneficial if the data was centralized.
To build a centralized system for all the
dormant bank accounts which could be used by
various other systems.



One centralized location for all information
Easier to search
Answer various questions and may also lead
to new question about where dormant
accounts exists.



Collect data from various systems
Data cleansing
Generate Snow-flake schema
Daniel Green
Kamiya Chaudhary






Motivation
Background
Problem statement
Proposed solution
Benefits
Schedule


Make Customers aware of unclaimed assets
Help business identify and prevent dormant
accounts
When a customer setup a bank account, there is
a potential for a customer to abandon the
account. Accounts with positive balance will
become dormant after a certain period of time.
Assets inside of a dormant accounts cannot
legally be utilized by the business.
To build an application to help identify where
accounts have gone dormant.



Understand basic facts about dormant
accounts
Identify the areas where dormant accounts
exists according to various factors like
geographical, bank, bank branch, amount.
Provide better answers to the questions for
the future
Week
Progress
1 and half week
Data warehouse consolidation
including data cleansing
2 week
Developing Data mining system
and front end
Half week
testing




https://www.ki.informatik.huberlin.de/mac/lehre/lehrmaterial/Informationsinte
gration/Rahm00.pdf
http://www.edmontonjournal.com/news/unclaime
d-bank-accounts/index.html
http://www.helpwithmybank.gov/getanswers/bank-accounts/inactive-accounts/bankaccounts-inactive-accounts-quesindx.html
http://www.helpwithmybank.gov/getanswers/bank-accounts/inactive-accounts/faqbank-accounts-inactive-accounts-01.html