Data Quality, Data Cleaning and Treatment of Noisy Data

Download Report

Transcript Data Quality, Data Cleaning and Treatment of Noisy Data

Data Quality, Data Cleaning and
Treatment of Noisy Data
DIMACS Workshop
November 3-4, 2003
Organizer: Tamraparni Dasu, AT&T Labs - Research
Why DQ?
• Data quality problems are expensive and
– DQ problems cost hundreds of billions of $$$ each
• Lost revenues, credibility, customer retention
– Resolving data quality problems is often the biggest
effort in a data mining study.
• 50%-80% of time in data mining projects spent on DQ
– Interest in streamlining business operations
databases to increase operational efficiency (e.g.
cycle times), reduce costs, conform to legal
The Data Quality Continuum
• Data/information is not static, it flows in a data
collection and usage process
Data gathering
Data delivery
Data storage
Data integration
Data retrieval
Data mining/analysis
• Problems can and do arise at all of these stages
• End-to-end, continuous monitoring needed
Technical Approaches
• Need a multi-disciplinary approach
– No single approach solves all problems
• Process management
– Pertains to data process and flows
– Checks and controls, audits
• Database
– Storage, access, manipulation and retrieval
• Metadata / domain expertise
– Interpretation and understanding
• Analysis – Data Mining, Statistics
– Analysis, diagnosis, model fitting, prediction, decision making …
Meaning of Data Quality –1
• Conventional definitions: completeness,
uniqueness, consistency, accuracy etc. –
measurable?Modernize definition of DQ wrt to
DQ continuum
• Depends on data paradigms (data gathering,
– Federated, High dimensional, Descriptive,
Longitudinal, Streaming, Web (scraped), Numeric,
Text data
DQ Meaning - 2
• Depends on applications (delivery, integration, analysis)
– Business operations, Aggregate analysis, prediction
– Customer relations …
• Data Interpretation
– Know all the rules used to generate the data
• Data Suitability
– Use of proxy data
– Relevant data is missing
Increased DQ  Increased reliability and usability
(directionally correct)
• Talks cover different aspects of the
complex DQ issue
• Outstanding set of speakers from
academia, industrial labs and industry
• Cover theoretical, methodological,
applied aspects – case studies!
• From a wide range of disciplines and
Rene Miller
• University of Toronto
• Renee is an Associate Professor of
Computer Science at the University of
Toronto. S.B., Mathematics, MIT. S.B.,
Cognitive Science, MIT. Ph.D., Computer
Science, U. Wisconsin-Madison.
• Heterogeneous databases, data mining,
and data warehousing.
• “Managing Inconsistency in Data
Exchange and Integration”
Grace Zhang
• Morgan Stanley Institutional Equity Division
IT. Master of Philosophy in Computer
Science from Columbia University, and a
Master and B.S. in Computer Science from
Zhongshan University,China.
• Develop tools to check data quality issues in
equity trading data, design and build the
standard destination referential data
• “Data Quality in Trading Surveillance”
Ted Johnson
• AT&T Labs – Research
• Database Research department. B.S. in
Mathematics, Johns Hopkins University,
Ph.D. in Computer Science, New York
University, 1990.
• Data warehousing and data mining
• “Bellman - A Data Quality Browser “
Ron Pearson
• Daniel Baugh Institute for Functional
Genomics and Computational Biology,
Thomas Jefferson University. B.S. in physics
from the University of Arkansas at Monticello
and M.S.E.E. and PhD in electrical
engineering from M.I.T. in 1982.
• Design and analysis of nonlinear digital
filters, exploratory data analysis and the
validation of analytical results.
• “The Data Cleaning Problem -- Some Key
Issues and Practical Approaches”
Dhammika Amaratunga, Javier
Cabrera, Nandini Raghavan
• Johnson & Johnson, Rutgers
University, Johnson & Johnson
• “Pre-processing of Microarray
S. Muthukrishnan
• Rutgers University, AT&T Labs –
• Associate Professor of Computer
• Design and analysis of algorithms
• “Checks and Balances: Monitoring
Data Quality Problems in Network
Traffic Databases”
T. Bonates, P. Hammer, A.
Kogan, and I. Lozina
• RutCOR, Rutgers University
• Operations Research
• Maximum Patterns and Outliers in
the Logical Analysis of Data (LAD)
Jiawei Han
• Professor, Simon Fraser University. Currently
at University of Illinois, UC. Ph. D. from
University of Wisconsin, Madison in 1985.
• Data mining (knowledge discovery in
databases), data warehousing, spatial
databases, multimedia databases, deductive
and object-oriented databases, and logic
• “Data Mining: A Powerful Tool for Data
Jon Hill
• British Telecommunications
• Jon leads a team of information experts
to deliver solutions within asset
management, process control and
billing assurance. Jon uses a wide range
of information quality tools within
projects and has extensive experience
in investigation and solving IQ
• “A $220 Million Success Story”
G. Vesonder, J. Wright & T. Dasu
• AT&T Labs - Research
• Head of Adaptive Systems
• AI, Knowledge Engineering, Expert
• “Life Cycle Datamining”
Andrew Hume
• AT&T Labs – Research
• Very large data systems, string
searching, performance
• Tamed many legacy systems
• “Managing Data Streams”
Bing Liu
• Associate Professor at National
Singapore University, on leave at
University of Illinois at Chicago
• Data mining and knowledge
discovery; web, text and image
mining; Bioinformatics
• Web page cleaning for web data
R.K. Pearson and M. Gabbouj
• Collaboration with Moncef Gabbouj
from the Tampere University of
Technology in Finland.
• “Relational Nonlinear FIR Filters”
Thank you!