Transcript Data mining

Chapter 7: The Future of Data Mining,
Warehousing, and Visualization
Modern Data Warehousing,
Mining, and Visualization: Core
Concepts
7-1: The Future of Data
Warehousing
• As a DW becomes a mature part of an organization,
it is likely that it will become as “transparent” as any
other part of the IS.
• One challenge to face is coming up with a workable
set of rules that ensure privacy as well as facilitating
the use of large data sets.
• Another is the need to store unstructured data such
as multimedia, maps and sound.
• The growth of the Internet allows integration of
external data into a DW, but its varying quality is
likely to lead to the evolution of third-party
intermediaries whose purpose is to rate data quality.
Predicting the Future
• In a technology-intensive area, it doesn’t pay to get
too far ahead of the curve.
– “The past is the best prelude to history.”
• Old example: Josephson junctions.
– A switching element based on superconductivity --rendered useless by IC’s.
• New Example: Quantum computing.
– We’re inventing clever algorithms for a device that
may well never exist.
The data explosion
• The amount of data stored in electronic
storage media increases at a fast pace
• UC Berkeley estimated that 5 Exabytes of
new data were generated in 2002
• Amount of data doubles every
18-24 months
• 1 Exabyte = 1 billion Gigabytes
• It took 300,000 years for
humans to accumulate 12
Exabytes of information, it
took only 2.5 years more for
the next 12 Exabytes
A guide to collective names for
scientific units
•
•
•
•
•
•
•
•
Kilo = 103
Mega = 106
Giga = 109
Tera = 1012
Peta = 1015
Billion Gigs
Exa = 1018
Zetta = 1021
Yotta = 1024
The data explosion
• In March 2007, an IDC study reported that 161 Exabytes
of new data were generated in the year 2006. At the
same time, 185 Exabytes of storage were available.
The data explosion
• In March 2008, another IDC study reported that, at 281
billion gigabytes (281 exabytes), the digital universe in
2007 was 10% bigger than originally estimated !
• http://www.emc.com/digital_universe
The data explosion
• According to the June 2009 update of the
Cisco Visual Networking Index IP traffic
forecast, by 2013, annual global IP traffic
will reach two-thirds of a zettabyte or 667
exabytes.
• Internet video will generate over 18 exabytes
per month in 2013.
• Global mobile data traffic will
grow at a CAGR of 131 percent
between 2008 and 2013,
reaching over two exabytes per
month by 2013.
Long-Lived Themes
• Very high-level query languages.
– If you are going to deal with very large amounts of
data, there has to be a lot of uniformity in what you
do.
– SQL-based user interfaces, like QBE in Access
will be central to the future of Data Warehouses
• Query optimization.
– The success of a very high-level language
depends on the ability to produce efficient
implementations.
Some Good, New Directions
• Languages and systems for automating
the process of integrating databases .
– Everyone acts as if this problem were solved,
but it is not.
• Stream data collection processing.
– Many applications where data whizzes by so
fast that storage and processing are limited.
– E.g., telecom billing, intrusion detection, etc.
More New Directions
• New kinds of data
– e.g., images, audio.
• Data mining:
– SAS Enterprise Mining-type GUI interfaces
• Automation of database design and tuning.
• Exploiting new architectures:
– Parallel database machines.
– Peer-to-peer and distributed systems.
Integrated Architecture
• Historically, market and business forces
have moved organizations toward ineffective
nonintegrated DW systems .
• Far too often, a “silo” DW simply replaces a
silo OLTP system.
• To survive in a future world of low-cost,
turnkey application systems, the transition to
a federated architecture must be made.
Typical Nonintegrated
Information Architecture
i2 Supply Chain
Supply
Chain
Data Mart
Oracle Financials
Siebel CRM
Oracle
Financial
DW
Subset Non-Architected Data Marts
Marketing
DW
3rd Party Data
Federated Integrated Information
Architecture
i2 Supply Chain
Oracle Financials
Siebel CRM
3rd Party Data
Common
Data Staging
Area
Federated
Supply Chain
Data Mart
Federated
Financial
DW
Subset Non-Architected Data Marts
Federated
Marketing
DW
Future
• The future of data warehousing is clearly
multi-faceted.
• There is a lot of blurring today with:
– CRM, Enterprise Systems and E-commerce
initiatives.
– Data warehousing is really becoming the method for
storing analytic-capable data for all these applications
and more, many of which are packaged.
• Architectures will need to be more tightly
integrated.
• E-commerce is cranking up data volumes.
7-2: Alternate Storage and the
Data Warehouse
• Surprisingly, the future of data warehousing
is not high-performance disk storage, but an
array of alternative storage.
• Involves two forms of alternative storage:
– Near-line storage involves an automated silo where
tape cartridges are handled automatically.
– Secondary storage which is slower and less
expensive, such as CD-ROMs and floppy disks.
• Firms like Teradata, Inc., Storage Technology
Corp. [STC] and others specialize in high
volume storage systems
Speed and Capacity of Various Near-Line Storage Media
Write once or
Write many
Device
Capacity
Data Access Speed
Media Lifetime
DAT DDS2
4-8 Gbyte
510 Kbyte/s
10-25 Yrs
WM
DAT DDS3
12-24 Gbyte
1 Mbyte/s
10-25 Yrs
WM
CD-ROM
640 Mbyte
X times 1.5 Mbits/s to
Read
10 Yrs Plus
WO
CD-RW
640 Mbyte
X times 1.5 Mbits/s to
Read
10 Yrs Plus
WM
Exabyte
20-40 Gbyte
3-6 Mbyte/s
10-25 Yrs
WM
DLT [Tape]
35 Gbyte
5 MByte/s
30 Yrs
WM
DVD
up to 15Gbyte
Not Known
Not Known
WO
DTF [Tape]
42Gbyte
12 Mbyte/s
10-25 Yrs
WM
Data D3
50 Gbyte
12 Mbyte/s
10-25 Yrs
WM
DVD-RAM
up to 3 Gbyte
Not Known
Not Known
WM
Magnetooptical
2.6-5.6 Gbyte
Not Known
Not Known
WM
Typical Near-Line Tape
Storage Silo
Main View
of the
Tape Silo
Close-up of tape storage carousel
Robotic Tape
Retrieval Arm
View Through
Silo Entry Door
Why Use Alternative Storage?
1. The data in a DW are stable. They are placed there
once and left alone, so do not need to be updated
at high speed.
2. The queries that operate on the DW data often
require long streams of data stored sequentially.
Operational access requires different units of data
from different storage areas.
3. The DW is of indeterminate size and is always
increasing in volume, requiring flexible capacity.
4. When data gets accessed less often as it ages, it
can be moved to secondary storage, making
access to newer data more efficient.
7-3: Trends in Data Warehousing
• Customer interaction and learning relationships
require capturing information “everywhere” and
massive scalability.
• Enterprise applications generate data that is
doubling very 9-12 months.
• The time available for working with data is
shrinking and the need for 24×7 access is
becoming the norm.
• Fast implementation and ease of management are
becoming more and more important.
• In the future, more organizations will build Web
applications that operate in conjunction with the
DW.
7-4: The Future of Data Mining
As promising as the field may be, it has
pitfalls:
– The quality of data can make or break the data
mining effort.
– In order to mine the data, companies first have to
integrate, transform and cleanse it.
– To obtain value from data mining, organizations
must be able to change their mode of operation
and maintain the effort (agile corporations).
– Finally, there are concerns about privacy.
Personalization versus
Privacy
• Companies that use data mining for target marketing
walk a tightrope between personalization and
privacy.
• Implementation of the recent FTC guidelines about
information practices can be a problem since
companies often do not know how they will use
information ahead of time. Signed releases from
customers increasing required.
• Further, technology appears to create new ways to
acquire information faster than the legal system can
handle the ethical and property issues.
7-5: Using Data Mining to
Protect Privacy
• While Internet use has grown, so have the problems
of network intrusion.
• One current intrusion detection technique is
misuse detection – scanning for known malicious
activity patterns known by signatures.
• Another technique is anomaly detection where
there is an attempt to identify malicious activity
based on deviations from norms.
• Most intrusion detection systems operate by the
signature approach.
Shortfalls of Current Detection
Schemes
• Variants – although signature lists are updated
frequently, minor changes in the “exploit code” can
produce a “new undetected” intruder.
• False positives – a detection system may be too
conservative and declare an intrusion when there is
none.
– E.g., Intruder scoring techniques for email …
• False negatives – an intrusion won’t be detected
until a signature has been identified.
• Data overload – as traffic grows, the ability to find
new hacks becomes harder and harder.
7-6: Trends Affecting the Future of
Data Mining
• While the available data increases exponentially,
the number of new data analysts graduating each
year has been fairly constant. Either of lot of data
will go unanalyzed or automatic procedures will
be needed.
• Increases in hardware speed and capacity makes
it possible to analyze data sets that were too
large just a few years ago.
• The next generation Internet will connect sites
100 times faster than current speeds.
• To be more profitable, businesses will need to
react more quickly and offer better service, and
do it all with fewer people and at a lower cost.
7-7: The Future of Data
Visualization
• Weapons performance and safety –
– Data visualization coupled with simulation models can show how
weapons perform under typical conditions and the effect of
weapons aging.
• Medical trauma treatment –
– Today’s surgeons use computer vision to assist in surgery. In
the future this trend suggests that local medical personnel can
also be assisted from afar by specialists through telepresence.
– X-ray transmission resolution now at acceptable limits
Augmented-reality Headset Worn
by Surgeon
7-8: Components of Future
Visualization Applications
• The data visualization environment links the
critical components and enables the smooth
flow of information among the components.
• In the future, the bounds between
computers, graphics and human knowledge
will become more blurred.
• Many advances in technology will be need to
handle the visualization environment of the
future.
• Intelligent file systems and data management
software will contend with thousands of
coupled storage devices.
In Conclusion
• Data explosion recent years have seen a dramatic
increase in the amount of information stored in
electronic format.
• It has been estimated that the amount of information
in the world doubles every 20 months and the size
and number of databases are increasing even faster
• Data and information are crucial for decision making,
especially in business operations. As a prominent
top manager aid said,
– "Whoever has information fastest and uses it wins"
• [Watterson K., from BYTE].
Future Vision
• The objective of taking a view on the future is
not so much about trying to guess lottery
numbers, it is about combining the past and
the present with what we think is likely to
occur.
• That way we believe we are able to forecast
with some accuracy.
• Predicting the future is like predicting the
weather, events will occur that were
unexpected and geniuses have a habit of
seeing things differently leading to major
shifts in the way things are done.
Accounting and DW
• http://ledgerism.net/datamart.htm
• http://www.finance.state.mn.us/agencyapps/training/i
a/ia150s_accounting.pdf
http://www.geocities.com/SiliconValley/Horizon/9144/ar
tdb003.html