Data Mining Engineering
Download
Report
Transcript Data Mining Engineering
Data Analysis for Decision and
Management Processes
Univ.-Prof. Dr. Peter Brezany
Institute of Scientific Computing
Faculty for Information Science
University of Vienna
E-mail : [email protected]
WWW: http://www.par.univie.ac.at/~brezany
http://artemis.wszib.edu.pl/~brezany/
P.Brezany
Institute of Scientific Computing – University of Vienna
1
Institute of Scientific Computing –
Research Profile
The primary objectives of the Institute are
- to conduct research in high-performance advanced
data analysis, knowledge management,
programming languages,
compilers, programming environments and software
tools for high performance computing systems,
- to actively contribute to a transfer of technology to
industry
- to disseminate knowledge in the fields of parallel and
distributed computing and software technology
P.Brezany
Institute of Scientific Computing – University of Vienna
2
Institute for Software Science –
Main Research Projects and Cooperations
Participation in 14 EU projects (coordination of 1 project)
The European Centre of Excellence for Parallel Computing,
a department of the Institute, founded by the EU
Coordination of the CEI-PACT project (Austria, Slovakia,
Czech Republic, Poland, Italy, Hungary, Slovenia)
Special Research Program AURORA of the Austrian Science
Fund (1997-2007)
Many international cooperations (NASA, CalTech, CERN, ...)
P.Brezany
Institute of Scientific Computing – University of Vienna
3
New Research Field: GRID COMPUTING
The Grid – a new distributed computing infrastructure for science
and engineering.
The Grid consists of physical
resources (computers, disks, networks, databases, sensors, laboratory
equipments) and “middleware“
software that ensures the access
and the coordinated use of such
resources.
P.Brezany
Institute of Scientific Computing – University of Vienna
4
Media That Radically Influenced Society
1500s
Printing Press
1840s
Penny Post
1930s
Radio
1950s
TV
1990s
Web
P.Brezany
1850s
Telegraph
20xx
Grid
Institute of Scientific Computing – University of Vienna
1920s
Telephone
5
•
•
•
•
•
Outline
Business Intelligence, knowledge management
Relation: data, information, knowledge
Knowledge discovery process – System Architectures
Data warehousing and data webhousing
Data preparation:
– selection, preprocessing (cleaning, transformation), integration
• Data mining techniques
– association rules, sequences, classification, prediction, neural
networks, clustering, meta-learning
• Advanced topics
–
–
–
–
–
–
P.Brezany
Multi-agent and mobile agent systems
Web mining
intelligent search engines
semantic web
information and knowledge management on computing grids
security issues
Institute of Scientific Computing – University of Vienna
6
Basic Literature
Mark and Mary Whitehorn: Business Intelligence:
The IBM Solution. Springer-Verlag, 2000.
R. Kimball: The Data Warehouse Toolkit. John Willey, 1996.
J. Han, M. Kamber: Data Mining. Concepts and Techniques
Morgam Kaufmann Publishers, 2000.
M. Ester, J. Sander: Knowledge Discovery in Databases.
Springer-Verlag, 2000 (in German).
I.H. Witten, E. Frank: Data Mining. (Practical Machine
Learning Tools and Techniques with Java Implementations).
Morgam Kaufmann Publishers, 2000.
P.Brezany
Institute of Scientific Computing – University of Vienna
7
Time Schedule
• Monday,
Feb 27 :
17.15 – 20.30 (4 hours)
• Tuesday,
Feb 28:
10.00 -- 13.15 (4 hours)
• Wednesday, Mar 01:
15.30 – 18.45 (3 hours)
• Thursday,
16.00 --18.15 (3 hours)
Mar 02:
Location: s.1 AK4
P.Brezany
Institute of Scientific Computing – University of Vienna
8
Business Intelligence
Definition:
Business Intelligence is an umbrella term, broadly covering the
processes involved in extracting valuable business information
and knowledge from the mass of data that exists within a typical
enterprise, and knowledge management (knowledge storage in
an appropriate form and knowledge distribution).
What is meant by information and knowledge? This is best understood by imagining a chain linking data information
knowledge.
P.Brezany
Institute of Scientific Computing – University of Vienna
9
Data Information Knowledge
• Data are the facts about events or processes.
• Information is the organization of, associations between, and
constraints upon data that allow it to be used by a user or a
machine.
• Knowledge is the interpretation of information and its use in a
problem solving context. Knowledge can lead to new insights,
which in turn lead to new innovations and ultimately to wealth
creation and improvements in the quality of life.
• Wisdom arises when one understands the foundational
principles responsible for the patterns representing knowledge
(She/he can answer questions like Why ... ? and knows how he
can find or derive new knowledge.
P.Brezany
Institute of Scientific Computing – University of Vienna
10
Data
Example: When a customer visits a gass station and buys
petrol, it is possible to describe this transaction with the
following data: data/time, volume, price.
However, this data do not say, why this customer has chosen
this station and not any other, and it is not possible to find
out from this data whether he will come again, or whether this
station is good or bad.
Data alone posses almost no meaning nor purpose. They are
the base material for getting information.
P.Brezany
Institute of Scientific Computing – University of Vienna
11
Information
• A piece of information can be described as a
message.
• As all messages, information has one sender and one
receiver.
• Information shall form the opinion or attitude of
the receiver to a problem and influence his
behavior.
• We can also think of information as data which
something changes/forms/influences.
• The word ``inform´´ originally meant ``give some
form one thing or person´´.
P.Brezany
Institute of Scientific Computing – University of Vienna
12
Information (2)
• Data become information when the receiver adds some
meaning to data. Such a data upgrading can be done in
different ways, for example:
– Contextualizing: We know for what purpose the data was collected.
– Calculation: The data could be mathematically analyzed und statistically
enriched.
– Correction: Errors are removed from the data material.
– Comprising: The data is transformed into a more compact form; main
components of the data material have to be identified.
P.Brezany
Institute of Scientific Computing – University of Vienna
13
Information Management
Information management: all management tasks, which deal
with information and communication in one enterprise.
P.Brezany
Institute of Scientific Computing – University of Vienna
14
Knowledge
• Knowledge is the production factor of the future,
which will replace energy and materials.
• Knowledge is produced by means of head activity
and processes, which modell the head activity.
• Transformation process Information Knowledge:
– Comparison: How shall I estimate information about the current
situation in comparison to other known situations?
– Consequence: How will information influence decisions and
activities.
– Connex: Which relations exist between one concrete information
element and another one?
– Conversation: How do think other people about one certain piece
of information?
P.Brezany
Institute of Scientific Computing – University of Vienna
15
Knowledge (2)
• People gain knowledge through experience – they see,
hear, touch, and taste the world around them.
• We can associate something we see with something we
hear, thereby gaining new knowledge about the world.
• Suppose we know that the sun is hot, balls are round,
and the sky is blue. These facts are knowledge about
the world. How do we store this knowledge in our
brain? How could we store this knowledge in a
computer?
• This problem, called knowledge representation, is one
of the first, most fundamental issues that
researchers in artificial intelligence had to face.
P.Brezany
Institute of Scientific Computing – University of Vienna
16
Knowledge Pyramide
Decision
Pragmatics (Associated with
Context and Experience)
Semantics (Meaning)
Syntax
Action
Knowledge
Information
Data
Characters
Knowledge has 3 Dimensions: Syntax, Semantics, and Pragmatics.
P.Brezany
Institute of Scientific Computing – University of Vienna
17
Example
• Characters: t i s n i o r o l a n i l w
• Data:
The above characters give with the right
syntax (here the sequence of letters) a
statement „It will rain soon“.
• Information:
The above statement means:
„Water drops fall from the sky“.
• Knowledge:
Information „Water drops fall from the sky.“ is
connected with experience and expectations like:
„One can become wet; it can rain into the flat“.
• Action:
P.Brezany
Based on this knowledge, activities are developed:
„I will take an umbrella, I will close the window, etc.“
Institute of Scientific Computing – University of Vienna
18
Knowledge Management
Knowledge management: all management tasks of the
enterprise, which deal with obtaining, utillization, and
further development of knowledge.
P.Brezany
Institute of Scientific Computing – University of Vienna
19
Knowledge Representation
• Procedural representation
– Perhaps the most common technique for representing knowledge in
computers is to use procedural knowledge.Procedural code not only
encodes facts (constants and variables) but also defines a
sequence of operations for using and manipulating those facts.
Thus, program code is a perfect natural way of encoding
procedural knowledge. This „hardcoded“ logic is typically not
considered to be part of artificial intelligence per se.
• Declarative representation
– A user simply states facts, rules, and relationships. However,
declarative knowledge must be processed by some procedural
code. Most of the knowledge representation techniques studied in
artificial intelligence are declarative. Some of them are shown on
the following slides.
P.Brezany
Institute of Scientific Computing – University of Vienna
20
Knowledge Representation - Rules
General form of a predicate logic rule:
if antecedents(s) then consequents(s)
(Instead antecedent, other names, e.g., precondition, are used.
Instead consequent, other names, e.g., conclusion, action,
hypothesis, are used.)
Rules can have following forms:
• if P then Q
• if P1 and P2 and ... and Pn then Q1 and Q2 and ... and Qm
• if P1 and P2 or ... or Pn then Q
Rules, which produce new facts, are called production rules.
P.Brezany
Institute of Scientific Computing – University of Vienna
21
Rules (2)
Architecture of a Production System
Knowledge base
Rules
Inference
mechanisms
Recognize
Select
Fact base
Facts
Act
P.Brezany
Institute of Scientific Computing – University of Vienna
22
Semantic Nets
Semantic nets are used to define the meaning of a concept by its
relationships to other concepts.
A graph data structure is used, with nodes holding concepts and
links with natural language labels showing the relationships.
A portion of a semantic net representation of the vehicle domain is
shown in the next slide.
Remark: The standard relationships such as isa, has-part, and instance
should be familiar to readers with object-oriented design
experience.
P.Brezany
Institute of Scientific Computing – University of Vienna
23
A Semantic Net Example
Vehicle
is-a
Doors
has-part
size
has-part
Wheels
Motor
Automobile
is-a
Small
has-part
Sports Car
num-wheels
num-doors
4
2
instance
Corvette
P.Brezany
Institute of Scientific Computing – University of Vienna
24
Business Intelligence Tools
• Data warehouses
• OLAP (On-Line Analytical Processing) tools
• Data mining tools
• Text mining tools
• Data joiners
• Business Intelligence portals, etc.
P.Brezany
Institute of Scientific Computing – University of Vienna
25
Business Intelligence Tools (cont.)
• Data warehouse - a repository of multiple heterogeneous data
sources, organized under a unified schema at a single site in order
to facilitate management decision making.
• OLAP – analysis techniques with functionalities such as summarization, consolidation, and aggregation, as well as the ability to view
information from different angles.
• Data mining – extracting or “mining“ knowledge from large data sets.
• Text mining – “mining“ large textual (document) databases. Related
term – web mining.
• Data joiner - working with data from disparate, heterogeneous data
sources
• Business Intelligence portal – a Web site designed to be the first
point of entry for visitors to information about a company. With help
of the portal´s personalising functions, the user can choose information sources that he needs for performing a specific task. The portal
allows problemless access to valuable information and data analyses;
so, the basis for competent decisions is optimized.
P.Brezany
Institute of Scientific Computing – University of Vienna
26