Title of Talk

Download Report

Transcript Title of Talk

What is data mining?
Włodzisław Duch
Dept. of Informatics,
Nicholas Copernicus University,
Toruń, Poland
http://www.phys.uni.torun.pl/~duch
ISEP Porto, 8-12 July 2002
What is it about?
• Data used to be precious! Now it is overwhelming ...
• In many areas of science, business and commerce
people are drowning in data.
• Ex: astronomy super-telescope – data mining in
existing databases.
• Database technology allows to store and retrieve large
amounts of data of any kind.
• There is knowledge hidden in data.
• Data analysis requires intelligence.
Ancient history
• 1960: first databases, collections of data.
• 1970: RDBMS, relational data model most popular
today, large centralized systems.
• 1980: application-oriented data models, specialized for
scientific, geographic, engineering data, time series,
text, object-oriented models, distributed databases.
• 1990: multimedia and Web databases, data
warehousing (subject-oriented DB for decision
support), and on-line analytical processing (OLAP),
deduction and verification of hypothetical patterns.
• Data mining: first conference in 1989, book 1996,
discover something useful!
Data Mining History
• 1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro and W. Frawley 1991)
• 1991-1994 Workshops on KDD
• 1996 Advances in Knowledge Discovery and Data Mining
(Fayyad et al.)
• 1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining (KDD’95-98)
• 1997 Journal of Data Mining and Knowledge Discovery
• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences,
and SIGKDD Explorations
• Many conferences on data mining: PAKDD, PKDD, SIAMData Mining, (IEEE) ICDM, etc.
References, papers
KDD WWW Resources:
http://www.kdd.org
http://www.kdnuggets.com
http://www.the-data-mine.com
http://www.acm.org/sigkdd/
ResearchIndex: http://citeseer.nj.nec.com/cs
AI & ML aspects
http://www.phys.uni.torun.pl/kmk
NN & Statistics
http://www.phys.uni.torun.pl/kmk
Comparison of results on many datasets:
http://www.phys.uni.torun.pl/kmk
Data Mining and statistics
• Statisticians deal with data: what’s new in DM?
• Many DM methods have roots in statistics.
• Statistics used to deal with small, controlled
experiments, while DM deals with large, messy
collections of data.
• Statistics is based on analytical probabilistic models,
DM is based on algorithms that find patterns in data.
• Many DM algorithms came from other sources and
slowly get some statistical justification.
• Key factor for DM is the computer cost/performance.
• Sometimes DM is more art than science …
Types of Data
• Statistical data – clean, numerical, controlled
experiments, vector space model.
•
•
•
•
•
•
•
Relational data – marketing, finances.
Textual data – Web, NLP, search.
Complex structures – chemistry, economics.
Sequence data – bioinformatics.
Multimedia data – images, video.
Signals – dynamic data, biosignals.
AI data – logical problems, games, behavior
…
What is DM?
• Discovering interesting patterns, finding useful
summaries of large databases.
• DM is more than database technology, On-Line
Analitic Processing (OLAP) tools.
• DM is more than statistical analysis, although it
includes classification, association, clustering,
outlier and trend analysis, decision rules,
prototype cases, multidimensional visualization
etc. Understanding of data has not been an
explicit goal of statistics, focusing on predictive
data models.
DM applications
• Many applications, but spectacular new knowledge is
rarely discovered.
Some examples:
– “Diapers and beer” correlation: please them close and
put potato chips in between.
– Mining astronomical catalogs (Skycat, Sloan Sky
survey): new subtype of stars has been discovered!
– Bioinformatics: more precise characterization of some
diseases, many discoveries to be made?
– Credit card fraud detection (HNC company).
– Discounts of air/hotel for frequent travelers.
Important issues in data mining.
• Use of statistical and CI methods for KDD.
• What makes an interesting pattern?
• Handling uncertainty in the data.
• Handling noise, outliers and missing or unknown data.
• Finding linguistic variables, discretization of continuous
data, presentation and evaluation of knowledge.
• Knowledge representation for structural data,
heterogeneous information, textual databases & NLP.
• Performance, scalability, distributed data, incremental or
“on-line” processing.
• Best form of explanation depends on the application.
DM dangers
• If there are too many conclusions to draw some
inferences will be true by chance due to too small data
samples (Bonferroni’s theorem).
Example 1: David Rhine (Duke Univ) ESP tests.
1 person in 1000 guessed correctly color (red or black) of
10 cards: is this evidence for ESP?
Retesting of these people gave average results.
Rhine’s conclusion: telling people that they have ESP
interferes with their ability …
Example 2: using m letters to form a random sequence of
the length N all possible subsequences of logmN are
found => Bible code!
Data Mining process
Knowledge discovery in databases (KDD):
a search process for understandable and useful
patterns in data.
Clean,
Collect,
Summarize
Data
Warehouse
most effort
Operational
Databases
Data
Preparation
Training
Data
Verification,
Evaluation
Data
Mining
Model
Patterns
Stages of DM process
• Data gathering, data warehousing, Web crawling.
• Preparation of the data: cleaning, removing outliers and
impossible values, removing wrong records, finding
missing data.
• Exploratory data analysis: visualization of different
aspects of data.
• Finding relevant features for questions that are asked,
preparing data structures for predictive methods,
converting symbolic values to numerical representation.
• Pattern extraction, discovery, rules, prototypes.
• Evaluation of knowledge gained, finding useful patterns,
consultation with experts.
Multidimensional Data Cuboids
• Data warehouses use multidimensional data model.
• Projections (views) of data on different dimensions
(attributes) form “data cuboids”.
• In DB warehousing literature:
base cuboid: original data, N-Dim.
apex cuboid: 0-D cuboid, highest-level summary;
data cube: lattice of cuboids.
• Ex: Sales data cube, viewed in multiple dimensions
– Dimension tables, ex. item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact tables, measures (such as cost), and keys to each of the
related dimension tables
Data Cube: A Lattice of Cuboids
none
time
time,item
item
time,location
0-D(apex) cuboid
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Forms of useful knowledge
AI/Machine Learning camp:
Neural nets are black boxes.
Unacceptable! Symbolic rules forever.
But ... knowledge accessible to humans is in:
• symbols,
• similarity to prototypes,
• images, visual representations.
What type of explanation is satisfactory?
Interesting question for cognitive scientists.
Different answers in different fields.
Forms of knowledge
• Humans remember examples of each category
and refer to such examples – as similaritybased or nearest-neighbors methods do.
• Humans create prototypes out of many
examples – as Gaussian classifiers, RBF
networks, neurofuzzy systems do.
• Logical rules are the highest form of
summarization of knowledge.
Types of explanation:
• exemplar-based: prototypes and similarity;
• logic-based: symbols and rules;
• visualization-based: exploratory data
analysis, maps, diagrams, relations ...
Computational Intelligence
Neural
networks
Evolutionary
algorithms
Fuzzy
logic
Soft computing
Pattern
Recognition
Expert
systems
Computational Intelligence
Data => Knowledge
Artificial Intelligence
Machine
learning
Probabilistic
methods
Visualization
Multivariate
statistics
CI methods for data mining
• Provide non-parametric (“universal”), predictive
models of data.
• Classify new data to pre-defined categories,
supporting diagnosis & prognosis.
• Discover new categories, clusters, patterns.
• Discover interesting associations, correlations.
• Allow to understand the data, creating fuzzy or
crisp logical rules, or prototypes.
• Help to visualize multi-dimensional
relationships among data samples.
Association rules
• Classification rules: X => C(X)
• Association rules: looking for correlation
between components of X, i.e. probability
p(Xi|X1,Xi-1,Xi+1,Xn).
• “Market basket” problem: many items selected
from an available pool to a basket; what are the
correlations?
• Only frequent items are interesting:
itemsets with high support, i.e. appearing
together in many baskets.
Search for rules above support threshold > 1%.
Association rules - related
• Related problems to market basket:
correlation between documents – high for
plagiarism;
phrases in documents – high for semantically
related documents.
• Causal relations matter, although may be
difficult to determine:
lower the price of diapers, keep high beer price,
or try the reverse – what will happen?
• More general approach:
Bayesian belief networks, causal networks,
graphical models.
Clustering
• Given points in multidimensional space divided them
into groups that are “similar”.
• Ex: if epidemic breaks, look for location of cases on the
map (cholera in London).
Documents in the space of words cluster according to
their topics.
• How to measure similarity?
• Hierarchical approaches: start from single cases, join
them forming clusters; ex: dendrogram.
Centroid approaches: assume a few centers and adapt
their position; ex: k-means, LVQ, SOM.
Neural networks
• Inspired by neurobiology: simple elements
cooperate changing internal parameters.
• Large field, dozens of different models, over
500 papers on NN in medicine each year.
• Supervised networks: heteroassociative
mapping X=>Y, symptoms => diseases,
universal approximators.
• Unsupervised networks: clusterization,
competitive learning, autoassociation.
• Reinforcement learning: modeling behavior,
playing games, sequential data.
Unsupervised NN example
Clustering and visualization of the quality of life
index (UN data) by SOM map.
Poor classification, inaccurate visualization.
Real and artificial neurons
Dendrites
Signals
Synapses
Nodes –
artificial
neurons
Synapses
(weights)
Axon
Neural network for MI diagnosis
~ p(MI|X) 0.7 Myocardial Infarction
Output
weights
Input
weights
Inputs:
-1
65
Sex
Age
1
5
3
1
Smoking Pain
Elevation
Pain
Intensity Duration ECG: ST
MI network function
Training: setting the values of weights and
thresholds, efficient algorithms exist.
Effect: non-linear regression function
 5 o  6 i

FMI  X     Wij   W jk X k  
 k 1

 i 1
Such networks are universal approximators:
they may learn any mapping X => Y
Knowledge from networks
Simplify networks: force most weights to 0,
quantize remaining parameters, be constructive!
• Regularization: mathematical technique
improving predictive abilities of the network.
• Result: MLP2LN neural networks that are
equivalent to logical rules.
MLP2LN
Converts MLP neural networks into a network
performing logical operations (LN).
Input
layer
Output:
one node
per class.
Aggregation:
Linguistic units:
better features windows, filters
Rule units:
threshold logic
Learning dynamics
Decision regions shown every 200 training epochs in x3, x4
coordinates; borders are optimally placed with wide margins.
Neurofuzzy systems
Fuzzy: mx0,1 (no/yes) replaced by a degree
mx[0,1]. Triangular, trapezoidal, Gaussian ... MF.
M.f-s in many
dimensions:
Feature Space Mapping (FSM) neurofuzzy system.
Neural adaptation, estimation of probability density
distribution (PDF) using single hidden layer network
(RBF-like) with nodes realizing separable functions:
G  X ; P    Gi  X i ; Pi 
i 1
GhostMiner Philosophy
GhostMiner, data mining tools from our lab.
http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building and
knowledge discovery from model use =>
GhostMiner Developer & GhostMiner Analyzer.
• There is no free lunch – provide different type of tools
for knowledge discovery.
Decision tree, neural, neurofuzzy, similarity-based,
committees.
• Provide tools for visualization of data.
• Support the process of knowledge discovery/model
building and evaluating, organizing it into projects.
Heterogeneous systems
Homogenous systems: one type of “building blocks”, same
type of decision borders.
Ex: neural networks, SVMs, decision trees, kNNs ….
Committees combine many models together, but lead to
complex models that are difficult to understand.
• Discovering simplest class structures, its inductive
bias, requires heterogeneous adaptive systems (HAS).
• Ockham razor: simpler systems are better.
•
•
•
•
HAS examples:
NN with many types of neuron transfer functions.
k-NN with different distance functions.
DT with different types of test criteria.
Wine data example
Chemical analysis of wine from grapes grown in
the same region in Italy, but derived from three
different cultivars.
Task: recognize the source of wine sample.
13 quantities measured, continuous features:
•
•
•
•
•
alcohol content
ash content
magnesium content
flavanoids content
proanthocyanins
phenols content
• OD280/D315 of diluted
wines
•
•
•
•
malic acid content
alkalinity of ash
total phenols content
nonanthocyanins
phenols content
• color intensity
• hue
• proline.
Exploration and visualization
General info about the data
Exploration: data
Inspect the data
Exploration: data statistics
Distribution of feature values
Proline has very large values, the data should be standardized
before further processing.
Exploration: data standardized
Standardized data: unit standard deviation, about 2/3 of all
data should fall within [mean-std,mean+std]
Other options: normalize to fit in [-1,+1], or normalize
rejecting some extreme values.
Exploration: 1D histograms
Distribution of feature values in classes
Some features are more useful than the others.
Exploration: 1D/3D histograms
Distribution of feature values in
classes, 3D
Exploration: 2D projections
Projections (cuboids) on selected 2D
Projections on selected 2D
Visualize data
Relations in more than 3D are hard to imagine.
SOM mappings: popular for visualization, but
rather inaccurate, no measure of distortions.
Measure of topographical distortions: map all Xi
points from Rn to xi points in Rm, m < n, and ask:
How well are Rij = D(Xi, Xj) distances reproduced by
distances rij = d(xi,xj) ?
Use m = 2 for visualization,
use higher m for dimensionality reduction.
Visualize data: MDS
Multidimensional scaling: invented in
psychometry by Torgerson (1952), re-invented
by Sammon (1969) and myself (1994) …
Minimize measure of topographical distortions
moving the x coordinates.
1
S1  x  
2
R
 ij
R
i j
ij
- rij  x  
2
MDS
i j
1 - r  x  
1
S2  x  
 Rij

1
S3  x  
 Rij
 1 - r  x 
i j
i j
i j
i j
2
ij
Sammon
Rij
ij
Rij 
2
MDS, more local
Visualize data: Wine
3 clusters are clearly distinguished, 2D is fine.
The green outlier can be identified easily.
Decision trees
Simplest things first:
use decision tree to find logical rules.
Test single attribute, find good point to split the
data, separating vectors from different classes.
DT advantages: fast, simple, easy to understand,
easy to program, many good algorithms.
4 attributes used,
10 errors, 168 correct,
94.4% correct.
Decision borders
Univariate trees:
test the value of a single attribute x < a.
Multivariate trees: test on combinations of
attributes, hyperplanes.
Result: feature space is divided into cuboids.
Wine data: univariate
decision tree borders for
proline and flavanoids
Logical rules
Crisp logic rules: for continuous x use linguistic
variables (predicate functions).
sk(x)  True [Xk x  X'k], for example:
small(x)
= True{x|x < 1}
medium(x) = True{x|x  [1,2]}
large(x)
= True{x|x > 2}
Linguistic variables are used in crisp
(prepositional, Boolean) logic rules:
IF small-height(X) AND has-hat(X) AND hasbeard(X)
THEN (X is a Brownie)
ELSE IF ... ELSE ...
Crisp logic decisions
Crisp logic is based on rectangular
membership functions:
True/False values jump from 0 to 1.
Step functions are used for
partitioning of the feature space.
Very simple hyper-rectangular
decision borders.
Sever limitation on the expressive
power of crisp logical rules!
Logical rules - advantages
Logical rules, if simple enough, are preferable.
• Rules may expose limitations of black box
•
•
•
•
solutions.
Only relevant features are used in rules.
Rules may sometimes be more accurate than
NN and other CI methods.
Overfitting is easy to control, rules usually
have small number of parameters.
Rules forever !?
A logical rule about logical rules is:
IF
the number of rules is relatively small
AND the accuracy is sufficiently high.
THEN rules may be an optimal choice.
Logical rules - limitations
Logical rules are preferred but ...
• Only one class is predicted p(Ci|X,M) = 0 or 1
•
•
•
•
black-and-white picture may be inappropriate in
many applications.
Discontinuous cost function allow only nongradient optimization.
Sets of rules are unstable: small change in the
dataset leads to a large change in structure of
complex sets of rules.
Reliable crisp rules may reject some cases as
unclassified.
Interpretation of crisp rules may be misleading.
• Fuzzy rules are not so comprehensible.
Rules - choices
Simplicity vs. accuracy.
Confidence vs. rejection rate.
 p
p  true | predicted   
 p-
pp--
p r  p
p- r  p-
p is a hit; p- false alarm; p- is a miss.
Accuracy (overall) A(M) = p+ p-Error rate
Rejection rate
Sensitivity
Specificity
L(M) = p-+ pR(M)=p+r+p-r= 1-L(M)-A(M)
S+(M)= p+|+ = p++ /p+
S-(M)= p-|- = p-- /p-
Rules – error functions
The overall accuracy is equal to a combination of sensitivity
and specificity weighted by the a priori probabilities:
A(M) = pS(M)+p-S-(M)
Optimization of rules for the C+ class;
large g means no errors but high rejection rate.
E(M;g)= gL(M)-A(M)= g (p-+p-) - (p+p--)
minM E(M;g)  minM {(1+g)L(M)+R(M)}
Optimization with different costs of errors
minM E(M;a) = minM {p-+ a p-} =
minM {p1-S(M)) - pr(M) + a [p-1-S-(M)) - p-r(M)]}
ROC (Receiver Operating Curve):
p (p-, hit(false alarm).
Wine example – SSV rules
Decision trees provide rules of different
complexity.
Simplest tree: 5 nodes, corresponding to 3 rules;
25 errors, mostly Class2/3 wines mixed.
Wine – SSV 5 rules
Lower pruning leads to more complex tree.
7 nodes, corresponding to 5 rules;
10 errors, mostly Class2/3 wines mixed.
Wine – SSV optimal rules
What is the optimal complexity of rules?
Use crossvalidation to estimate generalization.
Various solutions may be found, depending on the search:
5 rules with 12 premises, making 6 errors,
6 rules with 16 premises and 3 errors,
8 rules, 25 premises, and 1 error.
if OD280/D315 > 2.505  proline > 726.5  color > 3.435 then class 1
if OD280/D315 > 2.505  proline > 726.5  color < 3.435 then class 2
if OD280/D315 < 2.505  hue > 0.875  malic-acid < 2.82 then class 2
if OD280/D315 > 2.505  proline < 726.5 then class 2
if OD280/D315 < 2.505  hue < 0.875 then class 3
if OD280/D315 < 2.505  hue > 0.875  malic-acid > 2.82 then class 3
Wine – FSM rules
SSV: hierarchical rules
FSM: density estimation with feature selection.
Complexity of rules depends on desired accuracy.
Use rectangular functions for crisp rules.
Optimal accuracy may be evaluated using crossvalidation.
FSM discovers simpler rules, for example:
if proline > 929.5 then class 1
(48 cases, 45 correct, 2 recovered by other rules).
if color < 3.79285 then class 2
(63 cases, 60 correct)
Examples of interesting
knowledge discovered!
The most famous example of knowledge
discovered by data mining:
correlation between beer, milk and diapers.
Other examples:
2 subtypes of galactic spectra forced
astrophysicist to reconsider star evolutionary
processes.
Several examples of knowledge found by us in
medical and other datasets follow.
Mushrooms
The Mushroom Guide: no simple rule for mushrooms; no rule
like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy.
8124 cases, 51.8% are edible, the rest non-edible.
22 symbolic attributes, up to 12 values each, equivalent to
118 logical features, or 2118=3.1035 possible input vectors.
Odor: almond, anise, creosote, fishy, foul, musty, none,
pungent, spicy
Spore print color: black, brown, buff, chocolate, green,
orange, purple, white, yellow.
Safe rule for edible mushrooms:
odor=(almond.or.anise.or.none)  spore-print-color =  green
48 errors, 99.41% correct
This is why animals have such a good sense of smell!
What does it tell us about odor receptors?
Mushrooms rules
To eat or not to eat, this is the question! Not any more ...
A mushroom is poisonous if:
R1) odor =  (almond  anise  none);
120 errors, 98.52%
R2) spore-print-color = green
48 errors, 99.41%
R3) odor = none  stalk-surface-below-ring = scaly
 stalk-color-above-ring =  brown
8 errors, 99.90%
R4) habitat = leaves  cap-color = white no errors!
R1 + R2 are quite stable, found even with 10% of data;
R3 and R4 may be replaced by other rules, ex:
R'3): gill-size=narrow  stalk-surface-above-ring=(silky 
scaly)
R'4): gill-size=narrow  population=clustered
Only 5 of 22 attributes used! Simplest possible rules?
100% in CV tests - structure of this data is completely clear.
Recurrence of breast cancer
Data from: Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
286 cases, 201 no recurrence (70.3%),
85 recurrence cases (29.7%)
no-recurrence-events, 40-49, premeno, 25-29,
0-2, ?, 2, left, right_low, yes
9 nominal features: age (9 bins), menopause,
tumor-size (12 bins), nodes involved (13 bins),
node-caps, degree-malignant (1,2,3), breast,
breast quad, radiation.
Rules for breast cancer
Data from: Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
Many systems used, 65-78% accuracy reported.
Single rule:
IF (nodes-involved  [0,2]  degree-malignant = 3
THEN recurrence, ELSE no-recurrence
76.2% accuracy, only trivial knowledge in the data:
Highly malignant breast cancer involving many
nodes is likely to strike back.
Recurrence - comparison.
Method
MLP2LN 1 rule
SSV DT stable rules
10xCV accuracy
76.2
75.7  1.0
k-NN, k=10, Canberra
74.1 1.2
MLP+backprop.
CART DT
FSM, Gaussian nodes
Naive Bayes
73.5  9.4 (Zarndt)
71.4  5.0 (Zarndt)
71.7  6.8
69.3  10.0 (Zarndt)
Other decision trees
< 70.0
Breast cancer diagnosis.
Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
699 cases, 9 features quantized from 1 to 10:
clump thickness, uniformity of cell size, uniformity
of cell shape, marginal adhesion, single epithelial
cell size, bare nuclei, bland chromatin, normal
nucleoli, mitoses
Tasks: distinguish benign from malignant cases.
Breast cancer rules.
Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
Simplest rule from MLP2LN, large regularization:
If
uniformity of cell size < 3
Then benign
Else malignant
Sensitivity=0.97, Specificity=0.85
More complex NN solutions, from 10CV estimate:
Sensitivity =0.98, Specificity=0.94
Breast cancer comparison.
Method
10xCV accuracy
k-NN, k=3, Manh
FSM, neurofuzzy
97.0  2.1 (GM)
96.9  1.4 (GM)
Fisher LDA
MLP+backprop.
LVQ
IncNet (neural)
Naive Bayes
SSV DT, 3 crisp rules
LDA (linear discriminant)
Various decision trees
96.8
96.7 (Ster, Dobnikar)
96.6 (Ster, Dobnikar)
96.4  2.1 (GM)
96.4
96.0  2.9 (GM)
96.0
93.5-95.6
Melanoma skin cancer

Collected in the Outpatient Center of
Dermatology in Rzeszów, Poland.

Four types of Melanoma: benign,
blue, suspicious, or malignant.

250 cases, with almost equal class distribution.

Each record in the database has 13 attributes:
asymmetry, border, color (6), diversity (5).

TDS (Total Dermatoscopy Score) - single index

Goal: hardware scanner for preliminary
diagnosis.
Melanoma rules
R1:
R2:
R3:
R4:
IF TDS ≤ 4.85 AND C-BLUE IS absent
THEN MELANOMA IS Benign-nevus
IF TDS ≤ 4.85 AND C-BLUE IS present
THEN MELANOMA IS Blue-nevus
IF TDS > 5.45
THEN MELANOMA IS Malignant
IF TDS > 4.85 AND TDS < 5.45
THEN MELANOMA IS Suspicious
5 errors (98.0%) on the training set
0 errors (100 %) on the test set.
Feature aggregation is important!
Without TDS 15 rules are needed.
Melanoma results
Method
Rules Training %
Test %
MLP2LN, crisp rules
4
98.0 all
100
SSV Tree, crisp rules
4
97.5±0.3
100
FSM, rectangular f.
7
95.5±1.0
100
knn+ prototype selection
13
97.5±0.0
100
FSM, Gaussian f.
15
93.7±1.0
95±3.6
knn k=1, Manh, 2 features --
97.4±0.3
100
--
96.2
LERS, rough rules
21
Summary
Data mining is a large field; only a few issues have been
mentioned here.
DM involves many steps, here only those related to
pattern recognition were stressed, but in practice
scalability and efficiency issues may be most important.
Neural networks are used still mostly for building predictive
data models, but they may also provide simplified
description in form of rules.
Rules are not the only for of data understanding.
Rules may be a beginning for a practical application.
Some interesting knowledge has been discovered.
Challenges
Fully automatic universal data analysis systems:
press the button and wait for the truth …
•
•
•
•
Discovery of theories rather than data models
Integration with image/signal analysis
Integration with reasoning in complex domains
Combining expert systems with neural networks
We are slowly getting there.
More & more computational intelligence tools
(including our own) are available.
Disclaimer
A few slides/figures were taken from various presentations found in
the Internet; unfortunately I cannot identify original authors at the
moment, since these slides went through different iterations.
I have to apologize for that.