No Slide Title

Download Report

Transcript No Slide Title

February 6 & 13, 2003
Data Mining in Bioinformatics
Peter Bajcsy, PhD
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
[email protected]
Outline
• Introduction
•
•
•
•
2
— Interdisciplinary Problem Statement
— Microarray Problem Overview
Microarray Data Processing
— Image Analysis and Data Mining
— Prior Knowledge
— Data Mining Methods
— Database and Optimization Techniques
— Visualization
Validation
Artificial Immune Systems
Summary
Introduction: Recommended Literature
1. Bioinformatics – The Machine Learning Approach by P. Baldi & S.
Brunak, 2nd edition, The MIT Press, 2001
2. Data Mining – Concepts and Techniques by J. Han & M. Kamber,
Morgan Kaufmann Publishers, 2001
3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition,
John Wiley & Sons, 2001
3
Bioinformatics, Computational Biology, Data Mining
•
•
•
4
Bioinformatics is an interdisciplinary field about the information
processing problems in computational biology and a unified
treatment of the data mining methods for solving these problems.
Computational Biology is about modeling real data and simulating
unknown data of biological entities, e.g.
— Genomes (viruses, bacteria, fungi, plants, insects,…)
— Proteins and Proteomes
— Biological Sequences
— Molecular Function and Structure
Data Mining is searching for knowledge in data
— Knowledge mining from databases
— Knowledge extraction
— Data/pattern analysis
— Data dredging
— Knowledge Discovery in Databases (KDD)
Basic Terms in Biology
Example:
• The human body contains ~100 trillion cells
• Inside each cell is a nucleus
• Inside the nucleus are two complete sets of the human
genome (except in egg, sperm cells and blood cells)
• Each set of genomes includes 30,000-80,000 genes on the
same 23 chromosomes
• Gene – A functional hereditary unit that occupies a fixed
location on a chromosome, has a specific influence on
phenotype, and is capable of mutation.
• Chromosome – A DNA containing linear body of the cell
nuclei responsible for determination and transmission of
hereditary characteristics
5
Basic Terms in Data Mining
• Data Mining: A step in the knowledge discovery process consisting
•
•
6
of particular algorithms (methods) that under some acceptable
objective, produces a particular enumeration of patterns (models)
over the data.
Knowledge Discovery Process: The process of using data mining
methods (algorithms) to extract (identify) what is deemed
knowledge according to the specifications of measures and
thresholds, using a database along with any necessary
preprocessing or transformations.
A pattern is a conservative statement about a probability
distribution.
— Webster: A pattern is (a) a natural or chance configuration, (b) a
reliable sample of traits, acts, tendencies, or other observable
characteristics of a person, group, or institution
Introduction: Problems in Bioinformatics Domain
• Problems in Bioinformatics Domain
— Data production at the levels of molecules, cells,
organs, organisms, populations
— Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data, …
— Prediction of Molecular Function and Structure
— Computational biology: synthesis (simulations) and
analysis (machine learning)
7
MICROARRAY PROBLEM
8
Microarray Problem: Major Objective
• Major Objective: Discover a comprehensive theory of
life’s organization at the molecular level
— The major actors of molecular biology: the nucleic
acids, DeoxyriboNucleic acid (DNA) and RiboNucleic
Acids (RNA)
— The central dogma of molecular biology
Proteins are very complicated molecules with 20
different amino acids.
9
Input and Output of Microarray Data Analysis
• Input: Laser image scans (data) and underlying experiment
•
10
hypotheses or experiment designs (prior knowledge)
Output:
— Conclusions about the input hypotheses or knowledge
about statistical behavior of measurements
— The theory of biological systems learnt automatically from
data (machine learning perspective)
– Model fitting, Inference process
Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray
Experiment
Experiment
Design and
Hypothesis
Image
Analysis
Data Warehouse
Artificial
Intelligence (AI)
11
Data
Mining
Statistics
Knowledge discovery
in databases (KDD)
Statistics Community
•
•
•
•
•
•
•
12
Random Variables
Statistical Measures
Probability and Probability Distribution
Confidence Interval Estimations
Test of Hypotheses
Goodness of Fit
Regression and Correlation Analysis
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW
NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1
GF2
GF1
GF2
DIFFERENCE RATIO
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
209.79
194.04
1.92
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98
1.76
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87
1.19
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41
1.41
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
578.42
220.64
1.38
YDL211C
4
1 A 1 7 17.31 35.34 581.00
"1,005.18" -424.18
-1.73
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
"11,741.98" "11,428.10" 313.88
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21
1.16
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
273.36
187.67
1.69
1.03
Artificial Intelligence (AI) Community
Collect Data
• Issues:
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
13
— Prior knowledge
(e.g., invariance)
— Model deviation
from true model
— Sampling
distributions
— Computational
complexity
— Model complexity
(overfitting)
Knowledge Discovery in Databases (KDD) Community
Database
14
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW
NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1
GF2
GF1
GF2
DIFFERENCE RATIO
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
209.79
194.04
1.92
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98
1.76
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87
1.19
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41
1.41
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
578.42
220.64
1.38
YDL211C
4
1 A 1 7 17.31 35.34 581.00
"1,005.18" -424.18
-1.73
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
"11,741.98" "11,428.10" 313.88
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21
1.16
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
273.36
187.67
1.69
1.03
Microarray Data Mining and Image Analysis Steps
•
•
•
15
Image Analysis
— Normalization
— Grid Alignment
— Spot Quality Assurance Control
— Feature construction (selection and extraction)
Data Mining
— Prior knowledge
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
— Statistics
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
— Machine learning
RAW
NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1
GF
— Pattern recognition
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1
— Database techniques
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.7
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.5
— Optimization techniques
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
YDL211C
4
1 A 1 7 17.31 35.34 581.00
— Visualization
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
YDR346C
4
1
A
1
9
64.97
65.88
"2,180.8
Validation
YAL010C
MDM10 1 ? 1 A 2 2 13.73 9.61 461.03
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
— Issues
YBR162C
2
1 A 2 4 226.84
293.83
YCL052C
PBN1
3
1
A
2
5
41.28
34.79
"1,385.7
— Cross validation techniques
YDL028C
YDL219W
YDR163W
YDR354W
MPS1
TRP4
4
4
4
4
1
1
1
1
A
A
A
A
2
2
2
2
6
7
8
9
7.95
16.08
19.13
62.24
6.24
11.33
14.19
40.74
266.99
539.93
642.17
"2,089.4
MICROARRAY IMAGE
ANALYSIS
16
Microarray Image Analysis
17
DATA MINING OF
MICROARRAY DATA
18
Why Data Mining ? Sequence Example
•
•
•
•
•
•
•
•
•
19
Biology: Language and Goals
A gene can be defined as a region of DNA.
A genome is one haploid set of chromosomes with the genes
they contain.
Perform competent comparison of gene sequences across
species and account for inherently noisy biological
sequences due to random variability amplified by evolution
Assumption: if a gene has high similarity to another gene
then they perform the same function
Analysis: Language and Goals
Feature is an extractable attribute or measurement (e.g.,
gene expression, location)
Pattern recognition is trying to characterize data pattern
(e.g., similar gene expressions, equidistant gene locations).
Data mining is about uncovering patterns, anomalies and
statistically significant structures in data (e.g., find two
similar gene expressions with confidence > x)
Types of Expected Data Mining and Analysis Results
Hypothetical Examples:
• Binary answers using tests of hypotheses
— Drug treatment is successful with a confidence level x.
• Statistical behavior (probability distribution functions)
— A class of genes with functionality X follows Poisson
distribution.
• Expected events
— As the amount of treatment will increase the gene
expression level will decrease.
• Relationships
— Expression level of gene A is correlated with expression
level of gene B under varying treatment conditions (gene A
and B are part of the same pathway).
• Decision trees
— Classification of a new gene sequence by a “domain
expert”.
20
PRIOR KNOWLEDGE
21
Prior Knowledge: Experiment Design
Collect Data
Data Cleaning and
Transformations
Choose Features
Choose Model and Data
Mining Method
•
•
•
•
•
•
22
Microarray sources of
systematic and random
errors
Feature selection and
variability
Expectations and
Hypotheses
Data cleaning and
transformations
Data mining method
selection
Interpretation
Prior Knowledge from Experiment Design
Complexity Levels of Microarray Experiments:
1. Compare single gene in a control situation versus a treatment situation
• Example: Is the level of expression (up-regulated or down-regulated)
significantly different in the two situations? (drug design application)
• Methods: t-test, Bayesian approach
2. Find multiple genes that share common functionalities
• Example: Find related genes that are dependent?
• Methods: Clustering (hierarchical, k-means, self-organizing maps,
neural network, support vector machines)
3. Infer the underlying gene and protein networks that are responsible
for the patterns and functional pathways observed
• Example: What is the gene regulation at system level?
• Directions: mining regulatory regions, modeling regulatory networks
on a global scale
Goal of Future Experiment Designs: Understand biology at the system level,
e.g., gene networks, protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.
23
Data Mining Techniques
Data mining techniques draw from
Statistics
Machine learning
Database techniques
Pattern recognition
Optimization techniques
24
Visualization
STATISTICS
25
Statistics
Statistics
Descriptive
Statistics
Describe data
Inductive
Statistics
Make forecast
and inferences
Are two sample sets
identically distributed
26
?
Statistical t-test
•Gene Expression Level in Control and
Treatment situations
•Is the behavior of a single gene
different in Control situation than in
Treatment situation ?
Normalized distance
•
•
27
m – sample mean
s – variance
?
Normalized distance t follows a Student
distribution
with f degrees of freedom.
If t>thresh then the control
and treatment data
populations are considered
to be different.
MACHINE LEARNING
AND
PATTERN RECOGNITION
28
Machine Learning
Machine Learning
Unsupervised
Supervised
“Natural groupings”
Reinforcement
Examples
29
Pattern Recognition
Pattern Recognition
Statistical Models
Linear Correlation
and Regression
Locally Weighted
Learning
Decision Trees
Neural Networks
NN representation
and gradient based
optimization
30
NN representation and
genetic algorithm based
optimization
k-nearest
neighbors,
support
vectors
Unsupervised Learning and Clustering
• A cluster is a collection of data objects that are similar to
•
one another within the same cluster and are dissimilar to
the objects in other clusters.
Examples of data objects:
— gene expression levels, sets of co-regulated genes
(pathways), protein structures
“Natural groupings”
• Categories of Clustering Methods
— Partitioning Methods
— Hierarchical Methods
— Density-Based Methods
31
Unsupervised Clustering: Partitioning Methods
Example: Centroid-Based Technique
•
•
•
•
1.
2.
3.
4.
5.
6.
32
K-means Algorithm partitions a set of n objects into k
clusters so that the resulting intra-cluster similarity is high
but the inter-cluster similarity is low.
Input: number of desired cluster k
Output: k labels assigned to n objects
Steps:
Select k initial cluster’s centers
Compute similarity as a distance between an object and
each cluster center
Assign a label to an object based on the minimum similarity
Repeat for all objects
Re-compute the cluster’s centers as a mean of all objects
assign to a given cluster
Repeat from Step 2 until objects do not change their
labels.
Unsupervised Clustering: Partitioning Methods
Example: Representative-Based Technique
•
•
•
•
1.
2.
3.
4.
5.
6.
33
K-medoids Algorithm partitions a set of n objects into k
clusters so that it minimizes the sum of the dissimilarities
of all the objects to their nearest medoid.
Input: number of desired cluster k
Output: k labels assigned to n objects
Steps:
Select k initial objects as the initial medoids
Compute similarity as a distance between an object and
each cluster medoid
Assign a label to an object based on the minimum similarity
Repeat for all objects
Randomly select a non-medoid object and swap with the
current medoid it would decrease intra-cluster square
error
Repeat from Step 2 until objects do not change their
labels.
Unsupervised Clustering: Hierarchical Clustering
• Hierarchical Clustering partitions a set of n objects into a tree
of clusters
•
Types of Hierarchical Clustering
— Agglomerative hierarchical clustering
– Bottom-up strategy of building clusters
— Divisive hierarchical clustering
– Top-down strategy of building clusters
34
Unsupervised Agglomerative Hierarchical Clustering
• Agglomerative Hierarchical Clustering partitions a set of n
•
1.
2.
3.
•
35
objects into a tree of clusters with a bottom-up strategy.
Steps:
Assign a unique label to each data object and form n clusters
Find nearest clusters and merge them
Repeat Step 2 till the number of desired clusters is equal to the
number of merged clusters.
Types of Agglomerative Hierarchical Clustering
— The nearest neighbor algorithms (minimum or single-linkage algorithm, minimal
spanning tree)
— The farthest neighbor algorithms (maximum or complete-linkage algorithm)
Unsupervised Clustering: Density-Based Clustering
• Density-Based Spatial Clustering with Noise aggregates
•
36
objects into clusters if the objects are density connected.
Density connected objects:
— Simplified explanation: P and Q are density connected if
there is an object O such that both P and Q are density
connected to O.
— Aggregate P and Q if they are density connected with
respect to R-radius neighborhood and Minimum Object
criteria
Supervised Learning or Classification
• Classification is a two-step process consisting of learning
classification rules followed by assignment of classification
label.
37
Supervised Learning: Decision Tree
• Decision tree algorithm constructs a tree structure in a topdown recursive divide-and-conquer manner
Age
Car Type
Risk
23
family
High
17
sports
High
43
sports
High
68
family
Low
32
truck
Low
20
family
High
Age < 25 ?
no
yes
Risk: High
Sports car ?
yes
Answers
Risk: High
Car Insurance: Risk Assessment
Visualization of Decision Boundaries
38
Attributes
no
Risk: Low
Supervised Learning: Bayesian Classification
• Bayesian Classification is based on Bayes theorem and it can
•
predict class membership probabilities.
Bayes Theorem (X-data sample, H-hypothesis of data label)
— P(H/X) posterior probability
— P(H) prior probability
• Classification-maximum posteriori hypothesis
39
Statistical Models: Linear Discriminant
• Linear Discriminant Functions form boundaries between
•
data classes.
Finding Linear Discriminant Functions is achieved by
minimizing a criterion error function.
Linear discriminant function
Quadratic discriminant function
Finding w coefficients:
-Gradient Descent Procedures
-Newton’s algorithm
40
Artificial Neural Networks
• Artificial Neural Network (ANN) is a computational analogue of
•
•
•
neurons.
Artificial neural network is a set of connected input/output units
where each connection has a weight associated with it.
Phase I: learning – adjust weights such that the network predicts
accurately class labels of the input samples
Phase II: classification- assign labels by passing an unknown
sample through the network
Network topology or “Structure”
41
Artificial Neural Networks (cont.)
Unit or node j
• Steps:
1.
2.
3.
4.
•
42
Initial weights from [-1,1]
Interpretation
Propagate the inputs forward
Backpropagate the error
Terminate learning (training) if (a) delta w < thresh or (b) percentage
of misclassified samples < thresh or (c) max number of iterations has
been exceeded
Pros & Cons of ANN: Good performance with noisy data, rule extraction
& long training, poor interpretability, trial-and-error network design
Support Vector Machines (SVM)
• SVM algorithm finds a separating hyperplane with the largest
margin and uses it for classification of new samples
43
DATABASE TECHNIQUES
AND
OPTIMIZATION TECHNIQUES
44
Data Types and Databases
•
•
•
•
45
Relational Databases
Data Warehouses
Structure - 3D Anatomy
Transactional Databases
Advanced Database Systems
— Object-Relational
— Spatial and Temporal
— Time-Series
— Multimedia
— Text
— Heterogeneous, Legacy, and Distributed
— WWW
Function – 1D Signal
Metadata – Annotation
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW
NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1
G
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
YDL211C
4
1 A 1 7 17.31 35.34 581.00
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
YBR162C
2
1 A 2 4 226.84
293.83
Database Techniques
•
•
•
•
•
•
•
•
46
Database Design and Modeling (tables, procedures,
functions, constraints)
Database Interface to Data Mining System
Efficient Import and Export of Data
Database Data Visualization
Database Clustering for Access Efficiency
MINING
Database Performance Tuning (memory usage, query
encoding)
Database Parallel Processing (multiple servers and
CPUs)
Distributed Information Repositories (data warehouse)
Search and Optimization Techniques: Search Types
•
•
47
Types of search methods:
• Calculus-based
• Indirect (solve a nonlinear set of equations)
• Direct (follow local gradient - hill climbing)
• Enumerative (search objective function values at
every point – dynamic programming)
• Random (search with random sampling)
Randomized search methods: guide the search with
random processes – simulated annealing, genetic
programming
Search and Optimization Techniques: Challenges
•
•
48
Search and optimization challenges:
• Global versus local maxima
• Existence of derivatives (calculus-based)
• High dimensionality
• Highly nonlinear search space (global versus local
maxima)
• Large search space
Example: A genome with N genes can encode 2^N
states (active or inactive states, regulated is not
considered). Human genome ~ 2^30,000; Nematode
genome ~ 2^20,000 patterns.
Genetic Algorithm
• Genetic Algorithm (GA) based optimization is a computational
•
analogue of Darwin’s evolution theory (survival of the fittest).
Description of GA based optimization:
• Uses coding of the parameter set (not the parameters
themselves)
• Searches from a population of points (not a single point)
• Uses an objective function (not derivatives or other auxiliary
knowledge)
• Employs probability transition rules (not deterministic rules)
• Is composed of three operators
• Reproduction (or selection)
• Crossover
• Mutation
Reference: D. Goldberg: Genetic Algorithms in Search, Optimization & Machine Learning,AddisonWesley Publishing Co., 1989.
49
Genetic Algorithm: Additional Operators
• Additional operators
• Niching for optimization of multimodal and multiobjective
•
50
functions
• Fitness sharing: the number of individuals residing near
any peak will be proportional to the height of that peak
(reduce individual fitness according to their similarity)
• Crowding: spread individuals among the most prominent
peaks and do not allocate individuals proportionally to
fitness (maintain diversity)
Speciation for optimization of multimodal functions
• Mating restriction scheme (restrict mating or crossover
according to the similarity among individuals)
Genetic Algorithm: Example
On
Off
Off
On
Objective Function
(on,off,on,off) input sequence is
converted to a string (1010)
•
•
•
•
•
51
Steps:
Randomly generate initial population of size n=2; e.g., strings 0110 &
1100
Reproduction is a process of copying strings according to their
objective function – “a roulette wheel”
Crossover proceeds in two steps (1) random mating of strings and (2)
selecting random positions of each string for mating; e.g., obtain 1110
& 0100
Mutation is the occasional random alteration of the value of a string
position to protect premature loss of information; obtain 0110 & 0100
VISUALIZATION
52
Visualization
• Data: 3D cubes,distribution charts, curves, surfaces, link
•
graphs, image frames and movies, parallel coordinates
Results: pie charts, scatter plots, box plots, association rules,
parallel coordinates, dendograms, temporal evolution
Pie chart
Parallel coordinates
Temporal evolution
53
Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
54
Novel Visualization of Clustering Results
Class Labeling and Visualization
Isodata (K-means)
Clustering
Mean Feature Image
55
Label Image
VALIDATION
56
Why Validation?
•
Validation type:
— Within the existing data
— With newly collected data
•
Errors and uncertainties:
— Systematic or random errors
— Unknown variables - number of classes
— Noise level - statistical confidence due to noise
— Model validity – error measure, model over-fit or under-fit
— Number of data points - measurement replicas
•
Other issues
— Experimental support of general theories
— Exhaustive sampling is not permissive
57
Error Detection: Example of Spot Screening
Mask Image – No Screening
Mask Image – Location and Size Screening
Mask Image – SNR Screening
58
Cross Validation: Example
•
•
•
59
One-tier cross validation
— Train on different data than test data
Two-tier cross validation
— The score from one-tier cross validation is used by
the bias optimizer to select the best learning
algorithm parameters (# of control points) . The
more you optimize the more you over-fit. The
second tier is to measure the level of over-fit
(unbiased measure of accuracy).
— Useful for comparing learning algorithms with
control parameters that are optimized.
— Number of folds is not optimized.
Computational complexity:
— #folds of top tier X #folds of bottom tier X
#control points X CPU of algorithm
ARTIFICIAL IMMUNE SYSTEMS
60
Artificial Immune Systems
•
•
•
61
Artificial Immune Systems (AIS) are adaptive systems, inspired by
theoretical immunology and observed immune functions, principles and
models, which are applied to problem solving.
Other types of AIS are hybrids of ANN, GA and fuzzy systems combined
with theoretical immunology models
Applications of AIS:
— Pattern recognition (surveillance of infectious diseases)
— Fault and anomaly detection ((image inspection and segmentation)
— Data analysis (reinforced, unsupervised learning)
— Agent-based systems
— Scheduling (adaptive scheduling)
— Autonomous navigation and control (walking robots)
— Search and optimization methods (constrained, time-dependent
optimization)
— Security of information systems (virus detection, network intrusion)
Basic Terms Used in Artificial Immune Systems
• Immune system is understood as a complex set of cells and
•
molecules that protect our bodies against infection under
constant attack by antigens (foreign or self-antigens)
Immune system consists of two-tier line of defense: adaptive
(lymphocytes: B-cells & T-cells) and innate (granulocytes &
macrophages) immune systems. Both systems depend upon the
activity of white blood cells (leukocytes).
The organs that make up the immune
system (lymphoid organs) are thymus &
bone marrow (primary) and
tonsils,adenoids, spleen, appendix,
lymph nodes, lymphatic vessels, peyer’s
patches (secondary).
62
Mechanisms Adapted in Artificial Immune Systems
•
Pattern recognition: lymphocytes (B-cells & Tcells) carry surface receptors capable of
recognizing antigens
— Example: recognition via complementary
regions
•
The clonal selection principle: only cells capable
of recognizing an antigen stimulus will proliferate
and differentiate into effector cells
Immune learning and memory: reinforced
learning strategy
•
•
63
Self/Nonself discrimination: distinguish between
molecules of its own cell (self) and foreign
molecules (nonself)- positive and negative
selection, clonal expansion and ignorance
Why Artificial Immune System?
•
•
•
•
•
•
•
•
•
•
64
Pattern recognition: cells and molecules of the immune system have
several ways of recognizing patterns
Uniqueness: each individual possesses its own immune system
Self identity: other than native “elements” to the body can be recognized
and eliminated by the immune system
Diversity: there exist varying types of elements that together protect the
body
Disposability: no single native element is essential for the functioning of
the immune system
Autonomy: there is no central element controlling the immune system
Multi-layered: multiple layers of different mechanisms provide overall
security
No secure layer: any cell of the organism can be attacked by the IS
Anomaly detection: the IS can recognize and react to pathogens that the
body has never encountered before
Dynamically changing coverage: the IS maintains a circulating repertoire
of lymphocytes constantly being changed through cell death, production
and reproduction
Why Artificial Immune System? (cont.)
•
•
•
•
•
•
•
•
•
65
Distributivity: the immune elements are distributed all over the body
Noise tolerance: an absolute recognition of pathogens is not required
(tolerance to molecular noise)
Resilience: the IS is capable of functioning despite disturbances
Fault tolerance: the complementary roles of several immune components
allow the re-allocation of tasks to other elements
Robustness: diversity & number of immune elements
Immune learning and memory: the molecules of the IS can adapt to
themselves, structurally and in number, to the antigenic challenges
Predator-prey pattern of response:#pathogens goes up =>#immune cells
goes up
Self-organization: clonal selection and affinity maturation are responsible
for selecting the most adapted cells to be maintained as long living
memory cells
Integration with other systems: the IS communicates with parts of the
body
General Framework for Artificial Immune Systems
• General Framework for AIS:
— A representation for the components of the system
— A set of mechanisms to evaluate the interaction of
individuals with the environment and each other (input
stimuli, 1 to N fitness functions or other means) – Affinity
measures
— Procedures of adaptation that govern the dynamics of the
system (e.g., behavior over time) - Algorithms
Reference:L. N. de Castro and J. Timmis, “Artificial Immune
Systems: A New Computational Intelligence Approach,”Springer
2002.
66
Components of Artificial Immune Systems
• Representation:
•
67
— Generalized shape of any molecule in shape space is described
by an attribute string (set of coordinates) of length L.
— Shape-space describes interactions between molecules of the
immune system and antigens.
— Immune system is represented as a pattern (molecular)
recognition system that is designed to identify shapes.
Affinity Measures:
— Euclidean, Manhattan and Hamming
— Real-valued, integer, symbolic or alphabet sub-string spaces
Components of Artificial Immune Systems
• Immune system algorithms:
— Bone marrow model: generate repertoire of cells and molecules
(generate random attribute strings)
— Thymus model: generate repertoire of cells and molecules capable of
performing self/non-self discrimination (Positive selection: initialize
strings, evaluate affinity and keep strings with affinity < threshold;
Negative selection: eliminate strings > threshold)
— Clonal selection algorithms: modeling interaction control of the IS
and external environment or antigens (similar to GA without crossover
and with affinity proportional to reproduction and mutation)
— Immune network models: simulate immune networks (differential
equations describing the dynamics)
68
Examples of Artificial Immune Systems
Example
ANN
GA
AIS
Representa
tion
Artificial neurons &
interconnection of
neurons (summing
junction, connection
strength, activation)
Genetic
representation (gene
= a single bit or a
block of bits,
chromosome=
bitstring)
IS representation
of molecules
(strings of
coordinates), and
their interactions
(shape-space)
Affinity
Measures
Backpropagation
measures
Fitness function
Affinity defined
on a shape-space
Algorithms
Learning algorithms
Procedures for
Procedures for
reproduction,
generation,
genetic variation and cloning, selection
selection
and IS network
dynamics
69
SUMMARY
70
Summary: Interdisciplinary Science
• CS and ECE have been used to gain a better understanding
•
•
•
71
of biological processes through modeling and simulation
CS and ECE have been enriched with the introduction of
biological ideas, e.g., ANN, GA, cellular automata, artificial
life, artificial immune systems (AIS)
New fields: bio-informatics, bio-medical engineering
Bilateral interactions between CS, ECE and Biology:
— Biologically motivated computing (ANN, GA, artificial
immune systems)
— Computationally motivated biology (cellular automata)
— Computing with biological mechanisms (silicon-based
computing => quantum and DNA computing)
Summary: Bioinformatics
•
•
•
•
72
Bioinformatics and Microarray problem
— Interdisciplinary Challenges: Terminology
— Understanding Biology and Computer Science
Data mining and image analysis steps
— Image Analysis
— Experiment Design as Prior Knowledge
— Expected Results of Data Mining
— Which Data Mining Technique to Use?
— Data Mining Challenges: Complexity, Data Size, Search Space
Validation
— Confidence in Obtained Results?
— Error Screening
— Cross validation techniques
Artificial Systems
— Biologically motivated computing
Backup
73