Genetic Algorithm Optimization of Membership Functions for

Download Report

Transcript Genetic Algorithm Optimization of Membership Functions for

Introduction to Artificial
Intelligence:
Applications in
Computational Biology
Susan M. Bridges [email protected]
Intelligent Systems Laboratory
Outline
•
•
•
•
•
•
What is AI?
Search
Expert systems
Uncertainty
Machine learning
Data mining
Intelligent Systems Laboratory
Intelligent Systems and
Computational Biology
•
•
First applications (DNA) in which great
progress was made were digital
•
•
Signal processing algorithms
Text processing techniques
Many of the most interesting and difficult
problems to be tackled are analog
•
•
•
Protein structure
Gene expression
Metabolic networks
Intelligent Systems Laboratory
Definitions of AI
(What is AI?)
•
Rich, E. and K. Knight . 1991. Artificial
Intelligence. New York: McGraw-Hill.
“Artificial intelligence (AI) is the study of how to
make computers do things which at the moment,
people do better.”
Intelligent Systems Laboratory
Another definition of AI
• Winston, Patrick Henry. 1984. Artificial Intelligence.
1984. Addison-Wesley, Reading, MA.
“Artificial Intelligence is the study of ideas that
enable computers to be intelligent. Intelligence
includes: ability to reason, ability to acquire and
apply knowledge, ability to perceive and manipulate
things in the physical world, and others.”
Intelligent Systems Laboratory
Why Study AI?
•
•
Understand human human intelligence
Develop “intelligent” machines
•
•
Robotics
Programs with intelligent properties
Intelligent Systems Laboratory
Acting Rationally:
Turing Test Approach
Interrogator
Intelligent Systems Laboratory
AI Tasks
•
Mundane tasks
• Perception
•
» Vision
» Speech
» Geometry
» Logic
» Integral calculus
• Natural Language
» Understanding
» Generation
» Translation
• Common sense reasoning
• Robot control
Formal tasks
• Games
• Mathematics
•
Expert tasks
• Engineering
• Scientific analysis
• Medical diagnosis
• Financial analysis
Intelligent Systems Laboratory
Intelligent Agents
• Agent
•
•
• Perceives its environment using sensors
• Acts on environment using effectors
Rational agent
• An agent that does the right thing
• Basis for action
» A measure of degree of success.
» Knowledge of what has been perceived so far.
» The actions that the agent can perform
Autonomous Agent
• Learns from experience
• Makes independent decisions
Intelligent Systems Laboratory
Major Topics
•
•
•
Search
Knowledge Representation
Machine Learning
Intelligent Systems Laboratory
Problem-solving agent
•
•
•
•
A type of goal-based agent
Find sequence of actions that lead to a
desirable state
Intelligent agents should make a set of
changes in the state of the environment that
maximizes the performance measure
Life is simpler if we can set a goal and aim to
satisfy it.
Intelligent Systems Laboratory
Components of a problem
•
•
Initial state
Set of possible actions
• actions can be described as operators
» an operator describes an action by specifying the state that can be
reached by carrying out an action in a particular state
State x
Operator a
State y
• actions can be described in terms of a successor function S. Given
a particular state x, S(x) returns the set of states reachable from x
by any single action.
Intelligent Systems Laboratory
State Space
•
•
•
•
The set of all states reachable from the initial
state by any sequence of actions
A path in the state space is a sequence of
actions leading from one state to another
The agent can apply a goal test to any single
state to determine if it is a goal state.
If one path is preferable to another, then we
may need to compute path cost (g).
Intelligent Systems Laboratory
5
4
1
6
1
8
8
7
3
2
7
Initial State
2
3
4
6
Goal State
States
Goal Test
Operators
Path Cost
Intelligent Systems Laboratory
5
Problem: Find route from
Louisville to West Point
West Point
Pheba
Mathiston
Mayhew
Maben
Columbus
Starkville
Sturgis
Artesia
Ackerman
Crawford
Louisville
Intelligent Systems Laboratory
Brooksville
A. The initial state
Louisville
Louisville
B. After expanding
Louisville
Ackerman
Starkville
Brooksville
Louisville
C. After expanding
Ackerman
Maben
Ackerman
Sturgis
Starkville
Louisville
Intelligent Systems Laboratory
Brooksville
Some terms
•
•
•
•
•
New states are generated from old states by
operators.
This is called expanding the state.
The choice of which state to expand first is
called the search strategy
Result is called a search tree
The set of nodes waiting to be expanded is
called the fringe or frontier
Intelligent Systems Laboratory
Search Strategies
•
Requirements for a good search strategy
•
•
•
•
causes motion
is systematic
State space can usually be represented as a
tree or a graph
Two important parameters of a tree
•
•
branching factor (b)
depth (d)
Intelligent Systems Laboratory
Two Types of Searches
•
Uninformed or blind search
•
•
•
systematically generate states
test states to see if they are goal states
Informed or heuristic search
•
•
•
use knowledge about the problem domain
explore search space more efficiently
may sacrifice accuracy for speed
Intelligent Systems Laboratory
Breadth-first search
•
All nodes at each depth d are expanded
before any nodes at depth d+1
Intelligent Systems Laboratory
Depth-first search
l
l
Always expands one of the nodes at the
deepest level of the tree
Parameter m is the maximum depth
Intelligent Systems Laboratory
What is a heuristic?
(rule of thumb)
•
•
A heuristic is a formalized rule for choosing
those branches in a state space that are most
likely to lead to an acceptable solution (Luger
and Stubblefield, 1998).
Used two ways
•
•
some problems do not have exact solutions, so we
just do the best we can (medical diagnosis)
there may be an exact solution, but it may be very
expensive to find
Intelligent Systems Laboratory
Hill Climbing
•
•
•
•
Use an heuristic function (or objective or
evaluation function) to decide which direction to
move in the search space.
Always move toward the state that appears to be
best (basing all decisions on local information).
Assume that we want to maximize the value of the
function.
Can also be used for minimization (called gradient
descent)
Intelligent Systems Laboratory
1
2
8
7
6
3
1
2
3
Steepest Ascent Hill Climbing
4
7
8
4
Using Manhattan Distance
5
6
5
Heuristic
Goal
1
h=
2
7
6
h=
8
3
1
2
3
4
7
8
4
5
6
6
5
h=
1
2
3
1
2
3
7
8
4
7
8
4
7
6
5
6
6
5
1
2
3
7
8
4
6
5
h=
Intelligent Systems Laboratory
h=
A* Search
•
•
•
Minimizing the total path cost
Combines uniform-cost search and greedy search.
Evaluation function:
f(n) = g(n) + h(n)
g(n): cost of path from start to node n
h(n): estimate of cost of path from n to goal
f(n): estimated cost of the cheapest solution through n
Intelligent Systems Laboratory
Goal: Minimum length path.
Is h(n) an admissible heuristic?
f(n) = g(n) + h(n)
A(22)
5
3
K (18)
3
S (18)
11
F(7)
4
L ( 3)
4
C (21)
2
M(2)
14
d=1
D (8)
4
7
G (9)
H(6)
12
E(12)
3
10
B (18)
6
d=0
7
1
N(9)
O(5)
8
11
I (13)
5
12
P(2)
6
d=2
J(14)
3
Q(10)
4
R(12)
d=3
5
T(0)
U (0)
Numbers in parentheses are h(n)
Numbers on edges are operator costs
Intelligent Systems Laboratory
d=4
Multiple Sequence Alignment
•
•
DNA and protein sequences
Alignment of multiple sequences created by
inserting gaps to shift characters to matching
positions
ATCG
TGA
GAT
•
-ATCG--T-GA
GAT---
Optimal alignment maximizes the number of
matching positions
Intelligent Systems Laboratory
Multiple Sequence Alignment
As State-Space Search
(Eric Hansen, Rong Zhou)
Space Complexity: O (LN)
Time Complexity: O (2NLN)
Where L is the average length of sequences and
N is the number of sequences
start
ATCG
TGA
GAT
-ATCG--T-GA
GAT---
goal
An Illustration of Anytime A*
09
18
38
19
37
16
28
18
19
19
29
20
gh
f
47
18
47
18
56
17
65
16
Nodes pruned by Anytime A*
gh
= expanded node
f
g h = stored but not
f
expanded node
74
15
83
14
f = g + 2h
92
13
Goal
11 0
11
Total number of nodes stored = 8
Genetic Algorithms
•
•
•
Search procedure based on a simple model
of evolution
Uses a “random” process to explore search
space
Has been applied in many domains
Intelligent Systems Laboratory
Terminology
• Begin with a population of individuals. Each
•
•
•
•
individual represents a solution to the problem we are
trying to solve.
A data structure describes the genetic structure of the
individual. (Assume for initial discussion that this is a
string of 0’s and 1’s).
In genetics, the strings are called chromosomes and
the bits are called genes.
The string associated with each individual is its
genotype
Selection is based on fitness of individuals
Intelligent Systems Laboratory
The Genetic Algorithm
•
•
Each evolving population of individuals is
called a generation.
Given a population of individuals
corresponding to one generation, the
algorithm simulates natural selection and
reproduction in order to obtain the next
generation.
Intelligent Systems Laboratory
Three basic operations
•
Reproduction:
•
•
Crossover:
•
•
Individuals from one generation are selected for
the next generation
Genetic material from one individual is exchanged
with genetic material from another individual
Mutation:
• Genetic material is altered
Intelligent Systems Laboratory
General GA Procedure
Selection, crossover, and mutation operations
Initial
population
Evaluate fitness
Parent candidate pool
Father and Mother
Select
parents
no
Crossover and mutate
Evaluate fitness
and replace
yes
Converge?
Offspring
Next generation
population
Intelligent Systems Laboratory
Example of General GA
Procedure
Selection, crossover, and mutation operations
11
13
2
9
1101
1011
0100
1001
1101
1011
1011
1001
Generation n
Reproduction
Crossover
1
011
1
101
10 01
1 0 0 11 1
1
011
0 101
1 0 1010
1 0 0 11 1
Mutation
Generation n+ 1
15
8
1
Intelligent Systems Laboratory
13
Two keys to the success of a GA
•
•
Data structures for
» Genes
» Chromosomes
» Population
Fitness evaluation function
Intelligent Systems Laboratory
Knowledge Representation
•
•
•
•
•
Semantic networks
Frame based systems
Rule based expert systems
Ontologies
Neural networks
Intelligent Systems Laboratory
Anything
AbstractObjects
Sets
Numbers
Events
Representational
Objects
Places
Intervals
Sentences
Processes
Physical
Objects
Measurements
Moments
Categories
Times
Weights
Things
Animals
Stuff
Agents
Humans
Intelligent Systems Laboratory
Expert Systems
• Rule based systems
•
•
•
•
•
Garnered a great deal of attention in the 1980’s
Most famous examples are in medical domains
Stimulated interest in “logic programming”
Encode knowledge of people as sets of rules
Still widely used
• Knowledge acquisition bottleneck
Intelligent Systems Laboratory
Representing Uncertainty
•
•
Fuzzy logic
Bayesian reasoning
Intelligent Systems Laboratory
Uncertainty versus Vagueness
•
Certainty–degree of belief
•
•
•
there is a 50% probability of rain today
I am 30 % sure the patient is suffering from
pneumonia
Vagueness–the degree to which an item
belongs to a category
• the man is tall
• move the wheel slightly to the left
• the patient’s lungs are highly congested
Intelligent Systems Laboratory
Fuzzy Sets Represent
Vagueness
•
•
•
•
Lotfi Zadeh popularized the idea in the 60’s
Popular concept in Eastern philosophy
Reasoning with fuzzy sets is called fuzzy
logic
Fuzzy logic is also called
•
•
approximate reasoning
continuous logic
Intelligent Systems Laboratory
Fuzzy Set Definitions
•
•
Set membership can be expressed using a
characteristic (or descrimination) function
Classic (or crisp) sets
If objects x are chosen from some universe X
1 if x is an element of set A
0 if x is not an element of set A
 A (x)  
•
Fuzzy sets - an element can be a partial
member of a set (grade of membership)
0   A (x)  1
Intelligent Systems Laboratory
Examples of Fuzzy Concepts
from Natural Language
•
•
•
•
•
•
•
John is tall
The weather is rainy
Turn the volume up a little
Dr. Bridges’ tests are long
Add water until the dough is the right
consistency
There was very little change in the cost
The water bill was somewhat high
Intelligent Systems Laboratory
Representing Fuzzy Sets
•
Enumeration of membership values of all
elements with non-zero membership
TALL = {.125/5.5, .5/6, .875/6.5, 1/7, 1/7.5, 1/8}
• Represent membership with a function
Intelligent Systems Laboratory
Functional Representations
Fuzzy Set Tall
Membership
1
Tall
0
4
5
6
7
Intelligent Systems Laboratory
Height in feet
Linguistic (or Fuzzy)Variable
•
•
•
Usually corresponds to a noun
The values of a linguistic variable are fuzzy
sets (which correspond to adjectives)
Examples:
Linguistic variable
Height
Weight
Temperature
Speed
Fuzzy sets
short medium tall
light average heavy
cold cool typical warm hot
slow medium fast
Intelligent Systems Laboratory
Linguistic Variable
Temperature
Cold
Normal
Hot
1
0
30
40
50
60
70
80
90
100
Intelligent Systems Laboratory
Some Fuzzy Set Operations
•
Set union A  B
A  B(x) max(A(x),B(x)) for all x X
alternate syntax (join operator)
A  B(x) A(x)B(x)) for all x X
•
Set intersection A B
AB(x) min(A(x),B(x)) for all x X
alternate syntax (meet operator)
A  B(x) A(x) B(x)) for all x X
Intelligent Systems Laboratory
Fuzzy Reasoning
•
A fuzzy proposition is a statement that asserts
a value for a linguistic (or fuzzy) variable
•
Example: Joe’s height is medium
» Linguistic variable (noun)
Joe’s height
» Fuzzy set (adjective)
medium
» The fuzzy set “medium” is a value of the linguistic
variable “Joe’s height”
•
•
A fuzzy rule relates two or more fuzzy
propositions
Fuzzy inference techniques are used to draw
conclusions using fuzzy rules
Intelligent Systems Laboratory
Example Fuzzy Rule
If speed is normal
then braking.force is medium
Speed
Normal = (0/0, .1/20, .8/40, 1/60, .1/80, 0/100)
braking.force
Medium = (0/0, .5/1, 1/2, 1/3, .2/4, 0/5)
Intelligent Systems Laboratory
J. Dickerson, D. Bedeant, Z.Cox, W. Qi, D. Ashlock, and E. Wurtele, Atlantic
Symposium on Computational Biology and Genome Information Systems & Technology (CBGIST 2000,
26-30.
Intelligent Systems Laboratory
Bayesian Reasoning
•
•
Bayesian networks: Represent knowledge as
a network of random variables
Many names and many variations
• Belief networks
• Probabilistic networks
• Causal networks
• Knowledge Maps
• Influence Diagram (extension)
• Decision Network (extension)
Intelligent Systems Laboratory
Belief Network
P(B)
0.001
P(E)
0.002
Burglary
Earthquake
B
T
T
F
F
Alarm
JohnCalls
A
T
F
P(J|A)
0.90
0.05
MaryCalls
Intelligent Systems Laboratory
E
T
F
T
F
P(A|B,E)
0.95
0.94
0.29
0.01
A
T
F
P(M|A)
0.70
0.01
Intelligent Systems Laboratory
Classification of Learning
Systems
•
Supervised learning
•
•
•
•
Give the system a set of examples and an
“answer” for each example.
Train the system until it can give the correct
response to these examples (or most of them).
Unsupervised learning
•
Give the system a set of examples and let it
discover interesting patterns in the examples.
Reinforcement learning
•
Learn from rewards and penalties
Intelligent Systems Laboratory
Feature Vectors
• Simple representation
•
used by most learning
systems.
Represents each
example as a vector or
numbers
• Quantities
• Nominal data
• Ordinal data
#53 5.3
cold 3.2
Intelligent Systems Laboratory
blue 1
Neural Networks
• Computational models “loosely” based on the
•
structure of the brain
Characteristics of the brain
• Large number of simple processing units (neurons)
• Highly connected
• No central control
• Neurons are slow devices compared to digital computers
• Can perform complex tasks in a short period of time
• Neurons are failure-prone devices
• Handles fuzzy situations very well.
• Information accessed on the basis of content
• Learns from experience
Intelligent Systems Laboratory
Neural Networks
•
•
•
•
•
Based on model of nervous system
Large numbers of simple processing units
Units are highly connected and connections
are weighted.
Highly parallel distributed control
Emphasis on learning internal
representations automatically
Intelligent Systems Laboratory
Neural Network Concepts
• Cell or unit or neuron or node
• Autonomous processing unit that models a neuron
• Purpose
•
» Receives information from other cells
» Performs simple processing
» Sends results on to one or more cells
Layers
• A collection of cells that perform a common function
• Types:
» Input layer
» Hidden layer
» Output layer
Intelligent Systems Laboratory
Layers of Neurons
I1
H1
O1
I2
H2
I3
Input Layer
Hidden Layer
Output Layer
Intelligent Systems Laboratory
Properties
•
•
•
In general, there is no interconnection
between cells in the same layer
Connections are one or two way
communications links between two cells
Weights are the strength of the connections.
A weight wij is a real number than indicates
the influence that cell ui has on cell uj
Intelligent Systems Laboratory
More about weights
•
•
•
•
Positive weights indicate reinforcement
Negative weights indicate inhibition
Weight of 0 indicates no influence or connection
Weights may be initialized to one of these:
•
•
•
•
0
predefined values
random values
Weights are altered by experience
Intelligent Systems Laboratory
Multilayer Feed-Forward
Networks
•
•
Networks that are connected acyclic graphs
Backpropagation
•
•
•
•
Most popular training method for feed forward layered
networks.
Invented in 1969 by Bryson and Ho
Ignored until 80’s
Supervised learning technique
Intelligent Systems Laboratory
Back Propagation
•
•
•
•
•
Initialize the network with random weights
Show it an input instance
Compute the output
Determine how much the output differs from
the goal.
Feed small adjustments to the weights back
through the network based on the error.
Intelligent Systems Laboratory
General Algorithm for Training
Network
Initialize NN
for epoch = 1 to MAXEPOCHS
for each input-output pair in the training set
present an example and compute error
adjust weights to reduce the error
compute mean-square-error for training set
for each input-output pair in the test set
compute error
compute mean-square-error for test set
Intelligent Systems Laboratory
Generalization
Training set
%
correct
Test set
Training time (# of epochs)
Intelligent Systems Laboratory
Computational Biology Applications
•
•
•
•
•
•
Protein classification
Sequence analysis
DNA fragment assembly
Prediction of transmembrane regions
Phylogenetic classification of ribosome
sequences
And many more
Intelligent Systems Laboratory
Self-Organizing Maps
•
•
•
Also called Kohonen maps
Used for unsupervised learning
Widely applied for comparison of gene
expression data
Intelligent Systems Laboratory
Principle of Self-Organizing Maps
Intelligent Systems Laboratory
Self-Organizing Map from Yeast Gene Expression Data
(German Florez)
Intelligent Systems Laboratory
Knowledge Discovery
•
•
Definition: Non-trivial extraction of implicit,
previously unknown, and potentially useful
information from data.
Applications in biology
•
•
Text mining
Association rule mining
Intelligent Systems Laboratory
KDD Process
Interpretation/
Evaluation
Data Mining
Knowledge
Transformation
Preprocessing
Selection
--- --- --- ----- --- --- --Preprocessed
Data
Patterns
Transformed
Data
Target
Data
Data
Intelligent Systems Laboratory
Iterative Clustering Procedure
( Wan, Bridges, J.Boyle, A.Boyle)
Download data from Genbank
Represent data using an appropriate method
Construct feature vector for each gene from
POSITIONAL WEIGHT MATRICES
Cluster with K-clustering
Select clustering without clear patterns for
further clustering
Visualize results
Conclusions
Intelligent Systems Laboratory
Positional Weight Matrix
Representation
Clustering Results
S. solfataricus with A, G box (2976)
Clustering window (-48 to -1)
With A box
(1656)
Nearby with A, G box (286)
Cluster window (-24 to -1)
With A and G box
(142)
With G box
(1320)
Distant with A box
(1370)
Nearby with G box
(571)
With A box (146)
Intelligent Systems Laboratory
Distant with G box
(749)