Machine Learning and Motion Planning

Download Report

Transcript Machine Learning and Motion Planning

Machine Learning and Motion
Planning
Dave Millman
October 17, 2007
Machine Learning intro
• Machine Learning (ML)
– The study of algorithms which improve
automatically though experience. - Mitchell
• General description
– Data driven
– Extract some information from data
– Mathematically based
• Probability, Statistics, Information theory,
Computational learning theory, optimization
A very small set of uses of ML
– Text
• Document labeling, Part of speech tagging,
Summarization
– Vision
• Object recognition, Hand writing recognition, Emotion
labeling, Surveillance
– Sound
• Speech recognition, music genra classification
– Finance
• Algorithmic trading
– Medical, Biological, Chemical, on and on and on…
A few types of ML
• Supervised
– Given: labeled data
– Usual goal: learn function
– Ex: SVM, Neural Networks, Boosting etc.
• Unsupervised
– Given: unlabeled data
– Usual goal: cluster data, learn conditional
probabilities
– Ex: Nearest Neighbors, Decision trees
A few types of ML (cont.)
• Semi-Supervised
– Given: labeled and unlabeled data
– Usual goal: use unlabeled data to increase labeled
data
– Ex: Cluster, Label unlabeled data from clusters
• Reinforcement
– Given: Reward function and set of actions
– Goal: Learn a function which optimizes the reward
function
– Ex: Q-learning , Ant-Q
General Idea Supervised
General Idea Unsupervised
General Idea SemiSupervised
General Idea Reinforcement
• Markov Decision Process (MDP)
– State space (fully or partially observable)
– Action space (static or time dependant)
– Transition function produces an action
(based the present state, not the past)
– Reward function (based on action)
Why regression is not
enough…The XOR problem
Text book Q-Learning [MI06]
• Learning flocking behavior
– N agents
– discrete time steps
– Agent i partner j
– Define Q-state Q(st, at)
– st - state
– ai - action
Our text book example
– State of i
• [R] = floor(|i-j|)
– Actions for i
•
•
•
•
a1 - Attract to j
a2 - Parallel positive orientation to j
a3 - Parallel negative orientation to j
a4 - Repulsion from j
Reward Function - no predator
• Distances R1, R2, R3 s.t. R1 < R2 < R3
st
0<[R]≤R1
R1<[R]≤R2 R2<[R]≤R3
R3<[R]
at a4 a1,a2,a3 a2 a1,a3,a4 a1 a2,a3,a4 a1,a2,a3, a4
r
1
-1
1
-1
1
-1
0
Reward Function - predator
• Distances R1, R2, R3 s.t. R1 < R2 < R3
st
0<[R]≤R3
at a4
r
1
R3<[R]
a1,a2,a3
a1,a2,a3,a4
-1
0
Don’t repeat work!!
• Basic planners work from scratch
• Ex, path planning for parking, no
difference between first time and the
hundredth time
• Ideal learn some general higher level
“strategies” that can be reused
• General solution patterns in the problem
space
Viability Filtering [KP07]
• Agent can “see”, perceptual information
– Range finder like virtual sensors
• Data base of successfully perceptuallyparameterized motions
– From its own experimentation or external source
• Database exploited for future queries
– Search based off of what has previously been
successful in similar situations.
Sensors in Viability Filtering
some defs
•
•
•
•
•
•
•
X set of agent states
E set of environment states
def x+ \in X+ {x+=(x,e) |
x \in X, e \in E}
Sensor function (x+): X+  R
At a specific sensor state x+ \in X+
def sensor state s =(1(x+), …, n(x+))
And sensor space s \in S where S all sensor
state values
Finally
• Def locally situated state of the agent  =
(s,x’) \in  where x’ is some state information
independent of the sensory agent.
• Now we want collect data to train a function
(): {viable, nonviable}
• Note, errors in () could cause problems
Check Viability not Collision
Function IS_NONVIABLE(x+)
if is_collision(x+) then
return True
s := (1(x+), …, n(x+))
x’ := extract_internal_state(x+)
 := (s, x’)
return ¬():
Results and Further work
• Bootstrapping
• Use of history to create macroscopic plans
• Model transfer
Training a Dog [B02]
• MIT lab - System where the
user interactively train the
dog using “click training”
• Uses acoustic patterns as
cues for actions
• Can be taught cues on
different acoustic pattern
• Can create new actions from
state space search
• Simplified Q-learning based
on animal training
techniques
Training a Dog (cont.)
• Predictable regularities
– animals will tend to successful state
– small time window
• Maximize use of supervisor feedback
– limit the state space by only looking at states that
matter, ex if utterance u followed by action a
produces a reward then utterance u is important.
• Easy to train
– Credit accumulation
– And allowing state action pair to delegate credit to
another state action pain.
Alternatives to Q-Learning
• Q-decomp [RZ03]
•
A simple world with initial state S0 and
three terminal states SL , SU , SR , each
with an associated reward of dollars
and/or euros. The discount factor is γ ∈
(0, 1).
[fig from. RZ03]
– Complex agent as
set of simpler
subagents
– Subagent has its
own reward function
– Arbitrator decides
best actions based
on “advice” from
subagents
Learning Behavior with
Q-Decomp [CT06]
• Q-Decomp as the learning technique
• Reward function - Inverse
Reinforcement Learning (IRL) [NR00]
– Mimicking behavior from an “expert”
Support Vector Path Planning
• Idea that uses the SVM algorithm to
generate a smooth path.
• Not really Machine learning but neat
application of a ML algortihm
• Here is the idea
Support Vector Path Planning
Videos
• Robot learning to pick up objects
– http://www.cs.ou.edu/~fagg/movies/index.html#torso_2004
• Training a Dog
– http://characters.media.mit.edu/projects/dobie.html
References
•
•
•
•
•
•
•
[NR00] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In
Proc. 17th International Conf. on Machine Learning, pages 663-670. Morgan
Kaufmann, San Francisco, CA, 2000.
[B02] B. Blumberg et al. Integrated learning for interactive synthetic characters.
In SIGGRAPH ‘02: Proceedings of the 29th annual conference on Computer
graphics and interactive techniques, pages 417-426, New York, NY, USA, 2002.
ACM Press.
[RZ03] S. J. Russell and A. Zimdars. Q-decomposition for reinforcement learning
agents. In ICML, pages 656-663, 2003
[MI06] K. Morihiro, Teijiro Isokawa, Haruhiko Nishimura, Nobuyuki Matsui,
Emergence of Flocking Behavior Based on Reinforcement Learning,
Knowledge-Based Intelligent Information and Engineering Systems, pages 699706, 2006
[CT06] T. Conde and D. Thalmann. Learnable behavioural model for
autonomous virtual agents: low-level learning. In AAMAS ‘06: Proceedings of
the fifth international joint conference on Autonomous agents and multiagent
systems, pages 89-96, New York, NY, USA, 2006. ACM Press.
[M06] J. Miura. Support vector path planning. In Intelligent Robots and Systems,
2006 IEEE/RSJ International Conference on, pages 2894-2899, 2006.
[KP07] M. Kalisiak and M. van de Panne. Faster motion planning using learned
local viability models. In ICRA, pages 2700-2705, 2007.
Machine Learning Ref
[M07] Mehryar Mohri - Foundations of Machine Learning course
notes http://www.cs.nyu.edu/~mohri/ml07.html
[M97] Tom M. Mitchell. Machine learning. McGraw-Hill, 1997
RN05] Russell S, Norvig P (1995) Artificial Intelligence: A Modern
Approach, Prentice Hall Series in Artificial Intelligence.
Englewood Cliffs, New Jersey
[CV95] Corinna Cortes and Vladimir Vapnik, Support-Vector
Networks, Machine Learning, 20, 1995.
[V98] Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.
[KV94] Michael J. Kearns and Umesh V. Vazirani. An Introduction
to Computational Learning Theory. MIT Press, 1994.