Overview of project

Download Report

Transcript Overview of project

Far Reaching Research (FRR) Project
IBM Research
See, Hear, Do:
Language and Robots
Jonathan Connell
Exploratory Computer Vision Group
Etienne Marcheret
Speech Algorithms & Engines Group
Sharath Pankanti (ECVG)
Josef Vopicka (Speech)
© 2002 IBM Corporation
IBM Research
Challenge = Multi-modal instructional dialogs
Use speech, language, and vision to learn objects & actions
Innate perception abilities (objects / properties)
Innate action capabilities (navigation / grasping)
Easily acquire terms not knowable a priori
Example dialog:
command
following
verb
learning
noun
learning
advice
taking
Round up my mug.
I don’t know how to “round up” your mug.
Walk around the house and look for it.
When you find it bring it back to me.
I don’t know what your “mug” looks like.
It is like this <shows another mug> but sort of orange-ish.
OK … I could not find your mug.
Try looking on the table in the living room.
OK … Here it is!
Language Learning & Understanding is a AAAI Grand Challenge
http://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/GrandChallenges#language
2
© 2005 IBM Corporation
IBM Research
Eldercare as an application
 Example tasks:
Pick up dropped phone
Get blanket from another room
Bring me the book I was reading yesterday
 Large potential market
Many affluent societies have a demographic imbalance (Japan, EU, US)
Institutional care can be very expensive (to person, insurance, state)
 A little help can go a long way
Can be supplied immediately (no waiting list for admission)
Allows person to stay at home longer (generally easier & less expensive)
Boosts independence and feeling of control (psychological advantage)
 Note: We are not attempting to address the whole problem
X
X
X
X
3
Aggressive production cost containment
Robust self-recharging and stairs traversal
Bathing and bathroom care, patient transfer, cooking
OSHA, ADA, FDA, FCC, UL or CE certification
© 2005 IBM Corporation
IBM Research
Novel approach: Linguistically-guided robots
Use language as the core of the operating system
not something tacked-on after-the-fact
 Interface
Much easier than programming (textual or graphical)
More natural for unskilled users
Less effort for “one-off” activities
 Interaction
Simple progress / error reporting (“I am entering the kitchen”)
Easy to request missing information (“Please tell me where X is located.”)
Clarification dialogs possible (“Which box did you want, red or blue?”)
 Learning
Can direct attention to specific objects or areas (e.g. “this object”)
Can focus learning on relevant properties (e.g. color, location)
Less trial and error since richer feedback (i.e. faster acquisition)
4
© 2005 IBM Corporation
IBM Research
ELI the robot
 Power supply
528 WH sealed lead-acid batteries
28 lbs for balancing counterweight
Estimate 4-5 hr run-time
 Drive system
Two wheel differential steer
Two 4 in rear casters (blue)
47 in/sec (2.7 mph) top speed
Handles 10 deg slope, ½ in bumps
 Motorized lift
For arm & sensors (offset 27 in up)
Floor to 36 in (counter) range
16 in / sec = 2.3 sec bottom to top
 Computation
Platform for quad-core GPU laptop
Single USB cable for interface
 Overall
About 65 lbs total weight
Stable +/- 10 degs any direction
15 in wide, 24 in long, 45-66 in tall
5
© 2005 IBM Corporation
IBM Research
Joystick
control
video
Picking up a
dropped object
eli_kitchen.wmv
6
© 2005 IBM Corporation
IBM Research
Speech
interaction
video
Far-field
speech
interpretation
eli_voice.wmv
7
© 2005 IBM Corporation
IBM Research
Detached Arm
for dialog development
 Hardware
–
–
–
–
Single color camera 25 in above surface
Arm = 3 positional DOF, Wrist = 3 angular DOF
Gripper augmented with compliant closure
Workspace = 2 ft wide, 1ft deep, +8/-2 in high
camera
 Software
– Serial control code optimized
– Joint control via manual gamepad
– Inverse kinematic solver
arm
OTC medications
(Advil & Gaviscon)
8
© 2005 IBM Corporation
IBM Research
Speech
manipulation
video
Selecting and
disambiguating
objects
eli_table.wmv
9
© 2005 IBM Corporation
IBM Research
Dialog phenomena handled
 “Grab it.” (1 object)
<grabs object>
 no confusion since only 1 choice for “it”
 “Grab it.” (4 objects)
“I'm confused. Which of the 4 things do you mean?”
 knows a unique target is required
 “What color is the object on the left?” (4 objects)
“It’s blue.”
 understand positions & colors
 “Grab it” (4 objects)
<grabs blue object>
 uses “it” from previous interaction
 “Grab that object” (human points)
<grabs object>
 understands human gesture
 “Grab the white thing.” (2 white objects)
“Do you mean this one?” <robot points>
 uses gesture to suggest alternative
 “No, the other one.”
<grabs other object>
 uses “other” from previous interaction
 “Grab the green thing.”
“Sorry, that’s too big for me.”
10
 sensitive to physical constraints
© 2005 IBM Corporation
IBM Research
Noun Learning Scenario
Features:
 Automatically finds objects
 Selects by position, size, color
 Understands user pointing
 Robot points for emphasis
 Grabs selected object
 Passes object to/from user
 Adds new nouns to grammar
 Builds visual models
 Identifies objects from models
eli_noun_sub.wmv
11
© 2005 IBM Corporation
IBM Research
Multi-modal Dialog Script
1. “Eli, what is the object on the left?”
No existing visual model matches object
“I don’t know.”
2. “Eli, that is aspirin.”
New word added to grammar
New visual model for object
“Okay. This <points> is aspirin.”
3. “Eli, this object <points> is Advil.”
Word already known
New visual model for object
“Okay. That is Advil.”
Model = size + shape + colors
4. “Eli, how many Advil do you see?”
Matching = nearest neighbor
Uses existing visual model to find item(s)
dist = Σ w[i] * | v[i] – m[i] |
“I see two.”
12
© 2005 IBM Corporation
IBM Research
Multi-modal Dialog Script (continued)
5. “Eli, give me the Tylenol.”
Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
<regrabs bottle> “Thanks.”
<replaces bottle>
6. “Eli, where is the aspirin?”
Uses existing visual model to find item(s)
“Here.” <points>
13
© 2005 IBM Corporation
IBM Research
Collaboration with Toyko Research Lab
 Principle researchers:
• Michiharu Kudoh
• Risa Nishiyama
 “BRAINS” project goal:
Make the robot respond appropriately as if it understands social rules
Eli Robot
at Watson
Brainy Response System at Tokyo
Vision
ASR
Objects
Parser
Vocabulary
Visual
models
Reasoning
Semantic
memory
Action
models
Talk
Archive
context update
Network
Lifelog
vetoes,
recommendation
Kinematics
14
Sequencer
Retrieve
© 2005 IBM Corporation
IBM Research
Combined Demo
Features:
 Learns object names
 Learns object appearances
 Grabs and passes objects
 Vetoes actions based on DB
 Picks alternates using ontology
 Checks for valid dose interval
 Real-time cloud connection
eli_bottles_sub.wmv
15
© 2005 IBM Corporation
IBM Research
Combined Demo Script
1. “Eli, this <points> object is aspirin.”
New word added to grammar
New visual model for object
“Okay. That is aspirin”
2. “Eli, the object on the right is called Tums.”
Word already known
New visual model for object
“Okay. This <points> is Tums.”
3. “Eli, give me some aspirin.”
Uses existing visual model to find item(s)
“Alice”
Check against personal database
aspirin
NO
“But that will hurt your stomach.”
DB
16
© 2005 IBM Corporation
IBM Research
Combined Demo Script (continued)
4. “Eli, give me some Tylenol instead.”
Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
lifelog history
7:14 AM
xxxxx
8:39 AM
zzzzz
9:01 AM
took Tylenol
<regrabs bottle> “Thanks.”
<replaces bottle>
Records dose in lifelog
5. “Eli, give me some Rolaids.”
No visual model for item
antacid
“I don’t know what Rolaids looks like.”
Ontology used to find available alternative(s)
“Do you want another antacid, Tums?”
Rolaids
Tums
(requested)
(present)
6. “Eli, just give me some Tylenol.”
Uses existing visual model to find item(s)
Lifelog consulted for last dose
“You just had Tylenol.”
17
© 2005 IBM Corporation
IBM Research
Verb Learning Scenario
Features:
 Handles relative motion commands
 Responds to incremental positioning
 Learns action sequences
 Applies new actions to other objects
eli_verb_sub.wmv
18
© 2005 IBM Corporation
IBM Research
Verb Learning Script
1. “Eli, poke the thing in the middle.”
Resolves visual target based on position
No existing action sequence to link
New action sequence opened for input
“poke”
point
1.0
out
1.0
out
-1.0
“I don’t know how to poke something.”
2. “Eli, point at it.”
Resolves pronoun from previous selection
Moves relative to visual target
<points>
3. “Eli, extend your hand.”
Low level incremental move
<advances>
4. “Eli, retract your hand.”
Low level incremental move
<retreats>
19
© 2005 IBM Corporation
IBM Research
Verb Learning Script (continued)
5. “Eli, that is how you poke something.”
Recognizes closing of action block
Links action sequence to word
“Okay. Now I know how to now poke something.”
“poke”
point
out
out
1.0
1.0
-1.0
DB
6. “Eli, poke the red object.”
Resolves visual target based on color
Retrieves action sequence for verb and executes
<pokes>
7. “Eli, poke the object on the left.”
Resolves visual target based on position
Retrieves action sequence for verb and executes
<pokes>
8. “Eli, poke the Tylenol.”
Resolves visual target based known object model
Retrieves action sequence for verb and executes
<pokes>
20
© 2005 IBM Corporation
IBM Research
Project Milestones
 Year 1 : Establishing the Language Framework (2011)
table-top environment with off-the-shelf arm / cameras / mics
Visual detection & identification of objects
Visual servoing of arm to grasp objects
Speech-based naming of objects
Speech-based learning of motion routines
 Year 2 : Extension to Application Scenario (2012)
port to mobile platform with on-board power & processing
Vision-based obstacle avoidance
Visual grounding for rooms / doors / furniture
Speech adaptation for different users & rooms
Speech-based place naming & fetch routines
21
© 2005 IBM Corporation
IBM Research
Overcoming obstacles to widespread robotics
 Perception
Robots do not conceptualize world as people do (e.g. what is an object?)
 Focus on nouns using partial scene segmentation
 Separate using depth boundaries and homogeneous regions
 Recognize with interest points and bulk properties
 Programming
Hard to tell robots what to do short of C++ programming
 Use speech and (constrained) natural language
 Learn word associations to objects and places
 Simply remember spatial paths and action procedures
 Cost
Robots are too expensive for generic activities or personal use
 Substitute sensing and computation for precise mechanicals
 Use cameras only, not (low volume) special-purpose sensors
 Use graphics processors (GPU) instead of CPU when possible
22
© 2005 IBM Corporation