Overview of project
Download
Report
Transcript Overview of project
Far Reaching Research (FRR) Project
IBM Research
See, Hear, Do:
Language and Robots
Jonathan Connell
Exploratory Computer Vision Group
Etienne Marcheret
Speech Algorithms & Engines Group
Sharath Pankanti (ECVG)
Josef Vopicka (Speech)
© 2002 IBM Corporation
IBM Research
Challenge = Multi-modal instructional dialogs
Use speech, language, and vision to learn objects & actions
Innate perception abilities (objects / properties)
Innate action capabilities (navigation / grasping)
Easily acquire terms not knowable a priori
Example dialog:
command
following
verb
learning
noun
learning
advice
taking
Round up my mug.
I don’t know how to “round up” your mug.
Walk around the house and look for it.
When you find it bring it back to me.
I don’t know what your “mug” looks like.
It is like this <shows another mug> but sort of orange-ish.
OK … I could not find your mug.
Try looking on the table in the living room.
OK … Here it is!
Language Learning & Understanding is a AAAI Grand Challenge
http://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/GrandChallenges#language
2
© 2005 IBM Corporation
IBM Research
Eldercare as an application
Example tasks:
Pick up dropped phone
Get blanket from another room
Bring me the book I was reading yesterday
Large potential market
Many affluent societies have a demographic imbalance (Japan, EU, US)
Institutional care can be very expensive (to person, insurance, state)
A little help can go a long way
Can be supplied immediately (no waiting list for admission)
Allows person to stay at home longer (generally easier & less expensive)
Boosts independence and feeling of control (psychological advantage)
Note: We are not attempting to address the whole problem
X
X
X
X
3
Aggressive production cost containment
Robust self-recharging and stairs traversal
Bathing and bathroom care, patient transfer, cooking
OSHA, ADA, FDA, FCC, UL or CE certification
© 2005 IBM Corporation
IBM Research
Novel approach: Linguistically-guided robots
Use language as the core of the operating system
not something tacked-on after-the-fact
Interface
Much easier than programming (textual or graphical)
More natural for unskilled users
Less effort for “one-off” activities
Interaction
Simple progress / error reporting (“I am entering the kitchen”)
Easy to request missing information (“Please tell me where X is located.”)
Clarification dialogs possible (“Which box did you want, red or blue?”)
Learning
Can direct attention to specific objects or areas (e.g. “this object”)
Can focus learning on relevant properties (e.g. color, location)
Less trial and error since richer feedback (i.e. faster acquisition)
4
© 2005 IBM Corporation
IBM Research
ELI the robot
Power supply
528 WH sealed lead-acid batteries
28 lbs for balancing counterweight
Estimate 4-5 hr run-time
Drive system
Two wheel differential steer
Two 4 in rear casters (blue)
47 in/sec (2.7 mph) top speed
Handles 10 deg slope, ½ in bumps
Motorized lift
For arm & sensors (offset 27 in up)
Floor to 36 in (counter) range
16 in / sec = 2.3 sec bottom to top
Computation
Platform for quad-core GPU laptop
Single USB cable for interface
Overall
About 65 lbs total weight
Stable +/- 10 degs any direction
15 in wide, 24 in long, 45-66 in tall
5
© 2005 IBM Corporation
IBM Research
Joystick
control
video
Picking up a
dropped object
eli_kitchen.wmv
6
© 2005 IBM Corporation
IBM Research
Speech
interaction
video
Far-field
speech
interpretation
eli_voice.wmv
7
© 2005 IBM Corporation
IBM Research
Detached Arm
for dialog development
Hardware
–
–
–
–
Single color camera 25 in above surface
Arm = 3 positional DOF, Wrist = 3 angular DOF
Gripper augmented with compliant closure
Workspace = 2 ft wide, 1ft deep, +8/-2 in high
camera
Software
– Serial control code optimized
– Joint control via manual gamepad
– Inverse kinematic solver
arm
OTC medications
(Advil & Gaviscon)
8
© 2005 IBM Corporation
IBM Research
Speech
manipulation
video
Selecting and
disambiguating
objects
eli_table.wmv
9
© 2005 IBM Corporation
IBM Research
Dialog phenomena handled
“Grab it.” (1 object)
<grabs object>
no confusion since only 1 choice for “it”
“Grab it.” (4 objects)
“I'm confused. Which of the 4 things do you mean?”
knows a unique target is required
“What color is the object on the left?” (4 objects)
“It’s blue.”
understand positions & colors
“Grab it” (4 objects)
<grabs blue object>
uses “it” from previous interaction
“Grab that object” (human points)
<grabs object>
understands human gesture
“Grab the white thing.” (2 white objects)
“Do you mean this one?” <robot points>
uses gesture to suggest alternative
“No, the other one.”
<grabs other object>
uses “other” from previous interaction
“Grab the green thing.”
“Sorry, that’s too big for me.”
10
sensitive to physical constraints
© 2005 IBM Corporation
IBM Research
Noun Learning Scenario
Features:
Automatically finds objects
Selects by position, size, color
Understands user pointing
Robot points for emphasis
Grabs selected object
Passes object to/from user
Adds new nouns to grammar
Builds visual models
Identifies objects from models
eli_noun_sub.wmv
11
© 2005 IBM Corporation
IBM Research
Multi-modal Dialog Script
1. “Eli, what is the object on the left?”
No existing visual model matches object
“I don’t know.”
2. “Eli, that is aspirin.”
New word added to grammar
New visual model for object
“Okay. This <points> is aspirin.”
3. “Eli, this object <points> is Advil.”
Word already known
New visual model for object
“Okay. That is Advil.”
Model = size + shape + colors
4. “Eli, how many Advil do you see?”
Matching = nearest neighbor
Uses existing visual model to find item(s)
dist = Σ w[i] * | v[i] – m[i] |
“I see two.”
12
© 2005 IBM Corporation
IBM Research
Multi-modal Dialog Script (continued)
5. “Eli, give me the Tylenol.”
Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
<regrabs bottle> “Thanks.”
<replaces bottle>
6. “Eli, where is the aspirin?”
Uses existing visual model to find item(s)
“Here.” <points>
13
© 2005 IBM Corporation
IBM Research
Collaboration with Toyko Research Lab
Principle researchers:
• Michiharu Kudoh
• Risa Nishiyama
“BRAINS” project goal:
Make the robot respond appropriately as if it understands social rules
Eli Robot
at Watson
Brainy Response System at Tokyo
Vision
ASR
Objects
Parser
Vocabulary
Visual
models
Reasoning
Semantic
memory
Action
models
Talk
Archive
context update
Network
Lifelog
vetoes,
recommendation
Kinematics
14
Sequencer
Retrieve
© 2005 IBM Corporation
IBM Research
Combined Demo
Features:
Learns object names
Learns object appearances
Grabs and passes objects
Vetoes actions based on DB
Picks alternates using ontology
Checks for valid dose interval
Real-time cloud connection
eli_bottles_sub.wmv
15
© 2005 IBM Corporation
IBM Research
Combined Demo Script
1. “Eli, this <points> object is aspirin.”
New word added to grammar
New visual model for object
“Okay. That is aspirin”
2. “Eli, the object on the right is called Tums.”
Word already known
New visual model for object
“Okay. This <points> is Tums.”
3. “Eli, give me some aspirin.”
Uses existing visual model to find item(s)
“Alice”
Check against personal database
aspirin
NO
“But that will hurt your stomach.”
DB
16
© 2005 IBM Corporation
IBM Research
Combined Demo Script (continued)
4. “Eli, give me some Tylenol instead.”
Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
lifelog history
7:14 AM
xxxxx
8:39 AM
zzzzz
9:01 AM
took Tylenol
<regrabs bottle> “Thanks.”
<replaces bottle>
Records dose in lifelog
5. “Eli, give me some Rolaids.”
No visual model for item
antacid
“I don’t know what Rolaids looks like.”
Ontology used to find available alternative(s)
“Do you want another antacid, Tums?”
Rolaids
Tums
(requested)
(present)
6. “Eli, just give me some Tylenol.”
Uses existing visual model to find item(s)
Lifelog consulted for last dose
“You just had Tylenol.”
17
© 2005 IBM Corporation
IBM Research
Verb Learning Scenario
Features:
Handles relative motion commands
Responds to incremental positioning
Learns action sequences
Applies new actions to other objects
eli_verb_sub.wmv
18
© 2005 IBM Corporation
IBM Research
Verb Learning Script
1. “Eli, poke the thing in the middle.”
Resolves visual target based on position
No existing action sequence to link
New action sequence opened for input
“poke”
point
1.0
out
1.0
out
-1.0
“I don’t know how to poke something.”
2. “Eli, point at it.”
Resolves pronoun from previous selection
Moves relative to visual target
<points>
3. “Eli, extend your hand.”
Low level incremental move
<advances>
4. “Eli, retract your hand.”
Low level incremental move
<retreats>
19
© 2005 IBM Corporation
IBM Research
Verb Learning Script (continued)
5. “Eli, that is how you poke something.”
Recognizes closing of action block
Links action sequence to word
“Okay. Now I know how to now poke something.”
“poke”
point
out
out
1.0
1.0
-1.0
DB
6. “Eli, poke the red object.”
Resolves visual target based on color
Retrieves action sequence for verb and executes
<pokes>
7. “Eli, poke the object on the left.”
Resolves visual target based on position
Retrieves action sequence for verb and executes
<pokes>
8. “Eli, poke the Tylenol.”
Resolves visual target based known object model
Retrieves action sequence for verb and executes
<pokes>
20
© 2005 IBM Corporation
IBM Research
Project Milestones
Year 1 : Establishing the Language Framework (2011)
table-top environment with off-the-shelf arm / cameras / mics
Visual detection & identification of objects
Visual servoing of arm to grasp objects
Speech-based naming of objects
Speech-based learning of motion routines
Year 2 : Extension to Application Scenario (2012)
port to mobile platform with on-board power & processing
Vision-based obstacle avoidance
Visual grounding for rooms / doors / furniture
Speech adaptation for different users & rooms
Speech-based place naming & fetch routines
21
© 2005 IBM Corporation
IBM Research
Overcoming obstacles to widespread robotics
Perception
Robots do not conceptualize world as people do (e.g. what is an object?)
Focus on nouns using partial scene segmentation
Separate using depth boundaries and homogeneous regions
Recognize with interest points and bulk properties
Programming
Hard to tell robots what to do short of C++ programming
Use speech and (constrained) natural language
Learn word associations to objects and places
Simply remember spatial paths and action procedures
Cost
Robots are too expensive for generic activities or personal use
Substitute sensing and computation for precise mechanicals
Use cameras only, not (low volume) special-purpose sensors
Use graphics processors (GPU) instead of CPU when possible
22
© 2005 IBM Corporation