Confucius: an intelligent multimedia interpretation & presentation

Download Report

Transcript Confucius: an intelligent multimedia interpretation & presentation

CONFUCIUS:
an Intelligent MultiMedia storytelling
interpretation & presentation system
Minhua Eunice Ma
Supervisor: Prof. Paul Mc Kevitt
School of Computing and Intelligent Systems
Faculty of Informatics
University of Ulster, Magee
Objectives of CONFUCIUS
 To interpret natural language story and movie
(drama) script input and to extract conceptual
semantics from the natural language
 To generate 3D animation and virtual worlds
automatically from natural language
 To integrate 3D animation with speech and nonspeech audio, to form an intelligent multimedia
storytelling system for presenting multimodal
stories
CONFUCIUS’ context diagram
Storywriter
/playwright
Movie/drama script
CONFUCIUS
3D animation
User
/story
listener
Previous systems
 Schank’s CD Theory (1972)


Primitive & scripts
SAM & PAM
 Automatic Text-to-Graphics Systems



WordsEye (Coyne & Sproat, 2001)
‘Micons’ and CD-based language animation
(Narayanan et al. 1995)
Spoken Image (Ó Nualláin & Smith, 1994)
& its successor SONAS (Kelleher et al.
2000)
 MultiModal interactive storytelling





AesopWorld
KidsRoom
Larsen & Petersen’s Interactive Storytelling
Oz
Computer games
Virtual humans & embodied agents
BEAT (Cassell et al., 2000)
 Jack (University of Pennsylvania)
 Improv (Perlin and Goldberg, 1996)
 SimHuman
 Gandalf
 PPP persona

Architecture of CONFUCIUS
Natural language stories
Script writer
Script parser
Prefabricated objects
(knowledge base)
Language knowledge
mapping
3D authoring tools,
existing 3D models &
character models
visual knowledge
(3D graphic library)
lexicon
grammar
etc
Natural
Language
Processing
Text To
Speech
Sound
effects
semantic
representations
visual
knowledge
Animation
generation
Synchronizing & fusion
3D world with audio in VRML
Semantic representations
Categories
(1) general
knowledge
representation &
reasoning
Knowledge representations
rule-based representation





Typical applications
expert systems
FOPC
(First Order Predicate Calculus)





semantic networks





sentence representation,
expert systems
lexical semantics
Schank’s scripts






story understanding
frame-based representations





XML-based representations
Conceptual Dependency (CD)
(2) physical
knowledge
representation &
reasoning (inc.
spatial /temporal
reasoning)
Decomposition
event-logic truth conditions
multimodal semantics

















x-schema and f-structure





Jackendoff’s Lexical-Conceptual
Semantics (LCS)
decomposite predicate-argument
representation












dynamic vision (movement)
recognition & generation
MultiModal semantic representation
High-level multimodal
semantic representation:
XML/frame-based
Multimodal semantics
Media-independent representation
Visual media-dependent representation
Intermediate level
Visual modality
Audio media-dependent representation
Language modality
Non-speech audio modality
Mental imagery & meaning processing
Meanings, communicable ideas,
thoughts, manifestable
messages, proverbs, examples,
parables, etc.
Simulation:
presentation via language or other modalities
Mental world
Communicati
on
Simulation:
Image recognition
Cognition
Physical world
Mental world
Simulation:
Language understanding
Re-cognition
Virtual world
Knowledge base of CONFUCIUS
knowledge base
Language knowledge
Visual knowledge
Semantic knowledge - lexicons (eg. WordNet)
Syntactic knowledge - grammars
Statistical models of language
Associations between words
Object model (nouns)
Event model (event verbs, describes the motion of objects)
Functional information
Internal coordinate axes (for spatial reasoning)
Associations between objects
World knowledge
Spatial & qualitative reasoning knowledge
Graphic library
objects/props
characters
Simple geometry files
geometry & joint hierarchy
files
instantiation
motions
animation library
(key frames)
Data Flow Diagram
Primitives library
Natural
language
processor
Visual
semantics
Animation
generator
VRML without sound nodes
Scene&Actor descriptions
dialogues
script
Script
parser
Non-speech audio
script
story
TTS
Sound effect
driver
Script
writer
Music library
Media
coordination
Synthesized
animation
Animation generator
LCS representation
verb
semantic analysis
match basic motions
in library?
use lexical relations (WordNet)
to replace synonyms, scripts
application, etc.
Y
N
motion
decomposition
animation controller
motion
instantiation
environment
placement
VRML format of the virtual story world
examples demo
Categories of events
Atomic entities
Change physical location such as position and orientation, e.g. “bounce”, “turn”
Change intrinsic attributes such as shape, size, color, and texture, e.g. “bend”,
and even visibility, e.g. “disappear”, “fade” (in/out)

Non-atomic entities
Non-character events

or more individual objects fuse together, e.g. “melt” (in)
One object divides into two or more individual parts, e.g. “break” (into
pieces)
Change sub-components (their position, size, color), e.g. “blossom”
Environment events (weather verbs), e.g. “snow”, “rain”
Two
Character events

Action
verbs
Intransitive verbs
Transitive verbs
Non-action
verbs (stative, emotion, possession, mental activities,
cognition & perception)
Idioms & metaphor verbs
Categories of action verbs
 Intransitive verbs



Biped kinematics, e.g. “walk”, “swim”, & other motion models
like “fly”
Face expressions, e.g. “laugh”, “anger”
involve speech modality
Lip movement, e.g. “speak”, “say”
 Transitive verbs


single object, e.g. “throw”, “push”, “kick”
multiple objects


direct and indirect objects, e.g. “give”, “pass”, “show”
indirect object & the instrument, e.g. “cut”, “hammer”
Visual definition & word sense
polysemy
verb
many
many
synonymy
many
word sense one
visual definition entry
mapping
Example: “close” (a door)
1.
2.
3.
a normal door (rotation on y axis)
a sliding door (moving on x axis)
a rolling shutter door (a combination of
rotation on x axis and moving on y axis)
word sense -- minimal complete unit of meaning in
the language modality
visual definition entry -- minimal complete unit of
meaning in the visual modality
Troponyms &
verbs derived from adjectives/nouns
 troponym



elaborates the manners of a base verb (Fellbaum 1998)
examples: “trot”-“walk” (fast), “gulp”-“eat” (quickly)
base verb + adverb
present the base verb + modify the manner (speed, the agent’s state,
duration of the activity, iteration, etc.)
 Verbs derived from adjectives or nouns



change objects’ properties (size, color, shape) or the world
state
verbs with affixes such as –en, -ify, or –ize, e.g. “lengthen”
using predicates scale(), squash() or changing the
corresponding property fields of the object in VRML
Representing active & passive voice
 active and passive voice
 converse verb pairs such as “give/take”,
“buy/sell”, “lend/borrow”
 same activity from different point of view
 use of VRML Viewpoint node
Implementation: semanticsVRML
DEF ball Transform {
translation 0 0 0
children [
Shape {
appearance Appearance{
material Material{}
}
geometry Sphere {
radius 5
}
}
]
}
DEF ball Transform {
translation 0 0 0
children [
DEF ball-TIMER TimeSensor {
loop TRUE
cycleInterval 0.5 },
DEF ball-POS-INTERP
PositionInterpolator {
key [0, 0.5, 1 ]
keyValue [0 0 0, 0 20 0, 0 0 0 ]
},
Shape {
appearance Appearance {
material Material {}
}
geometry Sphere { radius 5 }
}]
ROUTE ball-TIMER.fraction_changed TO
ball-POS-INTERP.set_fraction
ROUTE ball-POS-INTERP.value_changed TO
ball.set_translation
}
(b) VRML code of a static ball
(c) Output  VRML code of a bouncing ball
Example: “A ball is bouncing”
bounce(ball):[moveTo(ball, [0,0,0]),
moveTo(ball,[0,20,0])]L.
(a) visual definition of “bounce”
Categories of adjectives
Objects’ attributes/states: dark/light, large/small, big/little, white/black
(color adj.), long/short, new/old, high/low, full/empty, open/closed
Visually
observable
Feelings: happy/sad, angry, excited, surprised,
Observable
human attributes
terrified
Others: old/young, beautiful/ugly, strong/weak,
poor/rich, fat/thin
Relational adj.: nasal (nose), mural (wall), dental (teeth)
Perceivable by other modalities: wet/dry, warm/cold, coarse/smooth,
hard/soft, heavy/light
Visually
unobservable
Unobservable human attributes (virtue):
Abstract attributes
good/evil, kind, mean, ambitious
Others: easy/difficult, real, important, particular,
right/wrong, early/late
Reference-modifying adj.: possible/impossible, former, past/present,
last, other, different/same
Software Analysis
 Java programming language



parsing intermediate representation
changing VRML code to create/modify animation
integrating modules
 Natural language processing tools



Gate (pre-processing)
PC-PARSE (morphologic and syntax analysis)
WordNet (lexicon, semantic inference)
 3D graphic modelling



existing 3D models on the Internet
3D Studio Max (props & stage)
VRML (Virtual Reality Modelling Language) 97, H-anim 2001 spec.
 The Actors – using embodied agents


Microsoft Agent (the narrator and minor actors)
Character Studio, Internet Character Animator (protagonists)
Natural Language Processing
Pre-processing
PC-PARSER
Part-of-speech tagger
Syntactic parser
Semantic
inference
WordNet 1.6
Coreference
resolution
FEATURES
Temporal
reasoning
LEXICON &
MORPHOLOGICAL RULES
morphological
parser
Contribution & prospective applications
 multimodal semantic representation of natural language
 automatic animation generation
 multimodal fusion and coordination
 Children’s education
 Multimedia presentation
 Movie/drama production
 Script writing
 Computer games
 Virtual Reality
Conclusion
The objectives of CONFUCIUS meet the challenging
problems in language visualisation:
 formalizes meaning of action verbs and states
 mapping language primitives with visual primitives
 a reusable ‘common sense’ knowledge base for other systems
 sophisticated spatial and temporal reasoning
 representing stories by temporal multimedia requires
significant coordination