Confucius: an intelligent multimedia interpretation & presentation
Download
Report
Transcript Confucius: an intelligent multimedia interpretation & presentation
CONFUCIUS:
an Intelligent MultiMedia storytelling
interpretation & presentation system
Minhua Eunice Ma
Supervisor: Prof. Paul Mc Kevitt
School of Computing and Intelligent Systems
Faculty of Informatics
University of Ulster, Magee
Objectives of CONFUCIUS
To interpret natural language story and movie
(drama) script input and to extract conceptual
semantics from the natural language
To generate 3D animation and virtual worlds
automatically from natural language
To integrate 3D animation with speech and nonspeech audio, to form an intelligent multimedia
storytelling system for presenting multimodal
stories
CONFUCIUS’ context diagram
Storywriter
/playwright
Movie/drama script
CONFUCIUS
3D animation
User
/story
listener
Previous systems
Schank’s CD Theory (1972)
Primitive & scripts
SAM & PAM
Automatic Text-to-Graphics Systems
WordsEye (Coyne & Sproat, 2001)
‘Micons’ and CD-based language animation
(Narayanan et al. 1995)
Spoken Image (Ó Nualláin & Smith, 1994)
& its successor SONAS (Kelleher et al.
2000)
MultiModal interactive storytelling
AesopWorld
KidsRoom
Larsen & Petersen’s Interactive Storytelling
Oz
Computer games
Virtual humans & embodied agents
BEAT (Cassell et al., 2000)
Jack (University of Pennsylvania)
Improv (Perlin and Goldberg, 1996)
SimHuman
Gandalf
PPP persona
Architecture of CONFUCIUS
Natural language stories
Script writer
Script parser
Prefabricated objects
(knowledge base)
Language knowledge
mapping
3D authoring tools,
existing 3D models &
character models
visual knowledge
(3D graphic library)
lexicon
grammar
etc
Natural
Language
Processing
Text To
Speech
Sound
effects
semantic
representations
visual
knowledge
Animation
generation
Synchronizing & fusion
3D world with audio in VRML
Semantic representations
Categories
(1) general
knowledge
representation &
reasoning
Knowledge representations
rule-based representation
Typical applications
expert systems
FOPC
(First Order Predicate Calculus)
semantic networks
sentence representation,
expert systems
lexical semantics
Schank’s scripts
story understanding
frame-based representations
XML-based representations
Conceptual Dependency (CD)
(2) physical
knowledge
representation &
reasoning (inc.
spatial /temporal
reasoning)
Decomposition
event-logic truth conditions
multimodal semantics
x-schema and f-structure
Jackendoff’s Lexical-Conceptual
Semantics (LCS)
decomposite predicate-argument
representation
dynamic vision (movement)
recognition & generation
MultiModal semantic representation
High-level multimodal
semantic representation:
XML/frame-based
Multimodal semantics
Media-independent representation
Visual media-dependent representation
Intermediate level
Visual modality
Audio media-dependent representation
Language modality
Non-speech audio modality
Mental imagery & meaning processing
Meanings, communicable ideas,
thoughts, manifestable
messages, proverbs, examples,
parables, etc.
Simulation:
presentation via language or other modalities
Mental world
Communicati
on
Simulation:
Image recognition
Cognition
Physical world
Mental world
Simulation:
Language understanding
Re-cognition
Virtual world
Knowledge base of CONFUCIUS
knowledge base
Language knowledge
Visual knowledge
Semantic knowledge - lexicons (eg. WordNet)
Syntactic knowledge - grammars
Statistical models of language
Associations between words
Object model (nouns)
Event model (event verbs, describes the motion of objects)
Functional information
Internal coordinate axes (for spatial reasoning)
Associations between objects
World knowledge
Spatial & qualitative reasoning knowledge
Graphic library
objects/props
characters
Simple geometry files
geometry & joint hierarchy
files
instantiation
motions
animation library
(key frames)
Data Flow Diagram
Primitives library
Natural
language
processor
Visual
semantics
Animation
generator
VRML without sound nodes
Scene&Actor descriptions
dialogues
script
Script
parser
Non-speech audio
script
story
TTS
Sound effect
driver
Script
writer
Music library
Media
coordination
Synthesized
animation
Animation generator
LCS representation
verb
semantic analysis
match basic motions
in library?
use lexical relations (WordNet)
to replace synonyms, scripts
application, etc.
Y
N
motion
decomposition
animation controller
motion
instantiation
environment
placement
VRML format of the virtual story world
examples demo
Categories of events
Atomic entities
Change physical location such as position and orientation, e.g. “bounce”, “turn”
Change intrinsic attributes such as shape, size, color, and texture, e.g. “bend”,
and even visibility, e.g. “disappear”, “fade” (in/out)
Non-atomic entities
Non-character events
or more individual objects fuse together, e.g. “melt” (in)
One object divides into two or more individual parts, e.g. “break” (into
pieces)
Change sub-components (their position, size, color), e.g. “blossom”
Environment events (weather verbs), e.g. “snow”, “rain”
Two
Character events
Action
verbs
Intransitive verbs
Transitive verbs
Non-action
verbs (stative, emotion, possession, mental activities,
cognition & perception)
Idioms & metaphor verbs
Categories of action verbs
Intransitive verbs
Biped kinematics, e.g. “walk”, “swim”, & other motion models
like “fly”
Face expressions, e.g. “laugh”, “anger”
involve speech modality
Lip movement, e.g. “speak”, “say”
Transitive verbs
single object, e.g. “throw”, “push”, “kick”
multiple objects
direct and indirect objects, e.g. “give”, “pass”, “show”
indirect object & the instrument, e.g. “cut”, “hammer”
Visual definition & word sense
polysemy
verb
many
many
synonymy
many
word sense one
visual definition entry
mapping
Example: “close” (a door)
1.
2.
3.
a normal door (rotation on y axis)
a sliding door (moving on x axis)
a rolling shutter door (a combination of
rotation on x axis and moving on y axis)
word sense -- minimal complete unit of meaning in
the language modality
visual definition entry -- minimal complete unit of
meaning in the visual modality
Troponyms &
verbs derived from adjectives/nouns
troponym
elaborates the manners of a base verb (Fellbaum 1998)
examples: “trot”-“walk” (fast), “gulp”-“eat” (quickly)
base verb + adverb
present the base verb + modify the manner (speed, the agent’s state,
duration of the activity, iteration, etc.)
Verbs derived from adjectives or nouns
change objects’ properties (size, color, shape) or the world
state
verbs with affixes such as –en, -ify, or –ize, e.g. “lengthen”
using predicates scale(), squash() or changing the
corresponding property fields of the object in VRML
Representing active & passive voice
active and passive voice
converse verb pairs such as “give/take”,
“buy/sell”, “lend/borrow”
same activity from different point of view
use of VRML Viewpoint node
Implementation: semanticsVRML
DEF ball Transform {
translation 0 0 0
children [
Shape {
appearance Appearance{
material Material{}
}
geometry Sphere {
radius 5
}
}
]
}
DEF ball Transform {
translation 0 0 0
children [
DEF ball-TIMER TimeSensor {
loop TRUE
cycleInterval 0.5 },
DEF ball-POS-INTERP
PositionInterpolator {
key [0, 0.5, 1 ]
keyValue [0 0 0, 0 20 0, 0 0 0 ]
},
Shape {
appearance Appearance {
material Material {}
}
geometry Sphere { radius 5 }
}]
ROUTE ball-TIMER.fraction_changed TO
ball-POS-INTERP.set_fraction
ROUTE ball-POS-INTERP.value_changed TO
ball.set_translation
}
(b) VRML code of a static ball
(c) Output VRML code of a bouncing ball
Example: “A ball is bouncing”
bounce(ball):[moveTo(ball, [0,0,0]),
moveTo(ball,[0,20,0])]L.
(a) visual definition of “bounce”
Categories of adjectives
Objects’ attributes/states: dark/light, large/small, big/little, white/black
(color adj.), long/short, new/old, high/low, full/empty, open/closed
Visually
observable
Feelings: happy/sad, angry, excited, surprised,
Observable
human attributes
terrified
Others: old/young, beautiful/ugly, strong/weak,
poor/rich, fat/thin
Relational adj.: nasal (nose), mural (wall), dental (teeth)
Perceivable by other modalities: wet/dry, warm/cold, coarse/smooth,
hard/soft, heavy/light
Visually
unobservable
Unobservable human attributes (virtue):
Abstract attributes
good/evil, kind, mean, ambitious
Others: easy/difficult, real, important, particular,
right/wrong, early/late
Reference-modifying adj.: possible/impossible, former, past/present,
last, other, different/same
Software Analysis
Java programming language
parsing intermediate representation
changing VRML code to create/modify animation
integrating modules
Natural language processing tools
Gate (pre-processing)
PC-PARSE (morphologic and syntax analysis)
WordNet (lexicon, semantic inference)
3D graphic modelling
existing 3D models on the Internet
3D Studio Max (props & stage)
VRML (Virtual Reality Modelling Language) 97, H-anim 2001 spec.
The Actors – using embodied agents
Microsoft Agent (the narrator and minor actors)
Character Studio, Internet Character Animator (protagonists)
Natural Language Processing
Pre-processing
PC-PARSER
Part-of-speech tagger
Syntactic parser
Semantic
inference
WordNet 1.6
Coreference
resolution
FEATURES
Temporal
reasoning
LEXICON &
MORPHOLOGICAL RULES
morphological
parser
Contribution & prospective applications
multimodal semantic representation of natural language
automatic animation generation
multimodal fusion and coordination
Children’s education
Multimedia presentation
Movie/drama production
Script writing
Computer games
Virtual Reality
Conclusion
The objectives of CONFUCIUS meet the challenging
problems in language visualisation:
formalizes meaning of action verbs and states
mapping language primitives with visual primitives
a reusable ‘common sense’ knowledge base for other systems
sophisticated spatial and temporal reasoning
representing stories by temporal multimedia requires
significant coordination