Transcript audition

Cosc 6326/Psych6750X
Audition and Auditory Displays
Use of auditory displays
Sound in information display
• speech provides a high bandwidth
communication channel
• audition is a long distance sense without field of
view restrictions
• Sound is useful for information display (Cohen
& Wenzel 1995)
– when origin of message is a sound (voice, music)
– when message is simple and short (e.g. event
markers)
– when message will not be referred to later (e.g.
time)
– when message deals with events in time
– warnings or prompts (hearing is always on, no field
of view issues)
– continuously changing information (e.g. countdown)
– when other systems (e.g. vision) are overloaded
– when verbal response is required
(compatibility)
– when illumination or disability prevents vision
(e.g. alarm clock, limited field of view,
blindness)
– when the user moves from place to place
(sound as an ubiquitous I/O channel)
Sonification
• In ‘visualization’ situations, ‘sonification’ of
data can assist in the exploration of complex
datasets
• In these applications ‘realism’ is typically
not a major issue
• Sound can help interpret complex or
multidimensional data; can provide an
independent display dimension
• In addition to information display, in
immersive displays sound contributes to:
–
–
–
–
–
realism, situational awareness and presence
ambience and emotive context
cueing visual attention
natural communication
space perception
Realism and ambience
• High quality sound improves perceived
‘quality’ of visual displays
• Sounds in the environment provides vital
information that contributes to situational
awareness
• Persistence of sounds of objects out of field
of view may help maintain object
permanence
• Sound is believed to be vital for conveying
emotion and ambience in movies
• Ambient sounds can be realistic or abstract
(e.g. music to set mood)
• Absence of appropriate sound degrades
realism
• If background sounds are not well matched to
visuals participant may feel detached –‘presence’
may be degraded
• Relation between presence and realism is not
straightforward (later lecture)
• Sound is an omni-directional sense and may help
user feel immersed in the VE
• Auditory collision cues may help navigating a VE
(especially with HMDs)
Audition
Sound
• Sound is “mechanical vibrations and waves
of an elastic medium, particularly in the
frequency range of human hearing (16 Hz to
20 kHz)”
• Normally, the medium is air. Sound is an air
pressure wave.
• Sound is usually used to describe the
physical stimulus.
• Audition refers to perception.
• An auditory event is usually elicited by a
sound event.
• A sinusoidal pressure wave is known as a
pure tone.
• Sinusoid
– x(t) = A cos(2f0t + )
A is amplitude
f0 is frequency
 is phase
– T0 is period
–  is related to time
shift of peak
x(t)
wavelength   
t
T0=1/f0
c
f
Dimensions of sound
• Harmonic content: pitch, melody, harmony,
waveshape, timbre, vibrato
• Timing: duration, tempo, rhythm,
• Loudness, envelope
• Spatial: azimuth, elevation, distance
• Ambience: resonance, reverberation,
spaciousness
• Representation: literal, auditory icons,
abstract
• Perceptual and physical dimensions are
analogous but distinct
– pitch and frequency (directly related for pure
tones)
– loudness and intensity
– timbre and complexity
Matlin and Foley, Sensation and Perception
Kandel et al, Principles of Neural Science
Physiology and psychophysics
• Cochlea performs mechanical spectral
analysis of sound signal
• Pure tone induces traveling wave in basilar
membrane.
– maximum mechanical displacement along
membrane is function of frequency (place coding)
• Displacement of basilar membrane changes
with compression and rarefaction (frequency
coding)
Matlin and Foley, Sensation and Perception
Kandel et al, Principles of Neural Science
Perception of pitch
• Along the basilar membrane, hair cell
response is tuned to frequency
– each neuron in the auditory nerve responds to
acoustic energy near its preferred frequency
– preferred frequency is place coded along the
cochlea. Frequency coding believed to have a
role at lower frequencies
• Higher auditory centers maintain frequency
selectivity and are ‘tonotopically mapped’
• Pitch is related to frequency for pure tones.
• For periodic or quasi-periodic sounds the
pitch typically corresponds to inverse of
period
• Some have no perceptible pitch (e.g. clicks,
noise)
• Sounds can have same pitch but different
spectral content, temporal envelope …
timbre
Perception of loudness
• Intensity is measured on a logarithmic scale
in decibels
• Range from threshold to pain is about 120
dB-SPL
• Loudness is related to intensity but also
depends on many other factors (attention,
frequency, harmonics, …)
Spatial hearing
• Auditory events can be perceived in all
directions from observer
• Auditory events can be localized internally
or externally at various distances
• Audition also supports motion perception
– change in direction
– Doppler shift
• Ability to localize depends on sound source
and environment
– a tone in reverberant room is difficult to locate
in time and space
– a click in an anechoic chamber, on the other
hand, is precisely located and time limited
Auditory Scene Analysis
• Process of separating out the different
sources present in the environment
• Detection and segregation of distinct
sources
• Grouping of sounds in spatial and temporal
proximity into single streams
Cocktail party effect
• In environments with many sound sources it
is easier to process auditory streams if they
are separated spatially
• Spatial sound techniques can help in sound
discrimination, detection and speech
comprehension in busy immersive
environments
Spatial Auditory Cues
• Two basic types of head-centric direction
cues
– binaural cues
– spectral cues
Binaural Directional Cues
• When a source is located eccentrically it is
closer to one ear than the other
– sound arrives later and weaker at one ear
– head ‘shadow’ also weakens sound arrive at
opposite ear
• Binaural cues are robust but ambiguous
http://headwize.com/tech/aureal1_tech.htm
• Interaural time differences (ITD)
– ITD increase with directional deviation from
the median plane. It is about 600 s for a source
located directly to one side.
– Humans are sensitive to as little as 10 s ITD.
Sensitivity decreases with ITD.
– For a given ITD, phase difference is linear
function of frequency
– For pure tones, phase based ITD is ambiguous
– At low to moderate frequencies phase
difference can be detected. At high frequencies
can use ITD in signal envelope.
– ITD cues appear to be integrated over a
window of 100-200ms (binaural sluggishness,
Kollmeier & Gillkey, 1990)
• Interaural intensity differences (IID)
– With lateral sources head shadow reduces
intensity at opposite ear
– Effect of head shadow most pronounced for high
frequencies.
– IID cues are most effective above about 2000 Hz
– IID of less than 1dB are detectable. At 4000 Hz a
source located at 90° gives about 30 dB IID
(Matlin and Foley, 1993)
Ambiguity and Lateralization
Goldstein, Sensation and Perception
Ambiguity and Lateralization
• These binaural cues are ambiguous. The same
ITD/IID can arise from sources anywhere along
a ‘cone of confusion’
• Spectral cues and changes in ITD/IID with
observer/object motion can help disambiguate
• When directional cues are used in headphone
systems, sounds are lateralised left versus right
but seem to emanate from inside the head (not
localised)
• also for near sources (less than 1 m) there is
significant IID due to differences in distance
to each ear even at lower frequencies
(Shinn-Cunningham et al 2000)
• Intersection of these ‘near field’ IID curves
with cones of confusion constrains them to
toroids of confusion
Spectral Cues
• Pinnae or outer ears and head shadow each
each ear and create frequency dependent
attenuation of sounds that depend on
direction of source
• Pinnae are relatively small, spectral cues are
effective predominately at higher
frequencies (i.e. above 6000 Hz)
• Direction estimation requires separation of
spectrum of sound source from spectral
shaping by the pinnae
• Shape of the pinnae shows large individual
differences which is reflected in differences
in spectral cues
Distance Cues
• anechoic
– intensity decreases
with distance
– attenuation is higher at
high frequency
– confound with
spectrum and intensity
of source
• Near field IID
http://headwize.com/tech/aureal1_tech.htm
http://headwize.com/tech/aureal1_tech.htm
• reverberation
– ratio of direct to reverberant energy indicates
distance wrt environment
– reverberation pattern indicates ‘spaciousness’ of
the environment
– reverberation is more realistic but can degrade
localisation, speech recognition …
Visual-Auditory Interactions
• Auditory cues associated with visual targets
can cue visual attention
• Latency for audition is less than vision
• A sound associated with visual target
–
–
–
–
can speed visual search
can reduce response times
facilitate saccadic eye movements
can cue attention outside the field of view
• Ventriloquism and visual capture
– When a visual and auditory source are grouped,
the sound is usually perceived in the direction
of the visual target
Auditory/Aural Displays
• Headphone displays
– Precise independent control of inputs to each ear.
– Individual display.
– Closed ear type can exclude external sounds.
Reduces interference from external sources;
simplifies AR systems.
– Entail an encumbrance.
– Diotic, dichotic (stereo) and spatialised displays
– Head fixed frame of reference. Display needs to
be head tracked to register with virtual world.
• Speaker systems
– Simpler, less encumbrance, multi-user
– Cannot ‘occlude’ real world sounds but can
sometimes mask
– Complication with echoes and cross-coupling
between channels
– Interference from/with visual displays
– World frame of reference.
– Subwoofer allows for deep bass. Could
augment headphones
Spatialised audio
• simple ITD, IID cues in a display lateralize
a sound. Sound is not ‘externalized’
• spatialised audio: generate most of the
spatial cues in real world environment using
signal processing
• with appropriate modeling of sound sources
and user tracking can provide a compelling
illusion of spatial sound in a VE
Binaural recording
http://www.engr.sjsu.edu/~knapp/HCIROD3D/3D_sys1/binaural.htm
• Head related transfer function (HRTF)
– describes how sound at a given location is
transformed (by pinnae etc.) as it travels to the
ear, as a function of frequency
– function of source direction and distance and
frequency (4D)
– equivalent to the Fourier transform of the
response to a impulse source at the desired
position
– IID and ITD as well as spectral cues are
incorporated (interaural differences in HRTF)
0.15 m
1.0 m
Shilling & Shinn-Cunningham 2001
• To simulate a source at a given location
– correct HRTF for response of the speaker
system
– convolve source with impulse response
corresponding to corrected HRTF.
– multiple sources possible by adding up HRTF
transformed signals
• To measure HRTF
– place microphones in ear canals
– measure microphone response to short clicks at
various locations
– correct for response characteristics of
microphones
• Lengthy, painstaking process.
• Storage requirements for dense sampling
Cohen and Wenzel, 1995
• Limitations in practice:
– sampling: often one distance and limited
number of directions
– interpolated for other locations
– generic versus individualized HRTF’s
(front/back confusion and elevation errors)
– HRTF is a characteristic of the user and does
not model effects of environment.
– need to track head position. Delay can be
problematic.
HRTF measurement using model
head (KEMAR)
Room Modeling
• Can model the effects of reverberation,
echoes etc. for a room transfer function
– Vary with listener and source position
– can have very long response
– combinatorially impractical
• Has been effort to develop efficient methods
for acoustic modeling of rooms
• Improves realism and distance estimation
but difficult for real-time immersive VEs
Shilling & Shinn-Cunningham 2001
Speaker Systems
• Spatialised audio complicated by fact that
both ears hear each speaker and that
reverberation will occur
• Effectiveness is sensitive to speaker
placement
• Stereo speakers: sound seems to be
localised between the speakers
• increasing number of speakers increases
ability to localise sounds (e.g. 5.1 surround
sound systems)
• more complex schemes are possible using
DSP but very challenging (‘ambisonics’)
– cancel interaural cross-talk based on HRTF
corresponding to speaker location
– computations are complex, not robust and must
be done in real time if head tracked
Auditory Rendering
• Auditory modeling/rendering of VEs
– sampling
– synthesis of complex sounds
• spectral
• physical models
• granular synthesis
– Filtering: HRTFs, reverberation, room
modeling
– Object occlusion, air absorption, Doppler
motion