Week 2 Power Point Slides

Download Report

Transcript Week 2 Power Point Slides

Introduction
• Phonetics: Speech production and perception
• Phonology: Study of sound combinations
• Orthography: Writing Systems
We’ll talk about each area and how they impact
Natural Language Processing
Phonetics
Study of speech production and perception
• Phone – set of all sounds that humans can articulate
• Phoneme - Distinct family of phones in a language
–
–
–
–
–
Languages utilize 15 – 40 phonemes
Note: Too few distinct sounds for a language vocabulary
Ears tuned to hear a language’s distinct phonemes
Languages are easy to speak and still be understood
Infer phoneme set: find words differing in only one sound
• Allophone: variant realizations of a phoneme
– Can be separate phonemes in other language
• Segment : All phones, phonemes, and allophones
Overview of the Noisy Channel
The Noisy Channel
Computational Linguistics
1. Replace the ear with a microphone
2. Replace the brain with a computer algorithm
Production
• We have a complete but approximate of how
speech is produced
• We cannot accurately predict the audio signal
corresponding to given articulatory positions
• The best synthesis methods, for now, use
concatenation-based algorithms to create
computerized speech.
• Model: Pulmonic egressive air-stream from the
source (glottis) through the vocal tract operating
as source-filter.
Vocal Source
• Speaker alters vocal tension of the vocal folds
– If folds are opened, speech is unvoiced resembling background noise
– If folds are stretched close, speech is voiced
• Air pressure builds and vocal folds blow open releasing pressure
and elasticity causes the vocal folds to fall back
• Average fundamental frequency (F0): 60 Hz to 300 Hz
• Speakers control vocal tension alters F0 and the perceived pitch
Open
Closed
Period
Formants
• Definition: harmonics of F0
– F1, F2, F3, etc.
– Adds timbre to voiced sounds
– Vowels have distinct harmonic patterns
– Vocal articulators change emphasis of the
harmonics and alter their frequencies
– There are complex relationships between
formants dependent on vocal musculature
– Formants spread out as the pitch goes higher
Formant Speaker Variance
Vowel Formants
e
eh
ae
o
u
ih
uh
ah
a
w
Vocal Tract
Note: Velum is the soft pallette, epiglottis guards protects the vocal cords
Another look at the vocal tract
Different Voices
• Falsetto – The vocal cords are stretched and become
thin causing high frequency
• Creaky – Only the front vocal folds vibrate, giving a
low frequency
• Breathy – Vocal cords vibrate, but air is escaping
through the glottis
• Each person tends to consistently use particular
phonation patterns. This makes the voice uniquely
theirs.
Vowels
No restriction of the vocal tract, articulators alter the formants
• Diphthong: Syllabics which show a marked glide
from one vowel to another, usually a steady vowel
plus a glide
• Nasalized: Some air flow through the nasal cavity
• Rounding: Shape of the lips
• Tense: Sound more extreme (further from the schwa)
and tend to have the tongue body higher
• Relaxed: Sounds closer to schwa (tonally neutral)
• Tongue position: Front to back, high to low
Vowel Characteristics
Demo: http://faculty.washington.edu/dillon/PhonResources/vowels.html
Vowel Word
high Low front back round tense F1
F2
Iy
Feel
+
-
+
-
-
+
300 2300
Ih
Fill
+
-
+
-
-
-
360 2100
ae
Gas
-
+
+
-
-
+
750 1750
aa
Father
-
+
-
-
-
+
680
ah
Cut
-
-
-
-
-
+
720 1240
ao
Dpg
-
-
-
-
-
-
600
ax
Comply
-
-
+
-
-
-
720 1240
eh
Pet
-
-
-
+
+
+
570 1970
ow
Tone
+
-
-
+
-
-
600
900
uh
Good
+
-
-
+
-
+
380
950
uw
Tool
300
940
1100
900
Consonants
• Significant obstruction in the nasal or oral cavities
• Occur in pairs or triplets and can be voiced or unvoiced
• Sonorant: continuous voicing
• Unvoiced: less energy
• Plosive: Period of silence and then sudden energy burst
• Lateral, semi vowels, retroflex: partial air flow block
• Fricatives, affricatives: Turbulence in the wave form
Manner of Articulation
• Voiced: The vocal cords are vibrating, Unvoiced: vocal cords don’t vibrate
• Obstruent: Noise-like sounds
– Fricative: Air flow not completely shut off
– Affricate: A sequence of a stop followed by a fricative
– Sibilant: a consonant characterized by a hissing sound (like s or sh)
• Trill: A rapid vibration of one speech organ against another (Spanish r).
• Aspiration: burst of air following a stop.
• Stop: Air flow is cut off
– Ejective: airstream and the glottis are closed and suddenly released (/p/).
– Plosive: Voiced stop followed by sudden release
– Flap: A single, quick touch of the tongue (t in water).
• Nasality: Lowering the soft palate allows air to flow through the nose
• Glides: vowel-like, syllable position makes them short without stress (w, y)
– On-glide: glide before vowel, off-glide: glide after vowel
• Approximant (semi-vowels): Active articulator approaches the passive
articulator, but doesn’t totally shut of (L and R).
– Laterality: The air flow proceeds around the side of the tongue
Place of the Articulation
Articulation: Shaping the speech sounds
• Bilabial – The two lips (p, b, and m)
• Labio-dental – Lower lip and the upper teeth (v)
• Dental – Upper teeth and tongue tip or blade (thing)
• Alveolar –Alveolar ridge and tongue tip or blade (d, n, s)
• Post alveolar –Area just behind the alveolar ridge and tongue
tip or blade (jug ʤ, ship ʃ, chip ʧ, vision ʒ)
• Retroflex – Tongue curled and back (rolling r)
• Palatal – Tongue body touches the hard palate (j)
• Velar – Tongue body touches soft palate (k, g, ŋ (thing))
• Glottal – larynx (uh-uh, voiced h)
English Consonants
Type
Phones
Mechanism
Plosive
b,p,d,t,g,k
Close oral
cavity
Nasal
m, n, ng
Open nasal
cavity
Fricative
V,f,z,s,dh,th,zh, sh
Turbulent
Affricate
jh, ch
Stop +
Turbulent
Retroflex Liquid
r
Tongue high
and curled
Lateral liquid
l
Side
airstreams
Glide
w, y
Vowel like
Consonant Place and Manner
Labial Labio- Dental
dental
Aveolar
Plosive
pb
td
kg
Nasal
m
n
ng
Fricative
f v
th dh
sz
Retroflex
sonorant
r
Lateral
sonorant
l
Glide
w
Palatal Velar
sh zh
y
Glottal
?
h
Example word
Speech Production Analysis
•
•
•
•
Plate attached to roof of mouth measuring contact
Collar around the neck measuring glottis vibrations
Measure air flow from mouth and nose
Three dimension images using MRI
Note: IPA was designed before the above technologies
existed. They were devised by a linguist looking down
someone’s mouth or feeling how sounds are made.
Perception
• Some perceptual components are understood,
but knowledge concerning the entire human
perception model is rudimentary
• Understood Components
1. The inner ear works as a filter bank
2. Sounds are perceived on a logarithmic scale
3. Some sounds will mask others
The Inner Ear
Two sensory organs are located in the inner ear.
– The vestibule is the organ of equilibrium.
– The cochlea is the organ of hearing.
Note: Basilar Membrane
shown unrolled
Basilar Membrane
• Thin elastic fibers stretched across the cochlea
– Short, narrow, stiff, and closely packed near the oval window
– Long, wider, flexible, and sparse near the end of the cochlea
– The membrane connects to a ligament at its end.
• Separates two liquid filled tubes that run along the cochlea
– The fluids are very different chemically and carry the pressure waves
– A leakage between the two tubes causes a hearing breakdown
• Provides a base for sensory hair cells
– The hair cells above the resonating region fire more profusely
– The fibers vibrate like the strings of a musical instrument.
Place Theory
Decomposing the sound spectrum
• Georg von Bekesy’s Nobel Prize discovery
– High frequencies excite the narrow, stiff part at the end
– Low frequencies excite the wide, flexible part by the apex
• Auditory nerve input
– Hair cells on the basilar membrane fire near the vibrations
– The auditory nerve receives frequency coded neural signals
– A large frequency range is possible because the basilar
membrane’s stiffness is exponential
Demo at: http://www.blackwellpublishing.com/matthews/ear.html
Hair Cells
• The hair cells are in rows along the basilar membrane.
• Individual hair cells have multiple strands or stereocilia.
– The sensitive hair cells have many tiny stereocilia which form a conical
bundle in the resting state
– Pressure variations cause the stereocilia to
dance wildly and send electrical impulses
to the brain.
Firing of Hair Cells
• There is a voltage difference across
the cell
– The stereocilia projects into the
endolymph fluid (+60mV)
– The perylymph fluid surrounds the
membrane of the haircells (-70mV)
• When the hair cells moves
– The potential difference increases
– The cells fire
Speech Perception
• We don't perceive speech linearly
• The cochlea has rows of hair cells. Each row acts as a
frequency filter.
• The frequency filters overlap
From early place theory experiments
Absolute Hearing Threshold
• The hearing threshold but varies at different frequencies.
• An empirical formula approximates the SPL threshold: SPL(f) =
3.65(f/1000)-0.8-6.5e-0.6(f/1000-3.3)^2+10-3(f/1000)4
• The table measures the threshold for men (M) and women (W)
ages 20 through 60
Sound Threshold Measurements
Intensity and Neural Response
• Auditory response is a function of intensity
• The response saturates at a maximum intensity level
From CMU Robust
Speech Group
Bark and Mel Scales
Mel scale:
f
Mel( f )  2595 log 10 (1 
)
700
Bark scale:
Bark( f ) 
26.81 f
 0.53
1960  f
Comparison of Frequency Perception
Scales
2.5
2
1.5
Perceptual scale
• Blue: Bark Scale
• Red: Mel Scale
• Green: ERB Scale
1
0.5
0
-0.5
-1
0
500
1000
1500
2000
2500
3000
Frequency, Hz
3500
Equivalent Rectangular Bandwidth (ERB) is an unrealistic but
simple rectangular approximation to model the filters in the cochlea
4000
4500
5000
Masking
•
Masking is a phenomenon in which perception of one sound
is obscured by the presence of another sound
•
Masking occurs in both the time and frequency domains
–
–
•
•
Time: One Tone occurs shortly before another tone
Frequency: One tone is near the frequency of another
Experiment (Most involve single sin waves)
– Fix one sound at a frequency and intensity
– Varying a second sine wave’s intensity
– When is the second sound heard?
Amplification of perception
–
Tones below the threshold of hearing can be perceived if they
occur simultaneously and the total energy within a frequency
band exceeds the threshold.
Masking Patterns
• A narrow band of noise at 410 Hz
• Note the asymmetrical pattern
From CMU Robust
Speech Group
Time Domain Masking
• Noise will mask a tone if:
– The noise is sufficiently loud
– The delay is short
– Intensity of the noise needs to increase with the delay length
• There are two types of masking
– Forward: Noise masking a tone that follows
– Backward: A tone is masked by noise that follows
• Delays
– beyond 100 − 200 ms no forward masking occurs
– Beyond 20 ms, no backward masking occurs. Training can reduce or
eliminate the perceived backward masking.
Phonology
• Study of sound combinations
• Rule based
– A finite state grammar can represent valid sound
combinations in a language
– Unfortunately, these rules are language-specific
• Statistics based
– Most other areas of Natural Language processing
are trending to statistical-based methods
Syllables
• Organizational phonological unit
– Vowel between two consonants
– Ambiguous positioning of consonants into
syllables
– Tree structured representation
• Basic unit of prosody
– Lexical stress: inherent property of a word
– Sentential stress: speaker choice to emphasize or
clarrify
Representing Stress
• There have been unsuccessful attempts to
automatically assign stress to phonemes
• Notations for representing stress
– IPA (International Phonetic Alphabet) has a diacritic
symbol for stress
– Numeric representation
• 0: reduced, 1: normal, 2: stressed
– Relative
• Reduced (R) or Stressed (S)
• No notation means undistinguished
Phonological Grammars
• SPC: Sound Pattern for English
– 13 features for 8192 combinations
– Complete descriptive grammar
• Recent research
– Trend towards context-sensitive descriptions
– Little thought concerning computational feasibility
– Its unlikely that listeners apply thousands of rules
to perceive speech
Morphology
• How phonemes combine to make words
• Important for speech synthesis
• Example: singular to plural
– Run to runs: z sound (voiced)
– Hit to Hits: s sound (unvoiced)
• Devise sets of rules of pronunciation
Orthography: Writing Systems
• Diacritics – Accent marks
• Prosody – Stress, loudness, pitch, tone, intonation, and length
• Written symbolic representation of speech
– Wide: symbol set representing a speech message
– Narrow: symbol set representing a speech signal
• English-based phonetic Transcriptions: Arpanet, Timit
• IPA: International Phonetic Alphabet
– International standard attempt at a narrow transcription
– Intent: represent all sounds of known languages
– Disadvantages:
• Misses articulator interrelationships
• Multiple realizations of the same sound
• Non-linearity of speech, articulators always moving
Narrow transcription Difficulties
•
•
•
•
•
•
•
•
•
•
Realizations are points in continuous space, not discrete
Sounds take characteristics of adjacent sounds (assimilation)
Sounds that are combinations of two (co-articulation)
Articulator targets are often not reached
Diphthongs combine different phonemes
Adding (epenthesis) or deleting (elision)
Missing word, phrase boundaries, endings
Many tonal variations during speech
Varied vowel durations
Common knowledge, familiar background leads to more
sloppy speech with additional non-linearities.
Written English
• Spellings are not consistent with regard to sounds
– Same spelling, different sounds: low vs. cow
– Different spelling, same sounds: cow, bough
• Pronunciations of written languages evolve over time
• If current written English was phonetically accurate
– It would only apply to a single dialect
– It would be wrong as soon as the population altered its
speech patterns
George Bernard Shaw’s System
His Goal: Replace the
Latin alphabet with
One that is phonetically
accurate
Result: It didn't work.
Language phonetics
Are not static and the
population was not
willing to switch to a
new writting
Pitman Shorthand
ARPABET: English-based phonetic system
Phone ExamplePhone
[iy]beat
[b]
[ih]
bit
[eh]
bet
[ah]
but
[x]
bat
[ao]
bought
[ow]
boat
[uh]
book
[ey]
bait
[er]
bert
[ay]
buy
[oy]
boy
[arr]
dinner
[aw]
down
[ax]
about
[ix]roses
[eng]
[aa]
cot
ExamplePhone Example
bet
[p]
pet
[ch]
chet
[r]
[d]
debt
[s]
[f]
fat
[sh]
[g]
get
[t]
[hh]
hat
[th]
[hy]
high
[dh]
[jh]
jet
[dx]
[k]
kick
[v]
[l]
let
[w]
[m]
met
[wh]
[em]
bottom
[n]
net
[y]
[en]
button
[z]
[ng]
sing
[zh]
washing
[-]
rat
set
shoe
ten
thick
that
butter
vet
wet
which
yet
zoo
measure
silence
The International
Phonetic Alphabet
IPA Vowels
Caution: English tongue positions don’t exactly match the chart.
For example, ‘father’ in English does not have the tongue position
as far back the IPA vowel chart shows.
IPA Diacritics
IPA: Tones and Word Accents
IPA: Supra-segmental Symbols
Newer Technologies
• Voice XML
–
–
–
–
–
Framework for integrating human/machine dialogues
W3 Consortium standard
Input: audio files or human speech
Output: synthesized
Script interpreted by voice-browsers
• SSML (speech synthesis markup language)
– XML-based technology to standardize manipulation of
synthesized speech
• Others
– SABLE (1998 Consortium)
– SAPI (Microsoft Speech API )