Speech recognition

Download Report

Transcript Speech recognition

Outline of Unit 13: Sensational Computing
• The digital divide
• Speech audio interfaces
• Non-speech sound
• Handwriting recognition
• Tangible computing and gesture computing
• Ubiquitous computing
Arab Open University - Riyadh
1
Sensational computing
 Studying the use of the senses of hearing
and touch to represent information, and
mediate in human–computer interaction.
You have seen in Unit 12 how to represent
information using the visual sense.
Why is it important to use such other
senses to represent information?
2
Sensational computing
 Some reasons for using other senses to represent
information are:
 To make information accessible to as many people
as possible.
 Exploit these senses and modes of information
representation to overcome the digital divide.
 Increase the richness of communication between
humans and computers.
 In some situations, especially some special kind of
information like music is best represented using nonvisual forms.
3
The digital divide
The information gap between developed countries
and developing ones.
It also describe the information gap between
people of the same country.
Major reason for such gap is the poor availability
of computers and telecommunication infrastructure
necessary to use the information.
4
The digital divide
Other reasons for such gap:
• The cost of computers and the lack of technical
infrastructure.
• Physically large computer with space to site and use it.
• The need for electricity distribution system.
• Many people in the developing countries are illiterate.
• High levels of illiteracy is due to that most languages
especially African ones do not have a written form and
many languages are not supported by computer
applications.
• There are still many people all around the world who are not
familiar with English language.
5
The Simputer
 To develop a solution to the digital divide, we
need user-friendly interfaces that don’t rely on text,
but utilize different senses.
One of the solutions is the Simputer.
Simputer stands for Simple inexpensive
multilingual people’s computer.
The Simputer project has been led by a group of
Indian scientists and engineers.
6
The Simputer
 The Simputer is a self-contained,
hand held computer, designed for
use in environments where
computing devices such as PC are
deemed inappropriate. Due to the
low cost, it was also deemed
appropriate to bring computing
power to the developing countries.
Designed to help the poor and
illiterate join the information age.
7
The Simputer
 The Simputer has software that reads web
pages aloud in native Indian languages, so that
the 35% of Indians who cannot read can find out
about aid projects targeted at them.
 To keep the cost down an open standard has
used.
 Linux operating system was used.
8
Information Markup Language
The primary interface of the Simputer is a
browser that can render the Information Mark-up
Language (IML).
IML is a new XML application that has been
developed by the project team.
One of IML’s main roles is to specify how pages
should be displayed on the Simputer and what text
on a web page should be read out. The text can be
turned into an artificial sounding but nevertheless
understandable speech in languages like Hindi,
Kannada and Tamil using the library of sounds
stored on the computer.
9
Outline of Unit 13
 The digital divide
• Speech audio interfaces
• Non-speech sound
• Handwriting recognition
• Tangible computing and gesture computing
• Ubiquitous computing
10
Speech audio interfaces
 Speech becomes important when providing interfaces for
illiterate and poor literacy skills people.
 Speech is also important to people with visual
impairment.
 Speech is important in situations where keyboard can't
be used or eye can’t read a computer display.
 Speech recognition: computers recognize spoken
words.
 Speech synthesis: computers utter recognizable
speech.
11
Speech recognition
 There are two uses for speech recognition
systems:
1. Dictation: translation of the spoken word into
written text.
2. Computer Control: control of the computer,
and software applications by speaking
commands.
 Speech recognition is one of the desired
assistive technology systems. People believe
speech recognition is a natural and easy
method of accessing the computer.
12
Speech recognition
 While speaking; the microphone transform the sound
waves in to analogue signal.
 The analogue signal is then converted in to digital one.
 The digital signal is then split into words. (Low signal
means a break between a word and another.)
 Once the words are separated, they must be recognized.
 This is done by a speech recognition system.
13
Speech recognition
 Types of speech recognition systems:
1. Simple speech recognition systems:
• The telephone answering system is an example, where key
presses are replaced with spoken numbers. A simple speech
recognition system recognizes the individual words (numbers).
• Such recognition systems are called Isolated Word
Recognizers, designed to recognize individual words only.
• A break between the word and the other allows isolating each
word and then recognizing each alone.
• Another category of speech recognizers is called SpeakerEnrolment Systems, where the software has been trained to
recognize a single individual. In general, speaker-independent
systems will recognize much less vocabulary than those
systems which have been trained to recognize one person’s
speech.
14
Speech recognition
1. Advanced speech recognition systems:
• There might be a large number of candidate words that match
the word spoken by the user, due to background noise, lack of
clarity on the part of the speaker, or the conversion process from
sounds to electrical signals.
• So that a sophisticated speech recognition system needs to be
given large databases of words, language and grammar rules,
information on the frequency with which words are used in the
user’s language and probabilities that a certain word follows
another word, in order to identify likely words from a range of
possibilities.
• As an example: suppose that I have dictated the words ‘The dog barked
in the morning’ and the speech recognition system has identified that the
first two words were ‘the’ and ‘dog’ and that possible candidates for the
third word were; ‘barged’, ‘barked’, ‘barred’ and ‘boiled’. Rules in the
speech recognition system could reveal that the probability of the word
‘barked’ following the noun ‘dog’ is the highest compared to the other
words. Barked would then be chosen as the recognized word.
15
Speech recognition
 Speech recognition is not speech understanding; this is a common
misconception.
 New computer systems with Artificial Intelligence are attempting to
possess some understanding of the meaning of words.
 This understanding is based on common sense: knowledge that we
take for granted when determining the meaning of words.
 Common sense knowledge about barking will be linked with common
sense knowledge about dogs, and the linking goes on.
 The drawback of the common sense approach is that it requires
many millions of pieces of knowledge, incredibly sophisticated
programming and enormous amounts of computing power to
correctly interpret that knowledge.
16
Speech synthesis
 Speech synthesis: a machine to reproduce human
speech.
Automata (forerunners of modern robots) are one of the
early inventions that were capable of sounding individual
vowels and consonants.
Difficulties associated with speech synthesis systems are
related to formulating the rules for converting the source
text into speech.
Speech using stored fragments: a computer stores
fragments of speech which are assembled as required to
complete sentences. (e.g. telephone service that tells the
time)
17
Speech synthesis techniques
 Techniques used by speech synthesis systems:
1. Using Phonemes to produce speech.
2. Using Diphones to produce speech.
3. Model-based speech synthesis to produce
speech
18
Speech synthesis techniques
1. Using Phonemes to produce speech.
 The individual sounds produced by humans are
called phonemes.
 Each language has a number of phonemes
(English uses about 45 phonemes while Chinese
use about 2000 ones)
 A speech synthesizer joins together appropriate
phonemes in order to construct words.
 Example: CAT word can be constructed by joining
the 3 phonemes: K , A and T.
 The speech system concatenates phonemes to
produce speech
19
Speech synthesis techniques
2. Using Diphones to produce speech.
 Diphones are fragments that span two phonemes.
They stretch from the middle of one phoneme to the
middle of the following phoneme.
 If we continue with our ‘cat’ example, a diphone
consisting of the second half of the ‘k’ phoneme and
the first half of the ‘a’ phoneme, would be
concatenated together with a second diphone
consisting of….you get the picture.
 Then Diphones are joined together to form
sentences.
 Produces much smoother speech than the phoneme
approach.
20
Speech synthesis techniques
3. Model-based speech synthesis to
produce speech:
 Most advanced techniques.
 Relies on modeling the way in which human speak.
 It simulates the human vocal tract (produces the
sound and then shapes it in order to speak).
21
Speech Synthesis
 Problems facing the speech systems that
convert text into speech are:
1. How to pronounce a word of text.
2. Ambiguous words which are spelt identically but
have different pronunciation, called homographs
(e.g. read/read).
3. Computer has no understanding of what it is
reading, so it cannot infer the correct pronunciation
while speaking the text.
 Overcoming these problems produce more
advanced speech synthesis systems.
22
Outline of Unit 13
 The digital divide
 Speech audio interfaces
• Non-speech sound
• Handwriting recognition
• Tangible computing and gesture computing
• Ubiquitous computing
23
Different types of sound
 The different types of sound may be categorized according to the
type of information the sound contains, the ways in which the
sounds are used, or how they support our interactions with a
computer.
1. Music: which can accompany other things to enhance
enjoyment or create atmosphere, and for itself.
2. Alerts: sound effects such as beeps used for attention
getting.
3. Warnings: loud sound effects used for attention
grabbing.
4. Noise: unwanted sounds that can appear in different
frequencies and amplitudes.
24
Music
Digital technologies are becoming an increasingly
important part of music technology.
One reason, music stored in digital form can be easily
copied without any loss of quality.
With analogue form this is not true there will be a
difference in quality between an original tape and copied
one.
Recording is a type of representation medium for music
whether it is stored in an analogue way or digital one (in
this unit we will be concerned in digital one).
25
Music
 Sampling is the technique used to convert
analogue sounds in to digital one.
 Digital sounds are then stored in CD or MP3
format.
 CD recording is a higher fidelity.
 MP3 recording provides smaller file; enables
easy transmission.
26
Manipulating digital music
 After storing sound we need to manipulate it.
 Computer system enables the recording
technician to easily join parts of different
performances.
 Unwanted noises such as coughs can be
removed from a recording.
 Correct old recordings.
27
Musical Instrument Digital Interface
Another way of storing and manipulating music by
computers is using MIDI interface.
MIDI: Musical Instrument digital Interface, widely used in
music industry.
Contains instructions that electronic instruments (such as
electronic keyboard) can interpret in order to play
individual notes.
Define an interface standard for connecting electronic
instruments to your PC that allows playing back or even
recording music through these instruments.
A piece of music can be orchestrated for different
instruments.
File can be edited and individual notes can be changed.28
Digital composition
Digital synthesizers are typically controlled by an
electronic keyboard like the piano keyboard.
Digital synthesizers work on a digitized sound source.
The sound is then transformed by changing the
frequency or by filtering and then converted into
analogue one suitable for loudspeakers.
 2 ways in which the computer is used in music
composition (capturing and processing the notes):
• A computer program is used to input musical scores using direct
annotation of notes (the keyboard and the mouse can be used to
input notes).
• Use MIDI input format and this allow capturing the data and also
editing it later.
29
Using sound effects in computer interfaces
Hearing is our richest sense after sight, so we need to
increase the applications that uses sound in user interfaces
 Our visual and auditory senses are independent but they work well
together.
 Sound reduces the load on the user’s visual system.
 Sound reduces the visual attention that must be paid to a device.
 Sound is attention grabbing.
 Sound helps computers to be more usable by people with visual
impairment.
For these reasons researchers are working in the use of non
speech sound in human-computer interfaces.
The term Earcons is used to describe the non verbal
messages that is used in interfaces to tell the user some
30
information about computer operations.
Outline of Unit 13
 The digital divide
 Speech audio interfaces
 Non-speech sound
• Handwriting recognition
• Tangible computing and gesture computing
• Ubiquitous computing
31
Writing systems
 Most computers are programmed to respond to mouse clicks and
keystrokes of a keyboard.
Keyboard is a device that contains a number of keys
arranged in a random way called QWERTY keyboard (the
same as early mechanical design of manual typewriters)
 For us who use Latin alphabets (26 letters, 10 numbers and few
characters) this keyboard is perfect.
 this keyboard is not suitable for many other alphabets like Japanese
which is made up of many thousands of characters.
 An alternative way beside the keyboard is needed
 Handwriting
32
Handwriting recognition
In the last few years the market for computers that has
only become viable is the hand–held or pocket computer.
Decreasing the size of device means decreasing the size
of keyboard, which creates an input problem (bad
usability).
Handwriting recognition via a touch-sensitive screen is
the solution.
33
Handwriting recognition
 Difficulties facing writing recognition:
• Wide diversity of writing systems (Latin, Arabic,…),
each language has a rule for writing direction.
• Large individual differences in writing style; each one
write the characters in different shape.
• Human beings are extremely good at resolving
ambiguity in characters (we find it simple to
distinguish the number 5 from letter S) but generating
a programming solution for this problem is difficult.
• Humans rely on the common sense knowledge (we
don't expect to have a number 5 in the middle of a
word then it is S ) this is difficult to codify in a
computer program.
34
Simplifying handwriting recognition
 Techniques and conventions used to simplify the
task of handwriting recognition are:
• Restricting the range of symbols that can be used, to
just the uppercase letters.
• Requiring that characters are written in predefined
boxes.
• Accepting handwritten characters that are not joined
up.
• Redesigning the interface so that it is very clear what
input is required.
35
Neural networks and their use in handwriting
recognition
Many handwriting recognition systems use the technology of neural
networks to overcome the difficulties.
Neural networks is a terms to refer to the network of neurons inside
the nervous system of human beings.
Also it refers to the Artificial neural networks (programming
constructs that mimic the properties of neurons of nervous system).
Each Artificial neural must be trained first before they become useful,
this is done by presenting the neural with known data and recording
its respond.
If the network produces correct answer, it moves to the next example,
else the software involves repeated test until the answer is correct .
This technique has been used by handwriting recognition systems.
36
Neural networks and their use in handwriting
recognition
Good example of performing handwriting recognition with neural
networks is Newton MessagePad released by apple in 1993.
It uses a powerful neural net software to interpret handwriting.
The Newton has 2 advantages: no need to change your handwriting
style and it would learn to recognize your writing.
When user enters a word the system attempt to match it with the
words from its internal dictionary, if it is found then it recognizes the
word.
Else the user can tell the Newton to add the word to its internal
dictionary.
As time went, Newton become more and more accurate.
37
Neural networks and their use in handwriting
recognition
With Newton there are still some problems that have to be overcome in
order to produce a viable handwriting recognition system.
 Handwriting recognition system must run on a pocket computer, and this
system requires a difficult computing tasks that needs large amount of
memory and processing power.
 large amount of memory and processing power greatly reduces the
battery life .
Palm computing wanted to produce a pocket computer with lower price
and a battery life of weeks or even months by using a slow
microprocessor and small memory.
Simplifying the way of entering letters will simplify the task of
recognition.
The solution is Glyphs
38
Graffiti – an alternative to handwriting
recognition
A glyph is an element of writing.
The glyphs Palm company used are highly stylized equivalents to
letters, numbers and common punctuation characters. Most glyphs can
be completed in a single stroke of the stylus and each is sufficiently
different from all the others to make the recognition process tractable,
even on a relatively slow microprocessor.
 Palm called their handwriting recognition system graffiti.
Users needed to first learn the graffiti alphabet
before using the Palm system.
Experiments shows that Palm system
overcome most Problems produced
by Newton system.
39
Outline of Unit 13
 The digital divide
 Speech audio interfaces
 Non-speech sound
 Handwriting recognition
• Tangible computing and gesture computing
• Ubiquitous computing
40
Tangible computing and gesture
computing
two related means of communicating and interacting with
computers using our sense of touch:
 tangible computing, which involves devices that can be used
to interact with representations of information in the digital
world;
 gesture computing, where computers are programmed to
interpret human gestures and movements.
This area of human–computer interaction is sometimes
known as haptic computing.
In contrast to the visual and auditory senses, which are
primarily used for from computer to the user, Haptic
computing is a bi-directional one.
41
Tangible computing
Tangible interface is an interface that gives a physical
form to digital information.
Physical object can be both a representation of digital
information and controller for such information.
PDA, personal digital assistant is an example of tangible
user interface in which extra controls and sensors are
added to it so that physically manipulate the PDA.
Tilting or squeezing it for instance controls the display of
information on the PDA screen.
Some devices can provide a feedback through the
sensation of resistance to movement (driving simulators that
gives the user feel resistance through the steering wheel when turning
a corner too fast) .
42
Gesture computing
Another way of communication that is appropriate when
it is necessary to use computer without keyboard or
screen.
When a user is not able to hear or even to speak.
Most commonly used language for communication is
probably American Sign Language (ASL).
It is based on recognizing the special signs done by the
user and responding to them.
So the problem is to develop a recognizer that recognize
the sign language.
43
Gesture recognition
 Developing such recognizer is not an easy manner for
several reasons especially that the sign language is
done free form, in the air, primarily by hands.
 2 solutions for such problem:
1. Person making the signs gestures to wear special gloves,
which make it easier for an image-recognition system to track
the hands against a general background.
2. The signer needs to wear special sensors which allows the
computer to track the position of the hand in 3 dimensions.
 New researches are done nowadays on the movement
made by the human eye and trying to recognize it.
44
Outline of Unit 13
 The digital divide
 Speech audio interfaces
 Non-speech sound
 Handwriting recognition
 Tangible computing and gesture computing
• Ubiquitous computing
45
Ubiquitous computing
 Making many computers available throughout the
physical environment, while making them invisible to the
user.
 Sensor inside the washing machine,……
 The computers is embedded inside the physical
environment and other equipment.
 Computer will be small, unlike conventional computers.
 the computer will be invisible in the sense that user will
not be aware that they are using a computer.
46
What’s next?
• Unit 14 : Hiding data: an introduction to
security.
• Check out TMA04
47