INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Download Report

Transcript INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Massimo Poesio
LECTURE 11 (Lab): Probability reminder
MACHINE LEARNING
• The next series of lectures will be on methods
to learn how to solve problems from data
(MACHINE LEARNING)
• Most methods of this type presuppose some
knowledge of probability and statistics
WHY PROBABILITY THEORY
• Suppose you’ve already texted the characters
“There in a minu”
• You’d like your mobile phone to guess the most likely
completion of “minu” rather than MINUET or MINUS or
MINUSCULE
• In other words, you’d like your mobile phone to know that
given what you’ve texted so far, MINUTE is more likely than
those other alternatives
• PROBABILITY THEORY was developed to formalize the notion
of LIKELIHOOD
NLE
3
TRIALS (or EXPERIMENTS)
• A trial is anything that may have a certain
OUTCOME (on which you can make a bet, say)
• Classic examples:
– Throwing a die (outcomes: 1, 2, 3, 4, 5 , 6)
– A horse race (outcomes?)
• In NLE:
– Looking at the next word in a text
– Having your NL system perform a certain task
NLE
4
(ELEMENTARY) OUTCOMES
• The results of an experiment:
– In a coin toss, HEAD or TAILS
– In a race, the names of the horses involved
• Or if we are only interested in whether a particular
horse wins: WIN and LOSE
• In NLE:
– When looking at the next word: the possible
words
– In the case of a system: RIGHT or WRONG
NLE
5
EVENTS
• Often, we want to talk about the likelihood of
getting one of several outcomes:
– E.g., with dice, the likelihood of getting an even
number, or a number greater than 3
• An EVENT is a set of possible OUTCOMES
(possibly just a single elementary outcome):
– E1 = {4}
– E2 = {2,4,6}
– E3 = {3,4,5,6}
NLE
6
SAMPLE SPACES
• The SAMPLE SPACE is the set of all possible outcomes:
– For the case of a dice, sample space S = {1,2,3,4,5,6}
– For the case of a coin toss, sample space S = {H,T}
• For the texting case:
– Texting a word is a TRIAL,
– The word texted is an OUTCOME,
– EVENTS which result from this trial are: texting the word “minute”,
texting a word that begins with “minu”, etc
– The set of all possible words is the SAMPLE SPACE
• (NB: the sample space may be very large, or even infinite)
NLE
7
PROBABILITY FUNCTIONS
• The likelihood of an event is indicated using a PROBABILITY
FUNCTION P
• The probability of an event E is specified by a function P(E),
with values between 0 and 1
– P(E) = 1: the event is CERTAIN to occur
– P(E) = 0: the event is certain NOT to occur
• Example: in the case of die casting,
– P(E’ = ‘getting as a result a number between 1 and 6’) =
P({1,2,3,4,5,6}) = 1
– P(E’’ = ‘getting as a result 7’) = 0
• The sum of the probabilities of all elementary outcomes = 1
NLE
8
EXERCISES: ANALYTIC PROBABILITIES
• When we know the entire sample space, and
we can assume that all outcomes are equally
likely, we can compute the probability of
events such as
– P(1)
– P(EVEN)
– P(>3)
PROBABILITIES AND RELATIVE
FREQUENCIES
• In the case of a die, we know all of the possible outcomes ahead of time,
and we also know a priori what the likelihood of a certain outcome is. But
in many other situations in which we would like to estimate the likelihood
of an event, this is not the case.
• For example, suppose that we would like to bet on horses rather than on
dice. Harry is a race horse: we do not know ahead of time how likely it is
for Harry to win. The best we can do is to ESTIMATE P(WIN) using the
RELATIVE FREQUENCY of the outcome `Harry wins’
• Suppose Harry raced 100 times, and won 20 races overall. Then
– P(WIN) = WIN/TOTAL NUMBER OF RACES = .2
– P(LOSE) = .8
• The use of probabilities we are interested in (estimate the probability of
certain sequences of words) is of this type
NLE
10
LOADED DICE
• The assumption that all outcomes have equal
probability is very strong
• In most real situations (and with most real
dice) probabilities of the outcomes are slightly
different
– P(1) = 1/4, P(2) = .15, P(3) = .15, P(4) = .15,
P(5) = .15, P(6) = .15
JOINT PROBABILITIES
• We are often interested in the probability of TWO events happening:
– When throwing a die TWICE, the probability of getting a 6 both times
– The probability of finding a sequence of two words: `the’ and `car’
• We use the notation A&B to indicate the conjunction of two events, and
P(A&B) to indicate the probability of such conjunction
– Because events are SETS, the probability is often also written as
• We use the same notation with WORDS: P(‘the’ & ‘car’)
P( A  B )
NLE
12
JOINT PROBABILITIES: TOSSING A DIE
TWICE
• Sample space =
{ <1,1>, <1,2>, <1,3>, <1,4>, <1,5>, <1,6>,
<2,1>, …..
…..
<6,1>, <6,2>, <6,3>, ……..}
EXERCISES: PROBABILITY OF TWO
EVENTS
– P(first toss=1 & second toss=3)
– P(first toss=even & second toss=even)
OTHER COMBINATIONS OF EVENTS
• A  B: either event A or event B happens
– P(A  B) = P(A) + P(B) – P(AB)
– NB: If AB = ∅, P(A  B) = P(A) + P(B)
•  A: event A does not happen
– P( A) = 1 –P(A)
NLE
15
EXERCISES: ADDITION RULE
• P(A  B) = P(A) + P(B) – P(AB)
• P( first toss = 1  second toss = 1)
• P(sum of two tosses = 6  sum of two tosses =
3)
PRIOR PROBABILITY VS CONDITIONAL
PROBABILITY
• The prior probability P(WIN) is the likelihood of an event occurring
irrespective of anything else we know about the world
• Often however we DO have additional information, that can help us
making a more informed guess about the likelihood of a certain event
• E.g, take again the case of Harry the horse. Suppose we know that it was
raining during 30 of the races that Harry raced, and that Harry won 15 of
these races. Intuitively, the probability of Harry winning when it’s raining is
.5 - HIGHER than the probability of Harry winning overall
– We can make a more informed guess
• We indicate the probability of an event A happening given that we know
that event B happened as well – the CONDITIONAL PROBABILITY of A given
B – as P(A|B)
NLE
17
Conditional probability
ALL RACES
RACES WHEN IT RAINS
RACES WON BY HARRY
NLE
18
Conditional probability
• Conditional probability is DEFINED as follows:
P( A & B)
P( A | B) 
P( B)
• Intuitively, you RESTRICT the range of trials in consideration
to those in which event B took place, as well (most easily seen
when thinking in terms of relative frequency)
NLE
19
EXAMPLE
• Consider the case of Harry the horse again:
P(WIN & RAIN)
P(WIN | RAIN) 
P( RAIN)
• Where:
– P(WIN&RAIN) = 15/100 = .15
– P(RAIN) = 30/100 = .30
• This gives:
0.15
P (WIN | RAIN ) 
 0.5
0.3
• (in agreement with our intuitions)
NLE
20
EXERCISES
• P(sum of two dice = 3)
• P(sum of two dice = 3 |
first die = 1)
THE MULTIPLICATION RULE
• The definition of conditional probability can be
rewritten as:
– P(A&B) = P(A|B) P(B)
– P(A&B) = P(B|A) P(A)
NLE
22
INDEPENDENCE
• Additional information does not always help. For example, knowing the
color of a dice usually doesn’t help us predicting the result of a throw;
knowing the name of the jockey’s girlfriend doesn’t help predicting how
well the horse he rides will do in a race; etc. When this is the case, we say
that two events are INDEPENDENT
• The notion of independence is defined in probability theory using the
definition of conditional probability
• Consider again the basic form of the chain rule:
– P(A&B) = P(A|B) P(B)
• We say that two events are INDEPENDENT if:
– P(A&B) = P(A) P(B)
– P(A|B) = P(A)
NLE
23
EXERCISES
• P(H & H)
• P(sum of two tosses greater than 6 &
first toss = 1)
THE CHAIN RULE
• The multiplication rule generalizes to the so-called
CHAIN RULE:
– P(w1,w2,w3,….wn) = P(w1) P(w2|w1) P(w3|w1,w2) ….
P(wn|w1 …. wn-1)
• The chain rule plays an important role in statistical
NLE:
– P(the big dog) = P(the) P(big|the) P(dog|the big)
NLE
25
Bayes’ theorem
• Suppose you’ve developed an IR system for searching a big
database (say, the Web)
• Given any search, about 1/100,000 documents is relevant
(REL)
• Suppose your system is pretty good:
– P(YES|REL) = .95
– P(YES| REL) = .005
• What is the probability that the document is relevant, when
the system says YES?
– P(REL|YES)?
NLE
26
Bayes’ Theorem
• Bayes’ Theorem is a pretty trivial consequence of the
definition of conditional probability, but it is very useful in
that it allows us to use one conditional probability to compute
another
• We already saw that the definition of conditional probability
can be rewritten equivalently as:
– P(A&B) = P(A|B) P(B)
– P(A&B) = P(B|A) P(A)
• If we equate the two left sides, we get Bayes’ theorem
P( A | B) P( B)
P( B | A) 
P( A)
NLE
27
Application of Bayes’ theorem
P(YES | REL) P( REL)
P( REL | YES) 
P(YES)
28

P(YES | REL) P( REL)
P(YES | REL) P( REL)  P(YES | REL) P(REL)

0.95  0.00001
 0.002
0.95  0.00001  0.005  0.99999
NLE