Multinomial Distributon Lecture(s)

Download Report

Transcript Multinomial Distributon Lecture(s)

Multinomial Distribution
•
•
•
•
•
Multinomial coefficients
Definition
Marginals are binomial
Maximum likelihood
Hypothesis tests
Multinomial Coefficient:
From n objects, number of ways to
choose
• n1 of type 1
• n2 of type 2
• nk of type k
Of 30 graduating students, how many
ways are there for 15 to be employed in a
job related to their field of study, 10 to be
employed in a job unrelated to their field
of study, and 5 unemployed?
Multinomial Distribution
•
•
•
•
Statistical experiment with k outcomes
Repeated independently n times
Pr(Outcome j) = pj, j = 1, …, k
Number of times outcome j occurred is
xj, j = 1, …, k
• A multivariate distribution
Marginals are also multinomial
Observe
• Adding over xk-1 throws it into the
“leftover” category.
• Labels 1, …, k are arbitrary, so this
means you can combine any 2
categories and the result is still
multinomial.
• k is arbitrary, so you can keep doing it
and combine any number of categories.
• When only two categories are left, the
result is binomial
• E(xj) = npj, Var(xj) = npj(1-pj)
Sample problem
•
•
•
•
P(Job related to field of study) = 0.60
P(Job unrelated to field of study) = 0.30
P(No job) = 0.10
Of 30 randomly chosen students, what is probability that 15
are employed in a job related to their field of study, 10 are
employed in a job unrelated to their field of study, and 5
are unemployed?
• What is the probability that exactly 5 are unemployed?
Data File
Case
Job
x1
x2
x3
1
1
1
0
0
2
3
0
0
1
3
2
0
1
0
4
1
1
0
0
N
2
0
1
0
Total
Lessons from the data file
• Cases (N of them) are independent M(1,p),
so E(xi,j) = pj.
• Column totals count the number of times
each category occurs: Joint distribution is
M(N,p)
• These are the table (cell) frequencies! They
are random variables, and now we know their
joint distribution.
• Each individual table frequency is B(N,pj)
• Expected value of frequency j is mj = Npj
• Tables of 2 and or more dimensions present
no problems -- combination variables.
More about the frequencies
We are in the familiar situation of estimating expected
values with sample means. And these sample means are
just sample proportions.
Simple Tools for Estimation
• So the (multivariate) sample mean is an
unbiased estimator of the vector of
multinomial probabilities.
• The Law of Large numbers says
• CLT says multivariate sample mean has
an approximate multivariate normal
distribution for large N.
• Basis of large-sample tests and
confidence intervals.
Maximum Likelihood
• Product of N probability mass functions, each
M(1,p)
• Depends upon the sample data only through the
vector of k frequency counts.
• By the factorization theorem, a sufficient statistic
• All the information about the parameter in the
sample data is contained in the sufficient
statistic.
Following the book’s notation
• Write the frequencies as x1, …, xk.
• Later, x values with multiple subscripts will
refer to frequencies in a multi-dimensional
table, like xi,j,k will be the frequency in row i
and column j of sub-table k.
• Write likelihood function as
Log likelihood: p-1 parameters
Set all k-1 derivatives to zero and solve for
p1, …, pk. Verify that pi = xi /N for i = 1, … k–
1 works: MLE is the sample mean.
Likelihood Ratio Tests
Under H0, G2 has an approximate chi-square
distribution for large N. Degrees of freedom =
number of (non-redundant, linear) equalities
specified by H0. Reject when G2 is large.
Degrees of Freedom
Express H0 as a set of linear combinations of
the parameters, set equal to constants
(usually zeros).
Degrees of freedom = number of non-redundant
linear combinations.
df=3
p = (p1,p2,p3,p4,p5)
• H0: p1=0.25, p2=(p3+p4)/2,p4=p5 so df=3
• H0: p1=1/5, p2=1/5, p3=1/5, p4=1/5, p5=1/5
so df=4 not 5, because probabilities add to
one, so one equality is redundant.
If is a kx1 vector and H0: C = h where C is an rxk matrix,
the degrees of freedom is the row rank (number of linearly
independent rows) of C --- usually r. But remember, if = p
for the multinomial, there are really k-1 parameters.
Example
University administrators recognize that the percentage
of students who are unemployed after graduation will
vary depending upon economic conditions, but they
claim that still, about twice as many students will be
employed in a job related to their field of study,
compared to those who get an unrelated job. To test
this hypothesis, they select a random sample of 200
students from the most recent class, and observe 106
employed in a job related to their field of study, 74
employed in a job unrelated to their field of study, and
20 unemployed. Test the hypothesis using a largesample likelihood ratio test and significance level
= 0.05. State your conclusions in symbols and words.
• What is the model?
• What is the null hypothesis, in symbols?
• What are the degrees of freedom for
this test?
What is the restricted MLE? Your answer is a symbolic
expression. It’s a vector. Show your work.
• What is the unrestricted MLE? Your answer is
a numeric vector: 3 numbers.
• What is the restricted MLE? Your answer is a
numeric vector: 3 numbers.
• What are the estimated expected frequencies
under the null hypothesis? Your answer is a
numeric vector: 3 numbers.
Calculate G2. Show your work.
Or, with R
State your conclusions
• In symbols: Reject H0: p1=2p2 at alpha
= 0.05
• In words: More graduates appear to be
employed in jobs unrelated to their
fields of study than expected.
Statement in words is justified because
Observed 106 74 20
Expected 120 60 20
Obs-Exp -14 14 0
For a general hypothesis about a multinomial
Two chi-square formulas
• Likelihood Ratio
• Pearson
• Summation is over all cells
• By expected frequency, we mean estimated expected
frequency.
• Asymptotically equivalent
• Same degrees of freedom
• Book's formula for df applies only to log-linear
models. Use the approach given here, for now.
Pearson Chi-square on the jobs data
Observed 106 74 20
Expected 120 60 20