DATA ANALYSIS - DCU School of Computing
Download
Report
Transcript DATA ANALYSIS - DCU School of Computing
DATA ANALYSIS
Module Code: CA660
Lecture Block 3
MEASURING PROBABILITIES – RANDOM
VARIABLES & DISTRIBUTIONS
(Primer) If a statistical experiment only gives rise to real
numbers, the outcome of the experiment is called a
random variable. If a random variable X takes values
X1, X2, … , Xn with probabilities p1, p2, … , pn
then the expected or average value of X is defined
n
E[X] =
pj Xj
j 1
and its variance is
n
VAR[X] = E[X2] - E[X]2 =
pj Xj2 - E[X]2
j 1
2
Random Variable PROPERTIES
• Sums and Differences of Random Variables
Define the covariance of two random variables to be
COVAR [ X, Y] =
E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y]
If X and Y are independent, COVAR [X, Y] = 0.
E[ X Y] = E[X] E[Y]
VAR [ X Y] = VAR [X] + VAR [Y]
2COVAR [X, Y]
and E[ k. X] = k .E[X] , VAR[ k. X] = k2 .E[X]
for a constant k.
Lemmas
3
Example: R.V. characteristic properties
B =1
2
R=1
8 10
2
5
7
3
6
6
Totals
19 23
E[B]
E[B2]
3
9
4
7
20
Totals
27
16
19
62
= {1(19)+2(23)+3(20) / 62 = 2.02
= {12(19)+22(23)+32(20) / 62 = 4.69
VAR[B] = ?
E[R]
= {1(27)+2(16)+3(19)} / 62 = 1.87
2
E[R ] = {12(27)+22(16)+32(19)} / 62 = 4.23
VAR[R] = ?
4
Example Contd.
E[B+R]
E[(B + R)2]
= { 2(8)+3(10)+4(9)+3(5)+4(7)+
5(4)+4(6)+5(6)+6(7)} / 62
= 3.89
= {22(8)+32(10)+42(9)+32(5)+42(7)+
52(4)+42(6)+52(6)+62(7)} / 62 = 16.47
VAR[(B+R)] = ? *
E[B R]
= {1(8)+2(10)+3(9)+2(5)+4(7)+6(4)
+3(6)+6(6)+9(7)}/ 62
= 3.77
COVAR (B, R) = ?
Alternative calculation to *
VAR[B] + VAR[R] + 2 COVAR[ B, R]
Comment?
5
DISTRIBUTIONS - e.g. MENDEL’s PEAS
6
P.D.F./C.D.F.
• If X is a R.V. with a finite countable set of possible outcomes, {x1
, x2,…..}, then the discrete probability distribution of X
P{ X xi} if x xi, i 1,2,....
f ( x) or pX ( xi )
0 if x xi
and D.F. or C.D.F.
P{ X xi} F ( xi )
P{X x }
i
i xj
• While, similarly, for X a R.V. taking any value along an interval
x
of the real number line
F ( x) P{ X x}
f (u )du
So if first derivative F ' ( x) exists, then
F ( x) dF( x) dx f ( x)
f ( x) F ' ( x)
is the continuous pdf, with
f ( x)dx 1
7
EXPECTATION/VARIANCE
• Clearly,
• and
xif ( xi ) discrete
iS
E( X )
x f ( x)dx continuous
[ x E ( X )] f ( x ) discrete
i
2
i
xS
Var ( X )
2
[
x
E
(
X
)]
f ( x)dx continuous
8
Moments and M.G.F’s
• For a R.V. X, and any non-negative integer k, kth moment
about the origin is defined as expected value of X k
• Central Moments (about Mean): 1st = 0 i.e. E{X}=,
second = variance , Var{X}
• To obtain Moments, use Moment Generating Function
• If X has a p.d.f. f(x), mgf is the expected value of e tX
For a continuous variable, then
mgf ( X ) E{etX } etX f ( x)dx
For a discrete variable, then
mgf( X ) E{etX } etX f ( x)
x
• Generally:
r th moment of the R.V. is r th derivative evaluated at t = 0
9
PROPERTIES - Expectation/Variance
etc. Prob. Distributions (p.d.f.s)
• As for R.V.’s generally. For X a discrete R.V. with
p.d.f. p{X}, then for any real-valued function g
• e.g.
E{g ( X )}
E{ X Y } E{ X } E{Y }
g(x) p{X }
E{ XY} E{ X }E{Y )
Applies for more than 2 R.V.s also
• Variance - again has similar properties to previously:
• e.g.
V {aX b} a 2V {X } a 2 E{X 2} [ E{X }]2
10
MENDEL’s Example
• Let X record the no. of dominant A alleles in a
randomly chosen genotype, then X= a R.V. with
sample space S = {0,1,2}
• Outcomes in S correspond to events
0 if aa
X 1 if aA, Aa
2 if AA
• Note: Further, any function of X is also a R.V.
0 if aa ( X 0)
e.g. Z g ( X )
1 if AA, Aa, aA ( X 0)
• Where Z is a variable for seed character phenotype
11
Example contd.
0 Wrinkled
• So that, for Mendel’s data, Z
1 Round
P{Z 0} 1
4
• And f ( z )
with
E(Z ) 3
4
3
P{Z 1}
4
• And Var( Z ) [ zi E ( Z )]2 f ( zi )
i
(0 3 ) 2 1 (1 3 ) 2 3 3
4
4
4
4
16
• Note: Z = ‘dummy’ or indicator. Could have chosen e.g. Q
as a function of X s.t. Q = 0 round, (X >0), Q = 1 wrinkled,
(X=0). Then probabilities for Q opposite to those for Z with
2
• E (Q) 1 4 and Var(Q) [qi E (Q)] f (qi )
i
(0 1 ) 2 3 (1 1 ) 2 1 9
4
4
4
4
16
12
JOINT/MARGINAL DISTRIBUTIONS
• Joint cumulative distribution of X and Y, marginal
cumulative for X, without regard to Y and joint distribution
(p.d.f.) of X and Y then, respectively
FX ( x)
F ( x, y ) P{ X x, Y y}
(1)
P{X x,Y y} F (x)
(2)
p( x, y ) P{ X x, Y y}
(3)
y
with
p(x , y ) 1
i
i
j
j
• where similarly for continuous case e.g. (2) becomes
x
F 1( x) f (u, v)dvdu f 1(u )du F ( x)
x
( 2a )
13
Example: Backcross 2 locus model (AaBb
aabb)
Observed and Expected frequencies
Genotypic S.R 1:1 ; Expected S.R. crosses 1:1:1:1
Cross
Genotype
Frequency
2
3
4
AaBb 310(300) 36(30) 360(300) 74(60)
Aabb 287(300) 23(30) 230(300) 50(60)
aaBb 288(300) 23(30) 230(300) 44(60)
aabb 315(300) 38(30) 380(300) 72(60)
Marginal A
Aa
aa
Marginal B
Bb
bb
Sum
1
597(600)
603(600)
59(60)
61(60)
Pooled
780(690)
590(690)
585(690)
805(690)
590(600) 124(120)
610(600) 116(120)
1370(1380)
1390(1380)
598(600) 59(60) 590(600) 118(120)
602(600) 61(60) 610(600) 122(120)
1200
120
1200
240
1365(1380)
1395(1380)
2760
14
CONDITIONAL DISTRIBUTIONS
• Conditional distribution of X, given that Y=y
p ( x, y )
p( x / y )
P{ X x / Y y}
p( y )
P{ X x, Y y}
P{Y y}
i.e. JOINT
and sim ilarly p( y / x)
• where for X and Y independent p( x / y) p( x) and p( y / x) p( y)
• Example: Mendel’s expt. Probability that a round seed (Z=1) is
a homozygote AA i.e. (X=2)
AND - i.e. joint or
intersection as above
1 *3
1
P{x 2, z 1}
P{X 2 / Z 1}
3 4 41
3
3
3
P{z 1}
4
4
15
Standard Statistical Distributions
Importance
Modelling practical applications
Mathematical properties are known
Described by few parameters, which have natural interpretations.
Bernoulli Distribution.
This is used to model a trial/expt. which gives
rise to two outcomes:
success/ failure: male/ female, 0 / 1..…
Let p be the probability that the outcome is one
and q = 1 - p that the outcome is zero.
Prob
1
p
1-p
0
p
1
E[X] = p (1) + (1 - p) (0) = p
VAR[X] = p (1)2 + (1 - p) (0)2 - E[X]2 = p (1 - p).
16
Standard distributions - Binomial
Binomial Distribution.
Suppose that we are interested in the number of successes X
in n independent repetitions of a Bernoulli trial, where the
probability of success in an individual trial is p. Then
Prob
Prob{X = k} = nCk pk (1-p)n - k, (k = 0, 1, …, n)
E[X] = n p
VAR[X] = n p (1 - p)
(n=4, p=0.2)
1
np
4
This is the appropriate distribution to model
e.g. Number of recombinant gametes produced by a heterozygous
parent for a 2-locus model . Extension for 3 loci is multinomial
17
Standard distributions - Poisson
Poisson Distribution.
The Poisson distribution arises as a limiting case of the binomial
distribution, where n , p 0 in such a way that np ( Constant)
P{X = k} = exp ( - )k /k!(k0,1,2,… ).
E [X] =
1
VAR [X] = .
Poisson is used to model No.of
occurrences of a certain phenomenon in a
fixed period of time or space, e.g.
5
X
O particles emitted by radioactive source in fixed direction for interval T
O people arriving in a queue in a fixed interval of time
O genomic mapping functions, e.g. cross over as a random event
18
Other Standard examples: e.g.
Hypergeometric, Exponential….
• Consider a population of M items, of which W are deemed to be
successes. Let X be the number of successes that occur in a
sample of size n, drawn without replacement from the finite
population
Prob { X = k} = WCk M-WCn-k / MCn
( k = 0, 1, 2, … )
• Then
E [X] = n W / M
VAR [X] = n W (M - W) (M - n) / { M2 (M - 1)}
• Exponential : special case of the Gamma distribution with n = 1
used e.g. to model inter-arrival time of customers or time to
arrival of first customer in a simple queue, e.g. fragment lengths
in genome mapping etc.
• The p.d.f. is
f (x)= exp ( - x ),
x 0,0
=0
otherwise
19
Standard p.d.f.’s - Gaussian/ Normal
• A random variable X has a normal distribution with mean and
standard deviation s if it has density
1 x 2
1
exp
x
2
s
s
2
f ( x)
0 otherwise
with E (X ) and V ( X ) s 2
• Arises naturally as the limiting distribution of the average of a set
of independent, identically distributed random variables with
finite variances.
• Plays a central role in sampling theory and is a good
approximation to a large class of empirical distributions.
Default assumption in many empirical studies is that each
observation is approx. ~ N(,s 2)
• Statistical tables of the Normal distribution are of great
importance in analysing practical data sets. X is said to be a
Standardised Normal variable if = 0 and s = 1.
20
Standard p.d.f.’s :
Student’s t-distribution
• A random variable X has a t -distribution with ‘n’ d.o.f. ( tn ) if it
has density ( 1)
( 1) 2
2
2 t
1
2
f (t )
t
0
otherwise.
Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2).
• For small n, the tn distribution is very flat.
• For n 25, the tn distribution standard normal curve.
• Suppose Z a standard Normal variable, W has a cn2
distribution and Z and W independent then r.v. form X Z
=
W n
• If x1, x2, … ,xn is a random sample from N(, s2) , and, if define
( xi x ) 2
then ( x )
2
~ tn 1
s
s n
n 1
21
Chi-Square Distribution
• A r.v. X has a Chi-square distribution with n degrees of freedom; (n a
positive integer) if it is a Gamma distribution with = 1, so its p.d.f. is
f ( x) x n 1 exp( x) (n 1)! x 0
Prob
c2 ν (x)
0 otherwise
E[X] =n ; Var [X] =2n
• Two important applications:
- If X1, X2, … , Xn a sequence of independently distributed
Standardised Normal Random Variables, then the sum of squares
X12 + X22 + … + Xn2 has a c2 distribution (n degrees of freedom).
X
- If x1, x2, … , xn is a random sample from N(,s2), then
x
i 1
n
n
xi
n
and
s
2
i 1
( xi x ) 2
s2
and
s2 has c2 distribution, n - 1 d.o.f., with r.v.’s x and s2 independent.
22
F-Distribution
• A r.v. X has an F distribution with m and n d.o.f. if it has a density
function = ratio of gamma functions for x>0 and = 0 otherwise.
•
E[ X ] n (n 2) if n 4
2n 2 (m n 2)
Var[ X ]
if n 4
2
m(n 4)(n 2)
X m
• For X andY independent r.v.’s, X ~ cm2 and Y~ cn2 then Fm , n
Y n
• One consequence: if x1, x2, … , xm ( m 2) is a random sample
from N(1, s12), and y1, y2, … , yn ( n 2) a random sample from
N(2,s22), then
( y y)
( xi x ) 2 (m 1)
i
2
(n 1)
~ Fm 1, n 1
23