Biostatistics - Amirkabir University of Technology
Download
Report
Transcript Biostatistics - Amirkabir University of Technology
Biostatistics
Biostatistics
• Statistics refers to the analysis and
interpretation of data with a view toward
objective evaluation of the reliability of the
conclusions based on the data.
• Statistics applied to biological problems is
called biostatistics / biometry.
Applications of Statistics in
Bioinformatics
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Descriptive Summaries
Clinical diagnosis
Equipment calibration
Experimental data analysis
Gene expression prediction
Gene hunting
Gene prediction
Genetic linkage analysis
Laboratory automation
Nucleotide alignment
Population studies
Protein function prediction
Protein structure prediction
Quantifying uncertainty
Quality control
Sequence similarity.
Basic Concepts - I
• Data
– Any kind of numbers
– Statistical analyses need numbers
• Statistics
– Concerned with collection, organization, and analysis
of data
– Drawing inferences about a population when only a
sample of the population is studied
• Summary
– Data are numbers, numbers contain information,
statistics investigate and evaluate the nature and
meaning of this information
Basic Concepts - II
• Sources of Data
– Routinely kept records
– Surveys
– Experiments / Research Studies
• Biostatistics
– Statistics applied to biological sciences and medicine
– Statistics including not only analytic techniques but
also study design issues
Variables
• Variable: A characteristic that differs from
one biological entity to another.
• Continuous Variable: A variable for which
there is a possible value between any other
two possible values. Eg: Height.
• Discrete variable: A variable that can take
only certain values. Eg: No.of leaves
Accuracy & Precision
• Accuracy is the nearness of a measurement
to the actual value of the variable being
measured.
• Precision refers to the closeness to each
other of repeated measurements of the same
quantity.
Frequency table & Frequency
Distributions.
• Frequency table: involves a listing of all the
observed values of the variable being studied and
how many times each value is observed. Helps
summarize large amounts of data.
• Frequency distribution: The distribution of the
total number of observations among the various
categories is called a frequency distribution.
• Represented graphically as a bar graph,
histogram, Frequency polygons etc.
Population and Samples
• Population: The entire collection of measurements
about which one wishes to draw conclusions is the
population / universe.
• Sample: The subset of all the measurements in the
population is called the sample.
• Random sampling: The selection of any member
of the population in no way influences the
selection of any other member, i.e each member of
the population has an equal and independent
chance of being selected.
Randomness
• Data are inherently noisy and randomness is inherent in any sampling
process.
• Every measurement system introduces noise-random variability-into
the desired signal.
• The noise can be minimized by controlling the external environment or
more often by reducing the bandwidth of the system using statistical
techniques.
• By reducing the bandwidth of acceptable (good) data, it can be more
readily differentiated from bad data and made more apparent and
available.
Eg: Analysis of intra-array spot fluorescence intensity can be
used to control for contamination and other sources of variability.
Simple Random Sample
• Reason
– sample a ‘small’ number of subjects from a population
to make inference about the population
– Essence of statistical inference
• Definition
– A sample of size n drawn from a population of size N in
such a way that every possible sample of size n has the
same chance of being selected
• Sampling with and without replacement
– In biostatistics, most sampling done without
replacement
Interface Noise
• Much of bioinformatics work involves interfacing
mechanical, biological and electronic systems and each
interface introduces noise and variability in the overall
process.
Eg: Translating analog fluorescence intensity to a digital signal
introduces noise, decreases overall system dynamic range and adds
non-linearities and variability to the gene expression data.
Similarly the mechanical and optical-to-digital interfaces in a
nucleotide sequencing machine contribute noise, errors and
random variability to sequence data.
Descriptive Statistics Measures of Location
• Descriptive measure computed from sample data statistic
• Descriptive measure computed from population
data - parameter
• Most common measures of location
–
–
–
–
Mean
Median
Mode
Geometric Mean
Descriptive Statistics - Arithmetic mean
• Probably most common of the measures of central tendency
– a.k.a. ‘average’
• Definition
– Normal distribution, although we tend to use it regardless of
distribution
• Weakness
xi
x
n values
– Influenced by extreme
• Translations
– Additive
– Multiplicative
Descriptive Statistics
Median
• Frequently used if there are extreme values in a
distribution or if the distribution is non-normal
• Definition
– That value that divides the ‘ordered array’ into two
equal parts
• If an odd number of observations, the median will be the
(n+1)/2 observation
– ex.: median of 11 observations is the 6th observation
• If an even number of observations, the median will be the
midpoint between the middle two observations
– ex.: median of 12 observations is the midpoint between 6th and
7th
• Comparison of mean and median indicates
skewness of distribution
Descriptive Statistics
Mode
• Not used very frequently in practice
• Definition
– Value that occurs most frequently in data set
• If all values different, no mode
• May be more than one mode
– Bimodal or multimodal
Descriptive Statistics Geometric mean
• Used to describe data with an extreme skewness to
the right
– Ex., laboratory data: lipid measurements
• Definition
– Antilog of the mean of the log xi
Descriptive Statistics
Measures of Dispersion
• Dispersion of a set of observations is the variety
exhibited by the observations
– If all values are the same, no dispersion
– More the values are spread, the greater the dispersion
• Many distributions are well-described by measure
of location and dispersion
• Common measures
–
–
–
–
–
Range
Quantiles
Variance
Standard deviation
Coefficient of variation
Descriptive Statistics
Range
• Range is the difference between the smallest and
largest values in the data set
– Heavily influenced by two most extreme values and
ignores the rest of the distribution
Descriptive Statistics
Variance
• Variance measures distribution of values around
their mean
• Definition
2 of sample variance
s ( xi x )2 /(n 1)
• Degrees of freedom
– n-1 used because if we know n-1 deviations, the nth
deviation is known
– Deviations have to sum to zero
Descriptive Statistics
Standard Deviation
• Definition of sample standard deviation
s
s2
• Standard deviation in same units as mean
– Variance in units2
• Translations
– Additive
– Multiplicative
Descriptive Statistics
Coefficient of Variation
• Relative variation rather than absolute variation
such as standard deviation
• Definition of C.V.
C .V .
s
(100 )
x
• Useful in comparing variation between two
distributions
– Used particularly in comparing laboratory measures to
identify those determinations with more variation
– Also used in QC analyses for comparing observers
Sampling and distributions
• Population mean and variance are estimated by sampling population
data and drawing inferences from the sample data based in part on
assumptions of how the data are distributed in the population.
• Distributions used in statistical analysis:
Discrete random variables: Binomial, Poisson and Hypergeometric
distributions.
Continuous random variables: Normal distribution, Z distribution.
Eg: The analysis of discrete random variables, such as the position of a
nucleotide on a given sequence may use techniques based on a binomial
distribution and not techniques that assume a normal distribution.
Hypothesis Testing
• Hypothesis testing deals with the null hypothesis
and the alternate hypothesis.
• The null hypothesis is usually assumed to hold
unless there is enough evidence to reject it.
Eg: In Microarray work, a typical hypothesis is that two
microarrays that have been subjected to the same
spotting and hybridization process will produce
identical gene expression fluorescence results.
The degree to which this hypothesis is true can be
estimated by examining the gene expression scatter
plots created from data gleaned from each microarray
and correlating the values mathematically.
Z score
• A statistic commonly used in alignment searches.
• It is a measure of the distance from the mean,
measured in standard deviation units.
• If each sequence to be aligned is randomized and
an optimal alignment is made, the result is a series
of scores (S) for the alignment of two sequences
with a mean(µ) and standard deviation (δ).
• The Z score
Z = (S - µ ) / δ
Z - score
• The advantage of a Z score over a simple
percentage score is that it corrects for
compositional biases in the sequence and accounts
for varying length of sequences.
• Z scores assume a normal distribution, whereas
alignment data don’t follow a normal distribution.
• As a result a higher z score is taken as a threshold
of significance.
Graphical Methods
Bar Graphs and Histogram
• Histogram graph of frequencies - special form of
bar graph
– Can be used to visually compare frequencies
– Easier to assess magnitude of differences rather than
trying to judge numbers
• Frequency polygon - similar to histogram
Summary
• In practice, descriptive statistics play a major role
– Always the first 1-2 tables/figures in a paper
– Statistician needs to know about each variable before
deciding how to analyze to answer research questions
• In any analysis, 90% of the effort goes into setting
up the data
– Descriptive statistics are part of that 90%
Distributions in Bioinformatics
• Binomial distributions are used for spotting
stretches of DNA with unusual nucleotide
sequences and pair-wise sequence comparisons.
• Normal distributions are used for modeling
continuous random variables with applications
such as the statistical significance of pairwise
sequence comparison.
• Multinomial distributions are used for spotting
stretches of DNA with unusual content,
distinguishing tests for introns by composition and
quantifying relative codon frequency.
Software
• Statistical software
–
–
–
–
–
–
SAS
SPSS
Stata
BMDP
MINITAB
Excel??
• Graphical software
–
–
–
–
From list above
Sigmaplot
Harvard Graphics
Axum
Case Study - Microarray
• Microarrays offer an efficient method of gathering
data that can be used to determine the expression
patterns of tens of thousands of genes in only a
few hours.
• Microarrays allow researchers to examine the
mRNA from different tissues in normal and
disease states to determine which genes and
environmental conditions lead to disease
Microarray analysis
• Analysis of the flourescence data includes a check
for micro-array to microarray variability using a
scatter plot.
• Gene expression levels are measured by
adequately quantifying the flourescence associated
with each spot.
• The most common methods of achieving this is to
rely on simple descriptive statistics such as mean,
mode and median.
Microarray analysis
• The total pixel intensity is the sum of all pixels
corresponding to fluorescence in an area.
• The volume measure is the sum of signal intensity
above background noise for each pixel.
• Role of statistical analysis in reading the intensity
value associated with each spot is to control for
variability. The inter and intra microarray
comparisons are used to identify contamination
and other sources of variability.
Microarray analysis
• The mean is the average pixel density over a spot,
corresponding to the average fluorescence
intensity. The advantage of measuring the mean
intensity level is that it decreases the error due to
variance in DNA deposition during microarray
work.
• The mode is the most likely intensity value,
represented by the highest peak in the
fluorescence plot.
• The median is the mid-point in the intensity plot.
Microarray analysis
• A quick check for data validity is to create a
scatter plot of flourescence data from two
identically treated microarrays.
• (Refer fig 6-4, Pg:226 – Bioinformatics
Computing Bk)
• The ideal condition is when gene expressions
measured by the microarrays are identical as
indicated by data on the 45-degree ID line as in
(A).
Microarray analysis
• If the amplitude of gene expression on one
microarray is greater than the other, data fall off
the ID line as in (B) and (C).
• The scatter plot also provides a measure of gene
expression amplitude, in that the greater the
distance from the origin, the greater the expression
amplitude.
• For example the gene plotted at position (C)
Has a greater expression amplitude than the gene at
position (A).