A day in the life of an IEEE802.11 Station
Download
Report
Transcript A day in the life of an IEEE802.11 Station
HY539: Mobile Computing and Wireless Networks
Basic Statistics / Data Preprocessing
Tutorial: November 21, 2006
Elias Raftopoulos
Prof. Maria Papadopouli
Assistant Professor
Department of Computer Science
University of North Carolina at Chapel Hill
Basic Terminology
An Element of a sample or population is a specific
subject or object (for example, a person, firm, item, state,
or country) about which the information is collected.
A Variable is a characteristic under study that assumes
different values for different elements.
An Observation is the value of a variable for an
element.
Data set is a collection of observations on one or more
variables.
Measures of Central Tendency
Mean: Sum of all values divided by the number of
cases.
Median: The value of the middle term in a data set
that has been ranked in increasing order. 50% of the
data lies below this value and 50% above.
Mode: is the value that occurs with the highest
frequency in a data set.
Measures of Dispersion
The square root of the average squared deviations
from the mean
Range = Largest value – Smallest Value
The range, like the mean has the disadvantage of being
influenced by outliers
Percentiles
This measure how the data values differ from the mean.
A small standard deviation implies most values are near the
average
The large standard deviation indicates that values are widely
spread above and below the average
Values that divide cases below which certain percentages of
values fall
Standard Error (SE): Standard deviation of the mean.
Data Analysis
Discovery of Missing Values
Data treatment
Outliers Detection
Outliers Removal [Optional]
Data Normalization [Optional]
Statistical Analysis
Why Data Preprocessing?
Data in the real world is dirty
incomplete
noisy
inconsistent
No quality data, no quality statistical
processing
Quality decisions must be based on quality data
Data Cleaning Tasks
Handle missing values, due to
Sensor malfunction
Random disturbances
Network Protocol [eg UDP]
Identify outliers, smooth out noisy data
Recover Missing Values
Linear Interpolation
Recover Missing Values
Moving Average
A simple moving average is the unweighted mean of
the previous n data points in the time series
A weighted moving average is a weighted mean of
the previous n data points in the time series
A weighted moving average is more responsive to recent
movements than a simple moving average
An exponentially weighted moving average
(EWMA or just EMA) is an exponentially weighted mean
of previous data points
The parameter of an EWMA can be expressed as a proportional
percentage - for example, in a 10% EWMA, each time period is
assigned a weight that is 90% of the weight assigned to the next
(more recent) time period
Recover Missing Values
Moving Average (cont’d)
Symmetric Linear Filters
Moving Average
q
1
Sm( xt )
* xt r
2 * q 1 r q
What are outliers in the data?
An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population
It is left to the analyst (or a consensus process) to
decide what will be considered abnormal
Before abnormal observations can be singled out,
it is necessary to characterize normal observations
What are outliers in the data? (cntd)
An outlier is a data point that comes from a
distribution different (in location, scale, or
distributional form) from the bulk of the data
In the real world, outliers have a range of causes,
from as simple as
operator blunders
equipment failures
day-to-day effects
batch-to-batch differences
anomalous input conditions
warm-up effects
Scatter Plot: Outlier
Scatter plot here reveals
A basic linear relationship between X and Y for most of the data
A single outlier (at X = 375)
Symmetric Histogram with Outlier
A symmetric distribution is one in which the 2 "halves" of the
histogram appear as mirror-images of one another.
The above example is symmetric with the exception of outlying
data near Y = 4.5
Confidence Intervals
In statistical inference we want to estimate
population parameters using observed sample
data.
A confidence interval gives an estimated
range of values which is likely to include an
unknown population parameter
the estimated range calculated from a given set of
sample data
Confidence Intervals
Common choices for the
confidence level C are 0.90, 0.95,
and 0.99
Levels correspond to percentages
of the area of the normal density
curve
Interpretation of Confidence Intervals
e.g. 95% C.I. for µ is (-1.034, 0.857).
Right: In the long run over many random
samples, 95% of the C.I.’s will contain the
true mean µ.
Wrong: There is a 95% probability that the
true mean lies in this interval (-1.034, 0.857).
Smoothing Data
If your data is noisy, you might need to apply a smoothing
algorithm to expose its features, and to provide a reasonable
starting approach for parametric fitting
Two basic assumptions that underlie smoothing are
The relationship between the response data and the predictor data is
smooth
The smoothing process results in a smoothed value that is a better
estimate of the original value because the noise has been reduced
The smoothing process attempts to estimate the average of the
distribution of each response value
The estimation is based on a specified number of neighboring
response values
Smoothing Data (cntd)
Moving average filtering
Lowess and loess
Lowpass filter that takes the average of
neighboring data points
Locally weighted scatter plot smooth
Savitzky-Golay filtering
A generalized moving average where you derive
the filter coefficients by performing an unweighted
linear least squares fit using a polynomial of the
specified degree
Smoothing Data (eg Robust Lowess)
Normalization
Normalization is a process of scaling the numbers in a data set
to improve the accuracy of the subsequent numeric
computations
Most statistical tests and intervals are based on the assumption
of normality
This leads to tests that are simple, mathematically tractable, and
powerful compared to tests that do not make the normality
assumption
Most real data sets are in fact not approximately normal
An appropriate transformation of a data set can often yield a
data set that does follow approximately a normal distribution
This increases the applicability and usefulness of statistical
techniques based on the normality assumption.
Box-Cox Transformation
The Box-Cox transformation is a particulary
useful family of transformations
( xt 1) /
y (t )
log xt
0
0
Measuring Normality
Given a particular transformation such as the Box-Cox
transformation defined above, it is helpful to define a
measure of the normality of the resulting transformation
One measure is to compute the correlation coefficient of
a normal probability plot
The correlation is computed between the vertical and horizontal
axis variables of the probability plot and is a convenient measure
of the linearity of the probability plot (the more linear the
probability plot, the better a normal distribution fits the data).
The Box-Cox normality plot is a plot of these correlation
coefficients for various values of the parameter. The value
of λ corresponding to the maximum correlation on the
plot is then the optimal choice for λ
Measuring Normality (cont’d)
The histogram in the upper
left-hand corner shows a
data set that has significant
right skewness
And so does not follow a
normal distribution
The Box-Cox normality plot
shows that the maximum
value of the correlation
coefficient is at = -0.3
The histogram of the data
after applying the Box-Cox
transformation with = -0.3
shows a data set for which
the normality assumption is
reasonable
This is verified with a normal
probability plot of the
transformed data.
Normal Probability Plot
The normal probability plot is a graphical technique
for assessing whether or not a data set is
approximately normally distributed
The data are plotted against a theoretical normal
distribution in such a way that the points should form
an approximate straight line. Departures from this
straight line indicate departures from normality
The normal probability plot is a special case of the
probability plot
Normal Probability Plot (cont’d)
PDF Plot
Probability density function
(pdf) represents a probability
distribution in terms of integrals
A probability density function is
non-negative everywhere and its
integral from −∞ to +∞ is equal
to 1
If a probability distribution has
density f(x), then intuitively the
infinitesimal interval [x, x + dx]
has probability f(x) dx
CDF Plot
Plot of empirical cumulative distribution
function
Percentiles
percentiles provide a way of estimating proportions of
the data that should fall above and below a given
value
The pth percentile is a value, Y(p), such that at most
(100p)% of the measurements are less than this
value and at most 100(1- p)% are greater
The 50th percentile is called the median
Percentiles split a set of ordered data into hundredths
For example, 70% of the data should fall below the 70th
percentile.
Interquartile Range
Quartiles are three summary measures that divide a ranked data
set into four equal parts
Second quartile is the same as the median of a data set
First quartile is the value of the middle term among the
observations that are less than the median
Third quartile is the value of the middle term among the
observations that are greater than the median
Interquartile range:
The difference between the third and first quartiles or equivalently
the middle 50% of the data