A day in the life of an IEEE802.11 Station
Download
Report
Transcript A day in the life of an IEEE802.11 Station
HY436: Mobile Computing and Wireless Networks
Data sanitization
Tutorial: November 7, 2005
Elias Raftopoulos
Ploumidis Manolis
Prof. Maria Papadopouli
Assistant Professor
Department of Computer Science
University of North Carolina at Chapel Hill
Data Analysis
Discovery of Missing Values
Data treatment
Outliers Detection
Outliers Removal [Optional]
Data Normalization [Optional]
Statistical Analysis
Why Data Preprocessing?
Data in the real world is dirty
incomplete
noisy
inconsistent
No quality data, no quality statistical
processing
Quality decisions must be based on quality data
Data Cleaning Tasks
Handle missing values, due to
Sensor malfunction
Random disturbances
Network Protocol [eg UDP]
Identify outliers, smooth out noisy data
Recover Missing Values
Linear Interpolation
Recover Missing Values
Moving Average
A simple moving average is the unweighted mean of
the previous n data points in the time series
A weighted moving average is a weighted mean of
the previous n data points in the time series
A weighted moving average is more responsive to recent
movements than a simple moving average
An exponentially weighted moving average
(EWMA or just EMA) is an exponentially weighted mean
of previous data points
The parameter of an EWMA can be expressed as a proportional
percentage - for example, in a 10% EWMA, each time period is
assigned a weight that is 90% of the weight assigned to the next
(more recent) time period
Recover Missing Values
Moving Average (cont’d)
Symmetric Linear Filters
Moving Average
q
1
Sm( xt )
* xt r
2 * q 1 r q
What are outliers in the data?
An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population
It is left to the analyst (or a consensus process) to
decide what will be considered abnormal
Before abnormal observations can be singled out,
it is necessary to characterize normal observations
Outliers
An outlier is a data point that comes from a
distribution different (in location, scale, or
distributional form) from the bulk of the data
In the real world, outliers have a range of causes,
from as simple as
operator blunders
equipment failures
day-to-day effects
batch-to-batch differences
anomalous input conditions
warm-up effects
Scatter Plot: Outlier
Scatter plot here reveals
A basic linear relationship between X and Y for most of the data
A single outlier (at X = 375)
Symmetric Histogram with Outlier
A symmetric distribution is one in which the 2 "halves" of the
histogram appear as mirror-images of one another.
The above example is symmetric with the exception of outlying
data near Y = 4.5
Normalization
Normalization is a process of scaling the numbers in a data set
to improve the accuracy of the subsequent numeric
computations
Most statistical tests and intervals are based on the assumption
of normality
This leads to tests that are simple, mathematically tractable, and
powerful compared to tests that do not make the normality
assumption
Most real data sets are in fact not approximately normal
An appropriate transformation of a data set can often yield a
data set that does follow approximately a normal distribution
This increases the applicability and usefulness of statistical
techniques based on the normality assumption.
Box-Cox Transformation
The Box-Cox transformation is a particulary
useful family of transformations
( xt 1) /
y (t )
log xt
0
0
Measuring Normality
Given a particular transformation such as the Box-Cox
transformation defined above, it is helpful to define a
measure of the normality of the resulting transformation
One measure is to compute the correlation coefficient of
a normal probability plot
The correlation is computed between the vertical and horizontal
axis variables of the probability plot and is a convenient measure
of the linearity of the probability plot (the more linear the
probability plot, the better a normal distribution fits the data).
The Box-Cox normality plot is a plot of these correlation
coefficients for various values of the parameter. The value
of λ corresponding to the maximum correlation on the
plot is then the optimal choice for λ
Measuring Normality (cont’d)
The histogram in the upper
left-hand corner shows a
data set that has significant
right skewness
And so does not follow a
normal distribution
The Box-Cox normality plot
shows that the maximum
value of the correlation
coefficient is at = -0.3
The histogram of the data
after applying the Box-Cox
transformation with = -0.3
shows a data set for which
the normality assumption is
reasonable
This is verified with a normal
probability plot of the
transformed data.
Normal Probability Plot
The normal probability plot is a graphical technique
for assessing whether or not a data set is
approximately normally distributed
The data are plotted against a theoretical normal
distribution in such a way that the points should form
an approximate straight line. Departures from this
straight line indicate departures from normality
The normal probability plot is a special case of the
probability plot
Normal Probability Plot (cont’d)
CDF Plot
Plot of empirical cumulative distribution
function