Chapter 2 Describing Data

Download Report

Transcript Chapter 2 Describing Data

•
•
The Central Tendency is the center of the
distribution of a data set. You can think of
this value as where the middle of a
distribution lies.
Measure of central tendency: Numbers that
describe what is average or typical of the
distribution… Mean, Median, Mode
Mean: The sum of all the
data values divided by
the number of values
Median: The middle
number when the data is
arranged in order
Mode: The value that
occurs most frequently
in the data





Data: 4,17,7,14,18,12,3,16,10,4,4,11
1. Order your data (putting the values in numerical order).
3,4,4,4,7,10,11,12,14,16,17,18
2. Find the median of your data. The median divides the data into two halves.
Median: 10.5
3. To divide the data into quarters, you then find the medians of these two halves.
3,4,4,4,7, 10, Median: 4
11 ,12,14,16,17,18 Median: 15
4. Now you have three points: These three points divide the entire data set into
quarters, called "quartiles
◦ Quartile 1 (Q1) = (4+4)/2 = 4
◦ Quartile 2 (Q2) = (10+11)/2 = 10.5
◦ Quartile 3 (Q3) = (14+16)/2 = 15

Once you have these three points, Q1, Q2, and Q3, you have all you need in order
to draw a simple box-and-whisker plot.
http://www.mathsisfun.com/data/quartiles.html
Percentile rank is calculated by taking the number of data
points with values less than the value we want, and dividing
that sum by the total number of data points.
34.05 +2(14.68)= 63.41
Notice that all the data values in the bins up to 60 are less
than 63.41
Adding the frequencies up to 60 is 37
37 out of 40 (total) is approximately 92.5%. So 63.41
is approximately the 93rd percentile


Deviations measure signed difference
between the data values and the mean
The variance is another measure of variability
that is equal to the sum of the squares of the
deviations divided by one less than the
number of values.
Example: Semester assignments scores
Oscar’s mean: 84
Connie’s mean: 84
These are Connie’s and Oscar’s scores and their deviations from
the mean score for each student.
How can we combine the deviations into a single value that
reflects the spread in a data set?
Should we find the sum of the standard deviations?
Let’s try that….
Of course, they cancel out!!
So we need to eliminate the effect of the different signs!
Any ideas?
When you sum the squares of the deviations, the sum is no
longer zero!!
The sum of the squares of the deviations, divided by one less
than the number of values, is called the variance of the data.
The square root of the variance is called the standard
deviation of the data.
The standard deviation provides one way to judge the “average
difference” between data values and the mean. It is a measure
of how the data are spread around the mean.
A histogram is a graphical representation of a data set, with
columns to show how the data are distributed across different
intervals of values.
The columns of a histogram are called bins and should not be confused
with the bars of a bar graph.
The bars of a bar graph indicate categories—
how many data items either have the same
value or share a characteristic (eye color).
The bins of a histogram indicate how many
numerical data values fall within a
certain interval.
The median (Q2) lies in
the middle of its first and
third quartiles .
The minimum and
maximum do not have to
be equally far away from
the median.
The median (Q2) is closer
to the first quartile.
The mean is typically
greater than the median.
The mean is typically less
than the median. The
median is closer to the
third quartile.
Shatevia took a random sample of
50 students who own MP3 players
at her high school
and asked how many
songs they have stored.
The two graphs were
constructed from the
data in the table.
a. What is the range of the data?
The number of songs goes from a low of 765 songs to a high
of 1013 songs. The range is 248 songs.
b. What is the bin width of each graph
The bin width of Graph A is 50 songs, and the bin width of
Graph B is 10 songs.
c. How can you know if the graph accounts for all 50 values?
The sum of all the bin frequencies is 50 for each of the graphs.
d. Why are the columns shorter in Graph B?
The bins in Graph A hold the values of up to five bins from Graph B.
With smaller bin widths you will usually have shorter bins.
e. Which graph is better at showing the overall shape of the
distribution? What is that shape?
Graph A shows that the distribution is skewed left. This fact is
harder to see with all the ups and downs in Graph B
f. Which graph is better at showing the gaps and cluster in the
data?
With more bins you can see gaps and clusters in the data. A dot
plot is like a histogram with a very small bin width. Graph B is
the better graph for seeing gaps and clusters
g. What percentage of the players have fewer than 850 songs
stored?
Add the bin frequencies for the bins below (to the left of) 850
songs. There are 10 data values, so 10 out of 50, or 20% of
the sample, had fewer than 850 songs