1) Descriptive statistics

Download Report

Transcript 1) Descriptive statistics

EXISTING STATISTICS
QUANTITATIVE DATA ANALYSIS
BUSN 364 – Week 13
Özge Can
Existing Statistics/ Documents



Many types of data are already available in the
form of statistical documents (books, reports, etc.)
=> Secondary Data
Researchers find what data is available there
Then, they decide how can it be used to address their
own research question
Existing Statistics/ Documents
Existing statistics research is most appropriate:


For testing hypothesis that involve variables in
official reports of social, economic, and political
conditions;
When you are looking over time and across nations
(large-scale, widespread data)
Locating Data

Main providers of existing statistics:
 Governments
 International
agencies
 Industry/ sector bodies
 NGOs, universities and research institutions
 Other private sources

Most existing documents are “free” (available to
public) but time and effort required to search for
specific information in them
Locating Data: Some Key Sources






Turkish government statistics (TUIK): www.tuik.gov.tr
UN statistics: http://data.un.org/
OECD statistics portal:
http://www.oecd.org/statsportal/
World Bank data: http://data.worldbank.org/
EU statistics (Eurostat):
http://epp.eurostat.ec.europa.eu/portal/page/portal
/eurostat/home/
US Census Bureau - Statistical Abstracts:
http://www.census.gov/compendia/statab/
Example: Financial Databases




Compustat (www.compustat.com) => annual income
statements, balance sheets, cash flow and other data
items from North American companies
Datastream (www.datastream.com) => one of the
largest financial statistical databases
The Wharton Research Data Service
(www.whartonwrds.com) => databases in the field of
finance, accounting, banking, economics, management
The Center for Research in Security Prices (CRSP,
www.crsp.com) => security prices, returns, volume data
from stock markets
Research Example Using Existing Statistics:



An androgynous first name is the one that can be for
either a girl or boy without clearly marking the
child’s gender (= unisex names)
What is the extent of gender segregation in
naming?
Why parents name their children in certain
ways?
Research Example Using Existing Statistics:



Lieberson et al. (2000) examined existing statistical
data in the form of computerized records from the
birth certificates of 11 million births in Illinois, US
from 1916 to 1989.
They found that androgynous first names are
rare (only increase in very recent years)
Parents give such names to girls more than to
boys
Limitations of Using Secondary Data

Existing data may not be appropriate for your
research question. You need to consider:
 The
units of analysis; the time and place of
data collection; the sampling method used

You must understand the topic in them so that you
don’t make false assumptions and interpretations
Limitations of Using Secondary Data

There are also problems regarding variable
attributes:
 Validity problems: Your theoretical definitions does
not match that of the organization that collected the
info
 Reliability problems: Variable definitions or the
method of collecting data changes over time
 Missing data problems: Government agencies start
or stop collecting data for political, budgetary or
other reasons. The data may not be complete
Analysis of Quantitative Data

Dealing with data:
 Coding data
 Entering data
 Cleaning data
 Analyzing data
Coding Data




Coding: Systematically recognizing raw data into
a format that is easy to analyze using statistics
Coding procedure => a set of rules stating that you
will assign certain numbers to variable attributes
Codebook => a document (one or more pages)
describing the coding procedure
One should prepare them before collecting the data
Codebook Example:
Entering and Cleaning Data



Most computer programs designed for numerical
data analysis require that the data be in a grid
format (rows and columns)
Accuracy is very important: the errors you make
when coding and entering data threaten the validity
of the measures and results
Carefully check your coding and how you enter the
data
Two Types of Quantitative Analysis:
1) Descriptive statistics => for summarizing
and describing data
2) Inferential statistics => for drawing
conclusions from data
1) Descriptive Statistics

Frequency Distributions: Summarizes the
information in terms of the frequencies/ percentages
in different categories

Common types of graphical representations:
Bar chart, pie chart, histogram
Graphical Representations of Frequency
Distributions:
1) Descriptive Statistics

Measures of Central Tendency: Statistical
measures that summarize the values/scores of a
variable into a single number



Mode => the most frequent or common score
Median => the middle point: the score at which half
of the cases are above it and half below it
Mean (arithmetic average) => the sum of all scores
divided by the total number of them. Most widely
used measure of central tendency
1) Descriptive Statistics



Measures of Central Tendency
If the frequency distribution of the data forms a
normal distribution or bell-shaved curve, the three
measures of central tendency equal each other
If it is a skewed distribution (more cases in the
upper or lower scores), then the three will not be
equal
Measures of Central Tendency
1) Descriptive Statistics




Measures of Variation (Spread): The dispersion
or distribution of the data around the mean
Range => the distance between the highest and
lowest scores
Percentile => the percentage of cases at or below a
a score or point
Standard deviation => the average distance
between the score and the mean
Standard Deviation:
Exercise:
27
What are the mode, median, mean and range for the
below data?
13, 13, 13, 13, 14, 14, 16, 18, 21
Exercise:
28
What are the mode, median, mean and range for the
below data?
13, 13, 13, 13, 14, 14, 16, 18, 21
Mode:
Median:
Mean:
Range:
15
14
13
8
Statistical Relationships


Expression of whether there is an association
between two variables
Covariation => Whether they tend to appear
together (or they are independent)
Statistical Relationships




To indicate statistical relationships:
Scattergram => a diagram displaying the relationship
between two variables
Contingency table => a summary format for two or
more variables by showing the percentage or number
of cases at the intersection of variable categories
Measure of association => a single number that
expresses the strength of a relationship. There are
many of them (e.g. chi-square, rho, lambda, correlation
coefficient)
A Scattergram:
A Contingency Table:
Statistical Control



For causal relationships, temporal order and
association are not enough
We must eliminate alternative explanations that
can make the hypothesize relationship spurious
In non-experimental research, we can statistically
control for alternative explanations by adding
control variables
Statistical Control: Example




“The relationship between height and liking of
basketball”
Control variable: Gender
If it has no effect => Both tall males and tall females
like basketball more than short males and short females
(suspected alternative explanation has no effect)
If it has an effect => Tall males are more likely than
short males to like basketball; and tall females are no
more likely to like it than short females (gender, not
height, is the true explanation)
Multiple Regression



Very popular statistical technique used with interval
or ratio-level data
Great advantage is its ability to adjust for several
control variables simultaneously
Its results tell two things:
 Overall
predictive powers of the set of independent
and control variables on the dependent variable (Rsquare)
 The direction and size of the effect of each variable on
a dependent variable
2) Inferential Statistics



Build on probability theory to test hypotheses
formally; permit inferences from a sample to a
population
Statistical significance => the probability of finding
a relationship in the sample when there is none in the
population (tell us what is likely)
Levels of significance => expression of statistical
significance in terms of levels (e.g. the results are
significant at the .05 level)
2) Inferential Statistics

There are two objectives:
I. To find out differences among groups
 Are
small start-up businesses more innovative than
alrge bureaucratic ones?
 Are there more men than women on the boards of
companies?
2) Inferential Statistics

There are two objectives:
II. To find out relationships
 Is
there a link between sunny weather and how
people feel at work?
 Does greater company use of social media (such as
Facebook and Twitter) increase their reputation
with customers?