Transcript Week5

Basic statistics for corpus
linguistics
05-06.10.2016
Types of studies in CL
• What kind of x in y? With which adjectives are politicians described in
different newspapers and magazines?
• Are a and b different? Are the frequencies of negative adjectives
different in comparison to positive adjectives in different newspapers
and magazines?
• Is there a correlation between x and y? Is it true that the more writewing the magazine is the higher the frequency of negative adjectives
of left-wing politicians?
• Usually you need a benchmark: a reference corpus, a frequency
expected in random distribution, two distributions between which
you compare
What do you need for a quantitative corpus
study?
A defined corpus in which you can count frequencies of interesting
objects
Frequencies
Benchmark
Statistical significance (H0 – null hypothesis, some tool for calculations: on-line
calculator, SPSS, R, xcel)
Descriptive statistics
• Tells how the population is distributed:
• What is the lowest and highest value
• What is the mean value
• What is the mode (most frequent value)
• What is the median (exactly half of values is smaller and exactly half is
larger than median)
• What is the first and the third quantile (25% and 75% of data is higher
and larger than 1st and 3rd quantile)
Types of distribution
• Skewed, symmetric, normal distribution
Inferential statistics
•
•
•
•
•
•
Testing statistical significance, collocational strength etc
Think of what you want to test
Try to find a suitable method
Check whether your data fulfills the conditions, think of eventual problems
Conduct the test
Interpret the results statistically (usually you can find guidelines in the
source where you checked for the test)
• Think what does the statistical information tells you about the language
• How far can you extrapolate from your data?
Normalized frequencies
• In order to compare data from obtained from samples of different
size you must normalize them:
• Raw_freq/corpus_size*x (usually 100, 1000, 1000000)
Types of variables
• In order to conduct a quantitative study you need numbers, but there
are different types of numbers:
• http://www.indiana.edu/~educy520/sec5982/week_2/variable_types
.pdf
Null hypothesis
• A hypothesis that you try to reject in your study
• Instead of asking are a and b different, you ask: what is the probability that
a and b are not correlated/not different / come from the same distribution
• p-value is the value of probability that null hypothesis holds, in other
words: p-value tells you how probable is that a and b are not correlated
• You need to decide what is a satisfying probability: 0.05, 0.01, 0.001?
• p=0.05 means: there is 5% probability that a and b are not correlated,
hence 95% chance that they are correlated.
• In linguistics p=0.05 is usually a standard, but in medicine it might be not
satisfactory. Would you like to take a treatment that in 5/100 is lethal?
Useful statistical methods
• Tests of statistical significance
• chi-square test observations on big data sets, more equally distributed (expected value in each cell must be >5)
• log-likelihood (LL) test – preferred test
• Fisher’s Exact test – observations on small data sets
• Collocation statistics
•
•
•
•
•
Mutual information (MI) -The higher the MI score, the stronger the link between two items
MI score of 3.0 the higher the chance it is a collocation
The closer to 0 the MI score , the higher probability it was random
A negative MI score indicates the candidates dislike
z score
• Testing correlation – simple linear regression
• More complicated methods
• Multiple linear regression
• Multifactorial analysis
• Clustering
Tools
• http://ucrel.lancs.ac.uk/llwizard.html
• https://www.r-project.org/ + https://cran.rproject.org/web/packages/languageR/languageR.pdf
• SPSS
• Calculating manually e.g. with help of excel table
Useful literature
• http://www.linguistics.ucsb.edu/faculty/stgries/research/sflwr/sflwr.h
tml
• http://www.sfs.uni-tuebingen.de/~hbaayen/
• http://www.cambridge.org/us/academic/subjects/languageslinguistics/grammar-and-syntax/analyzing-linguistic-data-practicalintroduction-statistics-using-r