Transcript Document

Everybody Learns:
Faculty and Student Perspectives on
Interdisciplinary Undergraduate Research
Center for
Interdisciplinary Research




College Statistics Course -> Statistics
Concentration
Statistics Concentration -> Graduate School
Create a Statistics Community
Promote connections between researchers on
campus
Linguistics & Statistics I


Approached by
Linguistics Faculty, Dr.
Rika Ito for assistance
with linguistics analysis
software
Haley, statistics student,
also pursuing a
linguistics concentration
Learning about Linguistics

Initial meeting of faculty and students
–
–

Readings in linguistics
–
–


Description of the field of linguistics
Description of the particular problem of interest
Seminal reading concerning the overall field
Readings addressing the problem of interest
Linguistics Workshop
Weekly meetings of faculty & students
Learning about Statistics


Done on an as needed basis
Faculty model interdisciplinary communication
–
–


Stating and re-stating the problem
Hypothetical results
Students eventually do the communicating
Weekly meetings
Example



Linguistic software outputs
Statisticians connected this to logistic
regression
This connection allowed for increased flexibility
in modeling linguistics data
Statistical Analysis of Phonetic Vowel Shifts
Haley Hedlin and Stacey Wood; Faculty Advisors Rika Ito and Julie Legler
Introduction
Linguists analyze speech characteristics by measuring the frequencies associated with the production of sound. These frequencies are decomposed into different components called formants. In this study, we focus
on the first and second formants, F1 and F2. F1 and F2 relate to the vertical and horizontal position of the tongue, respectively. Linguists then plot (F1, F2) on a coordinate system, which represents the individual's
vowel space. Vowel spaces differ across geographic regions and also over time. Of particular interest is the movement of one vowel, /æ/ as in “cat,” relative to a more stable vowel, // as in “bed.” To ascertain the
existence of a shift, linguists typically use a two-sample t-test. We explore the use of a more flexible approach, a random effects linear regression model, and compare these two methods.
Linguistic Methods
• 36 subjects from rural Northern Michigan pronounced 102 words, which
focused on different vowel sounds2
• For each word, F1 and F2 frequencies were recorded for all individuals
• Frequencies were plotted to determine an individual’s vowel space (see
Figure 2)
• Used linguistics methods to normalize the vowel spaces of all subjects3
• Calculate G (geometric mean) and S (individual speaker mean)
according top the following:
m
n
k 1
m

j 1

n
i , j ,k
i 1

)  


S
p
m *  nk
   ln(F
j 1
i, j
i 1

)

m*n
k 1
where p=number of speakers, m=number of formants, n=number of words for a particular speaker, and
F=measured frequency. Subscripts i, j, and k refer to particular speakers, formants and words,
respectively.
Research Questions
• Is there evidence of NCVS in rural Northern Michigan?
• If so, which subpopulations exhibit this shift?
• Calculate a uniform scaling factor, K, unique to each individual speaker
K  exp(G  S )
Predicted Means by Gender
500
Predicted Means by Age
Predicted Means by Economic Class
500
G

     ln(F
Results
• T-test results
• No significant difference between /æ/ and // vowels within individuals
• Unable to make inferences across people
• Does not control for other factors affecting variability
• Lack of power resulting from repeated testing and limited data within
subjects
• Random effects model results
• Significant difference in individuals’ vowel space centers
• Significant difference in frequency between genders but not between
ages or economic classes after accounting for elements of study design
and uniqueness of individual vowel spaces (see Figure 3)
• Significant difference in frequency between the /æ/ and // vowels after
controlling for differences among social variables, formant, word, and the
uniqueness of individual vowel spaces
• Vowel shift more advanced among females than males
500
Background
• A new urban speech pattern, the Northern Cities
Vowel Shift (NCVS), has received attention by spreading
into rural areas1
• NCVS involves a chain shift of vowels within the vowel
space (see Figure 1)
• Begins when the vowel /æ/ shifts from below and
behind the vowel // to above and in front of the vowel
//
• // is assumed to be a more stable vowel because it
is later in the chain shift
• For these reasons, we decided to focus on the
position of the vowel /æ/ relative to the vowel // to
assess the existence of NCVS
•Magenta-/o/ “hope”
•Maroon-/u/ “boot”
Figure 1: Diagram of the Northern Cities Vowel Shift.1 Each vowel
gradually moves to the position indicated by the arrows in the specified
order. /æ/ is the first to move. // is considered more stable because it
is fourth to move.
References
1. Labov, William. (2006)
http://www.ling.upenn.edu/phono_atlas/ICSLP4.html#Heading
4
2. Ito, Rika. 1999. “Diffusion of Urban Sound Change in Rural
Michigan: A Case of the Northern Cities Shift.” East Lansing,
MI: Michigan State University dissertation.
3. Labov, William. (2003) Plotnik 7.0 Documentation.
Contact Information
Haley Hedlin: [email protected]
Stacey Wood: [email protected]
2500
2000
1500
2200
2100
2000
1900
1800
1700
1600
600
700
F1 (Hz)
1000
900
800
600
700
F1 (Hz)
2300
2200
2100
2000
F2 (Hz)
1900
1800
1700
1600
2200
2100
F2 (Hz)
Figure 3 Legend
o // vowel
• female

3000
800
900
F1 (Hz)
700
1000
•Light blue-/>/ “caught”
•Light green-/oi/ “toy”
800
•Navy blue-/aw/ “loud”
•Orange-/ai/ “bite”
900
600
•Gray-/e/ as in “make”
1000
800
F2 (Hz)
400
Legend
•Red-// as in “pen”
•Black-/æ/ “apple”
•Blue-/^/ “bun”
•Gold-/a/ “Bob”
•Pink-/I/ “tin”
•Purple-/U/ “good”
•Green-/i/ “sleep”
1000
200
600
Example Vowel Space
asd
/æ/ vowel
• male
2000
1900
1800
1700
1600
F2 (Hz)
• young
•middle class
• old
• working class
1000
F1 (Hz)
Figure 2: Sample vowel space of a working class male. Note that the x and y axes are inverted to better
reflect the relationship between formants and tongue position. A point at the top of the graph reflects a
high tongue position. A point on the left of the graph reflects a forward tongue position.
Statistical Methods
• Relative position of /æ/ and // were compared in order to determine the
existence of a shift
• Classic approach: t-tests on normalized data
• Tests each speaker separately
• Proposed approach: random effects regression model predicting
frequency using raw data
• Each subject receives a random intercept to account for differences
in individual vowel spaces
• Model accounts for elements of design by including formant, vowel,
and word variables
• Model accounts for social factors such as gender, age, and economic
class
• Model includes terms representing the important interactions
between factors
Figure 3: Point estimates for mean vowel frequencies by subgroups of gender,
age, and economic class. The plot for gender displays much more significant
differences than those of age and economic class.
Conclusion
• Our research suggests that the use of random effects models provides a
more powerful and flexible option for linguists than t-tests
Future Directions
• Expand model to include other vowels
• Explore the effect of different consonants surrounding the vowel
• Create confidence regions around point estimates with bootstrapping
Acknowledgements
Special thanks to Rika Ito for inviting us to join this research project and Julie
Legler for all her statistical advice and guidance along the way. We would also like to
thank the Center for Interdisciplinary Research and the National Science Foundation
(Grant DMS-0354308) for providing us with the funding and the facilities to conduct our
research.
Linguistics & Statistics II

Linguist Dr. Maggie Broner received reviews
asking her to use logistic regression to reanalyze the data in her manuscript

Previous year’s student educated the incoming
student about the study of linguistics
Advanced Methods

Data structure required advanced methods in
statistics - methods new to both the students
and the faculty

Method suggested by one of the students from
Summer Internship at NIH
Broadening the Use of Statistical Analysis in Second Language Research
Kirsten Eilertson, Haley Hedlin, Mark Holland, Maggie Broner and Julie Legler
St. Olaf College
•One of the assumptions behind logistic regression is that there are
no empty or nearly empty cells (see Figure 1—note that Marvin speaks only Spanish
to adults in our data set)
•Examines when children in an immersion school
use their native language or second language,
Spanish
Spanish
400
Total
5
200
381
4
•Response variable is language of utterance
Adult
Carolina
Leonard
Peer
Total
32
3
2
248
1
•Consider the environmental factors influencing the
use of Spanish
•Adjust for various phenomena in the data, such as
interactions and complete separation of variables
Figure 1: The frequency of English and Spanish utterances by
interlocutor for Marvin
•With the instances of Spanish and English usage so unevenly
distributed across interlocutors, it is difficult to find unique
coefficients.
•In our case the likelihood surface’s continual increase, i.e. monotone
likelihood, implies that there is no maximum and thus we have a
failure to converge. (see Figure 2)
Methods:
•Data collected from three fifth grade students in a
full K-5 Spanish immersion school
•Dependent variable coded as English, Spanish,
Mix-English base or Mix-Spanish base
Statistical Methods
•Analysis was done using Stata 8.2 and R
•Used a package within R that maximizes a
penalized likelihood to adjust for monotone
likelihood in one or more of the predictors3
Off task
Language related
3
Marvin
Off task
Non-language related
4
Peer
On task
Non-language related
5
Marvin
Off task
6
Peer
On task
7
Marvin
On task
Non-language related
8
Marvin
On task
Language related
404
600
400
200
Frequency of combinations
Language related
Language related
0
.2
.4
.6
.8
1
Predicted probability of speaking Spanish by combination
Figure 4: Histogram of combination of situations and predicted probabilities with confidence intervals for Leonard’s
predicted probability of speaking Spanish in the 8 different situations
Results:
•Using penalized log likelihood we were able to successfully build models
that incorporated our data involving the teacher and other adult
interlocutors.
•Using Stata 8.2 we were able to explore potential interactions of
predictors used to model a student’s probability of using Spanish.
Specifically, this proved to be important for Leonard.
•Two were selected at random; the third was
studied for his unusual propensity to speak
Spanish
•Transcribed from thirteen 25 to 80 minute
classroom sessions taped and annotated by an
observer
Peer
•New models confirmed the hypothesis that factors such as the
interlocutor, type of task, and whether the student was on or off task all
have varying effects on the students’ usages of Spanish and English.
Linguistic Methods
•The children were provided with wireless lapel
microphones for the taping sessions
Non-language related
2
0
Interlocutor
Purpose:
6
8
6 564
Off task
7
English
Peer
5
600
Content
1
6
116
7
Combo Interlocutor On/off
4
800
11
3
•Environmental and linguistic explanatory variables
8
Frequency
Data
Leonard's Data
Marvin's Spanish v. English Usage
2
•Arose from an applied problem in second language
research in children1
A statistical summary like Leonard’s below (Figure 4) was generated for
each student illustrating the probability of that student speaking Spanish
in varying situations described by the predictors interlocutor, on/off task,
and content.
Firth penalized likelihood method
1
Research Problem
Combinations
Introduction:
References:
Figure 2: Unadjusted maximum
likelihood surface
Figure 3: Adjusted maximum
likelihood surface
•The Firth penalized likelihood method uses the equation2:
U (  r )*  U (  r ) 

 I (  ) 
1
trace I (  ) 1 
  0
2
  r 

•This method adjusts the likelihood surface using the second
derivative of the log likelihood of the coefficients, I (  , resulting in a
new likelihood surface where the maximum exists and represents the
best estimate for coefficients (see Figure 3)
1. Broner, M.A. (In preparation) “A variationist view of first and second language use
in full immersion contexts.”
2. Heinze, G. and Schemper, M. (2002) “A solution to the problem of separation in
logistic regression.”
3. Ploner, M.; Dunkler, D.; Southworth, H.; and Heinze, G. (2005). logistf: Firth's bias
reduced logistic regression. R package version 1.03.
http://www.meduniwien.ac.at/msi/biometrie/programme/fl/index.html
Acknowledgements:
Special thanks to Maggie Broner for inviting us to join this research project and
Julie Legler for all her statistical advice and guidance along the way. We would also
like to thank the Center for Interdisciplinary Research for providing us with the
funding and the facilities to conduct our research.
Linguistics & Statistics III



Back to Rika
Objective: Characterize the differences
between vowel spaces
Communication challenge
–
–
–
Our understanding of the field
Lack of readily available statistical tools
Discipline-specific tradition
Comprehensive Program
Connecting with High Schools
Continue study of statistics as undergrad
Graduate School in Statistics or related field
Post-doc
Return to teach in undergraduate institution
Center for
Interdisciplinary Research (CIR)

CIR Fellows awarded Stipend & Credit
Operon Prediction
in the
Tuberculosis Genome
Center for
Interdisciplinary Research (CIR)

Physical Location
Adolescent Mothers & Infants
in a
School-based Intervention Program
Center for
Interdisciplinary Research (CIR)

Weekly Research Skills Seminar with Meals
Assessing Baseball Performance using Hierarchical Models
Center for
Interdisciplinary Research (CIR)

Interdisciplinary Research Teams
The Use of
Moral Schemas
in Decision-making
Comprehensive Program

Post-docs
Quantitative Analysis
of Admission Trends
Everybody Learns!

Promotes interdisciplinary research involving
faculty and students from across campus
Modeling
Bluebird Predation