Transcript fung_sims

Text Use in Online Dating Profiles
James Fung | Christo Sims
ANLP | Final Presentation
Instructor Marti Hearst
12.04.06
Overview
Goal
An exploratory exercise: can we use the text someone provides
in their dating profile to assign them to various classes?
The Text
Can't wait to get to know you
Nice, warm and sweet, as most of my friends would describe me. I love
to laugh all the time. I have a strong passion towards life even through
little things. I tend to be quiet in a large group but generally great with
one on one basis. I am ambitious about love and romance. And I am
very respectful of the needs and wants of other people. Life is a
beautiful journey. I am seeking someone who would appreciate the
value of life, family, have a warm heart and nice peronality to share this
journey with. If you are that person, I can't wait to get to know you.
(female, asian, college grad)
Possible Classes
•
•
•
•
•
•
•
•
Education
Gender
Attend Services
Income
Ethnicity
Marital Status
Want kids
Others
– Astrology?
Approach
Scraping Profiles
•
•
•
•
Yahoo! Personals
200 Male seeking Female
200 Female seeking Male
Within 50 Miles of San
Francisco
• Ages 25-35
Feature Extraction (Python)
• Token frequency
– Words: TF, TF.IDF
– Bigrams
– Weighted headlines
• Readability measures
– Characters, syllables, words, complex words, sentences
– Ratios of the above
– Gunning-Fog and six others
Feature Selection & Classification (Weka)
• Use Weka’s built in feature selection tools
– Chi-Squared, Information Gain
– Subset Eval (not working well with most of the classes)
• Explore a variety of classification algorithms, for a
variety of possible classes
–
–
–
–
Multinomial Naïve Bayes
K-Nearest Neighbors
Decision Tree
Support Vector Machines
Preliminary Results
Preliminary Results
Able to beat a naïve baseline in a few cases, usually where
there are only two or three possible classification
categories:
Gender
Category
Women seeking a man
Man seeking a woman
~69% Accuracy
# Instances
196
200 (51%)
Want (more) kids
Category
Yes
Not sure
No
~65% Accuracy
# Instances
202 (62.3%)
105
17
Preliminary Results
More difficult with more classification categories:
Education
Category
Post-Graduate
College Grad
Some College
High School Grad
Some High School
Income (61% null reply)
Employment Status (75% Full-time)
Political Views
Attend Services
Ethnicity
47.4% Accuracy
# Instances
94
175 (44%)
86
13
3
Preliminary Results (cont.)
Some interesting statistics about feature probability for a
given class (from multinomial Bayes output):
Gender
– “man” - over 2x as likely in women’s profile
– “sense” - over 2x as likely in woman’s profile
– “honest” - over 2x as likely in female profile
– “independent” - over 3x as likely in female profile
– “loving” - over 3x as likely in female profile
– “crazy” - over 4x as likely in male profile
– “company” - over 3x as likely in male profile
– “friendship” - over 2x as likely in female profile
– “me laugh” - almost 4x as likely in female profile
– “great sense” - over 6x as likely in female profile
Preliminary Results (cont.)
Some interesting statistics about features (from multinomial
bayes):
Want (more) kids:
– “caring” - over 3x as likely in the “yes” than the “not sure” class
– “heart” - over 2x as likely in the “yes” than the “not sure” class
– “sometimes” - over 2x as likely in the “not sure” than the “yes”
class
– “beautiful” - 2x as likely in “yes” than “not sure”
– “real” - 2x as likely in “not sure” than “yes”
– “dancing” - 2x as likely in the “not sure” than “yes” class
– “games” - almost 2x as likely in the “not sure” than “yes” class
Challenges
Not Enough Instances
For most classes, we don’t have enough instances for
meaningful training:
Education:
– Some College
– College Grad
– Post-Graduate
– High School Grad
– Some High School
87
176
97
14
3
Ethnicity
– Hispanic/Latino
– Caucasian (white)
– Asian
– Inter-racial
– African American (black)
– Other
– Pacific Islander
– Native American
– East Indian
– Middle Eastern
33
202
74
14
38
13
7
1
8
1
Features Aren’t Working
• Weka identifies few relevant features
• Subset Eval selects subsets of size 1-3
• Difficult to overcome strong a priori probability:
PG
CG
SC
HSG SHS
Post-Graduate
0
94
0
0
0
College Grad
0
175
0
0
0
Some College
0
86
0
0
0
High School Grad
0
12
0
0
1
Some High School
0
2
0
0
1
Additional Challenges
I am 33old woman from Ireland living here for a few years
Love this country and love the out doors, favourite thing is
mountain biking and hiking tooSan Francisco has so much to
offer, nice restaurants which i love Thai food and so many live
music shows which i love to out and listen every month …
No punctuation
Additional Challenges (cont.)
I am into swimming, sunrises, Vinyasa Yoga at the Loft, cafes,
people watching, warm drinks in the morning, laughing, crying,
feeling all of it, freshly squeezed juice, tennis, painting, spirals,
Abraham Hicks, Life as Art, singing, swinging, sushi,
backgammon, remembering my dreams, warm weather, soft
textures, calligraphy, episopalian upbringing gone buddhist
tendencies, handmade paper, dancing, fire, telling stories …
Lists, not sentences
Additional Challenges (cont.)
I work alot but in my free time i love to play a round of golf and
spend time out with my dog. I love going to the beach with him or
going to the park and just chillin out. At night i love goin out with
friend and having a few drinks.
Lack of complex words
Additional Challenges
• Scraping profiles requires a user login
– Easy in PHP, not in Python
– Have to save profiles by hand, limits corpus size
• The profile text in Yahoo! Personals doesn’t seem as
thoughtful as profiles on Match.com
– Shorter profile text
– Spam?
– How dedicated are the participants?
Where we’re headed
Future Work (cont.)
• Need more profiles!
– PHP
– manually saved
• Different features
– Use of capitalization: emphasis, grammar
– Tailored features
– More token features
“I go to church I am very sincere in my faith and my striving to
become more Like Jesus. By know means am I perfect, however,
NEW MERCIES EVERYDAY! … I am a very real and straight
forward person, however,HUMBLE to God's word and voice in my
life. Always looking to HIM for my direction and HE is my
SOURCE”