ExploringMySpace

Download Report

Transcript ExploringMySpace

Exploring MySpace:
Measurement and Analysis of the Online
Social Network Site
Bill Gauvin
21-Jan-2009
Online Social Network Sites: A Definition
[Boyd & Ellison]
Web-based services that allow individuals to
(1) construct a public or semi-public profile within a bounded
system
(2) articulate a list of other users with whom they share a
connection
(3) view and traverse their list of connections and those made
by others within the system
The Rise of Online Social Networks (OSN)
• 1997: SixDegrees allowed users to create
profiles, list and surf and friend lists
• 1997-2001: a number of community tools
support profile and friend lists, AsianAvenue,
BlackPlanet, MiGente, LiveJournal
• 2001 - : business and professional social
network emerged, Ryze’s, LinkedIn,
The Rise of Online Social Networks (OSN)
• 2003: MySpace attracts teens, bands,
among others and grows to largest OSN
• 2004:Facebook designed for college
networking (Harvard), expanded to other
colleges, high schools, and everyone
• a global phenomenon: ???
Online Social Networking Goes Mobile
Displays locations of friends with
their presence status (available,
away, etc) visually on maps or on
lists
MySpace
• launched in Santa Monica, CA, in 2003
• grew rapidly and attracted Friendster’s users, bands, etc
• teenagers began joining en masse in 2004
• three distinct populations began to form:
• musicians/artists
• teenagers
• post-college urban social crowd
• purchased by News Corporation for $580M in 2005
• arguably the largest online social network site
MySpace Profile
MySpace Profile
• each user has a profile that contains age, gender, location,
last login time, and other information
• each user has a unique id associated with the profile
• some profiles claim neutral gender, e.g., bands
• user can set his/her profile to be private (default is public)
or customize the layout
MySpace Profile
• user can search and add friends to his/her friend list
• user can post messages to friend’s blog space
• only friends have access to private profile’s friend list and
blog space
• other functions: IM/Call, Block/Rank User, Add to Group
favorite, etc
Measurement: SnailCrawler
• generate random ids uniformly between 1 and max
(15,000,000,000)
• many ids are not occupied (invalid)
• retrieve profile information from MySpace (HTTP)
• name, id, gender, age, location,
public/private/custom
• other information when available for public profiles:
company, religion, marriage, children, smoke/drink,
orientation, zodiac, education, ethnicity, occupation,
hometown, body-type, mood, last login, etc
Measurement: SnailCrawler
For public profiles
• scan the friend list and record id of each friend
• scan the blog space and record for each entry
(publisher id, time, word/object/image/reference counts)
• secondary scan to retrieve the profile information of
publishers
Recorded information of a profile: an example
<Profile>
<Marriage>Single</Marriage>
<Name>Tantric Daydream</Name>
<Children>Someday</Children>
<Alias>214935142</Alias>
<SmokeDrink />
<ScannerIndex>214935142</ScannerIndex>
<Orientation>Straight</Orientation>
<Locale>USA</Locale>
<Zodiac>Virgo</Zodiac>
<Gender>Male</Gender>
<Education>Some college</Education>
<Age>23 years old</Age>
<Ethnicity>White / Caucasian</Ethnicity>
<City>all around</City>
<Occupation>USAF</Occupation>
<State>California</State>
<Hometown>Saint Louis</Hometown>
<Country />
<BodyType>5' 11" / Athletic</BodyType>
<SNtool>MySpace</SNtool>
<Mood><span class="searchMonkey<ID>214935142</ID>
mood">pretty</span></Mood>
<Private>False</Private>
<MembershipReason />
<Music>False</Music>
<MemberSince />
<Video>False</Video>
<LastLogin><span class="searchMonkey<Location />
lastLogin">3/12/2008</span></LastLogin>
<snID>6221,67105862,113546935,72060317,182521602,
<NumberFriends>7</NumberFriends>
17617975,70660559,</snID>
<NumberBlogs>3</NumberBlogs>
<Company>Tantric Daydream's Companies</Company>
</Profile>
<Religion />
Measurement: issues
Profile format changes over time
Some ids went away?
Some data are lost when the program crashes
MySpace rate limits IP address for invalid ids
MySpace Analysis
 Profile Analysis
– Distribution of number of Friends and Publishers
– Correlation between Friends and Publishers
 Publishing Analysis
– Age/Gender
– Times Published
– Day Published
– Day/Month/Year Published
– Distribution of number of blog entries
– Distribution of inter-blog time (need this)
 Content Analysis
– Number of words published
– Number of HREFs, objects, images used
– Distribution of message length (number of words) (need this)
Data
 Scanned: 1,397,840
– Female : 668,898
– Male: 473,531
– Neutral:255,411
 Public: 750,945
– Female: 294,947
– Male: 296,334
– Neutral: 159,664
 Private: 562,217
– Female: 373,951
– Male: 177,197
– Neutral: 11,069
 Blogs: 67,045
Friends Distribution
female
male
neutral
How to model the two scaling
regimes? Can it be modeled as
superposition of two power-law
distributions?
Can the neutral curve be fitted
by a power-law distribution?
• friends distribution of female and male profiles are similar
• friends distribution of neutral profiles different
• for female/male profiles, it appears that there are two
distinct scaling regimes
Publisher Distribution
• Similar to friends distribution, male/female turning point
smoother
• a sharp turning point for neutral profiles at high end
Number of Blog Entry of Profiles Distribution
Can this be modeled by a power-law distribution?
Further analysis needed
Publisher Age and Gender
• age of 16 and under protected by law, aggregated at 0 in the figure
• teenagers and twenties post most blogs
• false ages at 98-100 years old?
• among teenagers 16-19, female publish more than male; after 20, no
significant differences, often male publish more than female
Blog publish time
Christmas
Valentine’s
day
Feb
Sept
Dec
• females publish more than males, and male more than neutral
• spikes on holidays, e.g., Valentine’s day, Christmas
Blog publish time
Sun
Jan
Mon
Dec
Sun
• females publish more than males
• more blogs posted May to Oct
• slightly more blogs posted during weekdays
Sat
Blog publish time
• big jump at 1 pm
• people tend to publish from afternoon well into mid-night
• peak around 10pm, bottom around 5am
Linear-scale
number of friends
number of publishers
number of publishers
Publisher vs Friend
Log-log
number of friends
• only friends can publish in a user’s blog, but some profiles have more publishers
than friends, i.e., above the 45 degree line. This is because that some friend profiles
are removed by themselves or by MySpace after in-activity and hence not counted
• number of publishers remains relative flat as number of friends increase
Publisher-Friend ratio
spikes caused by integer round-up problem?
Need finer granularity data.
Blog Contents Analysis
objects
images
HREFs
words
• # pure word blogs >> hrefs blogs > images blogs > objects
• females write more words, post more images and references
• males post more objects
Time Interval Distribution (need this)