P - Data Communication and Data Management Laboratory

Download Report

Transcript P - Data Communication and Data Management Laboratory

Social Network
Something Interesting
Ruizhi Gao
Contents
•
•
•
•
The Born of Social Networks
New types of Social Networks
My Social Networks
Research related
History
Early Years
They are client-based application which can allow you to add friends and have friends list.
You can communicate with each other by send text message, image and other information
ICQ
AIM
Problems: You have to remember the account ID if you want to add new friends.
Your friends list is not visible
It’s more like a communication tool but SNS
SixDrgree.com
Combine the function of many different tools like ICQ and AIM, and provide a new search
approach based on users’ own information.
SixDegrees.com was a social network service website that lasted from 1997 to 2001 and was
based on the Web of Contacts model of social networking. It was named after the six
degrees of separation concept and allowed users to list friends, family members and
acquaintances both on the site and externally; external contacts were invited to join the
site. Users could send messages and post bulletin board items to people in their first,
second, and third degrees, and see their connection to any other user on the site. It was one
of the first manifestations of social networking websites in the format now seen today. Six
Degrees was followed by more successful social networking sites based on the "social-circles
network model" such as Friendster, MySpace, LinkedIn, XING and Facebook.
----- Wikipedia
http://en.wikipedia.org/wiki/SixDegrees.com
LiveJournal.com
It gives a new way that people mark
others as Friends to follow their
journals and manage privacy settings
SNS on business networks
• Ryze
• Tribe.net
• Friendster
MySpace
• Myspace could grow rapidly because many other SNS are trying to collect fees.
• MySpace open some public webpages to famous bands or singers so their fans can
follow.
• MySpace did not restrict users from add HTML code into the forms to make their
own pages special.
Today
Twitter is an online social networking service and microblogging service that enables its
users to send and read text-based messages of up to 140 characters, known as "tweets"
Harvard - Only  High School Networks  Corporate Networks
Reference
• Wikipedia
• Boyd, d. m., & Ellison, N. B. (2007). Social network
sites: Definition, history, and scholarship. Journal of
Computer-Mediated Communication, 13(1), article
11.
• Visual Academy
http://www.onlineschools.org/visualacademy/history-of-social-networking/
New
Types of Social Networks
• General
Facebook, Twitter …
• School, college
Classmates.com, Friends Reunited(UK)…
• Art Community
deviantART, Taltopia
• Movies , TV series
Youtube, douban, Flixster, Filmow …
• Photo sharing
Flickr, Fotki, DailyBooth …
• …
SoLoMo
• SoLoMo, short for social-local-mobile, refers to
a more mobile-centric version of the addition of
local entries to search engine results.(GPS)
• SoLoMo based social network.
The most famous one is Foursquare
Foursquare
You can check-in
in different place
(restaurant,
movie center … )
and leave your
comments. The
more you
checked in, more
point you earn.
You also can be
the “mayor of
one place” if you
check-in many
times
Private Social Network
• General social network may not allow you to
upload some “sensitive” information or
private information. However private social
network allow you to create a group-centric
SNS in which you can share more if you want.
Sgrouples
•
Sgrouples (sounds like ‘scruples’) is our very own private, group-centric social
network designed to mimic how small groups of people interact in their real lives.
Sgrouples allows you to easily post content to different groups based on your real
life interests – friends, family, work, sports teams, and hobby groups.
Reference
• http://blog.sgrouples.com/new-socialnetworking-sites/
• http://en.wikipedia.org/wiki/List_of_social_ne
tworking_websites
Experience
X-land Project
• We designed the Xland project as a 3D immersive blog.
Xland was part of the CHIPS (CHina Innovation Program
for Students) program sponsored by Sun Microsystems
and the Chinese Education Department.
• It may be not called as a blog but a social space. Every
user has its own room and we provide a open space
like a community. People can decorate their own room
and share information with each other. The most
important thing is it’s not a client based but can be
accessed from your web browser, which is very hard at
that time. You need to consider very carefully about
using the resource.
Trailer
Functions
Friend list and live chat HUD
Keyboard piano
Decorate your room
Albums
Functions
Background music of your room
Dairy
We tried to transfer the “my page” in facebook
Into “my room” in a 3D world. Good idea but
impractical like what Kaifu Li did  3D browser
Our blog:
http://blogs.openwonderland.org/2010/07/23/xland-3d-blog/
Visitor
AlienSandal
• A real start up project  SoLoMo based social
network cooperated with students in UC Berkley
• AlienSandal is a social platform based on Google
Map, which enable users to trace and customize Life
Track and connect to people who have similar tracks.
• Trailer
http://www.youtube.com/watch?v=ACYFEsodwrY
Different Types of check-in
Free draw in the map
Difficulties
• What is the most important thing in SNS?
• Why do you want to use a SNS application?
• Why can SNS do?
• Answer from VC, “friends, money and …”
How to make your SNS popular?
 RECOMMENDATION
Clustering
Why clustering is useful
• Grouping users in social networks is an important process that improves
matching and recommendation activities in social networks. The data
mining methods of clustering can be used in grouping the users in social
networks [1]
• [1]S. Alsaleh, R. Nayak, Y. Xu, “Grouping People in Social Networks
Using a Weighted Multi-Constraints Clustering Method,” in Proceedings of
WCCI 2012, June,10-15. Brisbane, Australia.
[2] GAN, G., MA, C. & WU, J. (2007) Data clustering: theory,
algorithms,and applications. ASASIAM Series on Statistics and Applied
Probability, 20, 219-230.
[3] NAYAK, R. (2011) Utilizing past relations and user similarities in a
social matching system. Advances in Knowledge Discovery and Data
Mining, 99-110.
First Question
• How many cluster???????
K-means, Fuzzy C means, K-medoids …
Rule of Thumb [1]
But this is not reliable….
[1]Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press.
Cluster Estimation 1
• Stephen L. Chiu, “Fuzzy Model Identification Based on Cluster
Estimation,” Journal of Intelligent and Fuzzy Systems, Vol.2, pp
267-278, 1994.
• Suppose we have a collection of n data points {x1,x2,…,xn}, in our
case. For each data points, we can assign a potential value.
n
Pi   e
 || xi  x j ||2
j 1
and  
4
r2
ra is a positive constant. We have ra = 0.5 here (suggested by paper).
and ||xi – xj|| is the distance function between data xi and xj.
Kendall tau Distance
• Kendall tau distance
Example:
Two sample data
R1: 1,2,3,4,5
R2: 3,4,1,2,5
There are 4 disorder between R1 and R2
The distance will be
4
1
5(5  1) 
2
For R1
For R2
Count
1<2
3<4
1<3
3>1
*
1<4
3>2
*
1<5
3<5
2<3
4>1
*
2<4
4>2
*
2<5
4<5
3<4
1<2
3<5
1<5
4<5
2<5
Euclidean Distance
Two sample data
R1: 1,2,3,4,5
R2: 3,4,1,2,5
• De =
(1  3) 2  (2  4) 2  (3  1) 2  (4  2) 2  (5  5) 2
Hamming Distance
Two sample data
R1: 1,2,3,4,5
R2: 3,4,1,2,5
4 and 2 are
different
• DH = 1+1+1+1+0
1 and 3 are
different
2 and 4 are
different
5 and 5 are
different
3 and 1 are
different
Cluster Estimation 1
• After we have potential value for each data point, we select the
*
x
highest potential as the first cluster center. Let 1 be the data
point of the first cluster center and P1* be its potential value.
Then we will revise the potential of each data point xi by
Pi  Pi  P  e
*
1
• In which, we have:  
and rb  1.5ra
4
rb2
  || xi  x1* ||2
Cluster Estimation 1
• Next we select the data point with the highest remaining
potential as the second cluster center. We further reduce the
potential of each data point according to their distance to the
second cluster center. In general, after the kth cluster center has
been obtained, we revise the potential of each data point by
Pi  Pi  P  e
*
k
  || xi  xk* ||2
*
Where xk is the location of the kth cluster center and Pk* is its
potential value.
Cluster Estimation 1
•
Every time we got the Pk*, we need to decide whether we should select that as a new
center , we have following process.
*
*
if Pk   P1
Accept xk* as a cluster center and continue.
*
*
else if Pk   P1
Reject x* and end the selecting process
k
else
*
Let dmin = [shortest of the distance between xk and all previously found cluster centers]
if
dmin Pk*
 * 1
ra
P1
*
Accept xk as a cluster center and continue the whole process
else
*
*
Reject xk and set the potential at xk to 0.
*
Select the data point with the next highest potential as new xk and re-test
end if
end if
In which, we have   0.5 and
  0.15 (suggested in paper)
Running Example 1
• Suppose we have the following rankings set (which may
represents different pages you viewed. 1 is page 1, 2 is page 2)
R1 = {1,2,3,4,5,6,7},
R2 = {1,2,4,3,5,7,6},
R3 = {7,6,4,5,3,1,2},
R4 = {7,6,5,4,1,3,2}.
First we will assign potential value for each of them by
We have P for each of them
P1 = 1.9311
P2 = 1.9336
P3 = 1.7451
P4 = 1.7496
n
Pi   e
j 1
 || xi  x j ||2
Running Example 1
• We choose the highest value P2 = 1.9336, that is, we choose R2 as our first
center and P1*  1.9336
• After this, we need to revise the potential value for each ranking by
Pi  Pi  P  e
*
1
  || xi  x1* ||2
So we got:
P1 = 0.06006
P2 = 0.0
P3 = 1.69382
P4 = 1.57027
We will choose P3 then go on the process.
Running Example 1
• Since we have  P1*  0.290049 and  P1*  0.966833
P3 = 1.69382 which is greater than  P1*  0.966833
Refer to if Pk*   P1*
*
Accept xk as a cluster center and continue.
So P3 will be our second center.
Then we will revise the potential value again by
Pi  Pi  P  e
Then we will have
P1 = -0.02683
P2 = -0.04499
P3 = 0.0
P4 = 0.085356
So P4 will be our next choice.
*
1
  || xi  x2* ||2
Running Example 1
• Since we have  P1*  0.290049 and  P1*  0.966833
P4 = 0.085356 which is less than  P1*  0.290049
Refer to if Pk*   P1*
*
Reject xk and end the selecting process
So we will stop the whole process
• Out of 4 data set, we have estimate 2 clusters and 2 centers which are R2
and R3, this result makes sense.
Cluster Estimation 2: Max-min Distance
•
•
•
ZHOU Shi-bing, XU Zhen-yuan, TANG Xu-qing, “New method for
determining optimal number of clusters in K-means clustering algorithm,”
Computer Engineering and Applications, Vol.46, pp 27-31, 2010.
Beliakov, Gleb and King, Matthew 2006, Density based fuzzy c-means
clustering of non-convex patterns, European journal of operational research,
vol. 173, no. 3, pp. 717-728.
Suppose we have a collection of n data points {x1,x2,…,xn}, in our case, they
are the ranks. For each two data points, we calculate the distance.(We use
Kendall tau distance here)
Step 1(in paper):
Randomly choose one data point as the first center c1.
Step 2(in paper):
Choose the data point which has largest distance from the first center c1 as the
second center c2.
Cluster Estimation 2: Max-min Distance
• Step 3:
Get the distance between the rest of the data and all the centers.
dij  xi  c j
j  1, 2....k (# center )
then we get di  min(di1 , di 2 ...dik ) i  1, 2....n(# data)
Then, if D  max{di }    || c1  c2 || then, we choose xi as our next
center.
 , we choose 0.6
Repeat this process until we cannot find any more centers.
Running Example 2
• Suppose we have the following rankings set.
R1 = {1,2,3,4,5,6,7},
R2 = {1,2,4,3,5,7,6},
R3 = {7,6,4,5,3,1,2},
R4 = {7,6,5,4,1,3,2}.
• The Kendall tau distance between each other is:
R1
R2
R3
R4
R1
0
0.095238
0.904762
0.904762
R2
0.095238
0
1
0.809524
R3
0.904762
1
0
0.190476
R4
0.904762
0.809524
0.190476
0
Running Example 2
• The largest distance is between R2 and R3 which is 1.
So we choose these 2 as two centers then
d1  min(di1 , di 2 )  0.095238
d4  min(di1 , di 2 )  0.190476
Since D  max{di }  0.190476  0.6 || c1  c2 ||
We stop the process and choose R2 and R3 as the centers.
Pros and Cons
• Estimation 1
ra,  ,  , at least 3 parameters needs to be decided in the process, and all of this
parameters are application sensitive. For now, there is no adaptive way to adjust
such parameter.
• Estimation 2
The problem is on the initial center selection. The result will be affected by the
noise data point
• Question