Group Members:
07005029
–
Abhinav
Gokari
07005030
–
Sudheer
Kumar
07d05004
–
Ignatius Pereira
07d05019
–
Praveen
Dhanala
Under the guidance of Prof.
Pushpak
Bhattacharyya
Outline
Motivation
Social Networks and
Homophily
Experiments
Statistical Methods Used
Advantages of Analyzing Social Networks for
Homophily
Conclusion
References
Motivation
Social Networks are the ongoing phenomenon
Orkut
,
Facebook
, Twitter, etc.,
Almost 1/10
th
of the world’s population use
Facebook
There is a great scope for innovation and development
in Social Computing which deals with creating social
contexts through the use of software and technology.
Interesting problems arise like :
Social Network analysis
Target marketing and improving e

commerce
Friendship(or relationship) suggestions
Social Networks
A
social network
is a social structure made up of
individuals (or organizations) called "nodes", which
are tied (connected) by one or more specific types of
interdependency such as friendship, kinship, etc.,
Online social networks are attribute independent
networks.
In online social networks, a relationship between two
individuals is mutually self defined and binary.
Friendship is not functional and reasons could be
subtle. E.g. an offline/online meeting, common
workplace, pure visual interest.
Homophily
It is the tendency of individuals to associate or bond
with others with a similar set of interests or attributes.
(Birds of same feather flock together)
People choose friends who share same common
interests and characteristics
One of the most general and least contested
theoretical principles in sociology is the principle of
homophily
Homophily
(contd.)
A social system is
homophilous
if contacts are more
similar to one another than to strangers in terms of
their individual attributes and behavior
If
homophily
is a robust aspect of human behavior, it
can be used to deduce a particular person’s attributes
from his/her friends’ attributes in an online social
network.
We shall now examine the following experiments to
observe
Homophily
in social networks
Expt. based on the paper by
Apoorv
et al
A “Travel Site” is chosen with mutually self

declared
friendships
A data

set of 181 nodes is selected with 1214
friendship links with additional information like
their attributes, characteristics, etc.,
Each user can select their hobbies from a list of 26
pre

defines hobbies
Also each user has additional characteristics such as
“language spoken”, “country they live in”, etc.,

Based on “Predicting Interests of People on Online Social Networks” by
Apoorv
et al, 2009.
Contd.
The information gathered on the website was
Friends’ Network: This is the mutually self

declared
friends network matrix.
Hobbies: Members declare their hobbies by clicking on
boxes next to a list of 26 possible hobbies.
Languages Spoken: There is a list of 139 languages from
which members select a maximum of three languages
they speak.
Age group: The age group is in terms of ranges, example
under 20, 20

25, 26

30 etc, from which the user chooses
one. There are a total of 12 ranges.
Statistics about the data
This is the 181 by 181 friends network matrix. If person p1 has a friend p2
then F[p1,p2] will be 1, otherwise it will be 0.
H

This is the hobbies matrix, 181 by 26. 181 for number of people and 26 for
the number different hobbies a person may have. For example if p50 has three
hobbies

Acting, Dancing and Theatre, H[p50, Acting], H[p50, Dancing], H[p50,
Theatre] will be 1 and all the other cells in row H[p50] will be 0.
L

This is the languages spoken by people matrix. It is exactly similar to H
with the only difference that the columns here are the different languages a
person can speak along columns. This matrix is therefore, 181 by 139
P

This is the places visited by people matrix. It is similar to H and L with the
only difference that the columns here are the different places visited by a person
along columns. This matrix is therefore, 181 by 263.
The hypothesis of the experiment is that there is a
correlation between mutual self

declared friendship
links in online social networks and attributes listed in
the profiles of said friends, presumably because of
homophily
GFHF algorithm is extremely sensitive to the
correctness of the weight matrices. Thus, GFHF allows
us to test our hypothesis.
GFHF Example
Based
on the paper by Amir
Saffari
et al. , 2010
GFHF Example
Based
on the paper by Amir
Saffari
et al. , 2010
Support Vector Machines
A classifier derived from statistical learning theory
by
Vapnik
, et al. in 1992
Currently, SVM is widely used in object detection
& recognition, content

based image retrieval, text
recognition, biometrics, speech recognition,
regression analysis, etc.
V.
Vapnik
.
The Nature of Statistical Learning Theory
. Springer

Verlag
,
New York, 1992
Linear classifier
The goal of
statistical classification
is to use an
object's characteristics to identify which class (or
group) it belongs to.
A
linear classifier
achieves this by making a
classification decision based on the value of a linear
combination
of the characteristics. An object's
characteristics are also known as
feature values
and are
typically presented to the machine in a vector called a
feature vector.
SVM in test
MATLAB Support Vector Machine Toolbox
The toolbox provides routines for support vector
classification and support vector regression.
A GUI is included which allows the visualisation of
simple classification and regression problems. (The
MATLAB optimisation toolbox, or an alternative
quadratic programming routine is required.)
http://www.isis.ecs.soton.ac.uk/isystems/kernel/
Support Vector Machine ran on a sample data.
http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf
Why SVM’s
Experiment by
Thorstein
Joachims
et al. on Text
Categorization with support vector machines.
Text categorization is the classification of documents
into a fixed number of predefined categories where
each documents can be in one, multiple or no category
at all.
SVM’s well suited for the task with
categorisation
with
many features.
SVM’s are robust, don’t require parameter tuning.
Thorstein
Joachims
,
Text
Categorisation
with support vector machines:
Learning with many relevant features
, 1998.
Why SVM’s Contd.?
SVM’s are based on Structural Risk Management
Principle.
Idea of structural risk management is to find a
hypothesis h for which we can guarantee the lowest
true error
i.e
the probability to h will make an error on
an unseen or a randomly selected test sample.
SVM’s are universal learners
Ability of learning is independent of the
dimensionality of the feature space.
With the use of a simple kernel function, they can be
used to learn polynomial classifiers.
SVM’s v/s MLP’s
Experiment by
Barabino
et al. on Support Vector Machines
v/s Multiple Linear
Perceptrons
in particle Identification in
Physics.
SVM are based on minimization of Structured risk whereas
MLP’s are based on minimization of Empirical risk.
Findings :

1) very similar performance except the SVM perform as
good as MLP’s
2) SVM work well in case of large training drawn from
input spaces of small dimensions
M.
Barabino
et al.,
Support Vector Machines versus
Multilinear
Perceptrons
in Particle Identification
, 1999.
Back to the Expt.
To accept or reject our research hypothesis, we
consider the prediction capability of GFHF using two
weight matrices:
Randomly generated binary weight matrix(
G
r
)
Self declared friends network(
G
f
)
To incorporate the effect of other attributes, Support
Vector Machine(SVM) is used along with GFHF
Two feature sets are used when using SVMs
The set with only personal characteristics
(S
c
)
Set with all the hobbies except the one being
predicted(S
t
)
Contd.
GFHF is run 30 times, each time for a random
configuration of
n
i
number of labeled data points
where
n
i
Є
N = (10; 30; 50; 70; 90)
These predictions are calculated for all 26 hobbies
under consideration
Therefore, for each weight matrix,
G
r
and
G
f
we get a
corresponding 26 x 5 x 30 matrix, where 26 is the
number of hobbies, 5 is the different number of data

points and 30 is the number trials.
Explanation of Results
Table shows the accuracy of running GFHF with the
random matrix (
G
r
) and with the friends matrix (
G
f
)
for 26 hobbies and across 3 different training set sizes
(numbers of labeled data

points)
The numbers are averages over the 30 trials with the
same configuration.
The second

to

last column shows the average of
difference in accuracy between
G
f
and
G
r
across all
training set sizes, and the last column shows the
difference in accuracy between
S
t
and
G
f
, again as
average across all training set sizes.
Contd.
The results show that in most of the cases
G
f
performs
significantly better than
G
r
which implies that the
underlying friends network is in fact important for
prediction.
For some hobbies, the difference in the performance of
G
f
and
G
r
is extremely high. These are precisely the
hobbies that over 50% of the people in the network
have.
Contd.
There are quite a few hobbies for which the friends
network does not provide any useful information.
We see that the friends network does not consistently
help over a random network if the hobby has a relative
incidence of 41% or less.
At 47% and above, the friends network consistently
outperforms the random network.
Contd.
The results corresponding to
S
c
and
S
t
are also similar.
In general,
S
t
performs better than
S
c
, which performs
better than
G
f
From this table we also observe that as we increase the
data, prediction accuracy increases for the SVM
Expt. Based on paper by
Akshay
Patil
The data was gathered from a large online social
networking site
The data is essentially in form of a huge network of
interconnected nodes, with nodes representing actual
people or users and the ties between them denoting
relationships in the social network.
Also each of the nodes store information regarding the
individual user. This information make up the node or user
profile, and is essentially a list of attribute: value pairs.

Akshay
N
Patil
.
Homophily
Based Link Prediction in Social Networks. 2009
Statistics of Data Set
Definitions
The nodes are distinguished as
Class of Near Nodes N(u)
–
Nodes within 2

hop radius
Class of Far Nodes F(u)
–
All nodes other than Near
nodes
We introduce a ‘
t’ bit vector associated with every pair
of nodes(to denote the attributes of a node), whereby
we place ‘1’ at the
ith
position if the two nodes match
on attribute A
i
, or a ‘0’ if they do not match.
Contd.
Now, for each attribute
A
i
in the network, we define a 2
×
2
contingency matrix as shown in Table 3.1, where,
C
00
: Pairs of nodes in FS not matching on
A
i
.
C
01
: Pairs of nodes in FS matching on
A
i
.
C
1
0
: Pairs of nodes in NS not matching on
A
i
.
C
11
: Pairs of nodes in NS matching on
A
i
.

C
ij
 =
k
ij
X
2
(chi square) Measure
The statistical measure we use to detect the
homophily
is
X
2
(chi square) Measure
X
2
measure aggregates the deviation of observed values
from the expected values, under the
independence
hypothesis .
The independence hypothesis in our case can be stated
as follows

“An attribute plays no role in classification of a node into
a particular class
C
ij
”
where, A
i
refers to a particular attribute, C refers to the classes defined,
k
lm
refers to the number of
users in class having value m for attribute A and n refers to the total number of users.
The larger the
X
2
value, the lower is the
belief in the
independence hypothesis, and hence larger is the role
played by the particular attribute in relationship formation.
We can rewrite the
forumla
for
X
2
measure in known
terms using the probabilities of each class/attribute
and independence hypothesis as follows :
In this way, we calculate the
X
2
value associated with
each attribute in the network.
The Odds Ratio
The
X
2
measures assesses how statistically unlikely the
lack of association between similarity on an attribute
and the
probability of a social relationship is.
The
X
2
measure cannot tell us is whether the
association is positive or negative.
Yet we need such a directional measure to test the
principle of
homophily
, which predicts a positive
relationship.
A negative relationship would imply negative
homophily
, a tendency for individuals to associate, not
with
alikes
, but with different others
We therefore also compute the odds ratio for each
attribute
The odds ratio is simply the odds that two
similar
individuals are
connected divided by the odds that two
dissimilar individuals are connected.
The odds ratio for an attribute can be defined as
follows,
Explanation of Results
Trends that are visible from the online social network results are
as follows,
Geographical location is the strongest factor affecting how
relationships shape up in a social network.
The results also indicate that relationships are more likely to
develop between individuals belonging to the same age group.
Religious affiliation and ethnicity are also dominant factors in
relationship formation, as demonstrated by attributes like religion
and languages spoken by individuals.
Likings, hobbies etc. are less likely to influence how ties are made in
a social network. Relationships are less likely to be formed between
individuals who for example enjoy the same movies or music, read
the same books etc.
Advantages of analyzing
Homophily
Some offline friendships may be absent in online social
communities, and are thus detectable. Friends may not know of
each other that they are members of the same online community.
This is especially true for young online communities or new
users to the system.
Facebook
has a feature of ‘people you may
know’, with which, people who are possibly friends are suggested
to be connected online.
Advantages in target marketing and e

commerce are straight

forward. For example,
orkut
shows us ads on our profile which
are based on our profile information.
Information spread in
social networks is being used in diverse fields such as marketing
campaigns
Contd.
Link prediction may also turn out to be useful for
suggesting links that are likely to develop in the future,
thus steering the evolution of a social community.
In the case of large organizations or companies, there is
often an official hierarchy for collaboration and interaction.
Methods for link prediction could be effectively used to
uncover beneficial interactions or collaborations that have
not yet been fully utilized, which would otherwise be
hidden by this official hierarchy
Conclusion
It has been widely observed that social networks
exhibit homophily
We have observed how to detect the Homophily and
some important applications using this phenomenon
More research had to done on different kinds of
sample data to analyze Homophily more accurately
and exploit it
References
Apoorv
Agarwal
, Owen
Rambow
,
Nandini
Bhardwaj
. Predicting
Interests of People on Online
SocialNetworks
. In the
Proceedings of IEEE CSE 09, 12th IEEE International Conference
on Computational Science and Engineering, IEEE Computer
Society Press, Vancouver, Canada, 2009.
Akshay
N
Patil
.
Homophily
Based Link Prediction in Social
Networks. 2009
Miller McPherson, Lynn Smith

Lovin
, and James M. Cook. Birds
of a feather:
Homophily
in social networks. Annual Review of
Sociology, 2001.
References(Contd.)
V.
Vapnik
.
The Nature of Statistical Learning Theory. Springer

Verlag
, New York, 1992
Thorstein
Joachims
, Text
Categorisation
with support vector
machines: Learning with many relevant features, 1998.
M.
Barabino
et al., Support Vector Machines versus
Multilinear
Perceptrons
in Particle Identification, 1999.
Amir
Saffari
, Christian
Leistner
, Horst
Bischof
. Semi

supervised
Learning in Vision. CVPR San Francisco, 2010
http://www.dtreg.com/svm.htm
Wikipedia
Comments 0
Log in to post a comment