Under the guidance of Prof.
Social Networks and
Statistical Methods Used
Advantages of Analyzing Social Networks for
Social Networks are the ongoing phenomenon
, Twitter, etc.,
of the world’s population use
There is a great scope for innovation and development
in Social Computing which deals with creating social
contexts through the use of software and technology.
Interesting problems arise like :
Social Network analysis
Target marketing and improving e
Friendship(or relationship) suggestions
is a social structure made up of
individuals (or organizations) called "nodes", which
are tied (connected) by one or more specific types of
interdependency such as friendship, kinship, etc.,
Online social networks are attribute independent
In online social networks, a relationship between two
individuals is mutually self defined and binary.
Friendship is not functional and reasons could be
subtle. E.g. an offline/online meeting, common
workplace, pure visual interest.
It is the tendency of individuals to associate or bond
with others with a similar set of interests or attributes.
(Birds of same feather flock together)
People choose friends who share same common
interests and characteristics
One of the most general and least contested
theoretical principles in sociology is the principle of
A social system is
if contacts are more
similar to one another than to strangers in terms of
their individual attributes and behavior
is a robust aspect of human behavior, it
can be used to deduce a particular person’s attributes
from his/her friends’ attributes in an online social
We shall now examine the following experiments to
in social networks
Expt. based on the paper by
A “Travel Site” is chosen with mutually self
set of 181 nodes is selected with 1214
friendship links with additional information like
their attributes, characteristics, etc.,
Each user can select their hobbies from a list of 26
Also each user has additional characteristics such as
“language spoken”, “country they live in”, etc.,
Based on “Predicting Interests of People on Online Social Networks” by
et al, 2009.
The information gathered on the website was
Friends’ Network: This is the mutually self
friends network matrix.
Hobbies: Members declare their hobbies by clicking on
boxes next to a list of 26 possible hobbies.
Languages Spoken: There is a list of 139 languages from
which members select a maximum of three languages
Age group: The age group is in terms of ranges, example
under 20, 20
30 etc, from which the user chooses
one. There are a total of 12 ranges.
Statistics about the data
This is the 181 by 181 friends network matrix. If person p1 has a friend p2
then F[p1,p2] will be 1, otherwise it will be 0.
This is the hobbies matrix, 181 by 26. 181 for number of people and 26 for
the number different hobbies a person may have. For example if p50 has three
Acting, Dancing and Theatre, H[p50, Acting], H[p50, Dancing], H[p50,
Theatre] will be 1 and all the other cells in row H[p50] will be 0.
This is the languages spoken by people matrix. It is exactly similar to H
with the only difference that the columns here are the different languages a
person can speak along columns. This matrix is therefore, 181 by 139
This is the places visited by people matrix. It is similar to H and L with the
only difference that the columns here are the different places visited by a person
along columns. This matrix is therefore, 181 by 263.
The hypothesis of the experiment is that there is a
correlation between mutual self
links in online social networks and attributes listed in
the profiles of said friends, presumably because of
GFHF algorithm is extremely sensitive to the
correctness of the weight matrices. Thus, GFHF allows
us to test our hypothesis.
on the paper by Amir
et al. , 2010
on the paper by Amir
et al. , 2010
Support Vector Machines
A classifier derived from statistical learning theory
, et al. in 1992
Currently, SVM is widely used in object detection
& recognition, content
based image retrieval, text
recognition, biometrics, speech recognition,
regression analysis, etc.
The Nature of Statistical Learning Theory
New York, 1992
The goal of
is to use an
object's characteristics to identify which class (or
group) it belongs to.
achieves this by making a
classification decision based on the value of a linear
of the characteristics. An object's
characteristics are also known as
typically presented to the machine in a vector called a
SVM in test
MATLAB Support Vector Machine Toolbox
The toolbox provides routines for support vector
classification and support vector regression.
A GUI is included which allows the visualisation of
simple classification and regression problems. (The
MATLAB optimisation toolbox, or an alternative
quadratic programming routine is required.)
Support Vector Machine ran on a sample data.
et al. on Text
Categorization with support vector machines.
Text categorization is the classification of documents
into a fixed number of predefined categories where
each documents can be in one, multiple or no category
SVM’s well suited for the task with
SVM’s are robust, don’t require parameter tuning.
with support vector machines:
Learning with many relevant features
Why SVM’s Contd.?
SVM’s are based on Structural Risk Management
Idea of structural risk management is to find a
hypothesis h for which we can guarantee the lowest
the probability to h will make an error on
an unseen or a randomly selected test sample.
SVM’s are universal learners
Ability of learning is independent of the
dimensionality of the feature space.
With the use of a simple kernel function, they can be
used to learn polynomial classifiers.
SVM’s v/s MLP’s
et al. on Support Vector Machines
v/s Multiple Linear
in particle Identification in
SVM are based on minimization of Structured risk whereas
MLP’s are based on minimization of Empirical risk.
1) very similar performance except the SVM perform as
good as MLP’s
2) SVM work well in case of large training drawn from
input spaces of small dimensions
Support Vector Machines versus
in Particle Identification
Back to the Expt.
To accept or reject our research hypothesis, we
consider the prediction capability of GFHF using two
Randomly generated binary weight matrix(
Self declared friends network(
To incorporate the effect of other attributes, Support
Vector Machine(SVM) is used along with GFHF
Two feature sets are used when using SVMs
The set with only personal characteristics
Set with all the hobbies except the one being
GFHF is run 30 times, each time for a random
number of labeled data points
N = (10; 30; 50; 70; 90)
These predictions are calculated for all 26 hobbies
Therefore, for each weight matrix,
we get a
corresponding 26 x 5 x 30 matrix, where 26 is the
number of hobbies, 5 is the different number of data
points and 30 is the number trials.
Explanation of Results
Table shows the accuracy of running GFHF with the
random matrix (
) and with the friends matrix (
for 26 hobbies and across 3 different training set sizes
(numbers of labeled data
The numbers are averages over the 30 trials with the
last column shows the average of
difference in accuracy between
training set sizes, and the last column shows the
difference in accuracy between
, again as
average across all training set sizes.
The results show that in most of the cases
significantly better than
which implies that the
underlying friends network is in fact important for
For some hobbies, the difference in the performance of
is extremely high. These are precisely the
hobbies that over 50% of the people in the network
There are quite a few hobbies for which the friends
network does not provide any useful information.
We see that the friends network does not consistently
help over a random network if the hobby has a relative
incidence of 41% or less.
At 47% and above, the friends network consistently
outperforms the random network.
The results corresponding to
are also similar.
performs better than
, which performs
From this table we also observe that as we increase the
data, prediction accuracy increases for the SVM
Expt. Based on paper by
The data was gathered from a large online social
The data is essentially in form of a huge network of
interconnected nodes, with nodes representing actual
people or users and the ties between them denoting
relationships in the social network.
Also each of the nodes store information regarding the
individual user. This information make up the node or user
profile, and is essentially a list of attribute: value pairs.
Based Link Prediction in Social Networks. 2009
Statistics of Data Set
The nodes are distinguished as
Class of Near Nodes N(u)
Nodes within 2
Class of Far Nodes F(u)
All nodes other than Near
We introduce a ‘
t’ bit vector associated with every pair
of nodes(to denote the attributes of a node), whereby
we place ‘1’ at the
position if the two nodes match
on attribute A
, or a ‘0’ if they do not match.
Now, for each attribute
in the network, we define a 2
contingency matrix as shown in Table 3.1, where,
: Pairs of nodes in FS not matching on
: Pairs of nodes in FS matching on
: Pairs of nodes in NS not matching on
: Pairs of nodes in NS matching on
(chi square) Measure
The statistical measure we use to detect the
(chi square) Measure
measure aggregates the deviation of observed values
from the expected values, under the
The independence hypothesis in our case can be stated
“An attribute plays no role in classification of a node into
a particular class
refers to a particular attribute, C refers to the classes defined,
refers to the number of
users in class having value m for attribute A and n refers to the total number of users.
The larger the
value, the lower is the
belief in the
independence hypothesis, and hence larger is the role
played by the particular attribute in relationship formation.
We can rewrite the
measure in known
terms using the probabilities of each class/attribute
and independence hypothesis as follows :
In this way, we calculate the
value associated with
each attribute in the network.
The Odds Ratio
measures assesses how statistically unlikely the
lack of association between similarity on an attribute
probability of a social relationship is.
measure cannot tell us is whether the
association is positive or negative.
Yet we need such a directional measure to test the
, which predicts a positive
A negative relationship would imply negative
, a tendency for individuals to associate, not
, but with different others
We therefore also compute the odds ratio for each
The odds ratio is simply the odds that two
connected divided by the odds that two
dissimilar individuals are connected.
The odds ratio for an attribute can be defined as
Explanation of Results
Trends that are visible from the online social network results are
Geographical location is the strongest factor affecting how
relationships shape up in a social network.
The results also indicate that relationships are more likely to
develop between individuals belonging to the same age group.
Religious affiliation and ethnicity are also dominant factors in
relationship formation, as demonstrated by attributes like religion
and languages spoken by individuals.
Likings, hobbies etc. are less likely to influence how ties are made in
a social network. Relationships are less likely to be formed between
individuals who for example enjoy the same movies or music, read
the same books etc.
Advantages of analyzing
Some offline friendships may be absent in online social
communities, and are thus detectable. Friends may not know of
each other that they are members of the same online community.
This is especially true for young online communities or new
users to the system.
has a feature of ‘people you may
know’, with which, people who are possibly friends are suggested
to be connected online.
Advantages in target marketing and e
commerce are straight
forward. For example,
shows us ads on our profile which
are based on our profile information.
Information spread in
social networks is being used in diverse fields such as marketing
Link prediction may also turn out to be useful for
suggesting links that are likely to develop in the future,
thus steering the evolution of a social community.
In the case of large organizations or companies, there is
often an official hierarchy for collaboration and interaction.
Methods for link prediction could be effectively used to
uncover beneficial interactions or collaborations that have
not yet been fully utilized, which would otherwise be
hidden by this official hierarchy
It has been widely observed that social networks
We have observed how to detect the Homophily and
some important applications using this phenomenon
More research had to done on different kinds of
sample data to analyze Homophily more accurately
and exploit it
Interests of People on Online
. In the
Proceedings of IEEE CSE 09, 12th IEEE International Conference
on Computational Science and Engineering, IEEE Computer
Society Press, Vancouver, Canada, 2009.
Based Link Prediction in Social
Miller McPherson, Lynn Smith
, and James M. Cook. Birds
of a feather:
in social networks. Annual Review of
The Nature of Statistical Learning Theory. Springer
, New York, 1992
with support vector
machines: Learning with many relevant features, 1998.
et al., Support Vector Machines versus
in Particle Identification, 1999.
Learning in Vision. CVPR San Francisco, 2010