PPTX - Department of Computer Science and Engineering

minorbigarmSecurity

Nov 30, 2013 (3 years and 10 months ago)

83 views

Group Members:

07005029


Abhinav

Gokari

07005030


Sudheer

Kumar

07d05004


Ignatius Pereira

07d05019


Praveen
Dhanala


Under the guidance of Prof.
Pushpak

Bhattacharyya

Outline


Motivation


Social Networks and
Homophily


Experiments


Statistical Methods Used


Advantages of Analyzing Social Networks for
Homophily


Conclusion


References

Motivation


Social Networks are the ongoing phenomenon


Orkut
,
Facebook
, Twitter, etc.,


Almost 1/10
th

of the world’s population use
Facebook


There is a great scope for innovation and development
in Social Computing which deals with creating social
contexts through the use of software and technology.


Interesting problems arise like :


Social Network analysis


Target marketing and improving e
-
commerce


Friendship(or relationship) suggestions


Social Networks


A
social network

is a social structure made up of
individuals (or organizations) called "nodes", which
are tied (connected) by one or more specific types of
interdependency such as friendship, kinship, etc.,


Online social networks are attribute independent
networks.


In online social networks, a relationship between two
individuals is mutually self defined and binary.


Friendship is not functional and reasons could be
subtle. E.g. an offline/online meeting, common
workplace, pure visual interest.


Homophily


It is the tendency of individuals to associate or bond
with others with a similar set of interests or attributes.
(Birds of same feather flock together)


People choose friends who share same common
interests and characteristics


One of the most general and least contested
theoretical principles in sociology is the principle of
homophily




Homophily
(contd.)


A social system is
homophilous

if contacts are more
similar to one another than to strangers in terms of
their individual attributes and behavior


If
homophily

is a robust aspect of human behavior, it
can be used to deduce a particular person’s attributes
from his/her friends’ attributes in an online social
network.


We shall now examine the following experiments to
observe
Homophily

in social networks


Expt. based on the paper by
Apoorv

et al


A “Travel Site” is chosen with mutually self
-
declared
friendships


A data
-
set of 181 nodes is selected with 1214
friendship links with additional information like
their attributes, characteristics, etc.,


Each user can select their hobbies from a list of 26
pre
-
defines hobbies


Also each user has additional characteristics such as
“language spoken”, “country they live in”, etc.,





-

Based on “Predicting Interests of People on Online Social Networks” by
Apoorv

et al, 2009.


Contd.


The information gathered on the website was


Friends’ Network: This is the mutually self
-
declared
friends network matrix.


Hobbies: Members declare their hobbies by clicking on
boxes next to a list of 26 possible hobbies.


Languages Spoken: There is a list of 139 languages from
which members select a maximum of three languages
they speak.


Age group: The age group is in terms of ranges, example
under 20, 20
-
25, 26
-
30 etc, from which the user chooses
one. There are a total of 12 ranges.


Statistics about the data

This is the 181 by 181 friends network matrix. If person p1 has a friend p2

then F[p1,p2] will be 1, otherwise it will be 0.

H
-

This is the hobbies matrix, 181 by 26. 181 for number of people and 26 for

the number different hobbies a person may have. For example if p50 has three

hobbies
-

Acting, Dancing and Theatre, H[p50, Acting], H[p50, Dancing], H[p50,

Theatre] will be 1 and all the other cells in row H[p50] will be 0.

L
-

This is the languages spoken by people matrix. It is exactly similar to H

with the only difference that the columns here are the different languages a

person can speak along columns. This matrix is therefore, 181 by 139

P
-

This is the places visited by people matrix. It is similar to H and L with the

only difference that the columns here are the different places visited by a person

along columns. This matrix is therefore, 181 by 263.


The hypothesis of the experiment is that there is a
correlation between mutual self
-
declared friendship
links in online social networks and attributes listed in
the profiles of said friends, presumably because of
homophily



GFHF algorithm is extremely sensitive to the
correctness of the weight matrices. Thus, GFHF allows
us to test our hypothesis.


GFHF Example

Based
on the paper by Amir
Saffari

et al. , 2010

GFHF Example

Based
on the paper by Amir
Saffari

et al. , 2010

Support Vector Machines


A classifier derived from statistical learning theory
by
Vapnik
, et al. in 1992



Currently, SVM is widely used in object detection
& recognition, content
-
based image retrieval, text
recognition, biometrics, speech recognition,
regression analysis, etc.

V.
Vapnik
.

The Nature of Statistical Learning Theory
. Springer
-
Verlag
,
New York, 1992



Linear classifier


The goal of


statistical classification

is to use an
object's characteristics to identify which class (or
group) it belongs to.


A

linear classifier

achieves this by making a
classification decision based on the value of a linear
combination

of the characteristics. An object's
characteristics are also known as

feature values

and are
typically presented to the machine in a vector called a
feature vector.

SVM in test


MATLAB Support Vector Machine Toolbox


The toolbox provides routines for support vector
classification and support vector regression.


A GUI is included which allows the visualisation of
simple classification and regression problems. (The
MATLAB optimisation toolbox, or an alternative
quadratic programming routine is required.)

http://www.isis.ecs.soton.ac.uk/isystems/kernel/

Support Vector Machine ran on a sample data.

http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf

Why SVM’s


Experiment by
Thorstein

Joachims

et al. on Text
Categorization with support vector machines.


Text categorization is the classification of documents
into a fixed number of predefined categories where
each documents can be in one, multiple or no category
at all.


SVM’s well suited for the task with
categorisation

with
many features.


SVM’s are robust, don’t require parameter tuning.

Thorstein

Joachims
,
Text
Categorisation

with support vector machines:
Learning with many relevant features
, 1998.

Why SVM’s Contd.?


SVM’s are based on Structural Risk Management
Principle.


Idea of structural risk management is to find a
hypothesis h for which we can guarantee the lowest
true error
i.e

the probability to h will make an error on
an unseen or a randomly selected test sample.


SVM’s are universal learners


Ability of learning is independent of the
dimensionality of the feature space.


With the use of a simple kernel function, they can be
used to learn polynomial classifiers.


SVM’s v/s MLP’s


Experiment by
Barabino

et al. on Support Vector Machines
v/s Multiple Linear
Perceptrons

in particle Identification in
Physics.


SVM are based on minimization of Structured risk whereas
MLP’s are based on minimization of Empirical risk.



Findings :
-


1) very similar performance except the SVM perform as
good as MLP’s


2) SVM work well in case of large training drawn from
input spaces of small dimensions

M.
Barabino

et al.,
Support Vector Machines versus
Multilinear

Perceptrons

in Particle Identification
, 1999.



Back to the Expt.


To accept or reject our research hypothesis, we
consider the prediction capability of GFHF using two
weight matrices:


Randomly generated binary weight matrix(
G
r
)


Self declared friends network(
G
f
)


To incorporate the effect of other attributes, Support
Vector Machine(SVM) is used along with GFHF


Two feature sets are used when using SVMs


The set with only personal characteristics
(S
c
)



Set with all the hobbies except the one being
predicted(S
t
)




Contd.


GFHF is run 30 times, each time for a random
configuration of
n
i

number of labeled data points
where

n
i

Є

N = (10; 30; 50; 70; 90)


These predictions are calculated for all 26 hobbies
under consideration


Therefore, for each weight matrix,
G
r

and
G
f

we get a
corresponding 26 x 5 x 30 matrix, where 26 is the
number of hobbies, 5 is the different number of data
-
points and 30 is the number trials.

Explanation of Results


Table shows the accuracy of running GFHF with the
random matrix (
G
r
) and with the friends matrix (
G
f

)
for 26 hobbies and across 3 different training set sizes
(numbers of labeled data
-
points)


The numbers are averages over the 30 trials with the
same configuration.


The second
-
to
-
last column shows the average of
difference in accuracy between
G
f

and
G
r

across all
training set sizes, and the last column shows the
difference in accuracy between
S
t

and
G
f
, again as
average across all training set sizes.

Contd.


The results show that in most of the cases
G
f

performs
significantly better than
G
r

which implies that the
underlying friends network is in fact important for
prediction.


For some hobbies, the difference in the performance of
G
f

and
G
r

is extremely high. These are precisely the
hobbies that over 50% of the people in the network
have.

Contd.


There are quite a few hobbies for which the friends
network does not provide any useful information.



We see that the friends network does not consistently
help over a random network if the hobby has a relative
incidence of 41% or less.



At 47% and above, the friends network consistently
outperforms the random network.


Contd.


The results corresponding to
S
c

and
S
t

are also similar.



In general,
S
t

performs better than
S
c
, which performs
better than
G
f



From this table we also observe that as we increase the
data, prediction accuracy increases for the SVM




Expt. Based on paper by
Akshay

Patil


The data was gathered from a large online social
networking site


The data is essentially in form of a huge network of
interconnected nodes, with nodes representing actual
people or users and the ties between them denoting
relationships in the social network.


Also each of the nodes store information regarding the
individual user. This information make up the node or user
profile, and is essentially a list of attribute: value pairs.


-
Akshay

N
Patil
.
Homophily

Based Link Prediction in Social Networks. 2009


Statistics of Data Set

Definitions


The nodes are distinguished as



Class of Near Nodes N(u)


Nodes within 2
-
hop radius


Class of Far Nodes F(u)


All nodes other than Near
nodes


We introduce a ‘
t’ bit vector associated with every pair
of nodes(to denote the attributes of a node), whereby
we place ‘1’ at the
ith

position if the two nodes match
on attribute A
i
, or a ‘0’ if they do not match.



Contd.


Now, for each attribute
A
i

in the network, we define a 2
×

2
contingency matrix as shown in Table 3.1, where,


C
00
: Pairs of nodes in FS not matching on
A
i
.


C
01
: Pairs of nodes in FS matching on
A
i
.


C
1
0
: Pairs of nodes in NS not matching on
A
i
.


C
11

: Pairs of nodes in NS matching on
A
i
.


|

C
ij

| =
k
ij

X
2

(chi square) Measure


The statistical measure we use to detect the
homophily

is
X
2

(chi square) Measure



X
2

measure aggregates the deviation of observed values
from the expected values, under the
independence
hypothesis .


The independence hypothesis in our case can be stated
as follows
-



“An attribute plays no role in classification of a node into
a particular class
C
ij







where, A
i

refers to a particular attribute, C refers to the classes defined,
k
lm

refers to the number of
users in class having value m for attribute A and n refers to the total number of users.



The larger the
X
2

value, the lower is the
belief in the
independence hypothesis, and hence larger is the role
played by the particular attribute in relationship formation.



We can rewrite the
forumla

for
X
2

measure in known
terms using the probabilities of each class/attribute
and independence hypothesis as follows :






In this way, we calculate the
X
2

value associated with
each attribute in the network.

The Odds Ratio


The
X
2

measures assesses how statistically unlikely the
lack of association between similarity on an attribute
and the
probability of a social relationship is.


The
X
2

measure cannot tell us is whether the
association is positive or negative.


Yet we need such a directional measure to test the
principle of
homophily
, which predicts a positive
relationship.


A negative relationship would imply negative
homophily
, a tendency for individuals to associate, not
with
alikes
, but with different others


We therefore also compute the odds ratio for each
attribute


The odds ratio is simply the odds that two
similar
individuals are
connected divided by the odds that two
dissimilar individuals are connected.


The odds ratio for an attribute can be defined as
follows,

Explanation of Results


Trends that are visible from the online social network results are
as follows,


Geographical location is the strongest factor affecting how
relationships shape up in a social network.


The results also indicate that relationships are more likely to
develop between individuals belonging to the same age group.


Religious affiliation and ethnicity are also dominant factors in
relationship formation, as demonstrated by attributes like religion
and languages spoken by individuals.


Likings, hobbies etc. are less likely to influence how ties are made in
a social network. Relationships are less likely to be formed between
individuals who for example enjoy the same movies or music, read
the same books etc.


Advantages of analyzing
Homophily


Some offline friendships may be absent in online social
communities, and are thus detectable. Friends may not know of
each other that they are members of the same online community.
This is especially true for young online communities or new
users to the system.
Facebook

has a feature of ‘people you may
know’, with which, people who are possibly friends are suggested
to be connected online.


Advantages in target marketing and e
-
commerce are straight
-
forward. For example,
orkut

shows us ads on our profile which
are based on our profile information.

Information spread in
social networks is being used in diverse fields such as marketing
campaigns


Contd.


Link prediction may also turn out to be useful for
suggesting links that are likely to develop in the future,
thus steering the evolution of a social community.



In the case of large organizations or companies, there is
often an official hierarchy for collaboration and interaction.
Methods for link prediction could be effectively used to
uncover beneficial interactions or collaborations that have
not yet been fully utilized, which would otherwise be
hidden by this official hierarchy

Conclusion


It has been widely observed that social networks
exhibit homophily



We have observed how to detect the Homophily and
some important applications using this phenomenon



More research had to done on different kinds of
sample data to analyze Homophily more accurately
and exploit it

References


Apoorv

Agarwal
, Owen
Rambow
,
Nandini

Bhardwaj
. Predicting
Interests of People on Online
SocialNetworks

. In the
Proceedings of IEEE CSE 09, 12th IEEE International Conference
on Computational Science and Engineering, IEEE Computer
Society Press, Vancouver, Canada, 2009.


Akshay

N
Patil
.
Homophily

Based Link Prediction in Social
Networks. 2009


Miller McPherson, Lynn Smith
-
Lovin
, and James M. Cook. Birds
of a feather:
Homophily

in social networks. Annual Review of
Sociology, 2001.


References(Contd.)


V.
Vapnik
.

The Nature of Statistical Learning Theory. Springer
-
Verlag
, New York, 1992


Thorstein

Joachims
, Text
Categorisation

with support vector
machines: Learning with many relevant features, 1998.


M.
Barabino

et al., Support Vector Machines versus
Multilinear

Perceptrons

in Particle Identification, 1999.


Amir
Saffari
, Christian
Leistner
, Horst
Bischof
. Semi
-
supervised
Learning in Vision. CVPR San Francisco, 2010


http://www.dtreg.com/svm.htm


Wikipedia