Open Problems in Relational Data Clustering

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

74 εμφανίσεις

Example Data Sets

Prior Research



Join

related

objects

to

form

independent

compound

objects,

cluster

normally

(Yin

et

al
.
,

2005
)
.



Use

attribute
-
based

distance

measures

as

weights

in

a

relation

graph
;

adapt

a

graph

cutting

algorithm

to

use

edge

weights

(Neville

et

al
.
,

2003
)
.



Probabilistic

relational

model

with

an

adapted

EM

algorithm

(Taskar

et

al
.
,

2001
)
.



Calculate

a

hybrid

metric

that

linearly

combines

relation

similarity

and

attribute

similarity,

run

single
-
link

algorithm

(Bhattacharya

and

Getoor,

2005
)

Open Problems in Relational Data Clustering

University of Maryland Baltimore County

Adam Anthony aanthon2@umbc.edu

Marie desJardins mariedj@cs.umbc.edu

Overview



Data

clustering

is

the

task

of

detecting

patterns

in

a

set

of

data
.




Most

algorithms

take

non
-
relational

data

as

input

and

are

sometimes

unable

to

find

significant

patterns
.




Many

data

sets

can

include

relational

information,

as

well

as

independent

object

attributes
.




Relational

data

clustering

techniques

can

help

find

strong

patterns

in

such

sets
.




Two

areas

of

interest

in

relational

data

clustering

are
:

clustering

heterogeneous

data
,

and

relation

selection
.

Feature Space


A
feature space
is a set of objects

with attributes,



FS = {o
1
, o
2
,

, o
n
},




where



o
i

= < a
1
, a
2
,

, a
m
>


Internet Movie Database

Attributes include personal data such as awards
received, financial earnings, age, gender, or
Hollywood stock exchange rating. Examples of
relations are acted
-
in, directed, and sequel.

CIA World Factbook

Attribute values come from categories like
government, economics, and population. Relations
can be derived from sources such as common
membership in international organizations.



Relation Space



A
relation space

is a set of relation graphs,




RS = {RG
1
, RG
2
, ..., RG
K
},



where




RG
i

= {O
i
, R
i
},




O
i



䙓F



and
R
i

is a set of edges for a specific relation


Heterogeneous Data

It can be very difficult to compare different typed
objects. For example, how can actors be compared
to directors? One possibility is an
inter
-
cluster
relation signature
.


Relation Selection

It is intuitive that, just as some features are not helpful for
clustering a data set, some relations might provide little
information for a relational clustering algorithm, or even harm
the performance of an algorithm. As relational clustering
algorithms continue to develop, detecting such graphs will
become more important.

Conclusion



Early research in relational clustering has been successful.



Analyzing relational patterns can help us develop methods
for comparing heterogeneous data objects.



Development of relation selection techniques will help
improve existing relational clustering algorithms.

1.
Cluster one set of
homogeneous data. This is


the
reference clustering.

2. For each object, Create a
vector that records the
number of links from that
object to each cluster
discovered in step 1. This


is the
inter
-
cluster


relation signature.

3. Cluster all objects based


on the inter
-
cluster


relation signatures.

AU G
-
77

Botswana

Kenya

Thailand

Japan

China

AsDB

AsDB

AsDB

US

UK

Italy

G
-
8

G
-
8

G
-
8

G
-
77

G
-
77

G
-
77

G
-
77

UNSC

UNSC

UNSC

This research
funded by NSF
grant #0545726

The graph on the right includes an additional
relation graph (blue links) that represents the
World Trade Organization, which fully connects all
countries shown (redundant links omitted).

Including the WTO as one of the relation graphs
obscures the patterns that can be seen in the
graph on the left, making a clustering harder to
find.

We find this situation to be similar to cases in the
feature space where an attribute has the same
value for all objects. Removing the WTO graph
reduces the size of the total graph, and makes
finding patterns easier.

AU G
-
77

Botswan
a

Kenya

Japan

AsDB

AsDB

AsDB

Italy

G
-
8

G
-
8

G
-
8

G
-
77

G
-
77

G
-
77

G
-
77

UNSC

UNSC

UNSC

US

Thailand

UK

China

Ron
Howard

Norman
Jewison

Carl
Weathers

Talia
Shire

directed

directed

directed

directed

acted
-
in

acted
-
in

acted
-
in

acted
-
in

directed

Ron
Howard

Norman
Jewison

Carl
Weathers

Talia
Shire

directed

directed

directed

acted
-
in

acted
-
in

acted
-
in

acted
-
in

Boxing

Comedy

Drama

1
Boxing

1
Comedy

1
Boxing

1

Drama

References

Bhattacharya, I., & Getoor, L. (2005).
Entity resolution in graph data

(Technical Report CS
-
TR
-
4758).
University of Maryland.

Neville, J., Adler, M., & Jensen, D. (2003).
Clustering relational data using attribute and link information
.
Proceedings of the Text Mining and Link Analysis Workshop.

Taskar, B., Segal, E., & Koller, D. (2001).
Probabilistic classification and clustering in relational data
.
Proceeding of IJCAI
-
01, 17
th

International Joint Conference on Artificial Intelligence (pp. 870

878).
Seattle, US.

Yin, X., Han, J., & Yu, P. S. (2005).
Cross
-
relational clustering with user’s guidance
. KDD ’05:
Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining (pp. 344

353). New York, NY, USA: ACM Press.