X

disturbedtonganeseΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

68 εμφανίσεις

A Visual Analytics Approach to Augmenting Formal Concepts
with Relational Background Knowledge in a Biological Domain

7
th

December


2010

Elma
Akand
*, Mike Bain, Mark Temple

*CSE,
UNSW/School
of Biomedical and Health
Sciences,UWS


1

The Sixth Australasian Ontology Workshop,
Adelaide
University of South Australia

Outline


Machine learning and data mining in bioinformatics



Domain Ontologies in biomedical applications



Formal Concept Analysis



MCW algorithm (Mining Closed
itemsets

for
Web apps)



BioLattice



a web based browser



Experimental Application:
systems biology


Part
-
1: Concept ranking by gene interaction




Part
-
2: Relational learning of multiple
-
stress rules


Machine learning & Data mining in Bioinformatics



Bioinformatics

“Bioinformatics is the study of information content and information flow in biological
systems and processes” (
Michael Liebman,1995
)


Machine Learning & Data mining

-
Can
offer automatic knowledge acquisition

-
Process
to discover knowledge by analyzing data from different perspectives and
can contribute greatly in building knowledge
base


Our work: focus on knowledge
-
based machine learning

-

Previous work
:

learning
from
ontologies


-

Current work
:

ontology
construction by
learning

-

Potential application areas:
ontologies



central
to eCommerce,
eHealth

-

Current application area:
s
ystems

biology



predict gene function, data
integration



Ontology



In philosophy
-

concerned with nature and relations of being




In knowledge representation
-

study of categorization of things:



Informal Ontology

Formal Ontology

Natural language

First order logic or a
variant

Upper Ontology

Domain Ontology


Specific


General

Ontology

Ontology



"s
pecification of
a
conceptualization”
(Gruber, 1993)


Conceptualization



"formalization of knowledge in declarative
form”


(Genesereth and Nilsson, 1987)

Gene Ontology


Missing
concepts and

relations



One
gene
annotated with different
GO
terms
with a term specialization of other






a

b

x

y

x

gene: x

concepts : a ,b

relations :


(i)

x


a



(
ii)

x


b

and


(iii)

b



a

Formal Concept
Analysis (FCA)


Mathematical
order theory
(
Rudolf
Wille

in the early 80s)



-
D
erives

conceptual

structures

out

of

data



-
Method
for data analysis, knowledge representation and information
management



Components

-
Formal
context, concept , concept lattice




four
-
legged

hair
-
covered

intelligent

marine

thumbed

cats

x

x

dogs

x

x

dolphins

x

x

gibbons

x

x

x

humans

x

x

whales

x

x

Formal concepts in a concept lattice

({
cats, gibbons, dogs, dolphins,
humans, whales
}
, {

}
)

Bottom

({
gibbons, dolphins, humans,
whales
}
,
{
intelligent
})

({
dolphins, whales
}
,
{
intelligent, marine
})

({
cats, gibbons, dogs
}
,
{
hair
-
covered
}
)

({
cats, dogs
}
,
{
hair
-
covered, four
-
legged
}
)

({
gibbons, humans
}
,
{
intelligent, thumbed
})

({
gibbons
}
,
{
intelligent,
hair
-
covered,
thumbed
})

({

}
,
{
intelligent, hair
-
covered,
thumbed, marine, four
-
legged
})

2

1

5

6

Top

3

4


Formal context: an
n

by
m

Boolean matrix

m attributes

A

columns

n objects


O

rows



Formal concept:
Galois
connection



<
X,
Y> X
is a subset of A, Y is a subset of O




Concept
lattice
loosely interpretable
in ontology
terms
:

concept
definitions and


cf.
T
-
box

sub
-
concept
relations



concept
membership


cf.
A
-
box

by
objects




FCA
in data mining


FCA can be seen as a clustering technique in machine learning

-
Most
of the work is in

a propositional framework




In data mining closed
itemset

mining

is an
e
fficient
alternative to
FCA


A frequent
itemset

X is closed if there exists no proper superset Y such that

Y
⊃X
with support(Y)=support(X)


E.g., if X = {a,b,c,d
}

and Y ={a,b,c,d,e} and
support(Y)=support(X
), then
X is not closed



Parameters to avoid
building entire lattice

-
Extent
size must be greater than
minsup







E
xisting closed
itemset

mining algorithms

-
Data structures to speed up closed
itemset

mining

-
But may
not build lattice, or include extents


MCW algorithm (Mining Closed
itemsets

for Web apps)


V
ertical data format



IT
-
tree
(
itemset
-
tidset

tree) search space



-
node has
X

x

t
(
X
)
and all children have prefix
X



Pruning

-

4 set difference closure operators



Subsumption

check


-

A look
-
up table to record all attributes and their occurrences
in closed concepts





Lattice



-

adding concepts following a general to specific order


D

2

4

5

6

A

1

3

4

5

C

1

2

3

4

5

6

T

1

3

5

6

W

1

2

3

4

5

attribute

Concept_id

D

C1,C2

T

C3,C4

A

C4,C5

W

C2,C4,C5,C6

C

C1,C2,C3,C4,
C5,C6,C7


Is {TA}{135} closed?

i(135)={TAWC}


Closure operators

{TA}{135}={TW}{135}
-
>{TAW}{135}

{D}{2456}

{C}{123456}
-
>{DC}{2456}

{D}{
2456}
and
{W}{12345}
-
>{DW}{245}

D

2

4

5

6

A

1

3

4

5

C

1

2

3

4

5

6

T

1

3

5

6

W

1

2

3

4

5

Based on CHARM (
Zaki
, 2005)


Visual analytics


-
combination of information visualization with machine learning and data analysis (Keim
et al., 2008)



V
isualization

of concept lattice

-

provides overview of the structure of the domain

-
means for further data analysis, e.g., classification, clustering, implication discovery, rule
learning



Previous work

-

lattice navigation since Godin et

al. (1993)

-
Browsable

concept lattice, e.g., Kim
&
Compton (
2004)




Our current work

-

on augmenting concept lattice by integrating multiple sources of knowledge (Gene
Ontology, protein interactions) for further analysis & machine learning

Concept lattice as a visual
analytics approach

Case study: Yeast systems biology

Browsable

concept lattice

m
ore
general

Biological validation (1) : synthetic lethality

Synthetic lethal interaction

if cell is viable when either
gene A or B are individually
deleted, but cannot grow
when both are deleted.

Our results show that 72 (119) concepts in the lattice more likely than
random chance at
p

< 0.01
(
p

< 0.05
) to contain synthetic lethal pairs.

Protein
-
protein interaction data

Microarray gene
-
expression data

Transcription factor binding
data (
ChIP
-
chip)

Ontology data

Biochemical pathway data

Inductive
Logic
Programming

concept(A):
-

ppi
(B,A,C),
ppi
(B,A,E
),
ppi
(B,C,E)

tfbinds
(D,C),
fbinds
(F,E)

First
-
order rule

Biological validation (2) : ILP learning of concept definitions

Transcription
factors

RSM19 required for
H2O2 response;
RSM19, RSM22 and
MRPS17 in
“mitochondrial
ribosomal small
subunit” stable
complex; and
RSM22, MRPS17
bound by
transcription factors
under amino acid
starvation.

Example rule
:

Conclusions


Many real
-
world domains are data
-
intensive



Machine learning and data mining applications required to generate predictive and
useful outputs



We focus on knowledge
-
based learning for comprehensibility


use ontologies



Formal concept analysis as a framework for ontology structure



Use data mining techniques for efficient concept lattice generation



Visual analytics approach:
browsable

lattice, added background knowledge



Initial validation on a case study from yeast systems biology




Investigate pseudo
-
intents to simplify concept lattice



Investigate variants of concept lattice structures

-
e.g., concept lattice of inverse context



Add concept definitions to background knowledge in ILP





Future work