Learning probabilistic Datalog rules for information
classication and transformation
Henrik Nottelmann
nottelmann@ls6.cs.unidortmund.de
Norbert Fuhr
fuhr@ls6.cs.unidortmund.de
Department of Computer Science
University of Dortmund
44221 Dortmund,Germany
ABSTRACT
Probabilistic Datalog is a combination of classical Datalog (i.e.,
functionfree Horn clause predicate logic) with probability theory.
Therefore,probabilistic weights may be attached to both facts and
rules.But it is often impossible to assign exact rule weights or
even to construct the rules themselves.Instead of specifying them
manually,learning algorithms can be used to learn both rules and
weights.In practice,these algorithms are very slow because they
need a large example set and have to test a high number of rules.
We apply a number of extensions to these algorithms in order to
improve efciency.Several applications demonstrate the power of
learning probabilistic Datalog rules,showing that learning rules is
suitable for low dimensional problems (e.g.,schema mapping) but
inappropriate for higher dimensions like e.g.in text classication.
1.INTRODUCTION
The logical view on information retrieval (IR) treats retrieval as
inference.For a query q,the systemsearches for documents d that
imply the query logically,i.e.for which the logical formula q d
is true.Due to the intrinsic vagueness and imprecision in IR,a
logic that allows for uncertain reasoning should be used.In [16],
a probabilistic approach is discussed for this purpose,thus docu
ment retrieval can be based on the estimation of the probability
Pr(q d).
This probabilistic approach is combined with Datalog in [3].
Datalog is a (functionfree) variant of Horn predicate logic which
is widely used for deductive databases.For probabilistic Datalog,
probabilistic weights may be attached to both facts and rules.As
rule weights can only be used with certain restrictions,[17] de
scribes another approach for using uncertain logical rules in gen
eral,based on exponentially many conditional probabilities.
For evaluating probabilistic Datalog programs,we developed the
inference engine HySpirit.
In practice,it is often very difcult to calculate the rule weights
or even the rule bodies itself.Applying methods from the eld of
machine learning can remedy this problem.[17,18] describe two
algorithms for learning the rule probabilities and the rules them
selves,respectively.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page.To copy otherwise,to
republish,to post on servers or to redistribute to lists,requires prior specic
permission and/or a fee.
Copyright 2001 ACM089791886/97/05...$5.00.
One drawback of these algorithms is that they are very slow.
There are four major reasons for this behaviour:
1.The most expensive operations are the rule evaluations per
formed with HySpirit.
2.To obtain a guaranteed quality for the probability learning al
gorithm,a large example set is needed.Thus,each evaluation
will take a long time.
3.The rule learning algorithmhas to evaluate a high number of
rules.
4.The rules to be tested may be rather complex because they
are generated automatically.Length and complexity of a rule
have direct impact on runtime.
In this paper,we describe several extensions of the basic algorithms
which lead to efciency improvements of several orders of magni
tude.
A number of applications demonstrate the feasibility of our ap
proach,such as different (text) classication tasks and the problem
of schema mapping in digital libraries.
The rest of this text is organised as follows:The next section
gives a brief overview over probabilistic Datalog and the new rule
approach of handling rule probabilities.Section 3 introduces the
two algorithms for learning rule weights and rule bodies,respec
tively.Our improvements are described in section 4.Then,we
present some applications and nish with a conclusion and an out
look on future work.
2.PROBABILISTIC DATALOG
Datalog (see [14]) is a variant of predicate logic based on function
free Horn clauses.Negation is allowed,but its use is limited to
achieve a correct and complete model.Rules have the form h
b
1
^ ^b
n
,where h (the head) and b
i
(the subgoals of the
body) denote literals with variables and constants as arguments.
A rule can be seen as clause fh;:b
1
;:::;:b
n
g.A fact f is a rule
with only constants in the head and an empty body (the can be
omitted in this case).
In probabilistic Datalog (see [3]),every fact or rule has a proba
bilistic weight attached,prepended to the fact or rule.Evaluation is
based on the notion of event keys and event expressions.Facts and
instantiated rules are basic events,each of them has assigned an
event key.Each derived fact is associated with an event expression
that is a Boolean combination of the event keys of the underlying
basic events.
Problems arise if more than one rule is formulated for the same
head predicate:Given only two rules for a predicate p specifying
Pr(pjr
1
) and Pr(pjr
2
),we are not able to estimate Pr(pjr
1
^r
2
) for
the case when the bodies of both rules are fullled (see [3]).Thus,
we have to know Pr(pj:r
1
^:r
2
),Pr(pjr
1
^:r
2
),Pr(pj:r
1
^r
2
)
and Pr(pjr
1
^r
2
).This is equivalent to the situation in probabilistic
inference networks where a link matrix with the same conditional
probabilities has to be given.
EXAMPLE 1.The following rules state that two documents are
semantically related iff they have the same author or are linked
together:
sameauthor(D1;D2) author(D1;A) ^author(D2;A);
0:5 related(D1;D2) link(D1;D2);
0:2 related(D1;D2) sameauthor(D1;D2):
But since two documents may share the same author and are
linked together as well,we have to reformulate the rule set in order
to cover all possible cases
sameauthor(D1;D2) author(D1;A) ^author(D1;A);
0:5 related(D1;D2) link(D1;D2)^
:sameauthor(D1;D2);
0:2 related(D1;D2) :link(D1;D2)^
sameauthor(D1;D2);
0:7 related(D1;D2) link(D1;D2)^
sameauthor(D1;D2);
In general,we have n rules with the common (target) head
predicate p.To specify rule#i,its head predicate is renamed to
p
i
.With the vector
~
X =(X
1
;:::;X
l
)
T
,where each X
k
is a variable,
p(
~
X) and p
i
(
~
X) are atoms.Then,the renamed rules together with
the n rules p(
~
X) p
i
(
~
X) are equivalent to the original rule set.
Further we set := fp
i
;:::;p
n
g.Then,we need 2
n
conditional
probabilities (see [17]):
DEFINITION 2.The function :2
![0;1] is dened as
(A):=Pr(p(
~
X) j
^
a2A
a(
~
X) ^
^
a2 nA
:a(
~
X)):
Furthermore,we dene (
/
0):=0.
EXAMPLE 3.For the second rule set in example 1,we have
(/0) =0; (frelated
1
g) =0:5; (frelated
2
g) =0:2;
(frelated
1
;related
2
g) =0:7:
These probabilities have to be integrated into probabilistic Data
log (coding theminto rules and/or facts) so that the uncertain rules
can be evaluated e.g.by HySpirit without changes.
[17,p.254] and [7,pp.103105] show that the obvious con
struction (using the values as probabilities of new rules,see ex
ample 1) not necessarily yields modularly stratied programs (and,
thus,cannot be evaluated by HySpirit,see [12]).
Instead,Wüthrich transforms the 2
n1
probabilities into 2
n1
additional facts f (A),
/
0 6=A ,with probabilities x(A):
DEFINITION 4.The weight x(A) is dened recursively
(using x(B) for
/
0 6=B A) as
x(A) =
/06=B
(1)
jBj1
(B)
/
06=B nA
(1)
jBj1
(B)
/06=BA
x(B)
:
Then,the transformed rule set is constructed according the fol
lowing scheme:
p(
~
X) p
i
(
~
X) ^
V
fp
i
gA
f (A);
p
i
(
~
X) ;
x(A) f (A):
The correctness of this construction is proven in [17] (for the
case of three rules) and in [7] (for n rules).
3.LEARNINGPROBABILISTICDATALOG
RULES
In this section we present two algorithms for learning rule proba
bilities (i.e., values) and probabilistic Datalog rules,respectively.
They have a similar structure which we describe in the rst subsec
tion.
3.1 Structure of the learning algorithms
The input to the algorithms consists of a set of n rules with a
common head predicate (the target predicate) p.A hypothesis h
P
is a nonempty set P of rules with head predicate p (with renamed
head predicate p
i
,respectively) and corresponding
P
values.The
hypothesis space H is a set of hypotheses h
P
.
Examples are probabilistic facts for the target predicate.The
algorithms have no direct access to these examples,but can call
the oracle EXAMPLES.EXAMPLES returns an arbitrary example
~c and a label l 2 f0;1g with Pr(l =1) =Pr(p(~c)).Furthermore,
the algorithms have access to a background knowledge base KB
containing rules and facts (used to evaluate the rules,e.g.with
HySpirit).
We search for the hypothesis h
P
2 H with minimum expected
quadratic loss (the prediction error of Pr(p(~c)),where ~c is a ran
domly drawn constant vector,which has some well known,useful
properties),but we must use approximations for the
P
values and
for the expected quadratic loss,respectively.
In the rest of this section,EX =f(~c
i
;y
i
)g,y
i
2 f0;1g,is a set of
examples drawn with EXAMPLES.
DEFINITION 5.Let EX
0
EX contain all (~c
i
;y
i
) where
V
a2A
a(~c
i
) ^
V
a2PnA
:a(~c
i
).Then,the approximation of
P
(A) is
dened as
P
(A):=
jf(~c
i
;y
i
) 2EX
0
jy
i
=1gjjEX
0
j
Thus,we need jPj evaluations (of the single rules p
i
2 P) for
computing x(A).
EXAMPLE 6.Let P:=fp
1
;p
2
g.positivenegative.p
10 1 1 1 01 0 1 0 1p
21 1 1 1 01 1 0 0 0p
1
:p
1p
2++++:p
2
P
(fp
1
g) =0,
P
(fp
2
g) =0:5,
P
(fp
1
;p
2
g) =0:75.
DEFINITION 7.The empirical quadratic loss of a hypothesis
h
P
with examples EX is dened as
E
EX
[Q
h
]:=
1jEXj
jEXj
i=1
(Pr(h
P
(~c
i
)) y
i
)
2
:
Pr(h
P
(~c
i
)) is the probability of p(~c
i
) as computed by HySpirit
using the hypothesis h
P
.We need only one evaluation of the trans
formed result set for computing the empirical quadratic loss.
For both approximations,we draw a new example set EX.The
resulting generic structure is depicted in gure 1.
3.2 Algorithm1
Wüthrich presents in [17] a PAC algorithm which learns val
ues.PAC (probably approximately correct) learning (introduced
in [15] for deterministic propositional logic) is often used in the
eld of machine learning.PACalgorithms return only in most cases
a nearly correct result (and can fail otherwise),but guarantee good
ness properties described below.
The algorithm is based on the generic architecture.It takes as
input also two parameters and .The hypothesis space is
INPUT:target predicate p,background knowledge base KB,
examples EXAMPLES,hypothesis space H
OUTPUT:optimumhypothesis h 2H with minimumloss
E
EX
0
(Q
h
)
ALGORITHM:
l:=
for(all h
0
2H) do
h
00
:= learnGammas(p,h
0
,KB,EXAMPLES)
l
00
:= computeLoss(h
00
,KB,EXAMPLES)
if(l
00
<l) then
l:= l
00
h:= h
00
if(l =0) then stop fi
fi
od
Figure 1:Generic structure of both algorithmsH = fh
P
j/0 6= P g.For both approximations we draw
poly(1=;1= ) new examples EX.Then,the output of the algo
rithm is with probability of 1 an good model of probability
(i.e.,the probability that the error is greater than is at most ).
We implemented this probability learning algorithm (with the
modication described in section 4.1) as algorithm1.
3.3 Algorithm2
We also adapted a second algorithm by Wüthrich which learns
not only the conditional probabilities of given rules but also the
rules themselves (see [18]).Learning (certain) logical rules (in
ductive logic programming,ILP) is well studied in the eld of
machine learning (e.g.,see [5]),but the methods have to be modi
ed for probabilistic Datalog.
We use the common specialisation order on rules to avoid testing
all possible rules with head predicate p:
DEFINITION 8.A rule r
1
is a specialisation of a rule r
2
,r
1
r
2
,iff there is a substitution with r
2
r
1
(viewing a rule as a
clause,i.e.as a set of literals containing the head and the negations
of the subgoals).
A rule r
1
is a most general (direct) specialisation of a rule r
2
,
r
1
r
2
,iff for every rule r with r
1
r
r
2
,r
r
1
or r
r
2
holds.
E.g.,with the given rules
r
1
:father(X;Y) parent(X;Y);
r
2
:father(X;X) parent(X;X);
r
3
:father(X;Y) parent(X;Y) ^parent(Y;Z);
r
4
:father(X;X) parent(X;X) ^parent(X;Z);
The rules r
2
and r
3
are both (direct) specialisations of r
1
(using
= fX=X;Y=Yg and = fX=X;Y=Xg,respectively),and r
4
is a
direct specialisation of r
2
and r
3
.
There are two ways to generate a direct specialisation:
1.A new literal is appended to the rule.Recursion is allowed,
but then the literal must be positive in order to obtain a mod
ularly stratied Datalog program.
2.Instead of appending a literal =(X;Y),where X and Y al
ready occur in the rule,all occurences of X are substituted
by Y.Both ways lead to equivalent rules,but evaluation of
the rule with substitution is faster.
We now rst present the original algorithm of Wüthrich before
we describe our modication.
The algorithmof Wüthrich consists of two nested loops:
1.The inner loop generates a locally optimum rule.Starting
with the most general rule r,the direct specialisation h
opt
of r (without values) with minimum expected quadratic
loss is computed.This step is repeated (via the assignment
r:=h
opt
) until no improvement can be achieved,or until the
rule reaches a specied length (number of subgoals).
Afterwards,the inner loop assigns the rule h
opt
a conditional
probability (note that is not a value).
2.The outer loop just calls the inner loop n times to learn n
rules.
Finally,the conditional probabilities and several approximations
are used to solve an equation system of 2
n1
rules in the 2
n1
un
known values.
This algorithmonly returns a local minimumand possibly misses
the global optimum.The advantage is the gain of efciency,due to
not testing all possible hypotheses (i.e.,all rules with head predi
cates p).
We modify this algorithm to reuse the generic structure.Now,
the inner loop simply calls the generic algorithmwith H:=fdirect
specialisations of rg (where r is the current optimum hypothesis).
As a consequence,our algorithmlearns the gamma values for every
direct specialisation,and the loss is determined w.r.t.these prob
abilities.The nal step (solving the equation system) is no longer
necessary.Our rule learning algorithm is called algorithm 2 in
the rest of this text (see gure 2).
The major drawback of our algorithm is that it is not as fast as
Wüthrich's (because the weights are computed for every hypothesis
instead of only for the locally optimum rule set),but Wüthrich's
way of determining the values leads to further inaccurracy.
4.EFFICIENCY IMPROVEMENTS
We developed the systemHyLearner that implements these al
gorithms (calculation of the additional fact weights,learning rule
probabilities,learning rule bodies).It became clear that the origi
nal learning algorithms are much too slow,for several reasons:we
need a large example set to have good guaranteed results,we have
to test many rules which might be rather complex due to the au
tomatic generation (for rule learning),and the evaluation of each
hypothesis is very expensive.In the following,we describe six so
lutions improving efciency,wich we integrated into HyLearner.INPUT:target predicate p,knowledge base KB,
examples EXAMPLES,number n of rules to learn
OUTPUT:transformed rule set R,loss l
ALGORITHM:
R:= empty rule set
l:= 0
for(i:= 1;i n;i:= i +1) do
l:=
r:= empty rule for p
do
H:=fR[fr
0
gjr
0
direct specialisation of rg
(h
r
,l):= learnOptimum(p,KB,EXAMPLES,H,l)
while(better rule found)
R:= h
r
od
Figure 2:Rule learning algorithm(algorithm2)
4.1 Reducing the number of examples
The rule probability learning algorithm(algorithm1) guarantees
some effectiveness properties described in [17].The number of ex
amples needed is polynomial in and .E.g.,for loss computation
we need at least
16(1= )
6
(1
2
n+1
p1 )
1
examples.Thus,useful parameters result in a high number of re
quired examples:With = =0:1 and n =3 rules,we need at least
100 000 examples.
To speed up both both evaluation in HySpirit and postprocessing
in the algorithm,we abandon this guarantee in our implementation.
We allow the user to specify as many examples as she wants,and
use all these examples for learning and loss computation (instead
of drawing a new sample every time).
4.2 Direct specialisations of the most general
rule
The rule learning algorithm (algorithm 2) rst tests the direct
specialisations of the most general rule (i.e.,the rules with shortest
bodies).These starting rules can be of course constructed automat
ically,but then there are exponentially many starting rules,with
most of them being absurd.E.g.,learning the father relation and
having the EDB predicates person,sex,likes and sport,then the
rule
father(X;Y) sport(X) ^sport(Y)
is a direct specialisation of the most general rule.But this rule does
not make sense as sports and persons are totally different topics.
To avoid generating a lot of absurd starting rules,this task is
delegated to the user.It is very easy for her to generate reasonable
starting rules,particularly when there are special relations for the
domains.
In the example above,both arguments of the binary relation
father should contain persons,thus a reasonable starting rule is
father(X;Y) person(X) ^person(Y):
Of course,there may be more than one starting rule in general.
In this case,the algorithmuses all of them.
4.3 Replacement patterns
It is often posssible to simplify learned rules,e.g.to replace parts
of a rule (set of subgoals) by other literals obtaining an equivalent
but shorter rule.The evaluation of a rule requires the computation
of a join over all relations used in the rule body.The join of m
relations with n
i
tuples can be determined in time O(
i
n
i
).Thus,
a shorter body leads to faster evaluation.
E.g.,the two rules r
1
and r
2
,given as
r
1
:father(X;Y) person(X) ^person(Y)^
parent(X;Y);
r
2
:father(X;Y) parent(X;Y);
(where person and parent are deterministic relations),are logically
equivalent w.r.t.the conditions that both arguments of the predicate
parent must be instances of the predicate person.
The literals person(X) and person(Y) are redundant in rule r
1
and can be discarded.Rule r
2
is shorter and can be evaluated faster
than rule r
1
.
Therefore it is useful to simplify rules as far as possible.Re
placement patterns specify which parts of a rule can be replaced by
other literals for simplication.
DEFINITION 9.A replacement pattern is a statement
h
1
^ ^h
s
 b
1
^ ^b
t
where the h
i
and b
j
are arbitrary literals.The h
i
's form the head
p
head
,the b
j
's the body p
body
of the pattern.The head variables
are a subset of the variables in the body.Constants are allowed in
both parts.
The semantics is dened as follows:Given a rule r,a replace
ment pattern p,and a substitution with p
head
r
body
,then
r (r n p
body
) [p
head
should hold.
One of the most import application areas of replacement pattern
is described by the example above:modeling part of relation
ships to allow for removing redundant literals.There,the replace
ment patterns
rp
1
:parent(A;B)  person(A) ^parent(A;B);
rp
2
:parent(A;B)  person(B) ^parent(A;B);
can be stated to express the part of relationships fromabove,stat
ing that both arguments of the predicate parent must be instances
of the predicate person.
Rule r
1
can be simplied to
father(X;Y) person(Y) ^parent(X;Y)
using pattern rp
1
and the substitution =fX=A;Y=Bg.The result
ing rule can further be simplied using pattern rp
2
and the same
substitution .
The probabilistic case is more difcult.Evaluation of head and
body of the pattern must yield the same probabilities.It is the task
of the user to ensure this property.If the body equals the head plus
one further literal (i.e.,the pattern states that this literal is redun
dant),this property is true if
1.the corresponding relations are deterministic (e.g.,domain
relations),or
2.the replacement pattern expresses a part of relation which
is also modelled by an IDB relation in the knowledge base.
Criterion 2 can be explained by the following example:
EXAMPLE 10.The knowledge base contains the rules
person(X) parent(X;Y);
person(Y) parent(X;Y);
Then,the rules r
1
and r
2
from above deliver for all constant tu
ples the same event expressions and,thus,the same probabilities.
Therefore,the replacement patterns rp
1
and rp
2
are still valid in
the probabilistic case.
4.4 Disjunction patterns
Algorithm 2 often generates rules with pairwise disjoint literals.
The evaluation of these rules always yields empty result sets,and
is therefore superuous.We can signicantly improve efciency if
we discard the evaluation of these rules.
In the most simple case of pairwise disjoint literals,a rule con
tains both a literal l and the negated counterpart:l.This case can
be tested automatically.
More general cases must be dened explicitly:
DEFINITION 11.A disjunction pattern is a rule with an empty
head
b
1
^ ^b
t
:
The semantics is dened as follows:Given a rule r,a disjunc
tion pattern p and a substitution with p
head
r
body
,then r is
unsatisable.
EXAMPLE 12.The rule
father(X;Y) sex(X;male) ^sex(X;female)^
parent(X;Y)
is unsatisable,and evaluation can be discarded.To detect this,we
have to state the disjunction pattern
sex(X;male) ^sex(X;female):
Another application area is the detection of wrong domains (pro
duced by rule specialisation):
EXAMPLE 13.Obviously,the second argument of sex contains
the constants male or female,but no persons.Nevertheless,the
algorithm produces the rule
father(X;Y) person(X) ^person(Y) ^sex(X;Y):
This can be detected using the disjunction pattern
person(Y) ^sex(X;Y):
This problems could be solved with signatures,too.A signa
ture assigns every argument of a predicate a type (domain).Then,
comparison of types can be used to detect wrong domains:
EXAMPLE 14.In example 13,there are the disjoint domains
PERSON and SEX =fmale;femaleg and the signatures
person(PERSON);parent(PERSON;PERSON);
father(PERSON;PERSON);sex(PERSON;SEX):
Then,it is easy to detect that Y 2PERSON andY 2SEX is an error.
There are several drawbacks of this approach:
1.It is less general than disjunction patterns.For the relations
and signatures professor(PERSON),student(PERSON),
examines(PERSON;PERSON),rules with a student as ex
aminer cannot be found with signatures (but with the simple
disjunction pattern student(X) ^examines(X;Y)).Thus,
we would need a type hierarchy which increases the com
plexity of this approach.
2.We have to introduce a syntax for domains,relationsships
between domains,and signatures.In contrast,we can use the
existing syntax for disjunction patterns.This is more elegant.
3.With signatures,we cannot detect disjunctions such as those
in example 12.
4.5 Parallel evaluation
Both learning algorithms are based on the same generic structure
(see gure 1),in which values and the resulting loss are computed
for all hypotheses (i.e.,the rst part inside the for loop).This is
completely independent of the computations for the other hypothe
ses.The only connection is the process of determing the hypothesis
with overall minimum loss (the if statement).Thus,computation
of the hypotheses can be parallelised in both learning algorithms
(also allowing for distributed evaluation).
As the computation for one hypothesis requires two evaluation
steps,these evaluations are the most expensive parts of the algo
rithms,and the algorithm itself spends most of the time waiting
for the external program (e.g.,HySpirit) to nish,parallelisation
yields a noticable performance improvement (of course,only by a
constant factor).
4.6 Direct evaluation in a relational database
management system
This enhancement can also be applied to both learning algo
rithms.
The common approach to evaluating rules is to use HySpirit.Al
though this inference engine underwent some noticable speedups,
it is still pretty slow.A strong improvement can be achieved by di
rectly evaluating rules (using SQL) in a standard relational database
management system(RDBMS) and performing some necessary but
fast postprocessing.
We restricted rules so that one select statement is sufcient to
evaluate them:
1.The background knowledge base KB may contain only EDB
predicates (facts) but no rules.
2.Recursion is not allowed.
3.Negated predicates are allowed,but then a preprocessing step
is required,and the use of negation in rules is restricted.
The EDB relations of the background knowledge base KB can
then be stored in the RDMBS,one table (relation) for each pred
icate.Each table column corresponds to one predicate argument.
There is an additional table for each predicate (which may occur in
a negated subgoal) which contains the complement of that predicate
against an appropriate domain.
Algorithm 3.1 in [14] describes the transformation of a non
recursive Datalog rule with only EDB predicates in the body into
an expression of relational algebra.As we do not use builtin pred
icates like =,<,,>,or 6=,we can omit steps (2) and (4) from
this algorithm.Negated subgoals can be handled analogously to
positive subgoals.
To transform this expression of relational algebra into a select
query,the selection and projection operations must be moved to
the outside and be replaced by an join operation.The resulting
(equivalent) expression has the form
projection
(
selection
(relations))
and can be transformed into a select query:
select projection from relations
where selection
Having uncertain facts and rules (which are in fact certain rules
with additional facts in the body),the RDBMS has to compute an
event expression for every result tuple.This event expression can
then be evaluated in the algorithm to compute a probability for the
tuple.
The select query now must return both,result tuples and the
corresponding event expression.There must be one event key (hence
there can be no IDBpredicates in the rule body) with its probability
for each relation in the event expression.Thus,this event expres
sion is a conjunction of event keys.
EXAMPLE 15.The rule
father(X;Y) parent(X;Y) ^sex(X;male)
can be transformed into the select query
select
concat(sex.name,‘,’,parent.name2) as consts,
concat(‘1.0 parent(‘,parent.name1,‘,’,
parent.name2,‘)’,‘& 1.0 sex(’,sex.name,‘,’,
sex.value,‘)’) as event
from
parent,sex
where
sex.name=parent.name1 and sex.value=‘male’
order by
consts;
With the relationsname1name2probbobmary0:9maryanton1antonpeter0:5relation parentnamevaluepetermalebobmaleantonmalemaryf emalerelation sex
stored in the RDBMS,this select query will returnconstseventbob;mary0:9 parent(bob;mary) &1 sex(bob;male)anton;peter0:5 parent(anton;peter) &1 sex(anton;male)Result of the select query
Having multiple rules for one predicate,each rule has to be eval
uated separately.The results have to be merged together (for that
reason,we use an order by statement in the query).If there are
multiple event expressions for one tuple (due to multiple rules or
free variables in the rule body),the disjunction of these expressions
has to be built.The probability of an event expression is calculated
in exactly the same way as described in [3] (except that an event
expression is automatically in disjunctive normal form).
5.APPLICATIONS
We now describe the application of the learning algorithms on
several test collections.
5.1 Text classication with the Reuters
collection
The wellknown Reuters collection [6] contains 7 775 training
and 3 019 test documents (Lewis split).Each document has at
tached at least one out of 115 categories.The task is to learn rules
for each category.For every test document,the categories in the
rst rank are selected for evaluation.
There are two relations:
1.docTerm contains the N terms of the documents (after stop
word elimination and stemming) with the highest inverse cat
egory frequency (assuming that these terms are good dis
criminators):
icf
t
:=log
#categories#categories containing t
The experiment is run twice:the rst time with deterministic
terms,and afterwards with probabilistic terms weighted with
normalised tf idf (see [13]).
2.doc contains all document ids.
The example set contains all training documents with the weight 1
iff the document is assigned the category,and 0 otherwise.Negated
literals are prohibited by disjunction patterns.
Thus,algorithm2 learns rules of the form:
cat(D) docTerm(D;t
1
) ^ ^docTerm(D;t
l
):
The following table shows the precision values for different pa
rameter settings:parametersprecision#rules#literals#termsdet:prob:325042:8%40:9%4220046:1%42:4%4410043:0%41:0%The rule learning algorithm(algorithm2) outputs as many rules as
specied (except if fewer rules are sufcient to achieve a loss of
zero).To this result set,we also applied the probability learning al
gorithm(algorithm1) which outputs an optimumsubset (the subset
with minimumloss).But in most cases,the optimumsubset simply
was the learned rule set,and we did not achieve a better precision.
Due to high computation times,we did not consider larger pa
rameter values.Anyway,the table suggests that no substantial im
provement could be gained.Particularly,it would not help to in
crease the number of literals,because the learned rules were never
longer than three literals.
An interesting result is that we gained better results for determin
istic facts (binary indexing).
Compared with other approaches (e.g.,as described [19]) with
a precision of 85% (kNN) or 65% (Rocchio),learning uncertain
rules for text classication is not appropriate.One reason is the
high dimensionality (high number of terms) of the learning task.To
reduce the hypothesis space,we only considered 2% of the terms
(w.r.t.the icf values).
5.2 Text classication to detect spamemails
The collection spambase [11] contains 4 601 emails,1 813
(= 39:4% ) of them are spams.We split the collection randomly
into two collections with nearly the same size.The collection only
contains the frequencies of 48 terms for each email.We used all of
these terms for learning.
We classied a test email emailID as spam iff
Pr(spam(emailID)) 0;5
holds.For probabilistic term weights,no email reached this limit,
thus only deterministic facts are considered.
We achieved the following results:#rules#literalsprecisionprecision
1171:6%79:4%1272:7%81:3%2274:7%98:7%3381:0%88:8%4483:4%96:2%Precision
is the percentage of documents where no nonspamemail
is classied as spam (and,possibly,automatically deleted),as
this is particularly interesting for the user.
The results are better than for the Reuters collection,because the
dimensionality of the learning task is much lower.But compared
with a precision of 93%(given in the collection as baseline value),
rule learning performs bad even for this simple collection.
5.3 Detection of poisonous mushrooms
As a third example,we applied our algorithmto a collection from
the eld of Machine Learning:the mushroomcollection from[10].
The task is to detect poisonous mushrooms (in the Agaricus and
Lepiota family).There are 8 124 examples (23 species),each of
them consists of 22 nominal attributes and a label (possibly) poi
sonous (3 916 examples,48:2%) or denitely edible (4 208 ex
amples,51:8%).Again,we split the collection randomly into two
collections.
We used one relation for each attribute (with the id and the at
tribute value as arguments) and the relation mushroom containing
all mushroomids.All relations are deterministic.
Again,we classied a mushroomas poisonous iff
Pr(poisonous(mushroomID)) 0;5
holds.
We evaluated both the learned uncertain rules and the same rule
sets without probabilities:parametersprecision#rules#literalsuncertaincertain1174:2%74:2%1389:1%89:0%2288:8%89:1%2393:7%92:6%3396:4%95:3%4498:8%97:1%5599:8%98:0%Again,the rules contain no more than three literals.In most cases,
the uncertain rules are slightly better than the certain ones (and in
real applications,the nal error reduction from 2%to 0:2%would
be important).
Our results with 99:8% are close to the optimum of 100%.The
optimumwas reached by [1],learning 12 rules with neuronal nets.
5.4 Uncertain schema mapping
Finally we applied our learning algorithm to schema mapping
in digital libraries.The schema of a digital library denes the at
tributes of the documents.Following the approach in [2],an at
tribute is a pair of a name and a data type (a set of values together
with a set of binary vague predicates).Examples for schemas are
Dublin Core (DC,see [4]) and MARC (see [9]).Then,binary re
lations attribute(docID;value) can be used to encode documents in
Datalog:
EXAMPLE 16.The DC description of [2] contains
creator(fuhr99;fuhr_norbert);
title(fuhr99;towards_data_abstraction_in_networked
_information_retrieval_systems);
date(fuhr99;1999):
Aquery condition is a triple containing an attribute name,a pred
icate and a comparison value.This can be encoded using the rela
tion attribute_predicate(docID;value) and rules mapping from the
attribute relation to attribute_predicate.For simplicity,we will ig
nore the different predicates in this text.Thus,to retrieve docu
ments with the creator Norbert Fuhr,we have to ask
?author(X;fuhr_norbert):
In the area of federated digital libraries,the ultimate goal is to
integrate hundreds of heterogenous digital libraries in one system.
Typically,no common schema is supported by all libraries.Thus,
we have to use a standard schema for the queries and then map
between different schemas.
For example,our standard schema used for queries is DC with
the attribute creator (which contains all authors),but a digital
library supports only another schema with the attributes author
and editor.Then we have to state
creator(X;Y) author(X;Y);
creator(X;Y) editor(X;Y);
as all authors and editors are included in creator.
If our standard query contains the attributes author and edi
tor,but for the documents only creator (from DC) is available,
we have to use uncertain rules.E.g.,if 60 % of the creators are
authors (and 40 %are editors),we obtain the rules
0:6 author(X;Y) creator(X;Y);
0:4 editor(X;Y) creator(X;Y);
Of course,more complex rules or rule sets or possible as well.
In general is is difcult to calculate the rule weights for known
schemas.In realistic applications,dozens or hundreds of digital li
braries have to be integrated.With this setting,it is even impossible
to construct the rule body manually.Instead,we have to learn the
rules with examples.
As a test collection,we used the Library of Congress Open Archive
Initiative Repository 1 (see [8]).Here,documents are available in
DC and MARC.
In DC,the elds title,creator,subject,description,
publisher,date,type,"identier,language and cover
age are supported.
The MARC schema consists of a set of elds,each identied
by an ordinal number.Furthermore,these elds are subdivided into
several subelds (identied by a letter or a digit).E.g.,subeld
a of eld 100 contains a personal name,and subeld d may
contain any date associated with the person.We used the combina
tion of eld and subeld identiers as attributes and added for every
eld another attribute containing the concatenation of all subelds
of this eld (noted as subeld all).
Learning is based on exact match.The values of the document
attributes are converted into Datalog constants:Upper case letters
are transformed into the lower case equivalent,and all characters
except of letters and digits are converted into the underscore.Sub
sequent underscores are transformed into a single one,and under
scores are deleted from the start and the end of the constant.Fur
thermore,only the rst 100 characters of the resulting string are
considered for efciency.
As examples,we used a randomly chosen subset of the prod
uct of all documents and all attribute values,and obtained approxi
mately 8 000 examples.
First,we used DC as standard schema and MARC as the library
schema,and learned rules for all DC attributes (3 rules,4 subgoals,
only positive subgoals).Once more,recursion is disabled.E.g.,
our algorithm outputs the following deterministic rules for the DC
attribute creator (with no loss):
dc_creator(X;Y) marc_100_all(X;Y);
dc_creator(X;Y) marc_700_all(X;Y);
dc_creator(X;Y) marc_710_all(X;Y):
MARC eld 100 is the personal name eld mentioned above,
eld 700 contains another personal name,and eld 710 a cor
porate name.
We invoked the algorithmwith different levels of improvements:
evaluation with HySpirit or direct evaluation in the relational database
management system mySQL,with different client numbers,with
and without replacement and disjunction patterns.(HySpirit was
not able to evaluate without them).We achieved the following com
puting (elapsed) times:#DB clientspatternstime (in minutes)1no87:535no14:155yes8:00We aborted the evaluation with HySpirit pass after three days.At
that time,HyLearner had not even completed learning the rst rule
(out of three).
These results showthat a signicant acceleration can be achieved
with our improvements.Particularly,direct evaluation in a database
systemand parallelisation yields a huge efciency improvement.
For the direction DC!MARC,HyLearner generates uncer
tain rules (due to the higher specicity of MARC).For example,
for the attribute concatenation of all subelds of eld 700 the
following rule was produced:
0:4 marc_100_all(X;Y) dc_creator(X;Y):
Of course,the output is formed as described in section 2.But for
one rule,this is equivalent to the original rule probabilities intro
duced in [3].
The results show that our algorithm can be used for learning
schema mappings:When there is equivalent information in both
schemas,HyLearner computes deterministic rules with no loss.
However,when there are schemas of different expressiveness (e.g.,
DC versus MARC),then uncertain mappings are required.In this
case,HyLearner can efciently compute the corresponding proba
bilistic rules.
6.CONCLUSION AND OUTLOOK
In this paper,we rst described briey how uncertain rules can
be used in general.Instead of assigning each rule a conditional
probability (implying that only one rule may hold for a tuple vec
tor),all required 2
n
probabilities are modeled and used.These
probabilities can be transformed into additional facts which are
added to the rule bodies.
We gave an overviewover two algorithms for learning rule prob
abilities and rule bodies.These algorithms borrowideas fromcom
mon Machine Learning methods,but modify them in order to t
probabilistic Datalog.
Then,we have shown how the Wüthrich's algorithms in [17,18]
can be extended to improve efciency.Some practical tests showed
a huge acceleration of the learning process.
Finally,we applied the algorithm to several test collections.It
became clear that rule learning is appropriate for low dimension
problems.
One main application area for probabilistic Datalog rule learning
could be schema mapping.Our preliminary results are promising.
Future work is needed to verify the retrieval performance of the
approach and to develop methods for transforming attribute values.
7.ACKNOWLEDGEMENTS
Part of this work is supported by the EUcommission under grant
IST200026061 (MIND).
8.REFERENCES
[1] W.Duch,R.Adamczak,and K.Grabczewski.Extraction of
logical rules fromtraining data using backpropagation
networks.In Proceedings of the 1st Online Workshop on Soft
Computing,pages 2530,1996.http://www.bioele.
nuee.nagoyau.ac.jp/wsc1/papers/p061.html.
[2] N.Fuhr.Towards data abstraction in networked information
retrieval systems.Information Processing and Management,
35(2):101119,1999.
[3] N.Fuhr.Probabilistic Datalog:Implementing logical
information retrieval for advanced applications.Journal of
the American Society for Information Science,51(2):95110,
2000.
[4] Dublin Core Metadata Initiative.Dublin Core metadata
element set:Reference description.
http://dublincore.org/documents/dces/.
[5] N.Lavrac and S.Dzeroski.Inductive Logic Program
mingTechniques and Applications.Ellis Horwood,1994.
[6] D.Lewis and M.Ringuette.A comparison of two learning
algorthms for text categorization.In Symposium on
Document Analysis and Information Retrieval,Las Vegas,
1994.
[7] H.Nottelmann.Lernen unsicherer Regeln für Hyspirit
(Learning uncertain rules for HySpirit,in German).Diploma
thesis,University of Dortmund,Department of Computer
Science,2001.
http://ls6www.informatik.unidortmund.de/
~nottelma/hylearner/diplomarbeit.pdf.
[8] Library of Congress.Library of Congress Open Archive
Initiative Repository 1.
http://memory.loc.gov/cgibin/oai1_0.
[9] Library of Congress.Marc standards.
http://www.loc.gov/marc/marc.html.
[10] UCI Machine Learning Repository.Collection mushroom.
ftp://ftp.ics.uci.edu/pub/
machinelearningdatabases/mushroom/.
[11] UCI Machine Learning Repository.Collection spambase.
ftp://ftp.ics.uci.edu/pub/
machinelearningdatabases/spambase/.
[12] K.Ross.Modular stratication and magic sets for Datalog
programs with negation.Journal of the ACM,
41(6):12161266,November 1994.
[13] G.Salton and C.Buckley.Termweighting approaches in
automatic text retrieval.Information Processing and
Management,24(5):513523,1988.
[14] J.D.Ullman.Principles of Database and KnowledgeBase
Systems,volume I.Computer Science Press,Rockville
(Md.),1988.
[15] L.G.Valiant.Theory of the learnable.Communications of the
ACM,27(11):11341142,1984.
[16] C.J.van Rijsbergen.A nonclassical logic for information
retrieval.The Computer Journal,29(6):481485,1986.
[17] B.Wüthrich.On the learning of rule uncertainties and their
integration into probabilistic knowledge bases.Journal of
Intelligent Information Systems,2:245264,1993.
[18] B.Wüthrich.Knowledge discovery in databases.Technical
report,Hong Kong University of Science and Technology,
Department of Computer Science,1996.ftp://ftp.cs.
ust.hk/pub/techreport/96/tr9604Chap7.ps.gz.
[19] Y.Yang.An evaluation of statistical approaches to text
categorization.Information Retrieval,1(1):6788,1999.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment