# Learning probabilistic Datalog rules for information classification and transformation

AI and Robotics

Nov 7, 2013 (7 years and 10 months ago)

252 views

Learning probabilistic Datalog rules for information
classication and transformation
Henrik Nottelmann
nottelmann@ls6.cs.uni­dortmund.de
Norbert Fuhr
fuhr@ls6.cs.uni­dortmund.de
Department of Computer Science
University of Dortmund
44221 Dortmund,Germany
ABSTRACT
Probabilistic Datalog is a combination of classical Datalog (i.e.,
function-free Horn clause predicate logic) with probability theory.
Therefore,probabilistic weights may be attached to both facts and
rules.But it is often impossible to assign exact rule weights or
even to construct the rules themselves.Instead of specifying them
manually,learning algorithms can be used to learn both rules and
weights.In practice,these algorithms are very slow because they
need a large example set and have to test a high number of rules.
We apply a number of extensions to these algorithms in order to
improve efciency.Several applications demonstrate the power of
learning probabilistic Datalog rules,showing that learning rules is
suitable for low dimensional problems (e.g.,schema mapping) but
inappropriate for higher dimensions like e.g.in text classication.
1.INTRODUCTION
The logical view on information retrieval (IR) treats retrieval as
inference.For a query q,the systemsearches for documents d that
imply the query logically,i.e.for which the logical formula q d
is true.Due to the intrinsic vagueness and imprecision in IR,a
logic that allows for uncertain reasoning should be used.In [16],
a probabilistic approach is discussed for this purpose,thus docu-
ment retrieval can be based on the estimation of the probability
Pr(q d).
Datalog is a (function-free) variant of Horn predicate logic which
is widely used for deductive databases.For probabilistic Datalog,
probabilistic weights may be attached to both facts and rules.As
rule weights can only be used with certain restrictions,[17] de-
scribes another approach for using uncertain logical rules in gen-
eral,based on exponentially many conditional probabilities.
For evaluating probabilistic Datalog programs,we developed the
inference engine HySpirit.
In practice,it is often very difcult to calculate the rule weights
or even the rule bodies itself.Applying methods from the eld of
machine learning can remedy this problem.[17,18] describe two
algorithms for learning the rule probabilities and the rules them-
selves,respectively.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page.To copy otherwise,to
republish,to post on servers or to redistribute to lists,requires prior specic
permission and/or a fee.
One drawback of these algorithms is that they are very slow.
There are four major reasons for this behaviour:
1.The most expensive operations are the rule evaluations per-
formed with HySpirit.
2.To obtain a guaranteed quality for the probability learning al-
gorithm,a large example set is needed.Thus,each evaluation
will take a long time.
3.The rule learning algorithmhas to evaluate a high number of
rules.
4.The rules to be tested may be rather complex because they
are generated automatically.Length and complexity of a rule
have direct impact on runtime.
In this paper,we describe several extensions of the basic algorithms
which lead to efciency improvements of several orders of magni-
tude.
A number of applications demonstrate the feasibility of our ap-
proach,such as different (text) classication tasks and the problem
of schema mapping in digital libraries.
The rest of this text is organised as follows:The next section
gives a brief overview over probabilistic Datalog and the new rule
approach of handling rule probabilities.Section 3 introduces the
two algorithms for learning rule weights and rule bodies,respec-
tively.Our improvements are described in section 4.Then,we
present some applications and nish with a conclusion and an out-
look on future work.
2.PROBABILISTIC DATALOG
Datalog (see [14]) is a variant of predicate logic based on function-
free Horn clauses.Negation is allowed,but its use is limited to
achieve a correct and complete model.Rules have the form h
b
1
^   ^b
n
,where h (the head) and b
i
(the subgoals of the
body) denote literals with variables and constants as arguments.
A rule can be seen as clause fh;:b
1
;:::;:b
n
g.A fact f is a rule
with only constants in the head and an empty body (the can be
omitted in this case).
In probabilistic Datalog (see [3]),every fact or rule has a proba-
bilistic weight attached,prepended to the fact or rule.Evaluation is
based on the notion of event keys and event expressions.Facts and
instantiated rules are basic events,each of them has assigned an
event key.Each derived fact is associated with an event expression
that is a Boolean combination of the event keys of the underlying
basic events.
Problems arise if more than one rule is formulated for the same
head predicate:Given only two rules for a predicate p specifying
Pr(pjr
1
) and Pr(pjr
2
),we are not able to estimate Pr(pjr
1
^r
2
) for
the case when the bodies of both rules are fullled (see [3]).Thus,
we have to know Pr(pj:r
1
^:r
2
),Pr(pjr
1
^:r
2
),Pr(pj:r
1
^r
2
)
and Pr(pjr
1
^r
2
).This is equivalent to the situation in probabilistic
inference networks where a link matrix with the same conditional
probabilities has to be given.
EXAMPLE 1.The following rules state that two documents are
semantically related iff they have the same author or are linked
together:
sameauthor(D1;D2) author(D1;A) ^author(D2;A);
0:2 related(D1;D2) sameauthor(D1;D2):
But since two documents may share the same author and are
linked together as well,we have to reformulate the rule set in order
to cover all possible cases
sameauthor(D1;D2) author(D1;A) ^author(D1;A);
:sameauthor(D1;D2);
sameauthor(D1;D2);
sameauthor(D1;D2);
In general,we have n rules with the common (target) head
predicate p.To specify rule#i,its head predicate is renamed to
p
i
.With the vector
~
X =(X
1
;:::;X
l
)
T
,where each X
k
is a variable,
p(
~
X) and p
i
(
~
X) are atoms.Then,the renamed rules together with
the n rules p(
~
X) p
i
(
~
X) are equivalent to the original rule set.
Further we set := fp
i
;:::;p
n
g.Then,we need 2
n
conditional
probabilities (see [17]):
DEFINITION 2.The function :2

![0;1] is dened as
 (A):=Pr(p(
~
X) j
^
a2A
a(
~
X) ^
^
a2 nA
:a(
~
X)):
Furthermore,we dene  (
/
0):=0.
EXAMPLE 3.For the second rule set in example 1,we have
 (/0) =0; (frelated
1
g) =0:5; (frelated
2
g) =0:2;
 (frelated
1
;related
2
g) =0:7:
These probabilities have to be integrated into probabilistic Data-
log (coding theminto rules and/or facts) so that the uncertain rules
can be evaluated e.g.by HySpirit without changes.
[17,p.254] and [7,pp.103105] show that the obvious con-
struction (using the  values as probabilities of new rules,see ex-
ample 1) not necessarily yields modularly stratied programs (and,
thus,cannot be evaluated by HySpirit,see [12]).
n1
probabilities into 2
n1
/
0 6=A ,with probabilities x(A):
DEFINITION 4.The weight x(A) is dened recursively
(using x(B) for
/
0 6=B A) as
x(A) =

/06=B
(1)
jBj1
 (B)
/
06=B nA
(1)
jBj1
 (B)

/06=BA
x(B)
:
Then,the transformed rule set is constructed according the fol-
lowing scheme:
p(
~
X) p
i
(
~
X) ^
V
fp
i
gA
f (A);
p
i
(
~
X)   ;
x(A) f (A):
The correctness of this construction is proven in [17] (for the
case of three rules) and in [7] (for n rules).
3.LEARNINGPROBABILISTICDATALOG
RULES
In this section we present two algorithms for learning rule proba-
bilities (i.e., values) and probabilistic Datalog rules,respectively.
They have a similar structure which we describe in the rst subsec-
tion.
3.1 Structure of the learning algorithms
The input to the algorithms consists of a set of n rules with a
common head predicate (the target predicate) p.A hypothesis h
P
is a non-empty set P of rules with head predicate p (with renamed
i
,respectively) and corresponding 
P
values.The
hypothesis space H is a set of hypotheses h
P
.
Examples are probabilistic facts for the target predicate.The
the oracle EXAMPLES.EXAMPLES returns an arbitrary example
~c and a label l 2 f0;1g with Pr(l =1) =Pr(p(~c)).Furthermore,
containing rules and facts (used to evaluate the rules,e.g.with
HySpirit).
We search for the hypothesis h
P
2 H with minimum expected
quadratic loss (the prediction error of Pr(p(~c)),where ~c is a ran-
domly drawn constant vector,which has some well known,useful
properties),but we must use approximations for the 
P
values and
In the rest of this section,EX =f(~c
i
;y
i
)g,y
i
2 f0;1g,is a set of
examples drawn with EXAMPLES.
DEFINITION 5.Let EX
0
 EX contain all (~c
i
;y
i
) where
V
a2A
a(~c
i
) ^
V
a2PnA
:a(~c
i
).Then,the approximation of 
P
(A) is
dened as

P
(A):=
jf(~c
i
;y
i
) 2EX
0
jy
i
=1gjjEX
0
j
Thus,we need jPj evaluations (of the single rules p
i
2 P) for
computing x(A).
EXAMPLE 6.Let P:=fp
1
;p
2
g.positivenegative.p
10 1 1 1 01 0 1 0 1p
21 1 1 1 01 1 0 0 0p
1
:p
1p
2++++:p
2
P
(fp
1
g) =0,
P
(fp
2
g) =0:5,
P
(fp
1
;p
2
g) =0:75.
DEFINITION 7.The empirical quadratic loss of a hypothesis
h
P
with examples EX is dened as

E
EX
[Q
h
]:=
1jEXj
jEXj

i=1
(Pr(h
P
(~c
i
)) y
i
)
2
:
Pr(h
P
(~c
i
)) is the probability of p(~c
i
) as computed by HySpirit
using the hypothesis h
P
.We need only one evaluation of the trans-
formed result set for computing the empirical quadratic loss.
For both approximations,we draw a new example set EX.The
resulting generic structure is depicted in gure 1.
3.2 Algorithm1
Wüthrich presents in [17] a PAC algorithm which learns  val-
ues.PAC (probably approximately correct) learning (introduced
in [15] for deterministic propositional logic) is often used in the
eld of machine learning.PACalgorithms return only in most cases
a nearly correct result (and can fail otherwise),but guarantee good-
ness properties described below.
The algorithm is based on the generic architecture.It takes as
input also two parameters  and .The hypothesis space is
INPUT:target predicate p,background knowledge base KB,
examples EXAMPLES,hypothesis space H
OUTPUT:optimumhypothesis h 2H with minimumloss

E
EX
0
(Q

h
)
ALGORITHM:
l:= 
for(all h
0
2H) do
h
00
:= learnGammas(p,h
0
,KB,EXAMPLES)
l
00
:= computeLoss(h
00
,KB,EXAMPLES)
if(l
00
<l) then
l:= l
00
h:= h
00
if(l =0) then stop fi
fi
od
Figure 1:Generic structure of both algorithmsH = fh
P
j/0 6= P   g.For both approximations we draw
poly(1=;1= ) new examples EX.Then,the output of the algo-
rithm is with probability of 1  an  -good model of probability
(i.e.,the probability that the error is greater than  is at most  ).
We implemented this probability learning algorithm (with the
modication described in section 4.1) as algorithm1.
3.3 Algorithm2
We also adapted a second algorithm by Wüthrich which learns
not only the conditional probabilities of given rules but also the
rules themselves (see [18]).Learning (certain) logical rules (in-
ductive logic programming,ILP) is well studied in the eld of
machine learning (e.g.,see [5]),but the methods have to be modi-
ed for probabilistic Datalog.
We use the common specialisation order on rules to avoid testing
all possible rules with head predicate p:
DEFINITION 8.A rule r
1
is a specialisation of a rule r
2
,r
1

r
2
,iff there is a substitution  with r
2
  r
1
(viewing a rule as a
clause,i.e.as a set of literals containing the head and the negations
of the subgoals).
A rule r
1
is a most general (direct) specialisation of a rule r
2
,
r
1

r
2
,iff for every rule r with r
1

r 

r
2
,r 

r
1
or r 

r
2
holds.
E.g.,with the given rules
r
1
:father(X;Y) parent(X;Y);
r
2
:father(X;X) parent(X;X);
r
3
:father(X;Y) parent(X;Y) ^parent(Y;Z);
r
4
:father(X;X) parent(X;X) ^parent(X;Z);
The rules r
2
and r
3
are both (direct) specialisations of r
1
(using
 = fX=X;Y=Yg and  = fX=X;Y=Xg,respectively),and r
4
is a
direct specialisation of r
2
and r
3
.
There are two ways to generate a direct specialisation:
1.A new literal is appended to the rule.Recursion is allowed,
but then the literal must be positive in order to obtain a mod-
ularly stratied Datalog program.
2.Instead of appending a literal =(X;Y),where X and Y al-
ready occur in the rule,all occurences of X are substituted
by Y.Both ways lead to equivalent rules,but evaluation of
the rule with substitution is faster.
We now rst present the original algorithm of Wüthrich before
we describe our modication.
The algorithmof Wüthrich consists of two nested loops:
1.The inner loop generates a locally optimum rule.Starting
with the most general rule r,the direct specialisation h
opt
of r (without  values) with minimum expected quadratic
loss is computed.This step is repeated (via the assignment
r:=h
opt
) until no improvement can be achieved,or until the
rule reaches a specied length (number of subgoals).
Afterwards,the inner loop assigns the rule h
opt
a conditional
probability (note that is not a  value).
2.The outer loop just calls the inner loop n times to learn n
rules.
Finally,the conditional probabilities and several approximations
are used to solve an equation system of 2
n1
rules in the 2
n1
un-
known  values.
This algorithmonly returns a local minimumand possibly misses
the global optimum.The advantage is the gain of efciency,due to
not testing all possible hypotheses (i.e.,all rules with head predi-
cates p).
We modify this algorithm to reuse the generic structure.Now,
the inner loop simply calls the generic algorithmwith H:=fdirect
specialisations of rg (where r is the current optimum hypothesis).
As a consequence,our algorithmlearns the gamma values for every
direct specialisation,and the loss is determined w.r.t.these prob-
abilities.The nal step (solving the equation system) is no longer
necessary.Our rule learning algorithm is called algorithm 2 in
the rest of this text (see gure 2).
The major drawback of our algorithm is that it is not as fast as
Wüthrich's (because the weights are computed for every hypothesis
instead of only for the locally optimum rule set),but Wüthrich's
way of determining the  values leads to further inaccurracy.
4.EFFICIENCY IMPROVEMENTS
We developed the systemHyLearner that implements these al-
gorithms (calculation of the additional fact weights,learning rule
probabilities,learning rule bodies).It became clear that the origi-
nal learning algorithms are much too slow,for several reasons:we
need a large example set to have good guaranteed results,we have
to test many rules which might be rather complex due to the au-
tomatic generation (for rule learning),and the evaluation of each
hypothesis is very expensive.In the following,we describe six so-
lutions improving efciency,wich we integrated into HyLearner.INPUT:target predicate p,knowledge base KB,
examples EXAMPLES,number n of rules to learn
OUTPUT:transformed rule set R,loss l
ALGORITHM:
R:= empty rule set
l:= 0
for(i:= 1;i n;i:= i +1) do
l:= 
r:= empty rule for p
do
H:=fR[fr
0
gjr
0
direct specialisation of rg
(h
r
,l):= learnOptimum(p,KB,EXAMPLES,H,l)
while(better rule found)
R:= h
r
od
Figure 2:Rule learning algorithm(algorithm2)
4.1 Reducing the number of examples
The rule probability learning algorithm(algorithm1) guarantees
some effectiveness properties described in [17].The number of ex-
amples needed is polynomial in  and .E.g.,for loss computation
we need at least
16(1= )
6
(1
2
n+1
p1 )
1
examples.Thus,useful parameters result in a high number of re-
quired examples:With  = =0:1 and n =3 rules,we need at least
100 000 examples.
To speed up both both evaluation in HySpirit and postprocessing
in the algorithm,we abandon this guarantee in our implementation.
We allow the user to specify as many examples as she wants,and
use all these examples for learning and loss computation (instead
of drawing a new sample every time).
4.2 Direct specialisations of the most general
rule
The rule learning algorithm (algorithm 2) rst tests the direct
specialisations of the most general rule (i.e.,the rules with shortest
bodies).These starting rules can be of course constructed automat-
ically,but then there are exponentially many starting rules,with
most of them being absurd.E.g.,learning the father relation and
having the EDB predicates person,sex,likes and sport,then the
rule
father(X;Y) sport(X) ^sport(Y)
is a direct specialisation of the most general rule.But this rule does
not make sense as sports and persons are totally different topics.
To avoid generating a lot of absurd starting rules,this task is
delegated to the user.It is very easy for her to generate reasonable
starting rules,particularly when there are special relations for the
domains.
In the example above,both arguments of the binary relation
father should contain persons,thus a reasonable starting rule is
father(X;Y) person(X) ^person(Y):
Of course,there may be more than one starting rule in general.
In this case,the algorithmuses all of them.
4.3 Replacement patterns
It is often posssible to simplify learned rules,e.g.to replace parts
of a rule (set of subgoals) by other literals obtaining an equivalent
but shorter rule.The evaluation of a rule requires the computation
of a join over all relations used in the rule body.The join of m
relations with n
i
tuples can be determined in time O(
i
n
i
).Thus,
a shorter body leads to faster evaluation.
E.g.,the two rules r
1
and r
2
,given as
r
1
:father(X;Y) person(X) ^person(Y)^
parent(X;Y);
r
2
:father(X;Y) parent(X;Y);
(where person and parent are deterministic relations),are logically
equivalent w.r.t.the conditions that both arguments of the predicate
parent must be instances of the predicate person.
The literals person(X) and person(Y) are redundant in rule r
1
2
is shorter and can be evaluated faster
than rule r
1
.
Therefore it is useful to simplify rules as far as possible.Re-
placement patterns specify which parts of a rule can be replaced by
other literals for simplication.
DEFINITION 9.A replacement pattern is a statement
h
1
^   ^h
s
- b
1
^   ^b
t
where the h
i
and b
j
are arbitrary literals.The h
i
p
,the b
j
's the body p
body
are a subset of the variables in the body.Constants are allowed in
both parts.
The semantics is dened as follows:Given a rule r,a replace-
ment pattern p,and a substitution  with p
 r
body
,then
r (r n p
body
 ) [p

should hold.
One of the most import application areas of replacement pattern
is described by the example above:modeling part of relation-
ships to allow for removing redundant literals.There,the replace-
ment patterns
rp
1
:parent(A;B) - person(A) ^parent(A;B);
rp
2
:parent(A;B) - person(B) ^parent(A;B);
can be stated to express the part of relationships fromabove,stat-
ing that both arguments of the predicate parent must be instances
of the predicate person.
Rule r
1
can be simplied to
father(X;Y) person(Y) ^parent(X;Y)
using pattern rp
1
and the substitution  =fX=A;Y=Bg.The result-
ing rule can further be simplied using pattern rp
2
and the same
substitution .
The probabilistic case is more difcult.Evaluation of head and
body of the pattern must yield the same probabilities.It is the task
of the user to ensure this property.If the body equals the head plus
one further literal (i.e.,the pattern states that this literal is redun-
dant),this property is true if
1.the corresponding relations are deterministic (e.g.,domain
relations),or
2.the replacement pattern expresses a part of relation which
is also modelled by an IDB relation in the knowledge base.
Criterion 2 can be explained by the following example:
EXAMPLE 10.The knowledge base contains the rules
person(X) parent(X;Y);
person(Y) parent(X;Y);
Then,the rules r
1
and r
2
from above deliver for all constant tu-
ples the same event expressions and,thus,the same probabilities.
Therefore,the replacement patterns rp
1
and rp
2
are still valid in
the probabilistic case.
4.4 Disjunction patterns
Algorithm 2 often generates rules with pairwise disjoint literals.
The evaluation of these rules always yields empty result sets,and
is therefore superuous.We can signicantly improve efciency if
we discard the evaluation of these rules.
In the most simple case of pairwise disjoint literals,a rule con-
tains both a literal l and the negated counterpart:l.This case can
be tested automatically.
More general cases must be dened explicitly:
DEFINITION 11.A disjunction pattern is a rule with an empty
b
1
^   ^b
t
:
The semantics is dened as follows:Given a rule r,a disjunc-
tion pattern p and a substitution  with p
  r
body
,then r is
unsatisable.
EXAMPLE 12.The rule
father(X;Y) sex(X;male) ^sex(X;female)^
parent(X;Y)
is unsatisable,and evaluation can be discarded.To detect this,we
have to state the disjunction pattern
sex(X;male) ^sex(X;female):
Another application area is the detection of wrong domains (pro-
duced by rule specialisation):
EXAMPLE 13.Obviously,the second argument of sex contains
the constants male or female,but no persons.Nevertheless,the
algorithm produces the rule
father(X;Y) person(X) ^person(Y) ^sex(X;Y):
This can be detected using the disjunction pattern
person(Y) ^sex(X;Y):
This problems could be solved with signatures,too.A signa-
ture assigns every argument of a predicate a type (domain).Then,
comparison of types can be used to detect wrong domains:
EXAMPLE 14.In example 13,there are the disjoint domains
PERSON and SEX =fmale;femaleg and the signatures
person(PERSON);parent(PERSON;PERSON);
father(PERSON;PERSON);sex(PERSON;SEX):
Then,it is easy to detect that Y 2PERSON andY 2SEX is an error.
There are several drawbacks of this approach:
1.It is less general than disjunction patterns.For the relations
and signatures professor(PERSON),student(PERSON),
examines(PERSON;PERSON),rules with a student as ex-
aminer cannot be found with signatures (but with the simple
disjunction pattern student(X) ^examines(X;Y)).Thus,
we would need a type hierarchy which increases the com-
plexity of this approach.
2.We have to introduce a syntax for domains,relationsships
between domains,and signatures.In contrast,we can use the
existing syntax for disjunction patterns.This is more elegant.
3.With signatures,we cannot detect disjunctions such as those
in example 12.
4.5 Parallel evaluation
Both learning algorithms are based on the same generic structure
(see gure 1),in which  values and the resulting loss are computed
for all hypotheses (i.e.,the rst part inside the for loop).This is
completely independent of the computations for the other hypothe-
ses.The only connection is the process of determing the hypothesis
with overall minimum loss (the if statement).Thus,computation
of the hypotheses can be parallelised in both learning algorithms
(also allowing for distributed evaluation).
As the computation for one hypothesis requires two evaluation
steps,these evaluations are the most expensive parts of the algo-
rithms,and the algorithm itself spends most of the time waiting
for the external program (e.g.,HySpirit) to nish,parallelisation
yields a noticable performance improvement (of course,only by a
constant factor).
4.6 Direct evaluation in a relational database
management system
This enhancement can also be applied to both learning algo-
rithms.
The common approach to evaluating rules is to use HySpirit.Al-
though this inference engine underwent some noticable speedups,
it is still pretty slow.A strong improvement can be achieved by di-
rectly evaluating rules (using SQL) in a standard relational database
management system(RDBMS) and performing some necessary but
fast postprocessing.
We restricted rules so that one select statement is sufcient to
evaluate them:
1.The background knowledge base KB may contain only EDB
predicates (facts) but no rules.
2.Recursion is not allowed.
3.Negated predicates are allowed,but then a preprocessing step
is required,and the use of negation in rules is restricted.
The EDB relations of the background knowledge base KB can
then be stored in the RDMBS,one table (relation) for each pred-
icate.Each table column corresponds to one predicate argument.
There is an additional table for each predicate (which may occur in
a negated subgoal) which contains the complement of that predicate
against an appropriate domain.
Algorithm 3.1 in [14] describes the transformation of a non-
recursive Datalog rule with only EDB predicates in the body into
an expression of relational algebra.As we do not use built-in pred-
icates like =,<,,>,or 6=,we can omit steps (2) and (4) from
this algorithm.Negated subgoals can be handled analogously to
positive subgoals.
To transform this expression of relational algebra into a select
query,the selection and projection operations must be moved to
the outside and be replaced by an join operation.The resulting
(equivalent) expression has the form

projection
(
selection
(relations))
and can be transformed into a select query:
select projection from relations
where selection
Having uncertain facts and rules (which are in fact certain rules
with additional facts in the body),the RDBMS has to compute an
event expression for every result tuple.This event expression can
then be evaluated in the algorithm to compute a probability for the
tuple.
The select query now must return both,result tuples and the
corresponding event expression.There must be one event key (hence
there can be no IDBpredicates in the rule body) with its probability
for each relation in the event expression.Thus,this event expres-
sion is a conjunction of event keys.
EXAMPLE 15.The rule
father(X;Y) parent(X;Y) ^sex(X;male)
can be transformed into the select query
select
concat(sex.name,‘,’,parent.name2) as consts,
concat(‘1.0 parent(‘,parent.name1,‘,’,
parent.name2,‘)’,‘& 1.0 sex(’,sex.name,‘,’,
sex.value,‘)’) as event
from
parent,sex
where
sex.name=parent.name1 and sex.value=‘male’
order by
consts;
With the relationsname1name2probbobmary0:9maryanton1antonpeter0:5relation parentnamevaluepetermalebobmaleantonmalemaryf emalerelation sex
stored in the RDBMS,this select query will returnconstseventbob;mary0:9 parent(bob;mary) &1 sex(bob;male)anton;peter0:5 parent(anton;peter) &1 sex(anton;male)Result of the select query
Having multiple rules for one predicate,each rule has to be eval-
uated separately.The results have to be merged together (for that
reason,we use an order by statement in the query).If there are
multiple event expressions for one tuple (due to multiple rules or
free variables in the rule body),the disjunction of these expressions
has to be built.The probability of an event expression is calculated
in exactly the same way as described in [3] (except that an event
expression is automatically in disjunctive normal form).
5.APPLICATIONS
We now describe the application of the learning algorithms on
several test collections.
5.1 Text classication with the Reuters
collection
The well-known Reuters collection [6] contains 7 775 training
and 3 019 test documents (Lewis split).Each document has at-
tached at least one out of 115 categories.The task is to learn rules
for each category.For every test document,the categories in the
rst rank are selected for evaluation.
There are two relations:
1.docTerm contains the N terms of the documents (after stop
word elimination and stemming) with the highest inverse cat-
egory frequency (assuming that these terms are good dis-
criminators):
icf
t
:=log
#categories#categories containing t
The experiment is run twice:the rst time with deterministic
terms,and afterwards with probabilistic terms weighted with
normalised tf  idf (see [13]).
2.doc contains all document ids.
The example set contains all training documents with the weight 1
iff the document is assigned the category,and 0 otherwise.Negated
literals are prohibited by disjunction patterns.
Thus,algorithm2 learns rules of the form:
cat(D) docTerm(D;t
1
) ^   ^docTerm(D;t
l
):
The following table shows the precision values for different pa-
rameter settings:parametersprecision#rules#literals#termsdet:prob:325042:8%40:9%4220046:1%42:4%4410043:0%41:0%The rule learning algorithm(algorithm2) outputs as many rules as
specied (except if fewer rules are sufcient to achieve a loss of
zero).To this result set,we also applied the probability learning al-
gorithm(algorithm1) which outputs an optimumsubset (the subset
with minimumloss).But in most cases,the optimumsubset simply
was the learned rule set,and we did not achieve a better precision.
Due to high computation times,we did not consider larger pa-
rameter values.Anyway,the table suggests that no substantial im-
provement could be gained.Particularly,it would not help to in-
crease the number of literals,because the learned rules were never
longer than three literals.
An interesting result is that we gained better results for determin-
istic facts (binary indexing).
Compared with other approaches (e.g.,as described [19]) with
a precision of 85% (kNN) or 65% (Rocchio),learning uncertain
rules for text classication is not appropriate.One reason is the
high dimensionality (high number of terms) of the learning task.To
reduce the hypothesis space,we only considered 2% of the terms
(w.r.t.the icf values).
5.2 Text classication to detect spamemails
The collection spambase [11] contains 4 601 emails,1 813
(= 39:4% ) of them are spams.We split the collection randomly
into two collections with nearly the same size.The collection only
contains the frequencies of 48 terms for each email.We used all of
these terms for learning.
We classied a test email emailID as spam iff
Pr(spam(emailID)) 0;5
holds.For probabilistic term weights,no email reached this limit,
thus only deterministic facts are considered.
We achieved the following results:#rules#literalsprecisionprecision
1171:6%79:4%1272:7%81:3%2274:7%98:7%3381:0%88:8%4483:4%96:2%Precision

is the percentage of documents where no non-spamemail
is classied as spam (and,possibly,automatically deleted),as
this is particularly interesting for the user.
The results are better than for the Reuters collection,because the
dimensionality of the learning task is much lower.But compared
with a precision of 93%(given in the collection as baseline value),
rule learning performs bad even for this simple collection.
5.3 Detection of poisonous mushrooms
As a third example,we applied our algorithmto a collection from
the eld of Machine Learning:the mushroomcollection from[10].
The task is to detect poisonous mushrooms (in the Agaricus and
Lepiota family).There are 8 124 examples (23 species),each of
them consists of 22 nominal attributes and a label (possibly) poi-
sonous (3 916 examples,48:2%) or denitely edible (4 208 ex-
amples,51:8%).Again,we split the collection randomly into two
collections.
We used one relation for each attribute (with the id and the at-
tribute value as arguments) and the relation mushroom containing
all mushroomids.All relations are deterministic.
Again,we classied a mushroomas poisonous iff
Pr(poisonous(mushroomID)) 0;5
holds.
We evaluated both the learned uncertain rules and the same rule
sets without probabilities:parametersprecision#rules#literalsuncertaincertain1174:2%74:2%1389:1%89:0%2288:8%89:1%2393:7%92:6%3396:4%95:3%4498:8%97:1%5599:8%98:0%Again,the rules contain no more than three literals.In most cases,
the uncertain rules are slightly better than the certain ones (and in
real applications,the nal error reduction from 2%to 0:2%would
be important).
Our results with 99:8% are close to the optimum of 100%.The
optimumwas reached by [1],learning 12 rules with neuronal nets.
5.4 Uncertain schema mapping
Finally we applied our learning algorithm to schema mapping
in digital libraries.The schema of a digital library denes the at-
tributes of the documents.Following the approach in [2],an at-
tribute is a pair of a name and a data type (a set of values together
with a set of binary vague predicates).Examples for schemas are
Dublin Core (DC,see [4]) and MARC (see [9]).Then,binary re-
lations attribute(docID;value) can be used to encode documents in
Datalog:
EXAMPLE 16.The DC description of [2] contains
creator(fuhr99;fuhr_norbert);
title(fuhr99;towards_data_abstraction_in_networked
_information_retrieval_systems);
date(fuhr99;1999):
Aquery condition is a triple containing an attribute name,a pred-
icate and a comparison value.This can be encoded using the rela-
tion attribute_predicate(docID;value) and rules mapping from the
attribute relation to attribute_predicate.For simplicity,we will ig-
nore the different predicates in this text.Thus,to retrieve docu-
ments with the creator Norbert Fuhr,we have to ask
?author(X;fuhr_norbert):
In the area of federated digital libraries,the ultimate goal is to
integrate hundreds of heterogenous digital libraries in one system.
Typically,no common schema is supported by all libraries.Thus,
we have to use a standard schema for the queries and then map
between different schemas.
For example,our standard schema used for queries is DC with
the attribute creator (which contains all authors),but a digital
library supports only another schema with the attributes author
and editor.Then we have to state
creator(X;Y) author(X;Y);
creator(X;Y) editor(X;Y);
as all authors and editors are included in creator.
If our standard query contains the attributes author and edi-
tor,but for the documents only creator (from DC) is available,
we have to use uncertain rules.E.g.,if 60 % of the creators are
authors (and 40 %are editors),we obtain the rules
0:6 author(X;Y) creator(X;Y);
0:4 editor(X;Y) creator(X;Y);
Of course,more complex rules or rule sets or possible as well.
In general is is difcult to calculate the rule weights for known
schemas.In realistic applications,dozens or hundreds of digital li-
braries have to be integrated.With this setting,it is even impossible
to construct the rule body manually.Instead,we have to learn the
rules with examples.
As a test collection,we used the Library of Congress Open Archive
Initiative Repository 1 (see [8]).Here,documents are available in
DC and MARC.
In DC,the elds title,creator,subject,description,
publisher,date,type,"identier,language and cover-
age are supported.
The MARC schema consists of a set of elds,each identied
by an ordinal number.Furthermore,these elds are subdivided into
several subelds (identied by a letter or a digit).E.g.,subeld
a of eld 100 contains a personal name,and subeld d may
contain any date associated with the person.We used the combina-
tion of eld and subeld identiers as attributes and added for every
eld another attribute containing the concatenation of all subelds
of this eld (noted as subeld all).
Learning is based on exact match.The values of the document
attributes are converted into Datalog constants:Upper case letters
are transformed into the lower case equivalent,and all characters
except of letters and digits are converted into the underscore.Sub-
sequent underscores are transformed into a single one,and under-
scores are deleted from the start and the end of the constant.Fur-
thermore,only the rst 100 characters of the resulting string are
considered for efciency.
As examples,we used a randomly chosen subset of the prod-
uct of all documents and all attribute values,and obtained approxi-
mately 8 000 examples.
First,we used DC as standard schema and MARC as the library
schema,and learned rules for all DC attributes (3 rules,4 subgoals,
only positive subgoals).Once more,recursion is disabled.E.g.,
our algorithm outputs the following deterministic rules for the DC
attribute creator (with no loss):
dc_creator(X;Y) marc_100_all(X;Y);
dc_creator(X;Y) marc_700_all(X;Y);
dc_creator(X;Y) marc_710_all(X;Y):
MARC eld 100 is the personal name eld mentioned above,
eld 700 contains another personal name,and eld 710 a cor-
porate name.
We invoked the algorithmwith different levels of improvements:
evaluation with HySpirit or direct evaluation in the relational database
management system mySQL,with different client numbers,with
and without replacement and disjunction patterns.(HySpirit was
not able to evaluate without them).We achieved the following com-
puting (elapsed) times:#DB clientspatternstime (in minutes)1no87:535no14:155yes8:00We aborted the evaluation with HySpirit pass after three days.At
that time,HyLearner had not even completed learning the rst rule
(out of three).
These results showthat a signicant acceleration can be achieved
with our improvements.Particularly,direct evaluation in a database
systemand parallelisation yields a huge efciency improvement.
For the direction DC!MARC,HyLearner generates uncer-
tain rules (due to the higher specicity of MARC).For example,
for the attribute concatenation of all subelds of eld 700 the
following rule was produced:
0:4 marc_100_all(X;Y) dc_creator(X;Y):
Of course,the output is formed as described in section 2.But for
one rule,this is equivalent to the original rule probabilities intro-
duced in [3].
The results show that our algorithm can be used for learning
schema mappings:When there is equivalent information in both
schemas,HyLearner computes deterministic rules with no loss.
However,when there are schemas of different expressiveness (e.g.,
DC versus MARC),then uncertain mappings are required.In this
case,HyLearner can efciently compute the corresponding proba-
bilistic rules.
6.CONCLUSION AND OUTLOOK
In this paper,we rst described briey how uncertain rules can
be used in general.Instead of assigning each rule a conditional
probability (implying that only one rule may hold for a tuple vec-
tor),all required 2
n
probabilities are modeled and used.These
probabilities can be transformed into additional facts which are
We gave an overviewover two algorithms for learning rule prob-
abilities and rule bodies.These algorithms borrowideas fromcom-
mon Machine Learning methods,but modify them in order to t
probabilistic Datalog.
Then,we have shown how the Wüthrich's algorithms in [17,18]
can be extended to improve efciency.Some practical tests showed
a huge acceleration of the learning process.
Finally,we applied the algorithm to several test collections.It
became clear that rule learning is appropriate for low dimension
problems.
One main application area for probabilistic Datalog rule learning
could be schema mapping.Our preliminary results are promising.
Future work is needed to verify the retrieval performance of the
approach and to develop methods for transforming attribute values.
7.ACKNOWLEDGEMENTS
Part of this work is supported by the EUcommission under grant
IST-2000-26061 (MIND).
8.REFERENCES
logical rules fromtraining data using backpropagation
networks.In Proceedings of the 1st Online Workshop on Soft
Computing,pages 2530,1996.http://www.bioele.
nuee.nagoya-u.ac.jp/wsc1/papers/p061.html.
[2] N.Fuhr.Towards data abstraction in networked information
retrieval systems.Information Processing and Management,
35(2):101119,1999.
[3] N.Fuhr.Probabilistic Datalog:Implementing logical
information retrieval for advanced applications.Journal of
the American Society for Information Science,51(2):95110,
2000.
element set:Reference description.
http://dublincore.org/documents/dces/.
[5] N.Lavrac and S.Dzeroski.Inductive Logic Program-
mingTechniques and Applications.Ellis Horwood,1994.
[6] D.Lewis and M.Ringuette.A comparison of two learning
algorthms for text categorization.In Symposium on
Document Analysis and Information Retrieval,Las Vegas,
1994.
[7] H.Nottelmann.Lernen unsicherer Regeln für Hyspirit
(Learning uncertain rules for HySpirit,in German).Diploma
thesis,University of Dortmund,Department of Computer
Science,2001.
http://ls6-www.informatik.uni-dortmund.de/
~nottelma/hylearner/diplomarbeit.pdf.
[8] Library of Congress.Library of Congress Open Archive
Initiative Repository 1.
http://memory.loc.gov/cgi-bin/oai1_0.
[9] Library of Congress.Marc standards.
http://www.loc.gov/marc/marc.html.
[10] UCI Machine Learning Repository.Collection mushroom.
ftp://ftp.ics.uci.edu/pub/
machine-learning-databases/mushroom/.
[11] UCI Machine Learning Repository.Collection spambase.
ftp://ftp.ics.uci.edu/pub/
machine-learning-databases/spambase/.
[12] K.Ross.Modular stratication and magic sets for Datalog
programs with negation.Journal of the ACM,
41(6):12161266,November 1994.
[13] G.Salton and C.Buckley.Termweighting approaches in
automatic text retrieval.Information Processing and
Management,24(5):513523,1988.
[14] J.D.Ullman.Principles of Database and Knowledge-Base
Systems,volume I.Computer Science Press,Rockville
(Md.),1988.
[15] L.G.Valiant.Theory of the learnable.Communications of the
ACM,27(11):11341142,1984.
[16] C.J.van Rijsbergen.A non-classical logic for information
retrieval.The Computer Journal,29(6):481485,1986.
[17] B.Wüthrich.On the learning of rule uncertainties and their
integration into probabilistic knowledge bases.Journal of
Intelligent Information Systems,2:245264,1993.
[18] B.Wüthrich.Knowledge discovery in databases.Technical
report,Hong Kong University of Science and Technology,
Department of Computer Science,1996.ftp://ftp.cs.
ust.hk/pub/techreport/96/tr96-04Chap7.ps.gz.
[19] Y.Yang.An evaluation of statistical approaches to text
categorization.Information Retrieval,1(1):6788,1999.