efficiency of machine learning techniques in predicting students

milkygoodyearΤεχνίτη Νοημοσύνη και Ρομποτική

14 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

68 εμφανίσεις

RECENT ADVANCES IN MECHANICS AND RELATED FIELDS
UNIVERSITY OF PATRAS 2003
in Honour of Professor Constantine L. Goudas


297
EFFICIENCY OF MACHINE LEARNING TECHNIQUES IN PREDICTING
STUDENTS’ PERFORMANCE IN DISTANCE LEARNING SYSTEMS

S. B. Kotsiantis, C. J. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

Educational Software Development Laboratory
Department of Mathematics
University of Patras
Greece
e-mail: e-mail: {sotos, chrpie, john, pintelas}@math.upatras.gr
Keywords: supervised machine learning algorithms, prediction of student performance, distance learning.

Abstract. The ability of predicting a student’s performance is very important in university-level distance
learning environments. The scope of the research reported here is to investigate the efficiency of machine
learning techniques in such an environment. To this end, a number of experiments have been conducted using
five representative learning algorithms, which were trained using data sets provided by the “informatics”
course of the Hellenic Open University. It was found that learning algorithms could enable tutors to predict
student performance with satisfying accuracy long before final examination. A second scope of the study was to
identify the student attributes, if any, that mostly influence the induction of the learning algorithms. It was found
that there exist some obvious and some less obvious attributes that demonstrate a strong correlation with student
performance. Finally, a prototype version of software support tool for tutors has been constructed implementing
the Naive Bayes algorithm, which proved to be the most appropriate among the tested learning algorithms.
1 INTRODUCTION
The tutors in a distance-learning course must continuously support their students regardless the distance
between them. A tool, which could automatically recognize the level of the students, would enable the tutors to
personalize the education in a more effective way. While the tutors would still have the essential role in
monitoring and evaluating student progress, the tool could compile the data required for reasonable and efficient
monitoring.
This paper examines the usage of Machine Learning (ML) techniques in order to predict the students’
performance in a distance learning system. Even though, ML techniques have been successfully applied in
numerous domains such as pattern recognition, image recognition, medical diagnosis, commodity trading,
computer games and various control applications, to the best of our knowledge, there is no previous attempt in
the presented domain
[10], [15]
. Thus, we use a representative algorithm for each one of the most common machine
learning techniques namely Decision Trees
[11]
, Bayesian Nets
[6]
, Perceptron-based Learning
[9]
, Instance-Based
Learning
[1]
and Rule-learning
[5]
so as to investigate the efficiency of ML techniques in such an environment.
Indeed, it is proved that learning algorithms can predict student performance with satisfying accuracy long
before the final examination.
In this work we also try to find the characteristics of the students that mostly influence the induction of the
algorithms. This will reduce the information that is needed to be stored as well as will speed up the induction.
For the purpose of our study the “informatics” course of the Hellenic Open University (HOU) provided the data
set. A significant conclusion of this work was that the students’ sex, age, marital status, number of children and
occupation attributes do not contribute to the accuracy of the prediction algorithms.
The following section describes the data set of our study. Some elementary Machine Learning definitions and
a more detailed description of the used techniques and algorithms are given in section 3. Section 4 presents the
experimental results for the five compared algorithms. The attribute selection methodology used to find the
attributes that most influences the induction as well as whether it improves the accuracy of the tested algorithms
or not, is discussed in section 5. Finally, section 6 discusses the conclusions and some future research directions.

2 HELLENIC OPEN UNIVERSITY DISTANCE LEARNING METHODOLOGY AND DATA
DESCRIPTION
For the purpose of our study the “informatics” course of HOU provided the training set. A total of 354
examples (student’s records) have been collected from the module “Introduction to Informatics” (INF10)
[16]
.
Regarding the INF10 module, during an academic year students have to hand in 4 written assignments,
optionally participate in 4 face to face meetings with their tutor and sit for final examinations after a 11-month-
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

298

period. A student must submit at least three of the four assignments. The total mark gathered from the handed-in
written assignments should be greater than or equal to 20 for a student to qualify to sit for the final examinations
of the module.
In the sequel, we present in Table 1 the attributes of our data set along with the values of every attribute. The set
of the attributes was divided in two groups: the “Demographic attributes” group and the “Performance attributes”
group.
Student’s demographic attributes
Sex male, female
Age age<32, age≥32
Marital status single, married, divorced, widowed
Number of children none, one, two or more
Occupation no, part-time, fulltime
Computer literacy no, yes
Job associated with computers no, junior-user, senior-user
Student’s performance attributes
1
st
face to face meeting Absent, present
1
st
written assignment mark<3, 3≤mark≤6, mark>6
2
nd
face to face meeting absent, present
2
nd
written assignment mark<3, 3≤mark≤6, mark>6
3
rd
face to face meeting absent, present
3
rd
written assignment mark<3, 3≤mark≤6, mark>6
4
th
face to face meeting absent, present
4
th
written assignment mark<3, 3≤mark≤6, mark>6
Table 1. The attributes used and their values
The “Demographic attributes” group represents attributes, which were collected from the Student’s Registry
of the HOU concerning students’ sex, age, marital status, number of children and occupation. In addition to the
above attributes, the previous –post high school– education in the field of informatics and the association
between students’ jobs and computers were also taken into account.
“Performance attributes” group represents attributes, which were collected from tutors’ records concerning
students’ marks on the written assignments and their presence or absence in face-to-face meetings. Marks in the
written assignments were categorized in three groups (mark<3, 3≤mark≤6 and mark>6) when no submission of
the specific assignment gives 0 as mark. In this work, we used the sophisticated approach for entropy
discretization
[4]
in order to discretize marks and the results were actually better than our previous practical
discretization
[8]
where marks in the written assignments had been categorized in five groups where “no” meant
no submission of the specific assignment, “fail” meant a mark less that 5, “good” meant a mark between 5 and
6.5, “very good” meant a mark between 6.5 and 8.5 and “excellent” meant a mark higher than 8.5.
Finally, as we have already mentioned the examined class of the induction represents the result on the final
examination test with two values. “Fail” represents students with poor performance. “Pass” represents students
who completed the INF10 module getting at least a mark 5 in the final test.
In order to examine the usage of the learning techniques in this domain, the application of five most common
machine learning techniques namely Decision Trees
[11]
, Perceptron-based Learning
[9]
, Bayesian Nets
[6]
,
Instance-Based Learning
[1]
and Rule-learning
[5]
are used. In the next section we give some elementary Machine
Learning definitions and we briefly describe these supervised machine-learning techniques. A detailed
description can be found in
[7]
.

3 MACHINE LEARNING ISSUES
Inductive machine learning is the process of learning from examples, a hypothesis or a classifier that can be
used to generalize to new examples. Generally, a classifier can make two types of classification errors in new
examples for a two-class problem. It can misclassify positive instances as negative as well as negative instances
as positive. The rate of correct predictions made by the classifier is the prediction accuracy of this classifier in
the specific data set. In the sequel, we will briefly describe the used supervised machine learning techniques.
A recent overview of existing work in decision trees is provided by
[11]
. Decision trees are trees that classify
examples by sorting them based on attribute values. Each node in a decision tree represents an attribute in an
example to be classified, and each branch represents a value that the node can take. Examples are classified
starting at the root node and sorting them based on their attribute values. The attribute that best divides the
training data would be the root node of the tree. The same process is then repeated on each partition of the
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

299
divided data, creating sub trees until the training data sets are divided into subsets of the same class. However, a
decision tree is said to overfit training data if there exists another hypothesis h΄ that has a larger error than h
when tested on the training data, but a smaller error than h when tested on the entire data set. For this reason,
there are two common approaches that decision tree algorithms can use to avoid overfitting training data: 1) Stop
the training algorithm before it reaches a point in which it perfectly fits the training data, 2) Prune the induced
decision tree.
Another interesting work in the machine-learning domain is under the heading of “perceptrons”
[9]
. The
perceptron structure can deal with two-class problems. In detail, it classifies a new instance x into class 2 if
i i
i
x w
θ
>


and into class 1 otherwise. It accepts instances one at a time and updates the weights w
i
as necessary. It initializes
its weights w
i
and θ and then it accepts a new instance (x, y) applying the threshold rule to compute the predicted
class y΄. If the predicted class is correct (y΄ = y), perceptron does nothing. However, if the predicted class is
incorrect, perceptron updates its weights. The most common way the perceptron algorithm is used for learning
from a batch of training instances is to run the algorithm repeatedly through the training set until it finds a
prediction vector which is correct on all of the training set. This prediction rule is then used for predicting the
labels on the test set.
An excellent book about the Bayesian networks is provided by
[6]
. A Bayesian network is a graphical model
for probabilistic relationships among a set of attributes. The Bayesian network structure S is a directed acyclic
graph (DAG) and the nodes in S are in one-to-one correspondence with the attributes. The arcs represent casual
influences among the variables while the lack of possible arcs in S encodes conditional independencies.
Moreover, an attribute (node) is conditionally independent of its non-descendants given its parents. Using a
suitable training method, one can induce the structure of the Bayesian Network from a given training set. In spite
of the remarkable power of the Bayesian Networks, there is an inherent limitation. This is the computational
difficulty of exploring a previously unknown network. Given a problem described by n attributes, the number of
possible structure hypotheses is more than exponential in n. In the case that the structure is unknown but we can
assume that the data is complete, the most common approach is to introduce a scoring function (or a score) that
evaluates the “fitness” of networks with respect to the training data, and then to search for the best network
(according to this score). The classifier based on this network and on the given set of attributes X
1
,X
2
, . . . , X
n
,
returns the label c that maximizes the posterior probability p(c | X
1
,X
2
, . . . , X
n
).
Instance-based learning algorithms belong in the category of lazy-learning algorithms
[10]
, as they defer in the
induction or generalization process until classification is performed. One of the most straightforward instance-
based learning algorithms is the nearest neighbour algorithm
[1]
. K-Nearest Neighbour (kNN) is based on the
principal that the examples within a data set will generally exist in close proximity with other examples that have
similar properties. If the examples are tagged with a classification label, then the value of the label of an
unclassified example can be determined by observing the class of its nearest neighbours. The absolute position of
the examples within this space is not as significant as the relative distance between examples. This relative
distance is determined using a distance metric. Ideally, the distance metric must minimize the distance between
two similarly classified examples, while maximizing the distance between examples of different classes.
In rule induction systems, a decision rule is defined as a sequence of Boolean clauses linked by logical AND
operators that together imply membership in a particular class
[5]
. The general goal is to construct the smallest
rule-set that is consistent with the training data. A large number of learned rules is usually a sign that the learning
algorithm tries to “remember” the training set, instead of discovering the assumptions that govern it. During
classification, the left hand sides of the rules are applied sequentially until one of them evaluates to true, and then
the implied class label from the right hand side of the rule is offered as the class prediction.
For the purpose of the present study, a representative algorithm for each described machine learning
technique was selected.
3.1 Brief description of the used machine learning algorithms
The most commonly used C4.5 algorithm
[12]
was the representative of the decision trees in our study. At
each level in the partitioning process a statistical property known as information gain is used by C4.5 algorithm
to determine which attribute best divides the training examples. The approach that C4.5 algorithm uses to avoid
overfitting is by converting the decision tree into a set of rules (one for each path from the root node to a leaf)
and then each rule is generalized by removing any of its conditions that will improve the estimated accuracy of
the rule.
Naive Bayes algorithm was the representative of the Bayesian networks
[3]
. It is a simple learning that
captures the assumption that every attribute is independent from the rest of the attributes, given the state of the
class attribute.
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

300
We also used the 3-NN algorithm, with Euclidean distance as distance metric, which combines robustness to
noise and less time for classification than using a larger k for kNN
[14]
. Attributes with missing values are given
imputed values so that comparisons can be made between every pair of examples on all attributes.
The RIPPER
[2]
algorithm was the representative of the rule-learning techniques because it is one of the most
usually used methods that produce classification rules. RIPPER forms rules through a process of repeated
growing and pruning. During the growing phase the rules are made more restrictive in order to fit the training
data as closely as possible. During the pruning phase, the rules are made less restrictive in order to avoid
overfitting, which can cause poor performance on unseen examples. The grow heuristic used in RIPPER is the
information gain function.
Finally, WINNOW is the representative of perceptron-based algorithms in our study
[9]
. It classifies a new
instance x into the second-class if
i i
i
x w
θ
>


and into the first class otherwise. It initializes its weights w
i
and θ to 1 and then it accepts a new instance (x, y)
applying the threshold rule to compute the predicted class y’. If y΄ = 0 and y = 1, then the weights are too low; so,
for each feature such that x
i
= 1, w
i
= w
i
∙ α, where α is a number greater than 1, called the promotion parameter.
If y΄ = 1 and y = 0, then the weights were too high; so, for each feature x
i
= 1, it decreases the corresponding
weight by setting w
i
= w
i
∙ β, where 0<β<1, called the demotion parameter. The vector, which is correct on all
examples of the training set, is then used for predicting the labels on the test set.
Detail description of all these algorithms can be found in
[7]
. It must be also mentioned that we used the free
available source code for these algorithms for our experiments by
[15]
before we implement the algorithm with
the best accuracy for the software support tool for the tutors.

4 EXPERIMENTS AND RESULTS
The experiments took place in two distinct phases. During the first phase (training phase) every algorithm
was trained using the data collected from the academic year 2000-1. The training phase was divided in 9
consecutive steps. The 1st step included the demographic data and the resulting class (pass or fail); the 2nd step
included both the demographic data along with the data from the first face-to-face meeting and the resulting
class. The 3rd step included data used for the 2nd step and the data from the first written assignment and so on
until the 9th step that included all attributes described in Table 1.
Subsequently, ten groups of data of the new academic year (2001-2) were collected from 10 tutors. Each one
of these ten groups was used to measure the prediction accuracy within these groups (testing phase). Similarly,
the testing phase also took place in 9 steps. During the 1
st
step, the demographic data of the new academic year
were used to predict the class (pass or fail) of each student. This step was repeated 10 times (for every tutor’s
data) and the average prediction accuracy is denoted in the row labeled “DEM_DAT” in Table 2 for each
algorithm. During the 2
nd
step these demographic data along with the data from the first face-to-face meeting
were used in order to predict the class of each student. This step was also repeated 10 times and the average
prediction accuracy is denoted in the row labeled “F_MEET1” in Table 2 for each algorithm. The remaining
steps use data of the new academic year in the same way as described above. These steps are also repeated 10
times and the average prediction accuracy is denoted in the rows labeled “W_ASS-1”, “F_MEET2”, “W_ASS-
2”, “F_MEET3”, “W_ASS-3”, “F_MEET4” and “W_ASS-4” concurrently in the Table 2.

Naive Bayes 3-NN RIPPER C4.5 WINNOW
DEM_DAT
62.59% 63.13% 62.97% 61.40% 54.77%
F_MEET1
61.95% 62.67% 63.17% 61.00% 54.54%
W_ASS-1
66.59% 63.48% 65.27% 64.02% 62.61%
F_MEET2
72.43% 67.00% 71.75% 73.57% 62.78%
W_ASS-2
77.18% 73.80% 78.55% 78.73% 69.85%
F_MEET3
78.43% 77.75% 78.92% 77.59% 70.31%
W_ASS-3
81.13% 79.82% 79.06% 77.53% 76.51%
F_MEET4
81.17% 80.38% 79.99% 77.39% 76.37%
W_ASS-4
83.00% 83.03% 81.34% 78.09% 77.88%
AVERAGE
73.83% 72.34% 73.44% 72.15% 67.29%
Table 2. Accuracy of the algorithms in each testing step
In order to rank the representative algorithms that have been used in this study the prediction accuracy
criterion was used. In Table 3, each cell is the comparison of the algorithm in the column with the algorithm in
the raw in the terms of statistically significant wins or losses. The middle term shows the number of steps where
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

301
there is no statistically significant difference between algorithms, while the left and right figures give the number
of wins and losses respectively. We used a statistical test (t-test) in order to compare these algorithms. The
resulting differences between algorithms were assumed statistically significant when p<0.001
[13]
. For example,
as showed in Table 3, the Naïve Bayes algorithm had statistically significant wins in 3 steps, 6 no statistically
significant differences and none statistically significant losses with the 3NN algorithm, while the Ripper
algorithm had statistically significant wins in 2 steps, 7 no statistically significant differences and none
statistically significant losses with the 3NN algorithm and so on.

Algorithm 3-NN RIPPER C4.5 WINNOW
Naive Bayes 0/6/3 0/9/0 0/5/4 0/0/9
3-NN 2/7/0 2/4/3 0/1/8
RIPPER 1/5/3 0/1/8
C4.5 0/4/5
Table 3. Comparing the algorithms
The comparison of the algorithms showed that the Naive Bayes algorithm and the RIPPER had the best
accuracy. An advantage of Naive Bayes in relation to RIPPER except for its better average accuracy is the short
computational time required. Another advantage of Naive Bayes classifier is that it can use data with missing
values as inputs, whereas RIPPER cannot. From this point of view, the Naive Bayes is the most appropriate
learning algorithm to be used for the construction of a software support tool. A prototype version of this software
support tool has already been constructed and is in use by the tutors (Figure 1).


a) An example on the 1
st
step (beginning of the academic year)

b) An example on the 5
th
step (middle of the academic year)

Figure 1. The software support tool for the tutors, which implements the Naive Bayes algorithm
Another interesting issue is the number of training examples that must be available in order a learning
algorithm to predict with satisfying accuracy the students’ performance. For this reason, we trained the Naive
algorithm with different subsets of our training set and evaluated its performance with the ten groups of data of
the new academic year (2001-2). For a given number of training examples, we randomly selected ten subsets of
the same size and the average prediction accuracy for the given number of the training examples is presented in
Figure 2. From the Figure 2, we conclude that even small data sets (i.e. 30 examples that correspond to the
number of students in a tutor’s class) provide sufficiently enough accuracy. However, it seems that at least 60
examples are needed for more satisfying predictive accuracy.

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

302

Figure 2. The accuracy of the Naive Bayes algorithm with different subset of the initial training set
Apart from the most suitable algorithm for this domain, in this work we also tried to find the attributes that
most influence the induction of the algorithms. This reduces the information that is needed to be stored as well as
speeds up the induction. In the next section, we describe how each attribute influences the final prediction and
we rank the influence of each one according to a statistical measure - the information gain
[10]
, in an attempt to
show how much each attribute influences the induction.
5 ATTRIBUTE SELECTION
The training data set from this course showed that the ratio of men who passed the exams vs. men who failed
is 48–52%, while for women this ratio drops to 39–61%. Moreover, it should be noted that the percentage of
students below 32 years old that pass the exams is measured 46%, when the corresponding number for older
students is 44%. Another interesting fact is related to student performance and their marital status. It is just as
possible for a married student to pass the exams (51%) while a single student has only 41% probability to pass
the module. A similar situation holds with the existence of children, a student with children has 52% probability
to pass the module while a student without children has only 43%. This is probably due to the fact that the family
obligations is known and has been taken under consideration prior to the commencement of the studies.
However, as expected, a strong correlation exists between student performance and the existence of previous
education in the field of Informatics. The ratio of students who have previous education in the field of
Informatics and pass the exams vs. those who fail is 51–49%, while for the remaining students this ratio drops to
28–72%. A similar correlation exists between the involvements in professional activities requesting the use of
computer. The students who use the computer in their work have 52% probability to pass the module while the
remaining students have only 32%.
Until now, we have described how each demographic attribute influences the prediction based on our data
set. In the sequel, in an attempt to show how much each attribute influences the induction, we will rank the
influence of each one according to a statistical measure - the information gain. Information gain determines how
well an attribute separates the training data according to the target concept. It is based on a measure commonly
used in information theory known as entropy
[10]
.
Defined over a collection of training data, S, with a Boolean target concept, the entropy is defined as:
(
)
⠩ 2 ( ) ( ) 2 ( )
汯g logEntropy S p p p p
+
+ − −
= − −

where p
(+)
is the proportion of positive examples in S and p
(-)
the proportion of negative examples.
Note that if the number of positive and negative examples in the set were even, then the entropy would equal
to 1. If all the examples in the set were of the same class, then the entropy of the set would be 0. If the set being
measured contains an unequal number of positive and negative examples then the entropy measure will be
between 0 and 1.
Information gain is the expected reduction in entropy when partitioning the examples of a set S, according to
an attribute A. It is defined as:

( ) ( )
( )
( )
,
v
v
v Values A
S
Gain S A Entropy S Entropy S
S

= −


where Values(A) is the set of all possible values for an attribute A and S
v
is the subset of examples in S which
have the value v for attribute A. On a Boolean data set having only positive and negative examples, Values(A)
would be defined over [+,-]. The first term in the equation is the entropy of the original data set. The second term
describes the entropy of the data set after it is partitioned using the attribute A.
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

303
The average information gain and the variance of each attribute according to the training data and the ten
groups of data collected from 10 tutors for the academic year (2001-2) are presented in Table 4. The larger the
value of the information gain is, the more influence of the attribute in the induction.

Attribute Average Information Gain
W_ASS-3 0.385 ± 0.018
W_ASS-4 0.382 ± 0.014
W_ASS-2 0.261 ± 0.011
F_MEET4 0.217 ± 0.016
F_MEET3 0.158 ± 0.010
W_ASS-1 0.138 ± 0.010
F_MEET2 0.082 ± 0.009
Job associated with computers 0.061 ± 0.008
Computer literacy 0.030 ± 0.004
domestic 0.008 ± 0.002
children 0.007 ± 0.003
sex 0.005 ± 0.003
occupation 0.002 ± 0.002
F_MEET1 0.001 ± 0.001
age 0.000 + 0.001
Table 4. Average Information Gain of each attribute
Thus, the demographic attributes that mostly influence the induction are the ‘computer literacy’ and the ‘job
associated with computers’. In addition, it was found that 1
st
face-to-face meeting has not a large value of
information gain. The reason is that almost all students come to the first meeting thus making the offered
information of this attribute minimal and maybe confusing.
Running the experiment again, this time without “domestic”, “children”, “sex”, “occupation”, “age” and
“FTOF-1” attributes, the accuracy of the algorithms has been improved (Table 5).


Naive Bayes 3-NN RIPPER C4.5 WINNOW
DEMOGR/F_MEET1 64.24%
o
62.65% 63.72% 63.98%
o
56.69%
o

W_ASS-1 66.63% 65.43%
o
66.33% 65.75% 63.34%
F_MEET2 74.24%
o
71.21%
o
73.15%
o
73.47% 63.68%
W_ASS-2 77.61% 77.04%
o
78.46% 79.19% 72.02%
o

F_MEET3 78.78% 78.77% 78.38% 78.96% 71.91%
o

W_ASS-3 81.84% 83.28%
o
79.59% 78.46% 76.83%
F_MEET4 81.49% 82.82%
o
80.09% 78.00% 76.34%
W_ASS-4 83.23% 84.23% 81.96% 80.35%
o
79.17%
o

AVERAGE 74.70% 74.23% 73.93% 73.57%

68.52%

Table 5. Accuracy of the algorithms in each testing step after attribute selection
The results showed that there is improvement in the accuracy of the tested algorithm almost always (even
though not at all times statistically significant). In addition, there is no statistically significant decrease in any
testing step. The tick (
o
) shows that in this testing step there is statistically significant increase in the accuracy of
the specific algorithm with the application of attribute selection (according to t-test with p<0.05).
The accuracy of the Naive Bayes algorithms remains the most accurate, although the 3-NN algorithm was
more benefited by the attribute selection process. In Table 6 we present the comparison of the algorithms after
the attribute selection process. Each cell in Table 6 is the comparison of the algorithm in the column with the
algorithm in the raw in the terms of statistically significant wins or losses (according to t-test with p<0.001).

Algorithm 3-NN RIPPER C4.5 WINNOW
Naive Bayes 0/6/2 0/8/0 0/5/3 0/0/8
3-NN 3/2/3 4/1/3 0/1/7
RIPPER 0/8/0 0/0/8
C4.5 0/4/4
Table 6. Comparing the algorithms
To sum up, as far as the demographic attributes are concerned we concluded that the attributes that mostly
influence the induction (as was expected) were the existence of previous education in the field of Informatics, as
S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

304
well as the use of computer in association with the students’ professional activities. In addition, it was found that
the participation on the 1
st
face-to-face meeting did not improve the accuracy of the compared algorithms.

6 CONCLUSION
With the help of machine learning techniques, the tutors of a distance learning environment are in a position
to known from the beginning of the module based only on curriculum-based data of the students, which of them
will pass or not the module with sufficient precision, which reaches the 64% in the initial forecasts and exceeds
the 83% before the final examinations. For this reason, a prototype version of a software support tool has already
been constructed implementing the Naive Bayes algorithm. Tracking student progress is a time-consuming job
that can be handled automatically by such a tool. While the tutors will still have an essential role in monitoring
and evaluating student progress, the tool can compile the data required for reasonable and efficient monitoring.
The average accuracy of the learning algorithms has proved that could be improved by an attribute selection
filter. It was found that the students’ sex, age, marital status, number of children and occupation do not add
accuracy to the used algorithms. Thus, it is not needed to collect such information. On the contrary, the results of
this study strongly correlate student performance to the existence of previous education in the field of
Informatics or to working with computers, which was anticipated of course.
After the middle of the academic period (the second written assignment), the accuracy of the algorithms is
very satisfying (greater than 77.5%). For this reason, after this training step in a future work we will try to use
regression methods
[15]
in order to predict the students’ marks. With this work, we can only predict if the student
passes the module or not.
The data set used is from the module “Introduction in Informatics” but most of the conclusions are wide-
ranging and present interest for the majority of Programs of Study of Hellenic Open University and more
generally for the whole of the distance education programs.

REFERENCES
[1] Aha, D. (1997), Lazy Learning, Dordrecht: Kluwer Academic Publishers.
[2] Cohen, W. (1995), “Fast Effective Rule Induction”, Proceeding of International Conference on Machine
Learning 1995, pp. 115-123.
[3] Domingos, P. and Pazzani, M. (1997), “On the optimality of the simple Bayesian classifier under zero-one
loss. Machine Learning”, Vol. 29, pp. 103-130.
[4] Fayyad, U. and Irani, K. (1993), “Multi-interval discretization of continuous-valued attributes for
classification learning”, Proceedings of the Thirteenth International Joint Conference on Artificial
Intelligence, pp. 1022-1027.
[5] Furnkranz, J. (1999), “Separate-and-Conquer Rule Learning”, Artificial Intelligence Review, Vol. 13, pp. 3-
54.
[6] Jensen, F. (1996), An Introduction to Bayesian Networks, Springer.
[7] Kotsiantis, S., Zaharakis, I. and Pintelas, P. (2002), Supervised Machine Learning, TR-02-02, Department
of Mathematics, University of Patras, Hellas, pp 28.
[8] Kotsiantis, S., Pierrakeas, C. and Pintelas P. (2002), Efficiency of Machine Learning Techniques in
Predicting Students’ Performance in Distance Learning Systems, TR-02-03, Department of Mathematics,
University of Patras, Hellas, pp 42.
[9] Littlestone, N. and Warmuth M. (1994), “The weighted majority algorithm”, Information and Computation,
Vol. 108(2), pp. 212–261.
[10] Mitchell, T. (1997), Machine Learning, McGraw Hill.
[11] Murthy, S. (1998), “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey”,
Data Mining and Knowledge Discovery, Vol. 2, pp. 345–389.
[12] Quinlan, J. R. (1993), C4.5: Programs for machine learning, Morgan Kaufmann, San Francisco
[13] Salzberg, S. (1997), “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach”, Data
Mining and Knowledge Discovery, Vol. 1, pp. 317–328.
[14] Wettschereck, D., Aha, D. and Mohri T. (1997), “A Review and Empirical Evaluation of Feature Weighting
Methods for a Class of Lazy Learning Algorithms”, Artificial Intelligence Review, Vol. 10, pp. 1–37.
[15] Witten, I. and Frank E. (2000), Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, San Mateo.
[16] Xenos, M., Pierrakeas C. and Pintelas P. (2002), “A survey on student dropout rates and dropout causes
concerning the students in the course of informatics of the Hellenic Open University”, Computers &
Education, Vol. 39, pp. 361–377.


S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

305
Περίληψη. Η δυνατότητα της πρόβλεψης της απόδοσης ενός σπουδαστή είναι ιδιαιτέρως χρήσιµη στην
πανεπιστηµιακού επιπέδου εκπαίδευση από απόσταση. Πρώτος στόχος αυτής της εργασίας ήταν να εξεταστεί αν οι
αλγόριθµοι µάθησης µπορούν να χρησιµοποιηθούν για την πρόβλεψη της απόδοσης των σπουδαστών,
αποτελώντας µε αυτόν τον τρόπο χρήσιµο εργαλείο για τους εκπαιδευτές. Έχουν πραγµατιποιηθεί πειράµατα µε
πέντε αντιπροσωπευτικούς αλγορίθµους, οι οποίοι εκπαιδεύθηκαν χρησιµοποιώντας δεδοµένα από το µάθηµα της
πληροφορικής του ελληνικού ανοικτού πανεπιστηµίου και τα αποτελέσµατα είναι πολύ ενθαρρυντικά. Ένας
δεύτερος στόχος αυτής της µελέτης ήταν να βρεθούν τα χαρακτηριστικά των σπουδαστών που επηρεάζουν
περισσότερο την διαδικασία απόφασης των αλγορίθµων. Βρέθηκε ότι υπάρχει ένας ισχυρός συσχετισµός µεταξύ
της απόδοσης των σπουδαστών και της ύπαρξης προηγούµενης εκπαίδευσης στον τοµέα της πληροφορικής, καθώς
και της χρησιµοποίησης του υπολογιστή στις επαγγελµατικές δραστηριότητες του σπουδαστή. Τέλος, ένα πρωτότυπο
εργαλείο υποστήριξης των εκπαιδευτών κατασκευάστηκε υλοποιώντας τον αλγόριθµο Naive Bayes, ο οποίος βάση
των πειραµάτων αποδείχθηκε ο καταλληλότερος.