RECENT ADVANCES IN MECHANICS AND RELATED FIELDS

UNIVERSITY OF PATRAS 2003

in Honour of Professor Constantine L. Goudas

297

EFFICIENCY OF MACHINE LEARNING TECHNIQUES IN PREDICTING

STUDENTS’ PERFORMANCE IN DISTANCE LEARNING SYSTEMS

S. B. Kotsiantis, C. J. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

Educational Software Development Laboratory

Department of Mathematics

University of Patras

Greece

e-mail: e-mail: {sotos, chrpie, john, pintelas}@math.upatras.gr

Keywords: supervised machine learning algorithms, prediction of student performance, distance learning.

Abstract. The ability of predicting a student’s performance is very important in university-level distance

learning environments. The scope of the research reported here is to investigate the efficiency of machine

learning techniques in such an environment. To this end, a number of experiments have been conducted using

five representative learning algorithms, which were trained using data sets provided by the “informatics”

course of the Hellenic Open University. It was found that learning algorithms could enable tutors to predict

student performance with satisfying accuracy long before final examination. A second scope of the study was to

identify the student attributes, if any, that mostly influence the induction of the learning algorithms. It was found

that there exist some obvious and some less obvious attributes that demonstrate a strong correlation with student

performance. Finally, a prototype version of software support tool for tutors has been constructed implementing

the Naive Bayes algorithm, which proved to be the most appropriate among the tested learning algorithms.

1 INTRODUCTION

The tutors in a distance-learning course must continuously support their students regardless the distance

between them. A tool, which could automatically recognize the level of the students, would enable the tutors to

personalize the education in a more effective way. While the tutors would still have the essential role in

monitoring and evaluating student progress, the tool could compile the data required for reasonable and efficient

monitoring.

This paper examines the usage of Machine Learning (ML) techniques in order to predict the students’

performance in a distance learning system. Even though, ML techniques have been successfully applied in

numerous domains such as pattern recognition, image recognition, medical diagnosis, commodity trading,

computer games and various control applications, to the best of our knowledge, there is no previous attempt in

the presented domain

[10], [15]

. Thus, we use a representative algorithm for each one of the most common machine

learning techniques namely Decision Trees

[11]

, Bayesian Nets

[6]

, Perceptron-based Learning

[9]

, Instance-Based

Learning

[1]

and Rule-learning

[5]

so as to investigate the efficiency of ML techniques in such an environment.

Indeed, it is proved that learning algorithms can predict student performance with satisfying accuracy long

before the final examination.

In this work we also try to find the characteristics of the students that mostly influence the induction of the

algorithms. This will reduce the information that is needed to be stored as well as will speed up the induction.

For the purpose of our study the “informatics” course of the Hellenic Open University (HOU) provided the data

set. A significant conclusion of this work was that the students’ sex, age, marital status, number of children and

occupation attributes do not contribute to the accuracy of the prediction algorithms.

The following section describes the data set of our study. Some elementary Machine Learning definitions and

a more detailed description of the used techniques and algorithms are given in section 3. Section 4 presents the

experimental results for the five compared algorithms. The attribute selection methodology used to find the

attributes that most influences the induction as well as whether it improves the accuracy of the tested algorithms

or not, is discussed in section 5. Finally, section 6 discusses the conclusions and some future research directions.

2 HELLENIC OPEN UNIVERSITY DISTANCE LEARNING METHODOLOGY AND DATA

DESCRIPTION

For the purpose of our study the “informatics” course of HOU provided the training set. A total of 354

examples (student’s records) have been collected from the module “Introduction to Informatics” (INF10)

[16]

.

Regarding the INF10 module, during an academic year students have to hand in 4 written assignments,

optionally participate in 4 face to face meetings with their tutor and sit for final examinations after a 11-month-

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

298

period. A student must submit at least three of the four assignments. The total mark gathered from the handed-in

written assignments should be greater than or equal to 20 for a student to qualify to sit for the final examinations

of the module.

In the sequel, we present in Table 1 the attributes of our data set along with the values of every attribute. The set

of the attributes was divided in two groups: the “Demographic attributes” group and the “Performance attributes”

group.

Student’s demographic attributes

Sex male, female

Age age<32, age≥32

Marital status single, married, divorced, widowed

Number of children none, one, two or more

Occupation no, part-time, fulltime

Computer literacy no, yes

Job associated with computers no, junior-user, senior-user

Student’s performance attributes

1

st

face to face meeting Absent, present

1

st

written assignment mark<3, 3≤mark≤6, mark>6

2

nd

face to face meeting absent, present

2

nd

written assignment mark<3, 3≤mark≤6, mark>6

3

rd

face to face meeting absent, present

3

rd

written assignment mark<3, 3≤mark≤6, mark>6

4

th

face to face meeting absent, present

4

th

written assignment mark<3, 3≤mark≤6, mark>6

Table 1. The attributes used and their values

The “Demographic attributes” group represents attributes, which were collected from the Student’s Registry

of the HOU concerning students’ sex, age, marital status, number of children and occupation. In addition to the

above attributes, the previous –post high school– education in the field of informatics and the association

between students’ jobs and computers were also taken into account.

“Performance attributes” group represents attributes, which were collected from tutors’ records concerning

students’ marks on the written assignments and their presence or absence in face-to-face meetings. Marks in the

written assignments were categorized in three groups (mark<3, 3≤mark≤6 and mark>6) when no submission of

the specific assignment gives 0 as mark. In this work, we used the sophisticated approach for entropy

discretization

[4]

in order to discretize marks and the results were actually better than our previous practical

discretization

[8]

where marks in the written assignments had been categorized in five groups where “no” meant

no submission of the specific assignment, “fail” meant a mark less that 5, “good” meant a mark between 5 and

6.5, “very good” meant a mark between 6.5 and 8.5 and “excellent” meant a mark higher than 8.5.

Finally, as we have already mentioned the examined class of the induction represents the result on the final

examination test with two values. “Fail” represents students with poor performance. “Pass” represents students

who completed the INF10 module getting at least a mark 5 in the final test.

In order to examine the usage of the learning techniques in this domain, the application of five most common

machine learning techniques namely Decision Trees

[11]

, Perceptron-based Learning

[9]

, Bayesian Nets

[6]

,

Instance-Based Learning

[1]

and Rule-learning

[5]

are used. In the next section we give some elementary Machine

Learning definitions and we briefly describe these supervised machine-learning techniques. A detailed

description can be found in

[7]

.

3 MACHINE LEARNING ISSUES

Inductive machine learning is the process of learning from examples, a hypothesis or a classifier that can be

used to generalize to new examples. Generally, a classifier can make two types of classification errors in new

examples for a two-class problem. It can misclassify positive instances as negative as well as negative instances

as positive. The rate of correct predictions made by the classifier is the prediction accuracy of this classifier in

the specific data set. In the sequel, we will briefly describe the used supervised machine learning techniques.

A recent overview of existing work in decision trees is provided by

[11]

. Decision trees are trees that classify

examples by sorting them based on attribute values. Each node in a decision tree represents an attribute in an

example to be classified, and each branch represents a value that the node can take. Examples are classified

starting at the root node and sorting them based on their attribute values. The attribute that best divides the

training data would be the root node of the tree. The same process is then repeated on each partition of the

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

299

divided data, creating sub trees until the training data sets are divided into subsets of the same class. However, a

decision tree is said to overfit training data if there exists another hypothesis h΄ that has a larger error than h

when tested on the training data, but a smaller error than h when tested on the entire data set. For this reason,

there are two common approaches that decision tree algorithms can use to avoid overfitting training data: 1) Stop

the training algorithm before it reaches a point in which it perfectly fits the training data, 2) Prune the induced

decision tree.

Another interesting work in the machine-learning domain is under the heading of “perceptrons”

[9]

. The

perceptron structure can deal with two-class problems. In detail, it classifies a new instance x into class 2 if

i i

i

x w

θ

>

∑

and into class 1 otherwise. It accepts instances one at a time and updates the weights w

i

as necessary. It initializes

its weights w

i

and θ and then it accepts a new instance (x, y) applying the threshold rule to compute the predicted

class y΄. If the predicted class is correct (y΄ = y), perceptron does nothing. However, if the predicted class is

incorrect, perceptron updates its weights. The most common way the perceptron algorithm is used for learning

from a batch of training instances is to run the algorithm repeatedly through the training set until it finds a

prediction vector which is correct on all of the training set. This prediction rule is then used for predicting the

labels on the test set.

An excellent book about the Bayesian networks is provided by

[6]

. A Bayesian network is a graphical model

for probabilistic relationships among a set of attributes. The Bayesian network structure S is a directed acyclic

graph (DAG) and the nodes in S are in one-to-one correspondence with the attributes. The arcs represent casual

influences among the variables while the lack of possible arcs in S encodes conditional independencies.

Moreover, an attribute (node) is conditionally independent of its non-descendants given its parents. Using a

suitable training method, one can induce the structure of the Bayesian Network from a given training set. In spite

of the remarkable power of the Bayesian Networks, there is an inherent limitation. This is the computational

difficulty of exploring a previously unknown network. Given a problem described by n attributes, the number of

possible structure hypotheses is more than exponential in n. In the case that the structure is unknown but we can

assume that the data is complete, the most common approach is to introduce a scoring function (or a score) that

evaluates the “fitness” of networks with respect to the training data, and then to search for the best network

(according to this score). The classifier based on this network and on the given set of attributes X

1

,X

2

, . . . , X

n

,

returns the label c that maximizes the posterior probability p(c | X

1

,X

2

, . . . , X

n

).

Instance-based learning algorithms belong in the category of lazy-learning algorithms

[10]

, as they defer in the

induction or generalization process until classification is performed. One of the most straightforward instance-

based learning algorithms is the nearest neighbour algorithm

[1]

. K-Nearest Neighbour (kNN) is based on the

principal that the examples within a data set will generally exist in close proximity with other examples that have

similar properties. If the examples are tagged with a classification label, then the value of the label of an

unclassified example can be determined by observing the class of its nearest neighbours. The absolute position of

the examples within this space is not as significant as the relative distance between examples. This relative

distance is determined using a distance metric. Ideally, the distance metric must minimize the distance between

two similarly classified examples, while maximizing the distance between examples of different classes.

In rule induction systems, a decision rule is defined as a sequence of Boolean clauses linked by logical AND

operators that together imply membership in a particular class

[5]

. The general goal is to construct the smallest

rule-set that is consistent with the training data. A large number of learned rules is usually a sign that the learning

algorithm tries to “remember” the training set, instead of discovering the assumptions that govern it. During

classification, the left hand sides of the rules are applied sequentially until one of them evaluates to true, and then

the implied class label from the right hand side of the rule is offered as the class prediction.

For the purpose of the present study, a representative algorithm for each described machine learning

technique was selected.

3.1 Brief description of the used machine learning algorithms

The most commonly used C4.5 algorithm

[12]

was the representative of the decision trees in our study. At

each level in the partitioning process a statistical property known as information gain is used by C4.5 algorithm

to determine which attribute best divides the training examples. The approach that C4.5 algorithm uses to avoid

overfitting is by converting the decision tree into a set of rules (one for each path from the root node to a leaf)

and then each rule is generalized by removing any of its conditions that will improve the estimated accuracy of

the rule.

Naive Bayes algorithm was the representative of the Bayesian networks

[3]

. It is a simple learning that

captures the assumption that every attribute is independent from the rest of the attributes, given the state of the

class attribute.

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

300

We also used the 3-NN algorithm, with Euclidean distance as distance metric, which combines robustness to

noise and less time for classification than using a larger k for kNN

[14]

. Attributes with missing values are given

imputed values so that comparisons can be made between every pair of examples on all attributes.

The RIPPER

[2]

algorithm was the representative of the rule-learning techniques because it is one of the most

usually used methods that produce classification rules. RIPPER forms rules through a process of repeated

growing and pruning. During the growing phase the rules are made more restrictive in order to fit the training

data as closely as possible. During the pruning phase, the rules are made less restrictive in order to avoid

overfitting, which can cause poor performance on unseen examples. The grow heuristic used in RIPPER is the

information gain function.

Finally, WINNOW is the representative of perceptron-based algorithms in our study

[9]

. It classifies a new

instance x into the second-class if

i i

i

x w

θ

>

∑

and into the first class otherwise. It initializes its weights w

i

and θ to 1 and then it accepts a new instance (x, y)

applying the threshold rule to compute the predicted class y’. If y΄ = 0 and y = 1, then the weights are too low; so,

for each feature such that x

i

= 1, w

i

= w

i

∙ α, where α is a number greater than 1, called the promotion parameter.

If y΄ = 1 and y = 0, then the weights were too high; so, for each feature x

i

= 1, it decreases the corresponding

weight by setting w

i

= w

i

∙ β, where 0<β<1, called the demotion parameter. The vector, which is correct on all

examples of the training set, is then used for predicting the labels on the test set.

Detail description of all these algorithms can be found in

[7]

. It must be also mentioned that we used the free

available source code for these algorithms for our experiments by

[15]

before we implement the algorithm with

the best accuracy for the software support tool for the tutors.

4 EXPERIMENTS AND RESULTS

The experiments took place in two distinct phases. During the first phase (training phase) every algorithm

was trained using the data collected from the academic year 2000-1. The training phase was divided in 9

consecutive steps. The 1st step included the demographic data and the resulting class (pass or fail); the 2nd step

included both the demographic data along with the data from the first face-to-face meeting and the resulting

class. The 3rd step included data used for the 2nd step and the data from the first written assignment and so on

until the 9th step that included all attributes described in Table 1.

Subsequently, ten groups of data of the new academic year (2001-2) were collected from 10 tutors. Each one

of these ten groups was used to measure the prediction accuracy within these groups (testing phase). Similarly,

the testing phase also took place in 9 steps. During the 1

st

step, the demographic data of the new academic year

were used to predict the class (pass or fail) of each student. This step was repeated 10 times (for every tutor’s

data) and the average prediction accuracy is denoted in the row labeled “DEM_DAT” in Table 2 for each

algorithm. During the 2

nd

step these demographic data along with the data from the first face-to-face meeting

were used in order to predict the class of each student. This step was also repeated 10 times and the average

prediction accuracy is denoted in the row labeled “F_MEET1” in Table 2 for each algorithm. The remaining

steps use data of the new academic year in the same way as described above. These steps are also repeated 10

times and the average prediction accuracy is denoted in the rows labeled “W_ASS-1”, “F_MEET2”, “W_ASS-

2”, “F_MEET3”, “W_ASS-3”, “F_MEET4” and “W_ASS-4” concurrently in the Table 2.

Naive Bayes 3-NN RIPPER C4.5 WINNOW

DEM_DAT

62.59% 63.13% 62.97% 61.40% 54.77%

F_MEET1

61.95% 62.67% 63.17% 61.00% 54.54%

W_ASS-1

66.59% 63.48% 65.27% 64.02% 62.61%

F_MEET2

72.43% 67.00% 71.75% 73.57% 62.78%

W_ASS-2

77.18% 73.80% 78.55% 78.73% 69.85%

F_MEET3

78.43% 77.75% 78.92% 77.59% 70.31%

W_ASS-3

81.13% 79.82% 79.06% 77.53% 76.51%

F_MEET4

81.17% 80.38% 79.99% 77.39% 76.37%

W_ASS-4

83.00% 83.03% 81.34% 78.09% 77.88%

AVERAGE

73.83% 72.34% 73.44% 72.15% 67.29%

Table 2. Accuracy of the algorithms in each testing step

In order to rank the representative algorithms that have been used in this study the prediction accuracy

criterion was used. In Table 3, each cell is the comparison of the algorithm in the column with the algorithm in

the raw in the terms of statistically significant wins or losses. The middle term shows the number of steps where

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

301

there is no statistically significant difference between algorithms, while the left and right figures give the number

of wins and losses respectively. We used a statistical test (t-test) in order to compare these algorithms. The

resulting differences between algorithms were assumed statistically significant when p<0.001

[13]

. For example,

as showed in Table 3, the Naïve Bayes algorithm had statistically significant wins in 3 steps, 6 no statistically

significant differences and none statistically significant losses with the 3NN algorithm, while the Ripper

algorithm had statistically significant wins in 2 steps, 7 no statistically significant differences and none

statistically significant losses with the 3NN algorithm and so on.

Algorithm 3-NN RIPPER C4.5 WINNOW

Naive Bayes 0/6/3 0/9/0 0/5/4 0/0/9

3-NN 2/7/0 2/4/3 0/1/8

RIPPER 1/5/3 0/1/8

C4.5 0/4/5

Table 3. Comparing the algorithms

The comparison of the algorithms showed that the Naive Bayes algorithm and the RIPPER had the best

accuracy. An advantage of Naive Bayes in relation to RIPPER except for its better average accuracy is the short

computational time required. Another advantage of Naive Bayes classifier is that it can use data with missing

values as inputs, whereas RIPPER cannot. From this point of view, the Naive Bayes is the most appropriate

learning algorithm to be used for the construction of a software support tool. A prototype version of this software

support tool has already been constructed and is in use by the tutors (Figure 1).

a) An example on the 1

st

step (beginning of the academic year)

b) An example on the 5

th

step (middle of the academic year)

Figure 1. The software support tool for the tutors, which implements the Naive Bayes algorithm

Another interesting issue is the number of training examples that must be available in order a learning

algorithm to predict with satisfying accuracy the students’ performance. For this reason, we trained the Naive

algorithm with different subsets of our training set and evaluated its performance with the ten groups of data of

the new academic year (2001-2). For a given number of training examples, we randomly selected ten subsets of

the same size and the average prediction accuracy for the given number of the training examples is presented in

Figure 2. From the Figure 2, we conclude that even small data sets (i.e. 30 examples that correspond to the

number of students in a tutor’s class) provide sufficiently enough accuracy. However, it seems that at least 60

examples are needed for more satisfying predictive accuracy.

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

302

Figure 2. The accuracy of the Naive Bayes algorithm with different subset of the initial training set

Apart from the most suitable algorithm for this domain, in this work we also tried to find the attributes that

most influence the induction of the algorithms. This reduces the information that is needed to be stored as well as

speeds up the induction. In the next section, we describe how each attribute influences the final prediction and

we rank the influence of each one according to a statistical measure - the information gain

[10]

, in an attempt to

show how much each attribute influences the induction.

5 ATTRIBUTE SELECTION

The training data set from this course showed that the ratio of men who passed the exams vs. men who failed

is 48–52%, while for women this ratio drops to 39–61%. Moreover, it should be noted that the percentage of

students below 32 years old that pass the exams is measured 46%, when the corresponding number for older

students is 44%. Another interesting fact is related to student performance and their marital status. It is just as

possible for a married student to pass the exams (51%) while a single student has only 41% probability to pass

the module. A similar situation holds with the existence of children, a student with children has 52% probability

to pass the module while a student without children has only 43%. This is probably due to the fact that the family

obligations is known and has been taken under consideration prior to the commencement of the studies.

However, as expected, a strong correlation exists between student performance and the existence of previous

education in the field of Informatics. The ratio of students who have previous education in the field of

Informatics and pass the exams vs. those who fail is 51–49%, while for the remaining students this ratio drops to

28–72%. A similar correlation exists between the involvements in professional activities requesting the use of

computer. The students who use the computer in their work have 52% probability to pass the module while the

remaining students have only 32%.

Until now, we have described how each demographic attribute influences the prediction based on our data

set. In the sequel, in an attempt to show how much each attribute influences the induction, we will rank the

influence of each one according to a statistical measure - the information gain. Information gain determines how

well an attribute separates the training data according to the target concept. It is based on a measure commonly

used in information theory known as entropy

[10]

.

Defined over a collection of training data, S, with a Boolean target concept, the entropy is defined as:

(

)

⠩ 2 ( ) ( ) 2 ( )

汯g logEntropy S p p p p

+

+ − −

= − −

where p

(+)

is the proportion of positive examples in S and p

(-)

the proportion of negative examples.

Note that if the number of positive and negative examples in the set were even, then the entropy would equal

to 1. If all the examples in the set were of the same class, then the entropy of the set would be 0. If the set being

measured contains an unequal number of positive and negative examples then the entropy measure will be

between 0 and 1.

Information gain is the expected reduction in entropy when partitioning the examples of a set S, according to

an attribute A. It is defined as:

( ) ( )

( )

( )

,

v

v

v Values A

S

Gain S A Entropy S Entropy S

S

∈

= −

∑

where Values(A) is the set of all possible values for an attribute A and S

v

is the subset of examples in S which

have the value v for attribute A. On a Boolean data set having only positive and negative examples, Values(A)

would be defined over [+,-]. The first term in the equation is the entropy of the original data set. The second term

describes the entropy of the data set after it is partitioned using the attribute A.

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

303

The average information gain and the variance of each attribute according to the training data and the ten

groups of data collected from 10 tutors for the academic year (2001-2) are presented in Table 4. The larger the

value of the information gain is, the more influence of the attribute in the induction.

Attribute Average Information Gain

W_ASS-3 0.385 ± 0.018

W_ASS-4 0.382 ± 0.014

W_ASS-2 0.261 ± 0.011

F_MEET4 0.217 ± 0.016

F_MEET3 0.158 ± 0.010

W_ASS-1 0.138 ± 0.010

F_MEET2 0.082 ± 0.009

Job associated with computers 0.061 ± 0.008

Computer literacy 0.030 ± 0.004

domestic 0.008 ± 0.002

children 0.007 ± 0.003

sex 0.005 ± 0.003

occupation 0.002 ± 0.002

F_MEET1 0.001 ± 0.001

age 0.000 + 0.001

Table 4. Average Information Gain of each attribute

Thus, the demographic attributes that mostly influence the induction are the ‘computer literacy’ and the ‘job

associated with computers’. In addition, it was found that 1

st

face-to-face meeting has not a large value of

information gain. The reason is that almost all students come to the first meeting thus making the offered

information of this attribute minimal and maybe confusing.

Running the experiment again, this time without “domestic”, “children”, “sex”, “occupation”, “age” and

“FTOF-1” attributes, the accuracy of the algorithms has been improved (Table 5).

Naive Bayes 3-NN RIPPER C4.5 WINNOW

DEMOGR/F_MEET1 64.24%

o

62.65% 63.72% 63.98%

o

56.69%

o

W_ASS-1 66.63% 65.43%

o

66.33% 65.75% 63.34%

F_MEET2 74.24%

o

71.21%

o

73.15%

o

73.47% 63.68%

W_ASS-2 77.61% 77.04%

o

78.46% 79.19% 72.02%

o

F_MEET3 78.78% 78.77% 78.38% 78.96% 71.91%

o

W_ASS-3 81.84% 83.28%

o

79.59% 78.46% 76.83%

F_MEET4 81.49% 82.82%

o

80.09% 78.00% 76.34%

W_ASS-4 83.23% 84.23% 81.96% 80.35%

o

79.17%

o

AVERAGE 74.70% 74.23% 73.93% 73.57%

68.52%

Table 5. Accuracy of the algorithms in each testing step after attribute selection

The results showed that there is improvement in the accuracy of the tested algorithm almost always (even

though not at all times statistically significant). In addition, there is no statistically significant decrease in any

testing step. The tick (

o

) shows that in this testing step there is statistically significant increase in the accuracy of

the specific algorithm with the application of attribute selection (according to t-test with p<0.05).

The accuracy of the Naive Bayes algorithms remains the most accurate, although the 3-NN algorithm was

more benefited by the attribute selection process. In Table 6 we present the comparison of the algorithms after

the attribute selection process. Each cell in Table 6 is the comparison of the algorithm in the column with the

algorithm in the raw in the terms of statistically significant wins or losses (according to t-test with p<0.001).

Algorithm 3-NN RIPPER C4.5 WINNOW

Naive Bayes 0/6/2 0/8/0 0/5/3 0/0/8

3-NN 3/2/3 4/1/3 0/1/7

RIPPER 0/8/0 0/0/8

C4.5 0/4/4

Table 6. Comparing the algorithms

To sum up, as far as the demographic attributes are concerned we concluded that the attributes that mostly

influence the induction (as was expected) were the existence of previous education in the field of Informatics, as

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

304

well as the use of computer in association with the students’ professional activities. In addition, it was found that

the participation on the 1

st

face-to-face meeting did not improve the accuracy of the compared algorithms.

6 CONCLUSION

With the help of machine learning techniques, the tutors of a distance learning environment are in a position

to known from the beginning of the module based only on curriculum-based data of the students, which of them

will pass or not the module with sufficient precision, which reaches the 64% in the initial forecasts and exceeds

the 83% before the final examinations. For this reason, a prototype version of a software support tool has already

been constructed implementing the Naive Bayes algorithm. Tracking student progress is a time-consuming job

that can be handled automatically by such a tool. While the tutors will still have an essential role in monitoring

and evaluating student progress, the tool can compile the data required for reasonable and efficient monitoring.

The average accuracy of the learning algorithms has proved that could be improved by an attribute selection

filter. It was found that the students’ sex, age, marital status, number of children and occupation do not add

accuracy to the used algorithms. Thus, it is not needed to collect such information. On the contrary, the results of

this study strongly correlate student performance to the existence of previous education in the field of

Informatics or to working with computers, which was anticipated of course.

After the middle of the academic period (the second written assignment), the accuracy of the algorithms is

very satisfying (greater than 77.5%). For this reason, after this training step in a future work we will try to use

regression methods

[15]

in order to predict the students’ marks. With this work, we can only predict if the student

passes the module or not.

The data set used is from the module “Introduction in Informatics” but most of the conclusions are wide-

ranging and present interest for the majority of Programs of Study of Hellenic Open University and more

generally for the whole of the distance education programs.

REFERENCES

[1] Aha, D. (1997), Lazy Learning, Dordrecht: Kluwer Academic Publishers.

[2] Cohen, W. (1995), “Fast Effective Rule Induction”, Proceeding of International Conference on Machine

Learning 1995, pp. 115-123.

[3] Domingos, P. and Pazzani, M. (1997), “On the optimality of the simple Bayesian classifier under zero-one

loss. Machine Learning”, Vol. 29, pp. 103-130.

[4] Fayyad, U. and Irani, K. (1993), “Multi-interval discretization of continuous-valued attributes for

classification learning”, Proceedings of the Thirteenth International Joint Conference on Artificial

Intelligence, pp. 1022-1027.

[5] Furnkranz, J. (1999), “Separate-and-Conquer Rule Learning”, Artificial Intelligence Review, Vol. 13, pp. 3-

54.

[6] Jensen, F. (1996), An Introduction to Bayesian Networks, Springer.

[7] Kotsiantis, S., Zaharakis, I. and Pintelas, P. (2002), Supervised Machine Learning, TR-02-02, Department

of Mathematics, University of Patras, Hellas, pp 28.

[8] Kotsiantis, S., Pierrakeas, C. and Pintelas P. (2002), Efficiency of Machine Learning Techniques in

Predicting Students’ Performance in Distance Learning Systems, TR-02-03, Department of Mathematics,

University of Patras, Hellas, pp 42.

[9] Littlestone, N. and Warmuth M. (1994), “The weighted majority algorithm”, Information and Computation,

Vol. 108(2), pp. 212–261.

[10] Mitchell, T. (1997), Machine Learning, McGraw Hill.

[11] Murthy, S. (1998), “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey”,

Data Mining and Knowledge Discovery, Vol. 2, pp. 345–389.

[12] Quinlan, J. R. (1993), C4.5: Programs for machine learning, Morgan Kaufmann, San Francisco

[13] Salzberg, S. (1997), “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach”, Data

Mining and Knowledge Discovery, Vol. 1, pp. 317–328.

[14] Wettschereck, D., Aha, D. and Mohri T. (1997), “A Review and Empirical Evaluation of Feature Weighting

Methods for a Class of Lazy Learning Algorithms”, Artificial Intelligence Review, Vol. 10, pp. 1–37.

[15] Witten, I. and Frank E. (2000), Data Mining: Practical Machine Learning Tools and Techniques with Java

Implementations, Morgan Kaufmann, San Mateo.

[16] Xenos, M., Pierrakeas C. and Pintelas P. (2002), “A survey on student dropout rates and dropout causes

concerning the students in the course of informatics of the Hellenic Open University”, Computers &

Education, Vol. 39, pp. 361–377.

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

305

Περίληψη. Η δυνατότητα της πρόβλεψης της απόδοσης ενός σπουδαστή είναι ιδιαιτέρως χρήσιµη στην

πανεπιστηµιακού επιπέδου εκπαίδευση από απόσταση. Πρώτος στόχος αυτής της εργασίας ήταν να εξεταστεί αν οι

αλγόριθµοι µάθησης µπορούν να χρησιµοποιηθούν για την πρόβλεψη της απόδοσης των σπουδαστών,

αποτελώντας µε αυτόν τον τρόπο χρήσιµο εργαλείο για τους εκπαιδευτές. Έχουν πραγµατιποιηθεί πειράµατα µε

πέντε αντιπροσωπευτικούς αλγορίθµους, οι οποίοι εκπαιδεύθηκαν χρησιµοποιώντας δεδοµένα από το µάθηµα της

πληροφορικής του ελληνικού ανοικτού πανεπιστηµίου και τα αποτελέσµατα είναι πολύ ενθαρρυντικά. Ένας

δεύτερος στόχος αυτής της µελέτης ήταν να βρεθούν τα χαρακτηριστικά των σπουδαστών που επηρεάζουν

περισσότερο την διαδικασία απόφασης των αλγορίθµων. Βρέθηκε ότι υπάρχει ένας ισχυρός συσχετισµός µεταξύ

της απόδοσης των σπουδαστών και της ύπαρξης προηγούµενης εκπαίδευσης στον τοµέα της πληροφορικής, καθώς

και της χρησιµοποίησης του υπολογιστή στις επαγγελµατικές δραστηριότητες του σπουδαστή. Τέλος, ένα πρωτότυπο

εργαλείο υποστήριξης των εκπαιδευτών κατασκευάστηκε υλοποιώντας τον αλγόριθµο Naive Bayes, ο οποίος βάση

των πειραµάτων αποδείχθηκε ο καταλληλότερος.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο