Use of ID3 Algorithm for Classification

crazymeasleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

76 εμφανίσεις



Use of ID3
Algorithm
for
Classification

IRNLP Coursework 2

Amrit Gurung 2702186

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

2



1. Abstract

................................
................................
................................
................................
...............

4

1. Machine Learning

................................
................................
................................
...............................

5

2. Information Retrieval

................................
................................
................................
..........................

5

3. Information Organisation

................................
................................
................................
....................

6

3.1 Semantic Web

................................
................................
................................
...............................

6

3.2 Classification
................................
................................
................................
................................
.

6

4. Classification Techniques

................................
................................
................................
...................

7

4.1.1 Proba
bilistic Models

................................
................................
................................
..................

7

4.1.2 Symbolic Learning and Rule Induction

................................
................................
.....................

7

4.1.3 Neural Networks

................................
................................
................................
........................

7

4.1.4 Analytic Learning and Fuzzy Logic

................................
................................
...........................

8

4.1.5 Evolution Based Models

................................
................................
................................
............

8

4.2.1 Supervised Learners

................................
................................
................................
...................

8

4.2.2 Unsupervised Learners

................................
................................
................................
...............

8

4.3.3 Reinforcement Learners

................................
................................
................................
.............

8

5. ID3
Algorithm

................................
................................
................................
................................
.....

8

5.1 Decision Trees

................................
................................
................................
..............................

8

5.2 Training Data and Set

................................
................................
................................
...................

9

5.3

Entropy

................................
................................
................................
................................
........

10

5.4 Gain

................................
................................
................................
................................
.............

10

5.5 Weaknesses of ID3 Algorithm

................................
................................
................................
....

11

6. WE
KA

................................
................................
................................
................................
..............

11

7. Using ID3 to Classify Academic Documents in the Web

................................
................................
.

11

7.1 Attributes of a Web Document

................................
................................
................................
...

11

7.2 Training Set

................................
................................
................................
................................
.

12

7.3 Training Data

................................
................................
................................
..............................

13

7.4 Weka Output

................................
................................
................................
...............................

14

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

3


7.5 Inferring Rules from the Decision Tree

................................
................................
......................

16

7.6 Positive classifications

................................
................................
................................
................

16

7.7 Negative Class
ifications

................................
................................
................................
..............

16

7.8 Issues with the training data set

................................
................................
................................
..

17

8. Conclusion

................................
................................
................................
................................
........

18

9. References

................................
................................
................................
................................
.........

18

10. Appendix

................................
................................
................................
................................
.........

19

10.1 “mytest.arff” file

................................
................................
................................
.......................

19

10.2 Screenshot of Weka

................................
................................
................................
..................

20

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

4


1.
Abstract

Information systems have huge amount of data. Information Retrieval is the act of acquiring
the correct data that we have, at the time we need it.

“Google” is one of the leading
co
mpanies that
have

been the most successful for Information Retrieval yet it is not perfect.

There are millions of documents on the web. Shifting through them manually is impossible.
Machine learning tries to organise as they come and in and retrieve these
documents
according to the query put in by the user. ID3 algorithm is one of the machine learning
techniques that can help to classify data. The report will look into; how ID3 fits into
Information Retrieval and Machine Learning, how it works, implement it

to classify some
data and finally conclude on its performance.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

5


1
. Machine Learning

Machine Learning is a part of Artificial Intelligence. In Machine Learning computers are
instructed using algorithms to help infer and deduce. Computers are taught to find

useful
information from data. Machine Learning will allow automatic generation of patterns and
rules in large data, help predict outcome of queries and also help classify data Machine
Learning has application in many fields like Information Retrieval, Na
tural Language
Processing, Biomedicine, Genetics, Stock Market etc.

2
.
Information Retrieval

Processed data that is meaningful is known as information.

We
have so much data available
on the internet
.

There are search engines like “Google” and “Yahoo” tha
t try to find what we
are looking for in the web. The data on the web is not meaningful at all if it is not available to
the people who are looking for it. “Information Retrieval” is the process of finding this pre
-
stored information. We know it’s there an
d we
try

to
access

it.


Fig1 Information Retrieval

Suppose we have a set of documents “N” and a subset “A” of “N” that includes all the
documents that relates to the keyword “A”. When a user inputs “A” as the search keyword
then the ideal situation in Inf
ormation Retrieval would be to return all the documents in the
set of “A”.

Let’s look at Google, the
most popular search engine for Information R
etrieval in
the web.
A query of the phrase “Implementation of ID3 algorithm for classification”

in
Google retur
ns 40,000 results in English. The
S
cholar version of Google returns 8000 results.
We
also
find that all the results that it produces are not the exact matches of
what we were
looking for but the success
ratio is acceptable
.


Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

6


There are two main processes in
volve in Information Retrieval,
understanding what the user
is looking for

and
organising the data

so that all related data are grouped together for correct
retrieval.

The user in a web search normally inputs search terms in keywords. Web search
engines ca
n’t fetch good results when natural language is used in the query so keywords are
preferred.

“User Modelling” can be done to check what the users input and what results they
accepted for their query. This will help us determine what the user was looking fo
r when he
input a particular search phrase. Organising the data that relates to the search term is known
as Classification. Manually classifying data is impossible when there’s million of datasets
involved. There are numerous algorithms and techniques that

instruct computers how to
classify data. ID3 algorithm together with C4.5, Neural Networks, Bayesian model, Genetic
algorithm, is one of them.

Information Retrieval differs from Data Mining by the fact that in Information Retrieval we
know exactly what we

are looking for. In Data Mining we are looking at data to try and find
patterns which will lead us to information that we previously didn’t know
.

3
.
Information Organisation

One of t
he

main
challenge
s in Information
Retrieval

is to organise the data so t
hat similar
data are grouped together.
As shown in Figure 1, w
e have to group all data covers the subject
“A” so that when “A” is queried we return all the elements of A.

Internet is the biggest source
of networked information that we have at the moment.
I
n the internet we have heterogeneous
data that is located in a multitude of servers.

There are duplications, no verification of the
documents, inaccurate descriptions and also data which disappear. This makes the job of
Information Retrieval even more diff
icult in web search.

3
.1 Semantic Web

One of the main ideas to organise data and the documents in the web is to have Meta data as
compulsion. Meta data describes the what ‘data’ the data actually holds. This will make it
very efficient to collect, organise

and retrieve data. But there is no authority in the web that
enforces the use of Meta data. Hyper Text Transfer Protocol (HTML) files, text files and
multimedia files like JPEG, Flash and WMV files allow the use of Meta tags to describe their
contents, si
ze, author and other properties. We don’t actively use the
META
tags which
makes the goal of making the web semantic more difficult.
One option is to collect all the
data and documents without Meta tags in the web and add them by means of machine
learning
automatically. This is a challenging proposal but a semantic web will help
tremendously in Information Retrieval.

3
.2 Classification

Classification is process of grouping together documents or data
that

have similar properties or are
related.

Our understan
ding of the data and documents become greater and easier once they are
classified. We can also infer logic based on the classification. Most of all it makes the new data to be
sorted easily and retrieval faster with better results.


Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

7



Fig 2 Classification
of Books according to Subject

Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten
parent classes which are further divided into ten further divisions which also are in turn divided into
ten sections.

Eac
h book is assigned a number according to its class, division and section
alphabetically. Dewey Decimal Classification is very successful in libraries but unfortunately it can’t
be implemented in Information Retrieval. Somebody needs to have a central catal
ogue of all the
documents in the web and whenever a new document is added the central committee would have to
look at it classify it assign a number and publish it in the web. This is in strong violation of the way
the internet works.
Some authority contro
lling the contents of the web will restrict the amount of data
that can be added into the web. We need a web that allows everyone to upload their content in the web
together with a Machine
L
earning technique that finds these new data and classifies them as

they
come.

4. Classification Techniques

There are different Machine Learning paradigms for classification.

4
.
1.1

Proba
bi
listic Models

I
n Probabilistic Models the probability of each class and features are recorded with the help
of the training data set.
The outcome of the new data or its classification is based on these
probabilistic models. One of the examples of Probabilistic Modelling is the Bayesian Model.

4
.
1.2

Symbolic Learning and Rule Induction

In Symbolic Learning and Rule Induction

the algorithm
s learn by being told and looking at
examples. ID3 algorithm developed by Quinlan is one of them.

4
.1.3
Neural Networks

In Neural Networks, the data and its output (nodes) are inter
-
connected in a web like structure
through programming constructs which mim
ic the function of the neurons in the human
brain. Based on these, when new data is place in one of the nodes,
its

output can be predicted
or it can be classified accordingly.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

8


4
.1.4

Analytic Learning and Fuzzy Logic

Analytic Learning and Fuzzy Logic normal
ly have logical rules which are used to create even
more complex rules. Based on
these logical rules

it looks for truths ranged between 0 and 1.

4
.1.5

Evolution Based Models

Evolution based models are based on Darwin’s theory of Natural Selection and are d
ivided
further into a) Genetic algorithms, b) Evolution strategies and c) Evolutionary programming.

These classification techniques also
can be
classified as being;

4.2.1 Supervised Learners

Algorithms that learn from looking at input/output matches of tra
ining data to find results for
new data (like the ID3 algorithm).

4.2.2 Unsupervised Learners

There are no training data sets, the algorithms learn by looking at the input patterns and
predict the result.

4.3.3 Reinforcement Learners

These algorithms obser
ve the state and make predictions. Each prediction are rewarded or
punished according to the accuracy. The algorithm then learns how to make the right
decision.


5. ID3 Algorithm

ID3 algorithm is an example of Symbolic Learning and Rule Induction. It is al
so a supervised
learner which means it looks at examples like a training data set to make its decisions. It was
developed by J. Ross Quinlan back in 1979.

It is a decision tree that is based on mathematical
calculations.

The main concepts of ID3 algorithm
are described below from sections 5.1 to
5.4.

5.1 Decision Trees


Fig 3 Decision Tree

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

9


A decision tree classifies data using its attributes.

It is upside down. The tree has decision
nodes and leaf nodes. In Fig 3, “linkFromAcademias” attribu
t
e is a decisio
n node and the
“author” attribute is the leaf node.

The leaf node has homogenous data which means further
classification is not necessary. ID3 algorithm builds similar decision trees until all the leaf
nodes are homogenous.

5.
2

Training
Data

and S
et

ID3 al
gorithm is a supervised learner. It needs to have training data sets to make decisions.

The training set lists the attributes and their possible values. ID3 doesn’t deal with
continuous, numeric data which means we have to descretize them. Attributes such
age which
can values like 1 to 100 are instead listed as young and old.

Attributes

Values

Age

Young, Middle aged, Old

Height

Tall, Short, Medium

Employed

Yes, No

Fig 4 Training Set

The training data is the list of data containing actual values.

Age

He
ight

Employed

Young

Tall

Yes

Old

Short

No

Old

Medium

No

Young

Medium

Yes

Fig 5 Training Data

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

10


5.
3

Entropy


Fig 6 Entropy

Entropy refers to the randomness of the data. It ranges from 0
-
1. Data sets with entropy 1
means it is as random as it can get.

A
data set with entropy 1 means that it is homogenous.

In
Fig 6, the root of the tree has a collection of Data. It has high entropy which means the data is
random. The set of data is eventually divided into subsets 3, 4, 5 and 6 where it is now
homogenous an
d the entropy is 0 or close to 0.

Entropy is

calculated by the formula:

E(S) =
-
(p+)*log2(p+ )
-

(p
-

)*log2(p
-

)

“S” represents the set and “p+” are the number of elements in the set “S” with positive values
and “p

“ are the number of elements with negati
ve values.

The purpose of ID3 algorithm is to classify data using decision trees, such that the resulting
leaf nodes are all homogenous with zero entropy.

5.
4

Gain

In decision trees, nodes are created by singling out an attribute. ID3’s aim is to create t
he leaf
nodes with homogenous data. That means it has to choose the attribute that fulfils this
requirement the most. ID3 calculates the “Gain” of the individual attributes. The attribute
with the highest gain results in nodes with the smallest entropy. To

calculate Gain we use:

Gain(S, A) = Entropy(S)
-

S ((|S
v
| / |S|) * Entropy(S
v
))

In the formula, ‘S’ is the set and ‘A’ is the attribute. ‘S
V

‘ is the subset of ‘S’ where attribute
‘A’ has value ‘v’.

‘|S|’ is the number of elements in set ‘S’ and ‘|S
v
|’ is

the number of
elements in subset ‘S
v

.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

11


ID3 chooses the attribute with the highest gain to create nodes in the decision tree. If the
resulting subsets do not have entropy zero or equal to zero then it chooses one of the
remaining attribute to create furthe
r nodes until all the subsets are homogenous.

5.
5

Weaknesses of ID3 Algorithm

ID3 uses training data sets to makes decisions. This means it relies entirely on the training
data.
The training data is input by the programmer.
Whatever is in the training data

is its base
knowledge.

Any adulteration of the training data will result in wrong classification.
It cannot
handle continuous data like numeric values so values of the attributes need to be discrete. It
also only considers a single attribute with the high
est attribute. It doesn’t consider other
attributes with less gain
. It also doesn’t backtrack to check its nodes

so it is
also
called a
greedy algorithm. Due to its algorithm it results in shorter trees. Sometimes we might need to
consider two attributes a
t once as a combination but it is not facilitated in ID3. For example
in a bank loan application we might need to consider attributes like age and earnings at once.
Young applicants with fewer earnings can potentially have more chances of promotion and
bet
ter pay which will result in a higher credit rating.

6. WEKA

“Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre
-
processing, classification, regression, clustering, association rules, and visualization. It is
also well
-
suited for developing new machine learning schemes.”
[1]

It was developed by E. Frank, M. Hall and L.Trigg et al at the University of Waikato.

In this
report, the Weka software is used to classify scholar documents in the web using the ID3
algorithm. Weka doesn’t create visual trees for the ID3 algorithm. It checks the accuracy of
the classified instances and
training
data sets can be loaded as
‘.arff’ files.


7. Using ID3 to
C
lassify
Academi
c

D
ocuments in the
W
eb

Google has a scholar section where academic documents are available for students. It tries to
classify if a document has any academic merit in it. The use of ID3 for classifying the sch
olar
will be tested by identifying the attributes/ value pairs of academic documents and creating a
training data set.

7.1 Attributes of a
W
eb
D
ocument

A web document has many properties like, author, date created, description and tags. To
identify if a d
ocument is academic or not, I have created a test scheme. The author should be
an academic himself, like a doctor, professor a student.
It is impossible to check the author’s
titles but it’s just an assumption that we can.
All journal articles are academic

articles.
Websites of academic institutions have “.ac” or “.edu
” and

governmental organizations have
“.gov” in their domain names. So all documents published in those websites
by
a doctor,
professor or
students are

considered academic
. If a document is pu
blished in a “.com”
Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

12


website and written by any of the academics, it can only be considered scholar material if it is
referred or linked to be by an academic or governmental website.

Google has more than five hundred million hyper
-
links in database to help
them rank pages
.
The document that the hyper
-
links point to are given ticks and ranked by the number of the
ticks they get. So analysing if the document containing the hyper
-
links is from an academic
or governmental website shouldn’t be troublesome.

7.2 Tr
aining
S
et

Attribute

Possible Values

linksFromAcademias

YES,NO

author

doctor, professor, student

domain

com,edu,ac,gov

linksTo

null, low, high

journalArticle

YES,NO

scholarMaterial

YES,NO

Fig 7 Training Set to check Academic
documents

The ‘linksFrom
Academia’
attribute
is to check if the document containing the hyper
-
link to
the document is from an academic website or not.

The ‘author’ attribute has values doctor,
professor and student. It means documents authored by people other than this will not be

considered at all.
The ‘linkTo’ attribute shows the number of hyperlinks in the document. ID3
can’t handle continuous numeric data so the possible values have been descretized as null,
high and low. The ‘journalArticle’ checks if the document is from Jour
nal or not.

And finally
the ‘scholarMaterial’ attribute is the decision attribute to decide if the document is academic
or not.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

13


7.3 Training Data

For the training data twenty
-
eight instances have input. The data set is consistent with no
contradictory ins
tances.

linksFromAcademia
s

author

domain

linksTo

journalMaterial

scholarMateria
l

NO




student


ac


low



NO




NO

YES




doctor


ac



low



NO




YES

NO




doctor


gov



null



NO



YES

NO





student


com



null



NO




NO

YES




student


com



low



NO




YES

YES




professor


com



high



NO




YES

NO




student


com



high



NO



NO

NO




doctor


com



high



NO



NO

YES




professor


edu



low



NO



YES

YES





doctor


edu



low



NO




YES

NO




professor


edu



null



NO




YES

NO





student


com



high



NO




NO

YES




professor


ac



null



NO




YES

NO






student


gov



null


NO




NO

NO




professor


gov



null



NO




YES

NO




doctor


com



null



YES




YES

NO




professor


com




null



YES



YES

YES




professor


ac



high



NO



YES

YES




student


com



high



YES



YES

NO




professor


com



null



NO



NO

NO




doctor


com



null



NO



NO

NO




professor


com



null



NO



NO

YES





student


gov



high



NO



YES

NO





student


edu



null



NO



NO

NO




professor


ac



null



NO



YES

NO




doctor


com



null



NO



NO

NO




doctor


edu



high



NO



YES

NO




doctor


ac



low



NO



YES



Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

14


7.4 Weka Output

The above data set was tried i
n Weka using the ID3 algorithm.
The tree generated is shown
below.

linksFromAcademias = YES: YES

linksFromAcademias = NO

| author = doctor

| | domain = com

| | | journalArticle = YES: YES

| | | journalArticle = NO: NO

| | domain = edu: YES

| |

domain = ac: YES

| | domain = gov: YES

| author = professor

| | domain = com

| | | journalArticle = YES: YES

| | | journalArticle = NO: NO

| | domain = edu: YES

| | domain = ac: YES

| | domain = gov: YES

| author = student: NO


To visual
ise the tree the straight lines have to be taken as nodes.



Fig 8 Half tree


Figure eight shows the half of the tree actually generated by the Weka software for the
training data set in section 7.3.


The ‘linksFromAcademias’ attribute had the highest ga
in so the node was created based on
that attribute. The sub
-
set with values ‘YES’ for the ‘linksFromAcademias’ node was
homogenous which meant there was no need to classify it further. There were 9 instances
with the ‘YES’ value in that particular subset.
The subset with ‘NO’ values for the
‘linksFromAcademias’ had nineteen instances and they were not homogenous
(the

entropy
was still high). The subset with the ‘NO’ value was then divided by using the ‘author’
attribute because it had the highest gain.

In t
he resultant subset the subset values was
homogenous but the subsets with doctor and professor values were not. So the subset with
doctor and professor values were divided on basis of the ‘journalArticle’ attribute which
resulted into a perfectly homogenou
s. The tree had in total five nodes and eight leaves.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

15



Weka producing the following evaluation report:


=== Evaluation on training set ===

=== Summary ===


Correctly Classified Instances 28 100%

Incorrectly Classified Instances

0 0 %

Kappa statistic


1

Mean absolute error

0

Root mean squared error


0

Relative absolute error



0%

Root relative squared er
ror


0%

Total Number of Instances 28


=== Confusion Matrix ===


a b <
--

classified as


17 0 | a = YES


0 11 | b = NO


All the twenty eight instances were correctly
classi
fied. Seventeen were
classified as scholar
do
cuments and 11 were not classified as scholar documents.

The classifications of these
instances were all correct which proves that the training set data was consistent and not
ambiguous.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

16


7.5 Inferring Rules from the Decision Tree

By observing the decision

tree we can create a series of if and else statements to create rules
to classify academic documents.

Weka doesn’t have this ability so
this has to be done
manually

for now.
In our case
the inferring rules can be described by the following if else
stat
e
me
nts:

If linksFromAcademias = ‘YES’ then
{

Document = academic
}

Elseif author = student then

{

Document

!
= academic

(not equal to)
}

Elseif author

= doctor

{


Elseif
domain = ‘edu’, or ‘gov’ or ‘ac’ then

{


Document = academic
}


Elseif domain = com

{



Elseif

journalArticle = ‘YES’ then

{



Document = academic

}



Else
{



Document

!
= academic

(not equal to)
}


}

}

Elseif author = professor

then {

Elseif domain = ‘edu’, or ‘gov’ or ‘ac’ then

{


Document = academic
}


Elseif domain = com

{



Elseif journalArticle

= ‘YES’ then

{



Document = academic

}



Else
{



Document

!= academic

(not equal to)
}


}

}



The above nested if else statements actually states the rules to classify academic documents
based on the training data set.



7.
6

Positive C
lassifications

Whene
ver a new data comes in we can classify it instantly by looking at this nested if else
statements. A document with links from other academia will be instantly classified as a
scholar document.


Documents without links from other academia
,

and written by do
ctors and
professors are all
classified as academic material unless they were published in a ‘.com’
website and not a journal article too.


7.
7

Negative Classifications

Non j
ournal documents written by doctors and
professors that were published in ‘.com’ s
ites
were all classified correctly as non academic.
All documents authored by students without
any links from any academia were classified as non academic.




Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

17


7.8
Issues with
the training data set

I
f
a document

didn’t have any links from any academia and w
as written by a student then
it was instantly classified as a non academic
document.

This is not true in real life. We
would like to state that documents written by students in journals should be classified as
academic documents. The training data set didn
’t have this instance which made the ID3
algorithm come to this decision. To be able to modify

this inclusion we have to add the
following

instance in the data set where a student’s document published in a journal without
any academia’s link is set as a
n

a
cademic document.


linksFromAcademia
s

author

domain

linksTo

journalMaterial

scholarMaterial

NO




student


com


low


YES




YES




When the above instance was added to the training set and the tree generated by the ID3
algorithm was different from
the previous one in section 7.4.

linksFromAcademias = YES: YES

linksFromAcademias = NO

| journalArticle = YES: YES

| journalArticle = NO

| | domain = com: NO

| | domain = edu

| | | author = doctor: YES

| | | author = professor: YES

| | | aut
hor = student: NO

| | domain = ac

| | | author = doctor: YES

| | | author = professor: YES

| | | author = student: NO

| | domain = gov

| | | author = doctor: YES

| | | author = professor: YES

| | | author = student: NO


The second nod
e in the new decision tree takes the ‘journalArticle’ attribute to create new
leaves. In the previous decision tree it was the ‘author’ attribute. This new decision tree can
correctly classify journal articles without academic links, published in ‘.com’ si
tes and
written by students as being academic.


ID3 algorithm cannot handle missing attribute values. If we try to enter a training data set
with missing values in Weka then it gives an error.


Inconsistent training data set created errors in the decision

tree. For example there are two
similar instances with different outcomes;


linksFromAcademia
s

author

domain

linksTo

journalMaterial

scholarMaterial

NO




doctor


ac


low



NO




YES

NO




doctor


ac


low



NO




NO



The above instances have same val
ues except for the last attribute which is opposite.

When
the training data set was tried out using the ID3 algorithm, the accuracy was only 96%.

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

18


The ‘linksTo’ attribute was never tested out while building decision tree in any of the cases
above. ID3 algo
rithm only considers the attribute with the highest gain. Due to this, it builds
the fastest and the shortest tree. In our case the ‘linksTo’ attribute was correctly left out but in
other scenarios all the attributes might have to considered equally.

8. Co
nclusion

Classification is very essential to organise data, retrieve information correctly and swiftly.
Implementing Machine learning to classify data is not easy given the huge amount of
heterogeneous data that’s present in the web.
ID3 algorithm depends
entirely on the accuracy
of the training data set for building its decision trees. The ID3 algorithm learns by
supervision. It has to be shown what instances have what results. Due to this ID3 algorithm, I
think, cannot be successfully classify documents i
n the web. The data in the web is
unpredictable, volatile and most of it lacks Meta data. The way forward for Information
Retrieval in the web
, in my opinion
would be to advocate the creation of a semantic web
where algorithms which are unsupervised and re
inforcement learners are used to classify
and
retrieve
data.

9. References


1
.

http://www.cs.waikato.ac.nz/ml/weka/

[1]

2
.


Building Classification Models: ID3 and C4.5


[Online] Available from:
http://www.cis.temple.edu/~ingargio/cis587/readings/id3
-
c45.html

[Accessed 7th Dec 08]

3
.

H. Hamilton, E.Gurak, F. Leah, W. Olive, (2000
-
2
)

Computer

Science 831: Knowledge
Discovery

in Da
tabases


[Online] Available from
:

http://www2.cs.uregina.ca/~dbd/cs831/index.html [Accessed 7th Dec 08]

4. Bhatia, MPS and Khalid, Akshi Kumar (2008).


"Information retrieval and machine
learning: Supporting technologies for web mining research and
practice."



Webology
,
5
(2),
Article 55.
[Online]
Available
from:

http://www.webology.ir/2008/v5n2/a55.
html

[Accessed 7th Dec 08]

Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

19


10. Appendix

10.1 “mytest.arff” file

@relation mytest


@attribute linksFromAcademias {YES,NO}

@attribute author{d
octor,professor, student}

@attribute domain{com,edu,ac,gov}

@attribute linksTo {null, low, high}

@attribute linksTo {YES,NO}

@attribute scholarMaterial {YES,NO}


@data

NO, student, ac, low, NO, NO

YES, doctor, ac, low, NO, YES

NO, doctor, gov, null, NO, YE
S

NO,

student, com, null, NO, NO

YES, student, com, low, NO, YES

YES, professor, com, high, NO, YES

NO, student, com, high, NO, NO

NO, doctor, com, high, NO, NO

YES, professor, edu, low, NO, YES

YES, doctor, edu, low, NO, YES

NO, professor, edu, null, NO,
YES

NO, student, com, high, NO, NO

YES, professor, ac, null, NO, YES

NO, student, gov, null, NO, NO

NO, professor, gov, null, NO, YES

NO, doctor, com, null, YES, YES

NO, professor, com, null, YES, YES

YES, professor, ac, high, NO, YES

YES, student, com, hi
gh, YES, YES

NO, professor, com, null, NO, NO

NO, doctor, com, null, NO, NO

NO, professor, com, null, NO, NO

YES, student, gov, high, NO, YES

NO, student, edu, null, NO, NO

NO, professor, ac, null, NO, YES

NO, doctor, com, null, NO, NO

NO, doctor, edu, hig
h, NO, YES

NO, doctor, ac, low, NO, YES


Amrit Gurung 2702186

Use of ID3 Algorithm for Classification

IRNLP Coursework 2

20


10.2 Screenshot of Weka