I In nv ve es st ti ig ga at ti io on n i in nt to o W We eb b C Co on nt te en nt t M Mi in ni in ng g U Us si in ng g P PS SO O a an nd d A AC CO O A Al lg go or ri it th hm ms s

siberiaskeinΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

205 εμφανίσεις

I
I
n
n
v
v
e
e
s
s
t
t
i
i
g
g
a
a
t
t
i
i
o
o
n
n


i
i
n
n
t
t
o
o


W
W
e
e
b
b


C
C
o
o
n
n
t
t
e
e
n
n
t
t


M
M
i
i
n
n
i
i
n
n
g
g


U
U
s
s
i
i
n
n
g
g


P
P
S
S
O
O


a
a
n
n
d
d


A
A
C
C
O
O


A
A
l
l
g
g
o
o
r
r
i
i
t
t
h
h
m
m
s
s




A Thesis Submitted
By

M
M
o
o
h
h
a
a
m
m
m
m
a
a
d
d


H
H
.
.


A
A
b
b
d
d
u
u
l
l


R
R
a
a
h
h
i
i
m
m


A
A
l
l
-
-
M
M
i
i
s
s
h
h
h
h
a
a
d
d
a
a
n
n
y
y





To

The Council of the College of

Computer Sciences and Mathematics

University of Mosul



As a Partial Fulfillment of
Requirements

For The Degree of Master of Science


In

Computer Sciences



Supervised By

D
D
r
r
.
.


G
G
h
h
a
a
y
y
d
d
a
a


A
A
b
b
d
d
u
u
l
l


A
A
z
z
i
i
z
z


A
A
L
L
-
-
T
T
a
a
l
l
i
i
b
b


assistant professor




20
11

A.D 143
3
A.H

The World Wide Web is known as a big data repository consisting
of a variety of data types. Organizing the huge volume of web
information in a

well
-
ordered and accurate way is crucial in order to be
used as an information resource. One way of accomplishi
ng this is
through utilizing of data mining methods to induce and extract useful
information from Web data information. Since web page classification
plays a vital role in many information management and retrieval tasks, it
is of great importance to search

about fast and highly accurate methods of
classification to cope with the huge increase in the size of web data.


In this thesis, we present discrete particle swarm optimization
algorithm (DPSO) in a new style as swarm intelligence based technique
for min
ing classification rules. This inductive process automatically
builds a model by learning over a set of previously classified web pages.
The learned model is then used to classify new web pages. The accuracy
of the new (DPSO) in classification was (89%±5.1
5).

This is good
classification accuracy, which is comparable to the accuracy of Ant
-
miner
algorithm that has been applied in this thesis also and acquires less
training time. The natural language processing in feature selection using
parts of speech taggi
ng algorithm is adopted. A new approach has been
used in the initialization of the particle swarm in the proposed algorithm
by
simple

computations

heuristic function, instead of random
initialization of the particle swarm; it gave the advantage to speed up

reaching the optimal solution.