Classification and Retrieval of Research

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 7 months ago)

113 views

Classification and Retrieval of Research
Papers: A Semantic Hierarchical Approach



Submitted in partial fulfilment of the requirements for the degree of

Master of Philosophy

by

Mi
rz
a
Naz
ur
a
Abdulkarim

(Roll No.

0935009)



Supervisor:

Ms. Sumitra Binu











Department of Computer Science

CHRIST UNIVERSITY BANGALORE

2010


2


D
eclaration

I hereby declare that the dissertation entitled „
Classification and Retrieval of
Research Papers: A Semantic Hierarchical Approach


submitted for the M.Phil. Degree is
my original work and the dissertation has not formed the basis for the ward of any degree,
associateship fellowship or any other similar titles.



Place: Bangalore

Date:



Mi
rz
a
Naz
ur
a
Abdulkarim

Roll No.:
0935009

3


Certificate

This is to certify that the dissertation entitled „
Classification and Retrieval of
Research Papers: A Semantic Hierarchical Approach’
is a bonafide research work carried
out by
. Mirza

N
a
z
u
r
a

A
b
d
u
l
k
a
r
i
m
,

student of M.Phil. (Computer Science) Christ univ
ersity,
Bangalore, during the year 2009
-
2010, in partial fulfilment of the requirements for the award
of the Degree of Master of
P
hilosophy and that the dissertation has not formed the basis for
the award previously of any degree, diploma, associateship, f
ellowship or any other similar
title.



Place: Bangalore

Date:



Ms. Sumitra Binu



4


Approval Sheet


Thesis entitled

Classification and Retrieval of Research Papers: A Semantic Hierarchical
Approach


by
Mirza

N
a
z
u
r
a

A
b
d
u
l
k
a
r
i
m

is approved for the degree of Master

of
Philosophy in Computer Science.


Examiners:

1. ___________________ ___________________

2. ___________________ ___________________

3. ___________________ ___________________



Supervisor (s)

1. ___________________ ___________
________

2. ___________________ ___________________

3. ___________________ ___________________





Chairman



______________________











(Seal)


Date: ___________


Place: __________





5



A
bstract



„Classification and Retrieval of Research papers: A Semantic Hierarchical Approach‟
demonstrates an effective and efficient technique for classification of Research documents
pertaining to Computer Science. The explosion in the number of documents a
nd research
publications in electronic form and the need to perform a semantic search for their retrieval
has been the incentive for this research.


The popularity and the widespread use of electronic documents and publications, has
necessitated the d
evelopment of an efficient document archival and retrieval mechanism.
Categorizing journal papers by assigning them relevant and meaningful classes, predicting the
latent concept or the topic of research, based on the relevant terms and assigning

the
appro
priate Classification labels is the objective of this thesis.


This thesis takes a semantic approach and applies the text mining techniques in a
hierarchical manner in order to classify the documents.


The use of a lexicon containing domain speci
fic terms (DSL) adds a semantic dimension to
classification and document retrieval. The Concept Prediction based on Term Relevance
(CPTR) technique demonstrates a semantic model for assigning concepts or topics to papers.


This Thesis proposes a concep
tual framework for organizing and classifying the research
papers pertaining to Computer Science. The efficacy of the proposed

concepts is
demonstrated with the help of Classification experiments. Classification experiments reveal
that the DSL technique of

training works efficiently when categorization is based on
keywords. The CPTR technique, on the other hand, shows very high accuracy; when the
classification is based on the contents of the document.

Both these techniques lend a semantic dimension to cla
ssification.


Narrowing down the scope of search at each level of hierarchy enables time efficient
retrieval and access of the goal documents. The hierarchical interface for Document Retrieval
enables retrieval of the target documents by gradually rest
ricting the scope of search at each
level of hierarchy

6



This work comprises of two main components.

1. The Framework for Hierarchical Classification.

2. The Hierarchical Interface for Document Retrieval.



Two distinct techniques for classification
are proposed in this thesis. These include

1.

The use of Domain Specific Lexicon (DSL) which is comparable to a Domain
Specific Ontology.

2.

The Concept Prediction Based on Term Relevance (CPTR) technique

These techniques lend a semantic dimension to classific
ation.


Keywords:

Text Mining, Classification, Document Retrieval, Hierarchical, Domain specific lexicon
(DSL), Probabilistic Latent Semantic Analysis (PLSA) , Concept Prediction based on Term frequency
(CPTR)














7




Table of Contents

Sr. No.

Sections

Page No.

1


Abstract

5

2


List of Figures

9

3


List of Tables



10

4


Abbreviations

11

5.

Chapter


ㄠ⁉1瑲潤uc瑩潮


1.1
Framework for Hierarchical Classification



1.2 The Hierarchical Interface for Document Retrieval


1.3

The Hierarchical Approach for Classification
-

Level I


1.4 The Hierarchical Approach for Classification


Level 2


1.5 The Hierarchical Approach for Classification


Level 3


1.6 Th
e Hierarchical Interface for Document Retrieval (Level 1, 2
and 3)


1
3


16

16

18

19

20

23

5

Chapter


㈠†䱩瑥r慴ure⁒ 癩敷

2.1 Data Mining Concepts

2.2 Text Mining

2.3 Handling Text Data


2.4 Document Preprocessing

2.5
Representing Documents
-

The Vector Space Model (VSM)

2.6 Document Retrieval

2.7 Dimensionality Reduction

2.8 Probabilistic Latent Semantic Analysis (PLSA)

2.9 The Text Classifiers

2.10 Related Works/ Background

2.11 RapidMiner
-
5
-

About RapidMiner

25

26

29

31

32

3
2

34

35

37

38

40

42

6

Chapter 3
-

Methodology

3.1 Module
-
1 The Framework for Hierarchical Classification

3.2 The Hierarchical Classification


Level 1

45

4
6

47

8


3.3 Classification (Level


1 ) Experiment

3.4

The Hierarchical Classification


Level 2

3.5 Classification (Level


2 ) Experiment

3.6 The Hierarchical Classification


Level 3

3.7 The Simple Classification Scheme

3.8 Classification (Level


2 ) Simple Classification Experiment

3
.9
Concept Prediction Using Term Relevance (CPTR)

3.10 The CPTR Technique Experiment

3.11

T
he Hierarchical Retrieval Module


48

51

55

62

63

66

71

77

84

7

Chapter
-

4 Experiments & Findings


88

8

Chapter


5

5.1 Conclusions

5.2 Scope of Future Work


93

94

95

9

Appendix

I


Excerpt from Codes

9.1 Excerpt from the Code


Level 1 Classification

9.2 Excerpt from the Code


Level
3

Classification

96

97

10
1

10

References


Bibliography

10
4

10
5

11

Publications


10
7

12

Acknowledgement

10
8







9



List of Figures




Figure No.

Description

Pg.

Fig. 1.1




Typical Format/Structure of a Research document


14

Fig. 1.2



Components of a Research paper





15

Fig. 1.3

T
he Classification
Hierarchy

17

Fig. 1.4

The Hierarchical Approach for Classification
-

Level 1

18

Fig. 1.5

The Hierarchical Approach for Classification
-

Level 2


20

Fig. 1.6


The Hierarchical Approach for Classification
-

Level 3 (
Simple
Classification)

22

Fig. 1.7

The Hierarchical Approach for Classification
-

Level 3 (
CPTR
Model)

22

Fig. 1.8

Hierarchical Retrieval

24

Fig. 2.1

Classification


27

Fig. 2.2

Text Mining

30

Fig. 2.3

Efficiency of Information Retrieval


31

Fig. 2.4

RapidMiner
Environment

44

Fig. 3.1

Classification Level


1 Process

48

Fig. 3.2

Capturing tagged Data (Classification Level
1)

49

Fig. 3.3


DB table journal_data (Classification Level
-
1)

50

Fig. 3.4

The Classification Process

55

Fig. 3.5

A sample DSL file

56

Fig. 3.6

A sample PDF file with t he Title and the keywords

57

Fig. 3.7

Building a Corpus of DSL files

57

Fig. 3.8

Developing and Testing the Classifier using RapidMiner
(Classification Level
-
2)

58

Fig. 3.9

Processing the Documents using RapidMiner

(Classification Level
-
2)

59

Fig 3.10

Predictions (Results) by the Classifier

60

Fig. 3.11


A sample PDF file containing keywords and abstract

67

Fig. 3.12

Building a Corpus of DSL(Training examples with
Concepts)

67

Fig. 3.13

The TD Matrix with Binary Term Occurrence for the Training Set
(DSL)

68

Fig. 3.14

The TD Matrix with Binary Term Occurrence for the Test Set

69

Fig. 3.15

The Concepts Predicted by the K
-
NN classifier

70

Fig. 3.16

Relationship

between domains and concepts

72

Fig. 3.17

The CPTR Process

78

Fig. 3.18

Transposed TD Matrix with TF for the Test Set

80

Fig. 3.19

The Generated CTF_matrix

81

Fig. 3.20


The final CPTR_tab

82

10


List of Tables



Table No.

Description

Pg.

Table 2.1


Term Document Matrix

33

Table 2.2


Term Document Matrix Using relative frequency

33

Table 2.3


Document table

34

Table 2.4


Term table

35

Table 2.5


Signature File

35

Table 3.1

The TD Matrix based on Binary Term Occurrence for the
Training Set/ DSL files

63

Table 3.2


The TD Matrix based on Binary Term Occurrence for the
Test Set

64

Table 3.3


Concept Matrix

72

Table 3.4


VSM/ Term Document Matrix for Test Set

73

Table 3.5

The
CPTR Matrix

73

Table 3.6

Cumulative Term Frequency matrix (CTF_Matrix)

74

Table 4.1

Domains of the DSL files

89

Table 4.2

The Performance of the
K
-
NN Classifier (Classification
Level
2)

90

Table 4.3

Concepts/ Topics

91

Table 4.4

The Performance of the Simple Classification technique
-

Classification Level 3

91

Table 4.5


The Performance of the CPTR technique


Classification
Level 3

92




11


Abbreviations



1.

DSL


Domain Specific Lexicon

2.

DB


Database

3.

LSI


Latent Semantic Indexing

4.

PLSA


Probabilistic Latent Semantic Indexing

5.

CPTR


Concept Prediction based on Term Relevance

6.

BOW


Bag of Words

7.

TD


Term Document

8.

TF


Term Frequency

9.

TF
-
IDF


Term Frequency


Inverse Document Frequency

10.

BTO


Binary Term Occurrence

11.

VSM


Vector Space Model

12.

CTF


Cumulative Term Frequency








12
















Chapter


1


Introduction





































13


I
I
I
n
n
n
t
t
t
r
r
r
o
o
o
d
d
d
u
u
u
c
c
c
t
t
t
i
i
i
o
o
o
n
n
n



Text
Analytics or Mining textual data in
order to extract hidden patterns from semi structured
text has become vital with the popularity of the World Wide Web and the increase in the
number of electronic documents and publications.

Classification of Documents involves assigning class labels to do
cuments indicating their
category. Categorizing documents enables a semantic search and retrieval of documents.
Meaningful classification can be achieved by using Machine learning techniques along with
domain specific and Concept based Lexicon.


T
he

pri
mary objective of this research titled „
Classification and Retrieval of Research Papers:
A
Semantic
Hierarchical Approach


is to

assign a classification label that specifies the
Domain, Sub domain and the underlying concept of the paper. This is done using

the Text
Mining techniques. The approach taken is hierarchical. The levels of
hierarchy

(
3 levels) are
based on the structure of the research document. At each level of hierarchy the mining
techniques are applied/ restricted to only specific
contents

of the document
.

Th
e

propose
d

classification framework
enables

a fast, accurate and meaningful search
and
retrieval
.


The approach taken is hierarchical. The levels of hierarchy
(3

levels) are based on the
structure of the research document. At each lev
el of hierarchy the mining techniques are
applied/ restricted to only specific
content

of the document. This approach helps to limit the
scope of mining to a limited
section

of documents.

Two distinct techniques for classification are proposed in this thes
is. These include


3.

The use of
Domain Specific Lexicon

(DSL)

which

is comparable to a Domain
Specific Ontology.

4.

The

Concept Prediction Based on Term Relevance

(CPTR)

technique
.

These techniques

lend
a semantic

dimension to
classification.


This thesis
comprises of

2 main components.

1.

Framework for Hierarchical Classification
.

2.

The Hierarchical

Interface for Document Retrieval
.

The Classification Hierarchy is based on the structure of a typical Research paper (fig 1.1).

14


The scope of mining is restricted
only to specific contents of the paper like the Title,
Keywords and the Abstract.



Fig.

1
.1 :

Typical Format/Structure of a Research document










15



A look at published Research papers reveals that the Research Papers are semi structured text
documents.

The following are the main components of a Research paper.

















Fig.
1.
2:

Components of a Research Paper





Heading

[ Title/Topic

Journal

Authors

Publication

Date of Publication ]


Keywords

[ A complete list of keywords which
are the highlights of the paper ]


Abstract

[Captures the essence of the Research
work]

Content

[The entire content of the document ]

16



1.1

Framework
for Hierarchical Classification.

The Classification Component

is proposed as a framework wherein the user can configure
the parameters for the Classification experiment
.

The parameters

-

1.

Enable the user to specify/select the Training examples or Lexicon which are used to
build the classification model.

2.

User can select the path/name of the documents to test the Classifier.

3.

The user can specify the format and name of the file to store the c
lassification results.

4.

The user can select the Classification model


either
K
-
NN

based on Binary Term
Occurrence

or
Concept Prediction based on Term
Relevance (
CPTR)

model for Level
-
3
classification.

5.

The user can group files (group of 10 PDF files each)
as per the restrictions imposed
by the CPTR model.


The main characteristic of the Hierarchical approach of classification is that

T
he extent

of

mining

increases
as
we move down the levels of hierarchy
but
the number of
documents to be mined decrease
as we

move down the levels of
hierarchy.


1.2

The Hierarchical Interface for Document Retrieval.

The Retrieval
Component

provides

an interface through which the user can specify his
search criteria.
„Drill
down search


enables

user to drill down and specify his general as well
as specific search criteria.

The Hierarchical approach f
or Retrieval aims at narrowing
down the scope of search at each
level of hierarchy. This enables time efficient retrieval and access of the goal doc
uments.


The
primary aspects of the thesis



A Hierarchical Framework for Classification
of Research

papers


A Conceptual

Framework
.



Demonstrating

the
efficacy

of Classification based on Semantics.



An Interface for retrieval of the documents.

17




The Classification Hierarchy



Fig
1.
3: The Classification Hierarchy







18



1.3
The Hierarchical Approach for Classification
-

Level I

Preliminary Classification of the document is based on the tagged/structured parameters like



Name of the Journal




Year in which Paper is published




Fig.
1.
4

The Hierarchical Approach for Classification
-

Level 1

The objective of this preliminary classification
is:

To
enable the
Document

Retrieval

module to reduce the scope of search
by
eliminating/filtering the
records
that do not satisfy the search criterion.

(
To illustrate:


In case the user knows the Title of the Research paper he can directly enter it as the
search criteria and thus access the required paper.


In case the user wants to restrict his search to research papers from a particular journal
or
particular

year;

he can specify the same. This would significantly reduce the scope of
search.)

19


No Text Mining Techniques are applied at this level of Classifica
tion


1.4
The Hierarchical Approach for Classification

Level 2

The objective of Level 2 classification is to assign Classification labels that specify the
Domain / Sub domain of the research paper.
The research papers are partially/ semi structured
documents.

The Level


2 of Classification is based on the Title of the Research paper and the
Keywords.

Keywords are an integral part of the documents because



Keyword
s are like the features of the d
ocument.



Keywords reflect the essence of the document
.



The
b
asic
i
nput

for level
-
2 Classification is a corpus of documents in the PDF format. Each
document is labeled and the label specifies the domain/ sub domain of the research paper.
Each document contains the 1) T
itle of the Research
Paper

and 2)
Keywords

extracted from
the Research paper
.

This Corpus of documents along with their associated label works as a Training set for the
classification process.





Output Generated:

Level 2 Classification yields the following output.

1.

Classification Model
(Classifier)

2.

Predictions
i.e.

Domain and Sub

domain (Class predicted for Unlabelled Research
Documents)

3.

New Database Table Containing the Document id, Set of key words, Domain
-

Sub
domain

and Classification key based on the Prediction)

20



Fig
1.
5
The

Hierarchical Approach for Classification
-

Level 2


1.5
The Hierarchical Approach for Classification


Level 3

The objective of Classification Level
-
3 is to arrive at a classification label which captures the
essence of the research document

or specifies the Concept explored in the research paper.

( To Illustrate : If it is determined at Level 2 of Classification that a document is pertaining
to Data Mining
-

Text Mining ; Level 3 further states that the Document relates to
Classification Tec
hniques in Text Mining)


The input for the Level 3 of classification
is


Keywords +
The Abstract of the Research Paper
.

The

Text Mining techniques are applied to the

above

extracted portion

only.





21



Issues and Resolution

The Abstract contains lots of
stop words and terms which are not in their root form. Thus Pre
-
processing is essential in order to


a)

Remove the Stop Words

b)

Stemming (To reduce all the words to their root form that is performance, performing,
performances etc. are all reduced to
„perform‟)


The
TFs or the TF
-

IDFs of the Training set and the TFs, TF
-
IDFs of the test set are
incomparable. This is so because the proposed Training set


the
Domain Specific Lexicon
(
DSL
)
files

contain only distinct terms while the test set which compr
ises of the abstract may
contain terms that repeat. This brings about significant differences in the TFs of the training
and the test set
.


Thus developing and applying a classification model based on term
frequencies
(
TF
)
representation

of the Training
and
Test set fails to give valid results.

To overcome this
, the two following methods are proposed.


1) Simpl
e

Classification (
uses VSM with

Binary Term Occurrence) and

2) Concept Prediction based on Term Relevance (CPTR model)


22



Fig.
1.
6

The

Hierarchical Approach for Classification
-

Level 3
(
Simple

Classification
)


Fig
1.
7
The Hierarchical Approach for Classification
-

Level 3

(
CPTR Model
)

23


1.6
The Hierarchical Interface for Document Retrieval
.

The proposed Document Retrieval module provides a
suitable and

user friendly interface for
retrieval of the
documents
. The objective of this module is to enable the user to perform a
„Drill Down‟ type of search. The hierarchical interface enables retrieval

of the target
documents by gradually restricting the scope of search at each level of hierarchy.


Level
-

I (Retrieval)

The user is provided an interface wherein he can supply the following parameters



Y
ear of Publishing and/or

Journal and/or

Title
and/or

Author and/or

Domain




(
All these parameters are optional and the search criteria selected by the user determines the
scope/table to be searched)


The result
is:

A list

containing the titles of the Documents that fulfill the search criteria.
Suitable hyperlinks are provided to access the associated documents.




Level
-

2(Retrieval)

The user is provided an interface wherein he can supply the following parameters


Domain/S
ub

domain


A set of maximum 5 keywords



(
The keywords are optional; one or more keywords can be supplied)


The result
is:

A list

containing the titles of the Documents that fulfill the search criteria.
Suitable hyperlinks are provided to access the associated documents.





24


Level
-

3(Retrieval)

The user is provided an interface wherein he can select the
specific search criteria.

The result is
: A list containing the titles of the Documents that fulfill the search criteria.
Suitable hyperlinks are provided to access the associated documents.








Fig
.

1.
8

Hierarchical Retrieval


The Highlights of
the Hierarchical Approach

for Retrieval



The approach followed is quite different from the Search engines like Google etc.
Google Search is mainly based on keyword search. The result
produced by Google
is
a
whole set of documents that contain the keyword/ph
rase.

The Hierarchical
a
pproach on the other hand helps the user to gradually drill down and thus
narrow down the scope of search/search space.





It is based on classification of documents and thus ensures a time efficient and target
oriented search.





The

use of a Lexicon

(containing Domain specific terms)
and the
CPTR technique

add

a semantic dimension to classification thus
enabling

a semantic
search for

Documents.

All Documents
Documents
Pertaining to
Domain/Sub
domain
Documents
containing
Specific Concepts
25


















Chapter


2 Literature
Review



26


2
2
2
.
.
.
1
1
1



D
D
D
a
a
a
t
t
t
a
a
a



M
M
M
i
i
i
n
n
n
i
i
i
n
n
n
g
g
g



C
C
C
o
o
o
n
n
n
c
c
c
e
e
e
p
p
p
t
t
t
s
s
s
:

D
ata mining

is the process of extracting patterns from data.

According to Jiawei Han and Micheline Kamber [1] data mining is the process of discovering
interesting knowledge from large amounts of data stored either in databases, data warehouses,
or other information
repositories.


[2]Methods of Data Mining help firms to find relevant data, identify patterns in data and
organize, group and reduce data.

Data Mining is automated extraction of previously unknown data that is interesting and
potentially useful.

Data mining

techniques are applied to discover new trends and patterns of behaviour that are
hidden. These can be used for prediction in a variety of applications.

Data Mining Techniques include techniques for



Association analysis



Classification



Prediction



Cluster
Analysis



Evolution and Deviation analysis

Association Analysis
:

[1]Association analysis is the discovery of association rules showing attribute
-
value
conditions that occur frequently together in a given set of data. Association rule mining finds
interestin
g association or correlation relationships among a large set of data items.

The discovery of interesting association relationships among huge amounts of business
transaction records can help catalogue design, cross
-
marketing, loss
-
leader analysis, and othe
r
business decision making processes.

Association analysis is widely used for market basket or transaction data analysis.

Association analysis enables the discovery of interesting relationship between data items.
These relationships are represented as asso
ciation rules in the form of


Buys{x, Computer}


Bu祳⡸Ⱐyn瑩⁖楲u獽

䍬慳獩f楣慴楯i:

[3]

Classification is a data mining (machine learning) technique used to predict group
membership for data instances.

[1] Classification is used to extr
act models describing important data classes or to predict
future data trends. Classification predicts categorical labels.

27


Data classification is a two step process. In the first step, a model is built describing a
predetermined set of data classes or conc
epts. The model is constructed by analyzing database
tuples described by attributes. Each tuple is assumed to belong to a predefined class, called
the class label attribute. In the context of classification, data tuples are also referred to as
samples, exa
mples, or objects. The data tuples analyzed to build the model collectively form
the training data set. The individual tuples making up the training set are referred to as
training samples and are randomly selected from the sample population.

Since the cla
ss label of each training sample is provided, this step is also known as

supervised learning
.

The learned model is represented in the form of classification rules, decision trees, or
mathematical formulae.

In the second step, the model generated is used
for classification.














Fig.
2.1

Classification
-

Adapted from [4]


The popular Classification techniques include

1.

Decision Tree Induction

2.

Bayesian Classification

3.

Classification by Back
propagation

4.

Genetic Algorithms for Classification

5.

K

Nearest Neighbor Classifier

28



Prediction:

[1] Classification predicts categorical labels (or discrete values), while prediction models
continuous
-
valued functions. For example, a classification model may b
e built to categorize
bank loan applications as either safe or risky, while prediction model maybe built to predict
the amount of loan that can be safely disbursed.

Thus prediction model predicts continuous values (using regression techniques).

Classification and prediction have numerous applications including credit approval, medical
diagnosis, performance prediction, and selective marketing
.


Cluster Analysis:

[1] A process of grouping a set of physical or abstract objects into classes of simil
ar objects is
called clustering.

Objects within one cluster are similar to each other (homogeneous). Objects belonging to one
cluster differ from the objects belong to another cluster.

Cluster analysis is used in many applications like Pattern Recognition,

Data Analysis, Image
Processing and Marketing Research.

As a data mining function it can be used as a standalone tool to gain insight into the
distribution of data to observe the characteristics of each cluster, and to focus on a particular
set of cluster
s for further analysis. Alternatively it may serve as a preprocessing step for other
algorithms like classification and characterization operating on the detected clusters.

Cluster analysis is also referred to as
Unsupervised Learning

because unlike classi
fication it
does not rely on labeled training examples. Thus it can be referred as Learning by observation
rather than learning by example.


Major clustering methods include the categories like

1.

Partitioning Methods (
K
-

means &
K
-
medoids)

2.

Hierarchical Meth
ods ( Agglomerative and Divisive)

3.

Density based methods (DBSCAN)

4.

Grid Based methods (STING & CLIQUE)

5.

Model Based methods



29


Evolution and Deviation analysis
:

Outlier Analysis: Outlier is a data object that does not comply with the general behavior of
the
data.

[1] A set of data objects that are grossly different from or inconsistent with the remaining set
of data are called outliers of the data set.

Many data mining algorithms try to eliminate outliers or minimize the influence of outliers.
But in many
cases, the outliers themselves might be of interest to the user. Outlier mining has
wide applications. It can be used in fraud detection for finding unusual usage of credit cards
or telecommunication services, in customized marketing for finding spending b
ehavior of
extremely rich or poor people or in medical analysis for finding unusual response to certain
medicine or treatment.

The computer
-
based outlier detection methods can be categorized into three approaches: 1)
statistical approach 2) distance
-
based
approach and 3)deviation based analysis approach.


The clustering algorithms can be modified to have outlier detection as a byproduct of their
execution.


2.2
Text Mining

[2] Text Mining is the automated or partially automated processing of text.

Classifying text documents, analyzing syntax, identifying relationships among documents
understanding a question expressed in natural language, extracting meaning from message,
summarizing involve application of non trivial tasks on textual data.

[5] The p
urpose of Text Mining is to process unstructured (textual) information, extract
meaningful numeric indices from the text, and, thus, make the information contained in the
text accessible to the various data mining algorithms.


Information can be extracted to derive summaries for the words contained in the documents
or to compute summaries for the documents based on the words contained in them. Hence,
you can analyze words, clusters of words used in documents, etc., or you could

analyze
documents and determine similarities between them or how they are related to other variables
of interest in the data mining project. In the most general terms, text mining will "turn text
into numbers" (meaningful indices), which can then be incor
porated in other analyses such as
predictive data mining projects.


30




Fig.
2.2

Text Mining Process a
dapted from [6]


Mining of textual data involves the following challenges.



Information is unstructured/semi structured.



Large textual data base.



Very high

number of possible “dimensions” (but sparse).



All possible word and phrase types in the language.



Word ambiguity and context
sensitivity for

e.g. Apple (the company) or apple (the
fruit).



Noisy data for e.g. Spelling mistakes.



Text databases are, in gen
eral, semi
-
structured


Example:


Title


Author






Publication_Date Structured attribute/value pairs



Length


Category



Abstract Unstructured



Content





31



2.3

Handling Text Data


Handling Text data
involves 2 aspects

1.

Information Retrieval(IR) :

Locating relevant documents (e.g., given a set of keywords)
in a corpus of Documents

E.g. Typical IR systems: Online library catalogs, Online document
management
systems

2.


Text mining




Classify documents




Cluster documents




Find patterns or trends across documents


1.Information Retrieval(IR):

This involves the retrieval of the documents based on the supplied keywords, or other
search criteria. Since
retrieval involves the search in semi structured data; the documents
retrieved may not all be relevant.

Thus accuracy and precision of IR determine the efficiency of Information Retrieval
Systems.




F
ig
2.3

Efficiency of Information Retrieval Systems


Adapted from [1]


Precision: T
he percentage of retrieved documents that are in fact relevant to the query (i.e.,
“correct” responses)



|
}
{
|
|
}
{
}
{
|
Retrieved
Retrieved
Relevant
precision


32



Recall: T
he percentage of documents that are relevant to the query and were, in fact,
retrieved





2.4 Document
Preprocessing:

Text Documents require preprocessing before mining techniques can be
applied.

Preprocessing Techniques include:


1.

Stop Word Removal:


Many words are not informative and thus irrelevant for document representation


For e.g the, and, a, an, is, of, that , etc.


2.


Stemming:


Reducing the words to their root forms.


A document may contain several occurrences of words like fish, fishes, fisher, and fishers


(Would not be retrieved by a query with the keyword fishing)


Different words share the same word stem and should be represented with its stem, instead of
the actu
al word i.e. fish

[2] Terms identified in the document act as variables in text mining.


2
.5
Representing Documents
-

The Vector Space Model

(VSM)
:

Documents have to be suitably represented before text mining techniques can be applied on
them. Terms from the documents serve as features and can be represented with the help of
Term Document Matrix.







|
}
{
|
|
}
{
}
{
|
Relevant
Retrieved
Relevant
precision


33


Table


2.
1 Term Document Matrix



Terms





t1

t2

t3

t4

Document 1


3

2

0

1

Document 2


0

4

0

3



Table


2.
2 Term Document Matrix
u
sing
R
elative frequency




Terms






t1



t2



t3



t4


Document 1

0.03

0.02

0

0.01

Document 2

0

0.004

0

0.003


Relative Frequency
= No. of
occurrences/No. of words in document


Normalized Term Frequency (TF):



0 (if freq(d, t) = 0


TF(d,t) =


1 +log(1+log(freq(d, t))) otherwise

T
he Term Frequency matrix can contain
Normalized TF instead of simple term frequency or
relative term frequency.


Inverse Document Frequency (IDF)

IDF represents the importance of a term.


If term t appears in many documents its importance is scaled down because of its reduced
discriminative
power.



34


Idf(t)= log (1+ |d|) (d = number of documents)





|dt| (dt = no. of documents containing the term)


If |dt| << |d| the term t has a large IDF scaling factor and vice versa.


TF
-
IDF

TF
-
IDF(d,t
) = TF(d, t) * Idf(t)


Thus a document can be represented by a Term Document Matrix using TF, IDF or TF
-
IDF.

Note:
-

[2] Typical Text Mining applications have many more terms than documents,
resulting in sparse Term Frequency matrix. To obtain meaningfu
l results for text mining
applications, analysts examine a distribution of terms across the document collection. Very
low frequency terms may be dropped off from the term frequency matrix, reducing its
dimensionality.

After the Term Frequency matrix is
defined text mining turns into Data Mining.


2.6
Document Retrieval :

Text Indexing Techniques:

1.

Inverted Index


Maintains two Hashed Index or B+ tree indexed tables.

a)
Document Table

b) Term Table

Table


2.
3 Document table


Document Id

Posting list

D1


t1,t3,t4,t5….
=
=

=
=
T2,t3,t4,t5….
=
=
=
=
=
=
=
35


Table


2.
4
Term

table


Term id


Posting List


T1


D1,d2,d3,d4…
=
=

=
=
D2,d2,d3,d4…..
=
=
=

2.

Signature File


T
able



2.
5

Signature File


Document Id



Signature


D1


1110000000111


D2


1100000001111



(Multiple Words can be mapped to 1 Bit)

Drawbacks:

Synonymy problem

Polysemy problem




2.7
Dimensionality Reduction:


Text Documents contain thousands of terms. When the document Corpus is represented in the
form of a Term Frequency
Matrix the Matrix is sparse and suffers from the curse of
dimensionality.


The popular Dimensionality reduction techniques include

1.

Latent Semantic Indexing

(LSI)

Term Matrix can be decomposed into 3 Matrices using SVD technique. The 3 matrices are
USV
T
.(V
T

is transpose of V)

36




[7] This can be illustrated with the following e.g.








After Dimensionality reduction (k=2)


37





Equation1:
A = USV
T


Where
U

is a matrix whose columns are the eigenvectors of the
AA
T

matrix. These are
termed the
left
eigenvectors.

S


-

is a matrix whose diagonal elements are the singular values of
A
. This is a diagonal
matrix, so its non diagonal elements are zero by definition.


V

-

is a matrix whose columns are the eigenvectors of the
A
T
A

matrix. These are termed the
right eigenvectors
.

V
T


-

is the transpose of
V
.

When computing the SVD of a matrix it is desirable to reduce its dimensions by keeping its
first
k

singular values. Since these are ordered in decreasing order along the diag
onal of
S

and
this ordering is preserved when constructing
U

and
V
T
, keeping the first
k

singular values is
equivalent to keeping the first
k

rows of
S

and
V
T

and the first
k

columns of
U
. Equation 1
reduces to

Equation 2:
A* = U* S* V
T
*

(* indicates reduced dimension)


2.8
Probabilistic Latent Semantic Analysis

(PLSA)
:

[19] Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of
two mode and co
-
occurrence data, which has applications in information retrieval and
filtering, natural language processing, machine learning from text, and in

related areas.
38


Compared to standard Latent Semantic Analysis which stems from linear algebra and
performs a Singular Value Decomposition of co
-
occurrence tables, this method is based on a
mixture decomposition derived from a latent class model. This resul
ts in a more principled
approach which has a solid foundation in statistics.


The PLSA Algorithm:

Inputs:

term by document matrix X (t, d), t=1:T, d=1:N and the number K of topics sought.

Output:

arrays P1 and P2, which hold the estimated parameters P (t |

k) and P (k | d)
respectively.

Initialise arrays P1 and P2 randomly with numbers between [0, 1] and normalise them to sum
to 1 along rows

Iterate until convergence

For d=1 to N, For t =1 to T, For k=1:K







2.9
The Text Classifiers:

[2]

Text Categorization and methods that assess document affinities or similarities are useful
preliminaries to information retrieval.

The documents of the corpus may be parsed,
categorized or classified in order to aid speedy
retrieval.

Document/Text classification plays an important role for several applications especially for
organizing, classifying, searching and concisely representing large volumes of information.


The
popular Classifiers used for Text / Document Classification are

1.

K
-

NN (
K
Nearest Neighbor Classifier)

2.

Naïve Bayes

3.

Support Vector Machine


This thesis has applied the
K
-
NN Classifier for Classification of the Research Documents.

















K
k
T
t
K
k
T
t
N
d
K
k
d
k
P
d
k
P
d
k
P
k
t
P
d
k
P
k
t
P
d
t
x
d
k
P
d
k
P
k
t
P
k
t
P
k
t
P
d
k
P
d
k
P
k
t
P
d
t
X
k
t
P
k
t
P
1
1
1
1
1
1
)
,
(
2
)
,
(
2
)
,
(
2
);
,
(
1
)
,
(
2
)
,
(
1
)
,
(
)
,
(
2
)
,
(
2
)
,
(
1
)
,
(
1
)
,
(
1
;
)
,
(
2
)
,
(
2
)
,
(
1
)
,
(
)
,
(
1
)
,
(
1
39



The
K
-

NN (
K
Nearest

Neighbor Classifier)

K
-
NN classifier operates on the premises that classification of unknown instances can be done
by relating the unknown to the known according to some distance/similarity function.


[11]The Nearest Neighbour Classifier:


classification

is achieved by identifying the nearest
neighbours to a query example and using those neighbours to determine the class of the query.


[9] k
-
N
earest
N
eighbour algorithm

(
k
-
NN) is a method for classifying objects based on
closest training examples in the feature space.
k
-
NN is lazy learning where the function is
only approximated locally and all computation is deferred until classification. The
k
-
nearest
neighbour algorithm

is amongst the simplest of all machine learning algorithms: an object is
classified by a majority vote of its neighbours, with the object being assigned to the class most
common amongst its
k

nearest neighbours (
k

is a positive integer, typically small).
If
k

= 1,
then the object is simply assigned to the class of its nearest neighbour.

The training examples are vectors in a multidimensional feature space, each with a class label.
The training phase of the algorithm consists only of storing the feature vec
tors and class labels
of the training samples.

In the classification phase,
k

is a user
-
defined constant, and an unlabelled vector (a query or
test point) is classified by assigning the label which is most frequent among the
k

training
samples nearest to t
hat query point.

Usually Euclidean distance is used as the distance metric.

In cases such as text classification, another metric such as the
overlap metric

(or Hamming
distance) can be used.

The
Euclidean distance

between points
p

and
q

is




40


The K
-
NN Classifier
-

Pros

1.

The
K
-
NN is a simple classifier and it often come free.

2.

This classification technique is of particular importance today because issues of poor
run
-
time performance are not such a problem these days with the computational power
t
hat is available.

3.

With properly tuned
K,

K
-
NN classifiers are comparable in accuracy to the best known
classifiers.


The K
-
NN Classifier
-

Cons

1.

[10] Classifying a test document involves as many inverted index lookups as the
number of distinct terms in docu
ment d
q
(The query document), followed by scoring
the candidate documents that overlap with d
q
in at least one word, sorting by overall
similarity and picking the best
K

documents.

2.

Another problem with the NN classifiers is the space overhead and redundancy in
storing the training information. Since this classifier is lazy it does not distill this data
into simpler class model like other classifiers. Thus
k
-
NN can have poor run
-
time

performance if the training set is large.

3.


[11]
k
-
NN is very sensitive to irrelevant or redundant features because all features
contribute to the similarity and thus to the classification. This can be ameliorated by
careful feature selection or feature
weighting.



2.10

Related Works/ Background:

Several Research papers have been published in the recent years in the area of Text /
Document classification. These papers provide guidelines for improving the efficiency of
classification. Most of the recent
work in this area suggests the use of Domain specific
Ontology to achieve meaningful classification.


Selection of features to create vector space, improves the scalability, effectiveness and
accuracy of a text classifier. A good feature selection method s
hould consider domain[12]
.


The statistical techniques are not sufficient for the text mining. Better classification will be
performed when the semantics are considered[13]
.

41



During the text mining process, ontology can be used to provide expert, backgroun
d
knowledge about a domain. Some resent research shows the importance of the domain
ontology in the text classification process.

These include
-

[14] A Semantic base feature vector instead of TF
-
IDF for giving a semantic
dimension to classification. This p
aper points out the problem associated with vector space
model (VSM) and suggests „Enhanced term weighting model based on domain knowledge
from ontology‟ to improve the classification process. This paper uses ontology as background
knowledge for extracting

concept from the term to improve the vector space model.


As per this paper,



„Techniques such as TF
-
IDF vectorize the data easily, but however, dimensionality
becomes an issue since each vectorized word is used to represent the document. This leads

to
the number of dimensions being equal to the number of words.‟ And each single word in the
text is considered as a term. For linear classifiers based on the principle of Empirical Risk
Minimization, capacity is equal to data dimensionality. For these l
earning machines, high
dimensionality leads to “over fitting” and the hypothesis becomes too complicated to
implement computationally.‟

This paper proposes a weighting vector in order to take the important concept features and for
extracting the associated

terms which have hidden semantic information.


Methods for improving the efficacy of classification include the use of a thesaurus based on
the Wikipedia[15]
.

„Abstract Text classification‟ has been widely used to assist users with the
discovery of useful

information from the Internet. The traditional classification methods based
on the “Bag of Words” (BOW) representation, only account for term frequency in the
documents, and ignore important semantic relationships between key terms.‟

This paper suggests
the automatic construction of a thesaurus of concepts from Wikipedia. It
introduces a unified framework to expand the BOW representation with semantic relations
(synonymy, hyponymy, and associative relations), and demonstrates its efficacy in enhancing
pre
vious approaches for text classification. It demonstrates that this approach achieves
significant improvements with respect to the baseline algorithm.


[16]A Text Categorization Method Based on Local Document Frequency discusses a different
approach for ca
tegorizing documents. It uses a weight to depict the importance of each term
42


for the categorization task. By doing so, important terms
affect more

when making
classification decision.

[20] Suggests a technique based on PLSA which is effective in assigning
class labels to
unlabelled data and works efficiently even when there is only small amount of labelled
positive data available. This work shows how PLSA can be extended and how by injecting
small amount of supervision, data can be labelled even with
a sm
all amount of labelled
positive data.

The paper Text Mining Infrastructure in R[17] provides an insight into the development of a
framework for mining the text. This paper discusses the procedure involved in pre
-
processing
text documents so that the tex
t mining algorithms for classification, clustering etc can be
effectively applied. It elaborates, on a conceptual level, important ideas and tasks a text
mining framework should be able to deal with.


[24] demonstrates

a suitable framework for mining of Text. It shows how the text mining
techniques can be applied to improve the effectiveness of learning networks. The architecture
and modules provide a valuable guideline about the design and structure of a framework.


2.11
RapidMiner
-
5
-

About RapidMiner



RapidMiner is an open source solution for data mining, analytical ETL, and predictive
reporting.



It is freely available
open
-
source

data mining and analysis system.



It can run
on every major platform

and operating sys
tem.



It has most
intuitive

process design.



Multi
-
layered data view concept ensures
efficient data handling




GUI mode, server mode (command line), or access via Java API.



Powerful high
-
dimensional
plotting

facilities are available.



Most comprehensive solu
tion available: M
ore than 500 operators

for data integration
and transformation, data mining, evaluation, and visualization.




Standardized XML

interchange format for processes



Graphical process design for standard tasks,
scripting language

for arbitrary
o
perations.



Machine learning library
WEKA

fully integrated.

43




Access to
data sources

like Excel, Access, Oracle, IBM DB2, Microsoft SQL,
Sybase, Ingres, mySQL, Postgres, SPSS, dBase, Text files and more.



Most comprehensive data mining solution

with respect t
o data integration,
transformation, and modelling methods



RapidMiner combines data mining power with an unbeatable ease of use.



Powerful but intuitive
graphical user interface

for the design of analysis processes



Repositories

for process, data and meta data handling



Only solution with
meta data transformation
: forget trial and error and inspect
results already during design time



Only solution which supports
on
-
the
-
fly error recognition

and
quick fixes




RapidMiner provides mo
re than 500 operators for all main machine learning
procedures.



It is written in the Java programming language and therefore can work on all popular
operating systems.



It also integrates learning schemes and attribute evaluators of the Weka learning
envi
ronment.



RapidMiner has a Text Processing module which provides innumerable operators for
tokenizing and preprocessing text documents of varied formats.



[18] RapidMiner

is the world
-
leading open
-
source system for data and text mining. It is
available as
a stand
-
alone application for data analysis, within the powerful enterprise server
setup
Rapid Analytics
, and finally as a data mining engine which can be integrated into own
products.


44




Fig
.

2.4

-

RapidMiner Environment

Courtesy [18]



































45

















Chapter 3
-


Methodology










46


C
C
C
l
l
l
a
a
a
s
s
s
s
s
s
i
i
i
f
f
f
i
i
i
c
c
c
a
a
a
t
t
t
i
i
i
o
o
o
n
n
n



of Documents involves assigning class labels to documents indicating their
category. Categorizing documents enables a semantic search and retrieval of documents.
Meaningful classification can be achieved by using Machine learning techniques along with
doma
in specific and Concept based Lexicon.


This thesis proposes a classification framework for assigning classification labels to the
research papers so as to enable a fast, accurate and meaningful search and retrieval. Each
research paper is assigned a cl
assification label that specifies the Domain, Sub domain and the
underlying concept of the paper. This is done using the Text Mining techniques.
C
lassification
is
done
based on the Title, Keywords and the Abstract of the research paper.

The approach taken
is hierarchical. The levels of hierarchy (3 levels) are based on the
structure of the research document. At each level of hierarchy the mining techniques are
applied/ restricted to only specific portions of the document.


The Hierarchical Approach
:

The Hie
rarchical approach is proposed for this classification. Efficient and meaningful
classification can be achieved by extracting portions of the document and restricting the
application of Text Mining techniques only to such limited portions.

This approach ha
s the following advantages

1.

The scope of Mining is kept limited at each level of hierarchy.

2.

The classification is based on the most significant and relevant portions of the
document.

3.

The entire content of the research paper does not contribute significantly

to the
classification and hence restricting the application of mining techniques to the abstract
increases the efficiency of classification.


3.1
Module
-
1
The Framework for Hierarchical Classification


The entire task of Classification is proposed as a

Framework:

The proposed framework provides flexibility to the user and enables him to configure
parameters for the experiment.

The user can specify and set up parameters relating to:

1.

Construction of Domain Specific Lexicon (DSL).

2.

Selection of DSL.

47


3.

Buildin
g a Corpus of DSL.

4.

Building of Corpus of Research Documents used for

testing the model.

5.

Selecting the format of the file/database for storing the results.

6.

Selecting the pathname/filename of the output files/database tables.

7.

Building the Concept Matrix for
the CPTR model.

8.

Building Corpus of Documents for Testing the CPTR model.



3.2
The Hierarchical Classification


Level 1

This is the first step in the Classification framework. The objective of this classification is to
derive a classification key
which is comprised of

a) The Year (in which the Paper is

published)


b) The Journal (in which the Paper is published)

This classification does not involve the usage of Text Mining techniques and is based on
tagged parameters captured from the documents

or accepted from the user.


The algorithm for Level
-
1 (Preliminary) Classification


















Algorithm: Capture /Accept tagged data from the Research document and
store the same in a Database table along with the generated Classification key.

Terms:

C

=
C潲灵猠潦⁒e獥a牣栠䑯a畭敮瑳
=
††††††††=
d
i




Research Document wherein

d
i


C i.e.




C = {d1,d2,d3…dn}
=
††††††††=
t
i

= Tagged data from d
i


Input:



t
i

= Tagged data from each d
i

Output :


Database Record containing Tagged Data + Generated Classification
Key for each
d
i


C

Steps:

For each
d
i


C


Accept tagged data (d
i
, t
i
) ;



Generate Classification Key = Year of publish + Journal id;



Generate unique primary key;


Insert new Database Record;

end;






48








Fig
.

3.1

Classification Level


1 Process


The objective of Level
-
1 (Preliminary) Classification:

The above algorithm performs preliminary classification of documents by assigning
classification key.

This aids in the Document Retrieval process/module. It enables the
retrieval module to filter out the Document records that do not meet the criteria regarding the
publication year and journal.


3.3
Classification (Level


1 ) Experiment

The classification

algorithm is coded as a java program. The program provides suitable
interface for accepting tagged data. It uses JDBC API to connect and store the accepted data
in the required database. (Database can be user specified).

After accepting the parameters
(Fi
g.
3.2
)
, the program generates a classification key which is
a concatenation of the Year of Publishing and the Journal. A unique primary key is generated
for each record
, and the record is then stored in the database along with its classification key.

The interface provided for capturing the data and the results of this Classification are
displayed in the figures that follow.

Update Database with structured
data + classkey


Input (Tagged/
Structured Parameters)

from the Corpus


Generate
Preliminary

Classification key


49



Fig. 3.2
Capturing tagged
Data (Classification Level
-
1)





50




Fig. 3.3

Database table ( journal data)
-

inserted records (Classification Level
-
1)



Structure of the Database table (Journal_data)

Attribute

Type



Comments

id

Numeric


Primary Key


generated automatically

Journal

Text

Year

Numeric

Issue

Text

Author1

Text

Author2

Text

Author3

Text

Author4

Text

Title

Text

Remarks

Text

51


Classkey

Text


G
enerated Classification Key

Filename

Text


Path / Name
of the file

containing






the Research paper.



3.4

The

Hierarchical Classification


Level 2

The proposed Hierarchical Classification

Level 2 of Research Documents uses text Mining
t
echniques to assign labels to research papers indicating their domain/sub domain
or

area of
research. The classification labels are predicted on the basis of the title and keyword
s found in
the research papers. The Titles and keywords
from the research documents
are extracted and
stored in the for
m of PDF files. Labels are predicted for
this Corpus of PDF files.

Assigning/predicting class labels is equivalent to mapping of a Research Document with it
predicted class label.


Domain Specific Lexicon


The Training Examples

For Correct and meaningful
Document


Domain Mapping a

Domain Sp
ecific Lexicon
(DSL)
is proposed.

DSL is a text file which contains terminologies pertaining to the domain. The terms in a DSL
file are specific to the domain and are delimited by „,‟ (commas).

There is a distinct DSL file associated with each Domain/Sub

domain
.

Each DSL file is
carefully planned and contains all the typical terms that can be associated with a domain or
area of research.

(To Illustrate: The DSL file constructed for Classification of Research papers pertaining to
Data Mining
-

Text Mining contains the following terms.

Text Mining, Text classification, Text Clustering, Document classification, Natural Language
Processing, Te
xt analysis, Stop words, stemming, Tokenize, LSI, Latent Semantic Index, SVD,
Singular Value Decomposition, PLSI, Information Retrieval, Unstructured data, semi
-
structured ,sentence
-
based, document
-
based, corpus
-
based, concept analysis, conceptual term
fre
quency, concept
-
based similarity, information extraction, topic tracking,

Summarization, clustering, VSM, Vector Space Model, question answering, document
visualization, content extraction, document collection, Association rules
)


52


The entire set of DSL fi
les is collectively referred to as the DSL Corpus and serves as training
set

for the Classifier.


The advantage and rationale behind the use of the DSL approach:

1.

A DSL file contains all the possible terms associated with domain or area of research.

2.

The
DSL file can be used to train the Classifier by associating the terms (in DSL file)
with the Classification label.

3.

This approach provides flexibility.

a.

DSL file is a simple Text file which can be built, modified and extended using
any text editor.

b.

Extending

the classification to more areas/domains only involves the
construction of additional DSL files.

4.

The number of DSL files/training documents required to train the Classifiers is
limited to the number of Domains/Sub domains.


Level
-
2 Classification involve
s 3 major processes:

a)

The Training Process

b)

The Testing/Application Process

c)

Process for Storing Predictions (Results) and Generating key.


a)

The Training Process:


The steps involved herein can be stated as under:












1.

Identify Domain Specific terms and
construct DSL file(s).

2.

Associate a Classification label with each of the DSL file(s).

3.

Build a Corpus of DSL file(s). C ={dsl
1
,dsl
2
,dsl
3
…dsl
n
}

4.

Preprocess the Corpus C.

a)

Apply a Stemming algorithm to reduce all the words to their root form.

b)

Generate VSM or a

Term Document matrix using Binary Term Occurrence
D( i, j
) ( where i is the document i and j is the jth term of document i.)


(TF and TF
-
IDF
are not

used in the matrix because only the occurrence of
the term in the DSL file is relevant for classification; the distinguishing or rarity of
the term is irrelevant in this approach)

5.

Train the K
-
NN Classifier using
C

as training examples
.





53


b)
Testing /Application of the Classifier
:

The Title of the Research paper and the keywords contained in it are extracted and
stored in the separate file each. This file is in the PDF format and is referred to as the
Keyword PDF
.

A corpus of such
Keyword PDF

files
make up the test set.

The generated
Classification model (Classifier) is tested using
the test set
i.e. A classification label


the Domain/Sub domain is predicted for each of the Keywords PDF file(s).

The steps involved for Testing the Classifier can

be stated as under:



















c.

Process for Storing Predictions (Results) and Generating key
.

The results generated by
process b
are captured in a temporary Database table. This
temporary data is further processed and a classification key is generated based on the
Classification label predicted in process b. The final database table is then create
d with
the updated data.

A Java program using the JDBC API is proposed to access the temporary Database
table.

1.

Extract Title and Keywords from each Research document and
store in a separate PDF file
-

Keyword PDF(s)

i.e.
keyword pdf
1
= title
1

+ keywords
1


2. Build a corpus of PDF(s)


i.e.
C

=
{

keyword pdf
1
+ keyword pdf
2
… keyword pdf
n
}

3. Preprocess
C


For each
Keyword PDF

in
C

a)

Remove stop words

b)

Apply Stemming Algorithm

c)

Generate VSM or a TD matrix

=
䐨⁩Ⱐ=
) ( where i is the
Keyword PDF and j is the jth term of Keyword PDF i.)

4.

Apply the classification model to generate class labels for each
Keyword PDF.

5.

Store the above results of Classification test in a temporary Data
base table
(keyword_temp
).


54


The algorithm of this process can be stated as under:
































Algorithm: Reading the temporary table, generating the classification key and storing
results in the final DB table.

Input: The temporary DB table (keyword_temp


generated with Classification
experiment in RapidMiner))

Output: the final
Keyword table

Steps
:

For each record in the temporary DB table


Read keywords from

the associated
Keyword PDF;


Read predicted class label from temporary DB table



keyword_temp
;


Generate class key using the predicted class label;


Insert a new DB record in the final
Keyword_table;

end;



55




Fig. 3.4 The Classification Process


3.5

Classification (Level
-
2 ) Experiment

1. DSL files were created for all the possible domains/ sub domains.
(Fig
. 3.5
).

2.
PDF files

were constructed with the Title and the keywords in the Research paper.
(Fig.
3.6
)

2. RapidMiner was used to carry out this classification experiment
.

(Fig.
3.8

& Fig.
3.9
)

3. The
K
-
NN classifier was trained with the help of a Corpus comprising of 24 such
DSL
files. Each DSL file was assigned a label which indicated its domain/ sub domain.
(
Fig.
3.7
)

4. The DSL Corpus was processed to produce Term Document matrix using Term
Frequency. (
Fig. 3.9
)


The processing involved a) Tokenizing b) Removing Stop Wor
ds and c) Stemming

5
. The classifier was then tested using 81 PDF files to arrive at the predictions. (
Fig. 3.10
)

6
. The Keyword_table is generated at the end of this process.



56




Fig
.

3.5

A sample DSL file









57





Fig
.

3.6

A

sample PDF file with the Title and the keywords

(
Keyword PDF)




Fig.
3.7

Building a Corpus of DSL

files
(Training examples) using RapidMiner
(Classification Level
-
2)

58




Fig
.

3.8

Developing and Testing the Classifier using RapidMiner (Classification Level
-
2)


59




Fig. 3.9

Processing the Documents using RapidMiner

(Classification Level
-
2)



60




Fig.
3.10

Predictions (Results) by the Classifier using RapidMiner (Classification Level
-
2)












61


The Structure of the
Keyword table generated at the end of the process


Attribute

Type



Comments

Id

Numeric

Primary Key


generated automatically

Keyword1

Text


Keywords from the file containing Research
paper

Keyword2

Text


Keyword3

Text

Keyword4

Text

Keyword5

Text

Keyword6

Text

Keyword7

Text

Keyword8

Text

Keyword9

Text

Keyword10

Text

Keyword11

Text

Keyword12

Text

Classkey

Text


Generated Classification Key

Filename

Text


Path/Name of the

file containing the






Research paper.














62


3.6

The Hierarchical Classification


Level 3


The proposed Hierarchical Classification

Level 3 of Research Documents uses text Mining
Techniques to assign labels to research papers indicating the underlying concept or topic of
research. The classification labels are predicted on the basis of the
„Key
word + Abstract‟
extracted from the research papers. Each extracted portion is stored in a separate PDF file. A
corpus of such PDF files make up the Test Set.


The abstract is a very important part of a research paper as it captures the essence of the enti
re
paper. The abstract is useful for determining the specific topic of the paper. Thus abstract
based classification enables the assignment of labels that reflect the „Concept of the Research
paper‟


Classification


Level 3 : Issues:

The abstract of the p
aper contains several stop words (like is, the, on, in, there etc). These stop
words have to be removed since they are absolutely irrelevant. This can be done with any tool
having Text Processing capabilities and do not pose any problem. However the abstra
ct also
contains certain irrelevant/ insignificant terms for e.g.
perform, develop, build, generate etc.
These terms are not stop words and hence cannot be removed using Text Processing tools.
Assigning
these terms the same significance/weight as the techn
ical, domain specific terms
(important for classification) leads to incorrect classification.

These terms dilute the relevance of domain specific and distinguishing terms (included in
DSL) and thus defeat the purpose of DSL.



This paper proposes two
schemes for Hierarchical Classification


Level 3,

a)

Simple Classification
-

Based on the occurrence/non
-
occurrence of Domain
Specific terms in the abstract.

b)


Concept Prediction Using Term Relevance (CPTR).






63


3.7 The Simple Classification Scheme:

This is simple scheme of classification based on the occurrence/non
-
occurrence of DSL
terms in the Test set.



The Training set is a Corpus of DSL files with their corresponding labels.

Each such DSL file contains terms relating to a particular concept/topi
c.



Simple Classification is just based on the presence/absence of DSL terms in the
Test set. Documents in the Test set (for which classification label has to be
predicted) are assigned the class label of their nearest neighbor.


Characteristics of Simple
Classification
:

1
.


Classification experiment based on this scheme is very simple.

2.

Only the presence/absence of DSL terms is considered. Hence
t
he Term Document
Matrix based on
Binary
Term Occurrence

is required to represent the Training

examples
(DSL files)

as well as

the Test Documents.

The
K
-
NN Classifier compares the
Binary Term Occurrence Matrix of the
t
raining
set or DSL with the Binary Term Occurrence Matrix of the Test Set and assigns
the
nearest

label.

3.

The Simple Classification scheme
yields acceptable classification labels.

The accuracy
of classification however depends on the characteristics of the document
being

classified. The accuracy is less in case of documents with multiple concepts

because
the frequency or relevance of terms is

not taken into account for classification.