An Open Source Framework for Data Pre-processing of Online Software Bug Repositories

classypalmInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

94 εμφανίσεις

CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
Software bug repositories are great source of
knowledge. It contains lot of useful information related to software
development, software design and common error patterns for a
software project. Most of the projects uses some bug tracking system
to manage the bugs associated with the software. These bug tracking
system works as an online bug repositories, which can be accessed by
all of the project members situated at different locations. All project
members can update and read the software bug related information
from these online bug repositories. In order to extract knowledge from
these online software bug repositories some mechanism is required to
extract, parse and save the data locally for analysis. In this paper a
framework is proposed and implemented using open source API’s
(Application Programming Interfaces) for the preprocessing of the
online software bug repositories for data mining, also performance is
evaluated for the implemented framework in terms of software bug
data fetch and parse timings from online repositories.

Software bug repositories, Fetching bug repositories,
Parsing software bugs, Data preprocessing of bug repositories.

I. I

bug is defect in sofware. Bug indicates the unexpected
behavior of some of the given requirement during
software development. During software testing the unexpected
behavior of requirements are identified by software testers or
quality engineers and they are marked as a Bug. Bugs are
managed and tracked using number of available tools like
JIRA[19], MantisBT[25], Bugzilla[3] etc. For larger projects
where the large number of developers works together from
different locations it is very complex to manage the software
bugs and other useful information related to project. The bug
tracking tools plays an important role in this direction, these
tools provides the online mechanism to manage the software
bugs which can be accessible and updated by the project team
members. All the software bug data are kept as in terms of an
online repository. Number of open source projects also uses
these bug tracking tool for smoothening the bug related
operations, the example of such projects are Mozilla, Mysql,
JBoss projects, Apache projects etc.
Whenever a new bug is identified by a quality engineer, all
the information required for the bug is created as a new record

Manuscript received on November 11, 2009, review completed on
November 16, 2009 and revised on November 23, 2009.
Naresh Kumar Nagwani is with the National Institute of Technology,
Raipur, CG 492001 India (phone: +91-99933-12001; e-mail:
Dr. Shrish Verma is with the National Institute of Technology, Raipur, CG
492001 India (e-mail:
Digital Object Identifier No: DMKE102009004
in the bug tracking tool. During the fixation a software bug
enters into the various stages, which are also tracked using the
bug tracking tools. The state flow diagram of a bug is presented
in Figure 1. Here is one case of bug flow diagram is explained.
When a new bug will enter into the system it is in new state.
When developer will start working for this bug it will move to
the in-progress state, if this bug is fixed by developer then
developer will mark it as resolved and bug will enter into the
resolved state. Once tester will verify this resolved bug if it is
fixed then it will be in closed state.
For ease of developers and other project members spread
across different locations the software bug repositories are
deployed as on line databases, from where all the required
information can be accessed from different locations. Software
repositories contains information in the html or xml form
through web interface. Software repositories are becoming area
of focus in data mining since it contains lot of useful patterns
related to software development.
Applying data mining operations on these on-line
repositories is a complex process because all the records are not
accessible at a time locally and record and data format is
present either in html or in xml format. So lot of preprocessing
work is required on these on line bug repositories and some
mechanism is also required to fetch the records from these
repositories. In this paper an open source framework is
proposed to fetch, parse and traverse with these online bug

Fig. 1 Bug State Diagram

The structure of this paper is as follows. In section 2
framework related and previous work done in same direction is
mentioned. Section 3 presents the proposed open source
framework with various modules. Section 4 describes the
implementation of the proposed framework using open source
API’s. Section 5 describes performance evaluation of the
implementation in terms of timings associated with the
important processes. And section 6 describes some of the
research areas where the proposed framework can be
An Open Source Framework for Data Pre-processing of
Online Software Bug Repositories
Naresh Kumar Nagwani and Dr. Shrish Verma

CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
useful.Section 7 summarizes the contribution of the proposed
work and future scope of the research in the same directions.


There are several number of research problems associated
with the online software bug repositories. Some of the studied
research problems are related work is discussed here.The
problem of duplicate bug detection in online bug repositories is
addressed by many researchers which are studied in [33,37].
Similarity and duplicity detection method of GUI bugs is
proposed by Nagwani and Singh [28]. Extended association
rule based defect data analysis method for online software
defect repositories is proposed in [35].
There are several number of framework and other tools exist
for the similar problems, but they are addressing the problem of
online software code retreiving and preprocessing, software
defect repositories are not yet covered by any of these existing
tools. The existing tools are briefed here for the point of
reference. An Open Framework for CVS Repository Querying,
Analysis and Visualization is proposed in [25]. A tool named
Sourcerer is developed to support the infrastructure for the
automated crawling, parsing, and database storage of open
source software [36]. So none of them support the
functionalities associated with the online software defect
repositories, hence an approach is required to deal with the
software defect repositories.
Baysal, Godfrey, and Cohen [29] have proposed a
framework for automated assignment of bug-fixing tasks. The
proposed approach in their framework employs preference
elicitation to learn the developers’ predilections in fixing bugs
within a given system. The approach infers knowledge about a
developer’s expertise by analyzing the history of bugs
previously resolved by the developer. A vector space model to
recommend experts for resolving bugs was suggested. The
proposed framework works as follows, whenever a new bug
report arrives, the system automatically assigns it to the
appropriate developer considering his or her expertise, current
workload, and preferences. A new approach that involves
execution information for resolving software bugs for finding
duplicate bugs in a software bug repository was given by
Wang, Zhang, Xie, Anvik and Sun [37]. In the proposed
approach, when a new bug report arrives, its natural language
information and execution information are compared with
those of the existing bug reports. Then, a small number of
existing bug reports are suggested to the triager as the most
similar bug reports to the new bug report. The triager finally
examines the suggested bug reports to determine whether the
new bug report duplicates an existing bug report. Two
interrelated problems related to software reuse repositories was
addressed by Henninger [32]. The problems addressed were
acquiring the knowledge to initially construct the repository,
and modifying the repository to meet the evolving and dynamic
needs of software development organizations. The solution
proposed for the problems was choosing a retrieval method that
utilizes minimal repository structure to effectively support the
process of finding software components. The security issue
related to Grid Security Infrastructure (GSI) was addressed by
Novotny[16], Tuecke, and Welch. The issue were addressed
using an online credentials repository system, nameed
MyProxy. MyProxy allows Grid Portals to use the GSI to
interact with Grid resources in a standard, secure manner.
A method to use the source code change history of a software
project to drive and help to refine the search for bugs, was
proposed by Williams and Hollingsworth[4]. The proposed
method works as follows, based on the data retrieved from the
source code repository, a static source code checker searches
for a commonly fixed bug and uses information automatically
mined from the source code repository to refine its results.
Some problem related to proper usage of open bug repository
was addressed by Anvik, Hiew and Murphy[21]. The
advantages and disadvantages associated with the open bug
repositories are also studied in the same work. Based on the
various parameters duplicate bug detection method using
machine learning approach was also mentioned work. The
issues that task visualizations that support social inferences
address in software development is addressed by Halverson,
Ellis, Danis, and Kellogg[6]. Managing change requests was
focused in the proposed work. A design for the visual
inspection of change requests was also created by interviewing
with industry and open-source programmers. A model was
proposed by combining traditional contribution metrics with
data mined from software repositories, was delivered to
accurate developer contribution measurements by Gousios,
Kalliamvakou and Spinellis[11]. The model creates clusters of
similar projects to extract weights that are then applied to the
actions a developer performed on project assets to extract a
combined measurement of the developer’s contribution. The
limitation about versioning system information for a software
system was studied by Robbes[31]. Versioning systems such as
CVS or SubVersion store only snapshots of text files, leading to
a loss of information. The exact sequence of changes between
two versions is hard to recover. In the proposed work an
alternative information repository is suggested which stores
incremental changes to the system under study, retrieved from
the IDE used to build the software. A method was developed
by Saggion and Gaizauskas[13] that uses already available
external sources to gather knowledge about the “definiendum”
before trying to define it using the given text collection. This
knowledge consists of lists of relevant secondary terms that
frequently co-occur with the definiendum in definition-bearing
passages or “definiens”. External sources used to gather
secondary terms are an on-line enyclopedia, a lexical database
and the Web. These secondary terms together with the
definiendum are used to select passages from the text collection
performing information retrieval. Linguistic analysis is also
carried out on each passage to extract definition strings from
the passages using a number of criteria including the presence
of main and secondary terms or definition patterns.
The choice of a proper data analysis/exchange format while
analyzing the evolution of software systems is an important
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
decision, this problem is studied and addressed by Kiefer,
Bernstein, and Tappolet[7]. They suggested EvoOnt, a
software repository data exchange format based on the Web
Ontology Language (OWL). EvoOnt includes software,
release, and bug-related information. OWL describes the
semantics of the data. The proposed work also includes
iSPARQL, which is SPARQL-based Semantic Web query
engine containing similarity joins. Together with EvoOnt,
iSPARQL can accomplish a sizable number of tasks sought in
software repository mining projects, such as an assessment of
the amount of change between versions or the detection of bad
code smells. Mierle, Laven, Roweis, and Wilson[23] studied of
behaviour of operations in mining and analyzing CVS
repositories. They extracted various quantitative measures of
student behaviour and code quality, and attempted to correlate
these features with grades in CVS repositories. A
work-oriented design project was created by Trigg, Blomberg,
and Suchman[30], which concerned with migrattion of shared,
workgroup document collections. The focus of the project was
a document collection called the "project files," a
heterogeneous mix of documents that serve as an ongoing
resource for the group during a project's course as well as an
archival record at its completion. A method was describeto
discover the dynamics of the standardized classification
scheme in use for the project files, existing practices of
document filing including routine troubles, and the prototype
developed to move the project files online.
A technique to effectively reuse the Reusable Software
Libraries (RSLs) for softwares was suggested by Poulin[19].
The implementation of an RSL depends on many factors
including the availability of quality and useful software.
Domain-specific considerations most often determine the
usefulness of software and therefore should influence how an
organization populates an RSL. This proposed work presents
IBM’s’ experiences with a corporate RSL, reuse incentive
programs, and summarizes the results of an enterprise-wide
initiative to develop reusable software for the RSL. The
proposed work also explains some of the issues surrounding a
large RSL and defines a three-phased progression typical of
corporate reuse libraries. A prediction technique to estimate the
work time to fix a software bug is given by Panjer[24]. The
proposed uses data mining tools to predict the time to fix a bug
given only the basic information known at the beginning of a
bug’s lifetime. A bug history transformation process is
described and several data mining models were built and tested.
Interesting behaviours derived from the models were also
documented in the proposed work. A static analysis based bug
finding tools is proposed by Williams and Hollingsworth[5] to
search the source code for violations of system-specific rules.
These rules describes how functions interact in the code, how
data is to be validated or how an API is to be used. A method is
proposed to automatically recover a subset of these rules,
function usage patterns, by mining the software repository. A
project named Marmoset was developed by Spacco, Strecker,
Hovemeyer, and Pugh[15] for project snapshot and submission
system. It was a project submission systems, which allows
students to submit versions of their projects to a central server,
and automatically tests them and records the results. Marmoset
collects finegrained code snapshots, each time a user saves the
work, it is automatically committed to a CVS repository. The
data collected by Marmoset works as a source of insight about
learning to program and software evolution.
A prototype was developed by Ankolekar, Sycara, James
and Welty[2] for semantic web system for OSS communities
named Dhruv. Dhruv provides an enhanced semantic interface
to bug resolution messages and recommends related software
objects and artifacts. Dhruv uses an integrated model of the
OpenACS community, the software, and the web interactions,
which is semi-automatically populated from the existing
artifacts of the community. Software repositories are full of
valuable information about the project: Bug descriptions,
check-in messages, email and newsgroup archives,
specifications, design documents, product documentation, and
product support logs contain a wealth of information that can
potentially help software developers resolve crucial questions
about the history, rationale, and future plans for source code. A
suite of tools are proposed by Venolia[12] to access the useful
resources using browse, full-text search, artifact-based search,
and implicit search from a software repository. All these tools
depend on an index that represents software-related artifacts
and the relationships among them. An extensible architecture
for representing and provisioning artifacts and relationships
among the software artifacts is presented in the proposed work.
The artifacts and relationships form a typed graph. The graph is
provisioned from structured data sources, structured files, and
textual allusions to artifacts. A tool was proposed by Arenas,
Bicarregui, and Margaria[1] for verifying compiler grand
challenge is the verified software repository. In the FMICS
view, the repository should include proven correct software and
tools to help establishing the correctness of the software in
question. Based on this using jETI technology, a tools is
suggested to the repository and to orchestrate different tools.
The concept of Verified Software Repository is proposed by
Bicarregui, Hoare, and Woodcock[14]. Verifying Compiler is a
tool that automatically proves that a program will always meet
its requirements, without even needing to run it. The Verified
Software Repository is a first step towards the realisation of the
verifying compiler. It maintains and develops an evolving
collection of state-of-the art tools, together with a
representative portfolio of real programs and specifications on
which to test, evaluate and develop the tools. It contribute
initially to the inter-working of tools, and eventually to their
Software version control repositories provide a uniform and
stable interface to manage documents and their version
histories. The open source systems, for example, CVS,
Subversion, and GNU Arch are not well suited to highly
collaborative environments and fail to track semantic changes
in repositories. Watkins, and Nicole[8] introduced a technique
document provenance as the Description Logic framework to
track the semantic changes in software repositories and draw
interesting results about their historic behaviour using a
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
rule-based inference engine. To support the use of this
framework, an online collaborative tool is also developed
named WikiWikiWeb for leveraging the fluency. An approach
for visual information retrieval from large distributed on-line
repositories was suggested by Chang, Smith, Beigi, and
Benitez[34]. Using the proposed approach the digital images
and video were retrieved from on-line repositories for effecting
human communications applications. A semi-automated
approach was presented by Anvik, Hiew and Murphy[22] to
simplify one part of bug fixing process, i.e. the assignment of
reports to a developer. In the proposed approach a machine
learning algorithm is applied to the open bug repository to learn
the kinds of reports each developer resolves. When a new
report arrives, the classifier produced by the machine learning
technique suggests a small number of developers suitable to
resolve the report. They have also described the conditions
under which the approach is applicable.
A new approach is proposed towards an integrated
framework for software bug repositories fetching, parsing and
traversing. The goal is to provide users to data preprocessing of
online software bug repositories. The framework is
implemented using open source technologies only and
experiments are also done for performance analysis.


The proposed framework is divided in the four major
module. The first module is fetch online bugs, second module is
parser module for parsing the bugs available in html and in xml
format. The third module is schema generator module for the
local database and the fouth module is the traversal module for
the local database where the parsed bug records are stored. The
four modules are:

A. Fetch Module
The fetch module is created to fetch the bug stored in on line
bug repositories. Each online bug repositories have some
standard url pattern, by specifying the bug id the software bug
can be retrieved locally. For example mozilla online repository
have the url pattern as “
bug.cgi?id=” by appending a particular id (an integer), a
mozilla bug can be fetched locally . If the connection requires
some proxy address settings, then user can also specifies the
proxy address and port number. User can also specify the
location where the bugs can be stored locally. Ranges can also
be specified for bug id’s in order to retrieve number of bugs
together at a time.

B. Parser Module
Once the software bug is retrieved at local machine, the
retreived software bug is either html or xml. Since online
repositories supports either the html or xml format. Parser
module is reponsible for parsing the html or xml bugs and save
the bug information in local database. Two types of parsers are
used for parsing process HTML parser and XML parser.

C. Schema Generator Module
To store the bug information after parsing, schema of the tables
should be defined in local database. Using schema generator
user can specify the table schema at the local database. A user
can specify the number of columns in the table, the names and
types of the columns can be specified using this module.Once
all the parameters are specified by user for schema definition,
using DDL (Data Definition Language) implementation in
JDBC (Java Data Base Connectivity) the table structure can be
created to store the parsed software bug information.

D. Traversal Module
The purpose of traveral module is to traverse through with the
local stored database. Traversal module is created in generic
way, which supports all the types of schema's defined for the
database. The gui fields for the columns are generated at
runtime independent of table structure and based on the schema
defined for the local database.

Fig. 2 framework for data preparation of online software defect

The framework with various module and associated
processes for preprocessing of online bug repository can be
represented using figure-1. The process starts with a reference
URL (Uniform Resource Locator) for a online bug repository.
The fetch module using the URL try to get the bug records one
at a time. The bug records are retreieved as HTML or XML
files at the local machine. Then parser module parses these file
formats and stores the parsed information to the local database.
Schema associated with the local databases can be generated
with the help of schema generator module. Once bug records
are available at local machine the records can be traversed and
retrieved for the analysis using the traverse module.


Implementation for thr proposed framework is done using
Java[17], JDBC, MySql[27] and JAXP[18] technologies.
On Line
Local Database

Bug Records
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
Retrieving timings from different defect repositories and
Parsing timings are calculated for performance evaluation of
the implemented framework.
All the graphical user interfaces (GUI’s) are created using
java swing and MySql is used as a backend local database.
Fetch module is created with help of URL class of java
networking package, where the proxy setting option is also
implemented. Parser module is implemented using HtmlParser
class for html and JAXP for xml content parsing. Schema
generator is implemented using the JDBC classes and Traversal
module is created using the ResultSetMetaData class of the
Figure 2 shows the main GUI of the proposed framework
which displays the option for all the four modules and its help
regarding these modules, shown as “?” in the figure. User can
start with any of the module independently for preprocessing of
the software repositories. The preferred sequence for
instantiate data preparation of software bug repository is fetch
the bugs, then parse the retrieved bugs from a specific online
format (HTML or XML) to local java object, defining the
database schema to store the bug and mapping the local java
object to the database created and finally traversing through
with the stored data in the database.

Fig. 2 Online Bug Repository Retriever and Parser main GUI

The fetch module of the framework is depicted in figure 3. In
fetch module user can select the stored URL (uniform resource
locator) pattern for the online bug database or can enter the
newer online bug database source. User can also specify the
proxy addresses if it is required for the internet connection and
location where the retrieved bug’s files can be stored.
All the known online bug repositories are stored in the local
database from where they are stored in HashMap java
collection API at runtime, the new online bug repositories can
also be stored and updated in the database. getUrl() method
retrieves the URL location for a specified online repository
from database. The URL is shown in a text field which is
editable so user can also edit the URL location manually. The
code snippet for handling URL locations is given below:

private HashMap<String, String> urlMap = new
HashMap<String, String>();

The code snippet for retrieving the software bug to the local
system is given below. URL location is stored in the urlStr
variable and fileFormat specifies the format in which the bug is
available in online bug repository. Each individual software
bug is stored in a separate file using FileOutputStream class in a
specified folder in the system.

URL url = new URL(urlStr+i);
foStream = new FileOutputStream( fileLocation+bugType +
"-bugs-" + i + "." + fileFormat);
BufferedReader in = new BufferedReader( new
InputStreamReader( url.openStream()));
String str;
while ((str = in.readLine()) != null) {
foStream.write(str.getBytes()); }

If the proxy connection is required for establishing the
connection with online software bug repository, then by
clicking at "PROXY NEEDED" checkbox various proxy
connection parameters can be specified. The code snippet to
establish a proxy connection is given below:


Total time consumed to retrieve the software bug from the
various online software bug repositories are also measured for
the retrieving process. The measurement is done using the
System.currentMillis() method of System class in java, the
starting and ending timings are stored and their differences are
calculated to get the time consumed by the retrieving process.

Fig.3.Fetch module for the online bug repository
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
Figure 4 represents the parse module of the proposed
framework. In Parse module user can first selects the data
format type i.e. whether the retrieved software bug format is
HTML or XML, then user can specify the table in the local
database where the parsed bug can be saved. The schema of the
local database can be created using schema generation module.
The HTML parsing can be done by extending the
HTMLEditorKit.ParserCallback class in java. Following is the
code snippet to instantiate a HTML parses in java. Class
MyParser extends the ParserCallback in the given code.

ParserGetter kit = new ParserGetter();
HTMLEditorKit.Parser parser = kit.getParser();
FileReader reader = new FileReader(fileName);
MyParser callback = new MyParser();
parser.parse(reader, callback, true);

The HTML parser has number of methods to handle the
various HTML tags. handleText() is a method to deal with the
text information stored inside the HTML tags. The following
code snippet shows, how the title information is captured for a
software bug. bObj is the instance of BugObject class which is
a java POJO (Plain Old Java Object) to store the bug
information at runtime.

public void handleText(char[] text, int position) {
if(currentTag.equals(HTML.Tag.TITLE)) {
if(text.length > 5) {
String titleStr = new String(text);
String []strArr = titleStr.split(":");
if(strArr.length > 2) {
String bugId = strArr[1];
String summary = strArr[2];
bugId = bugId.replaceAll("#","");
bObj.setSummary(summary); }
} }

A parsed bug can be saved in an object of class named
BugObject at runtime. The structure of BugObject class is
shown below. It consists of all the attributes of a bug and its
setters and getters methods.

public class BugObject {
private String budId;
private String summary;
private String description;
private String submitted;
private String modified;
private String status;
private ArrayList<String> comments = new
private ArrayList<String> descList = new
// Setters and Getters of all the attributes
The bug fix duration can also be measured by clicking on the
check box "Calculate and Save Fix Duration". This can be
calculated by examine the attributes bug-status,
bug-creation-date and bug-modified date. If the bug-status is
“Fixed” or “Resolved” then the fix duration is the differences
of bug-modified-date and bug created- date. This difference
can be transformed into any time granular e.g. hours, days,
weeks, months etc. Here number of days is taken out as a time
granular for the bug fix duration. The java code for calculating
the fix duration is given below.

String submitted = bugObject.getSubmitted();
String modified = bugObject.getModified();
DateFormat df = new SimpleDateFormat("dd MMM yyyy
String bugStatus = bugObject.getStatus();
if( bugStatus.equals(”Fixed”) || bugStatus.equals(”Resolved”))
Calendar calenderSubmitted = Calendar.getInstance();
Calendar calenderModified = Calendar.getInstance();
long milliseconds1 = calenderSubmitted.getTimeInMillis();
long milliseconds2 = calenderModified.getTimeInMillis();
long diff = milliseconds2 - milliseconds1;
diffDays = diff / (24 * 60 * 60 * 1000);

Fig. 4.Parse module of the framework

GUI for the table traversal module of the implemented
framework is shown in Figure 5 and 6. User can traverse for a
specific table records by choosing the database name and table
name. Once table name and database name is set all GUI for the
table fields are generated at runtime using the
ResultSetMetaData class of JDBC (Java Database
First a database name is selected from the combo box; a
selection of database name populates the table names present in
that database. On click of “Set Database” button the database
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
and table names are selected and stored at runtime for
traversing through with the table records. Based on the
database and table selection, the traverse GUI components are
created at runtime along with the four buttons – First, Last,
Previous and Next.
The First button the first record stored in the table is pointed.
Previous and Next buttons are used for traversing through in
backward and forward direction in table with one record at a
time. And Last button points to the last record of the table. The
remaining fields of the GUI are generated at runtime
independently of table structure using ResultSetMetaData class
of java.

Fig. 5.Specify database for traverse operation

Fig. 6.A Generic Traverse Module used for traversing through records

Schema generator module is divided into the two step
process to create the table schema for the software bug database
at runtime. The first part is specify columns for the schema
which is shown in figure 7 and second specifying the column
names and their data types as shown in figure 8.

Fig.7. Specify the number of columns for the schema generator

Before creating a table the number of columns should be
specified in the table, so that the GUI for specifying the name
and type of fields can be generated at runtime. Once the number
of fields is specified using GUI shown in figure 7, another GUI
shown in figure 8 is used to specify the field name and data type
of various fields. All the common data types like VARCHAR,
int, date etc. are provided in the column type definitions. In case
of VARCHAR data type, user can also specify the number of
characters for the fields.
Software bug repository can be preprocessed using all the
above mentioned four modules. User can perform all the
preprocessing activities in sequence independently. When all
the operations are successfully done, the data mining
techniques can be applied directly on this preprocessed data.

V. P

Retrieving timings from different defect repositories and
Parsing timings are measured for performance evaluation of the
implemented framework. Data is taken from open source bug
repositories of projects Mysql, JBoss-Seam and Mozilla
projects. The timings associated with other modules of the
framework is not taken for performance evaluation since the
operations are done at local system and have contsnt time
complexity, which can be neglected. Fetching and parsing the
operations which takes the maximum time in the preprocessing
of the software bug repository.
Table-1 contains the fetching timings of fetch module from
different bug repositories for different numbers of software


of Bugs /
100 200 500 700 1000
MySql 129516 326859 652422 848828 1361844
Mozilla 94437 245062 724891 1120188 1959844
Seam 214782 448078 1207297 1708797 2571844

Table-2 contains the parsing timings in ms of fetched
software bugs from different repositories in html format for
different numbers of bugs.

Fig. 8.Schema generator module for specifying table schema

CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology

of Bugs /
100 200 500 700 1000
MySql 134300 234100 721780 1034200 1605000
Mozilla 153900 267600 819500 1168400 1867200
Seam 232000 393120 1223100 1688400 1932100

Figure 9 depicts the fetching timings from different online
bug repositories in ms and figure 10 depicts the parsing timings
for fetched bugs in ms for the different online bug repositories.
Behaviors of both of the operations are almost same, time
consumed increases linearly with increase the number of bugs
to fetch and parse.
Number of software bugs is retrieved from different online
software bug repositories using the fetch module and the
overall time of retrieving is taken in milliseconds. For
experiment 100, 200, 500, 700 and 1000 bugs are retrieved at a
time from MySql, Mozilla and JBoss-Seam online software bug
repositories and their corresponding graph is plotted for
performance evaluation, which is depicted in the figure 9.

Fig.9. Graph showing the fetching timings in ms for different

Once the bugs are retrieved from the different online
software bug repositories, the next step is to extract the
software bug information from the retrieved bug files. MySql,
Mozilla and JBoss-Seam provide the bug information in HTML
format. So HTML file parsing is required for bug data
extraction. The time consumed for parsing 100, 200, 500, 700
and 1000HTML bug files are measured in milliseconds and
graph is plotted for performance evaluation for parsing module,
which is depicted in figure 10.
Parsing operation is complex than the software bug
retrieving operation. Since parsing includes the extracting
various tags from an html or xml file and mapping the extracted
data to the local database fields. The above experiment is done
on a system with Intel dual core processor with 2.0 GHz speed,
2 GB RAM and 2 MBPS internet connection.

Some of the recent research area's are listed below, where the
proposed framework for preprocessing of software bug
repository can be effectively used:

A. Software Bug Classification
Software bug classification refers to the problem of
categorizing the software bugs into various predefined bug
classes. The bug classes could be related to the backend bugs,
middle tier bugs or front end GUI(Graphical User Interfaces)
bugs. In order classify the software bugs, classification
techniques are applied over the software bug repositories.
Before applying the clasification techniques, all the software
bugs should be available to the local machine. For availability
of all the software bugs in software bug repositories, the
proposed framework could be useful.

B. Software Bug Estimation
Software bug estimation is one of the prediction problem over
the software bug repositories. Looking at the efforts required
for the previous resolved bugs, the resolution effort required for
the new bugs can be predicted which is called as the software
bug estimation. For getting the fix efforts of previous bugs all
the bugs related data should be captured at the local machine
from a software bug repository. Using the proposed framework
the required data can be fetched and saved to the local machine
where the prediction technique can be applied.

C. Duplicate Bug Detection
The problem of duplicate bug detection in software
repository is applying some data and text mining techniques to
identify the similar or duplicate of a software bug. Whenever a
new bug enters into the software bug repository, the bug
repository can be checked to see whether the similar of the bug
already exist in the repository. If the bug already exist then the
No of Bugs

Fig.10. Graph showing the parsing timings in ms for fetched bug’s
different repositories.
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
new bug can be marked as a duplicate bug, which requires no
resolution. In case of similar bugs the amount of effort required
for resolution is reduced.

D. Identifying Common Error Patterns
Data mining techniques can be applied over the software bug
repositories to identify the common error patterns. These
patterns could be useful for analyzing the general faults made
during the development time.

E. Developing Coding Guidelines to Reduce the Bugs
Once the common error patterns can be analyzed from a
software bug repository for a specific project, the coding
guidelines can developed to reduce the number of software
bugs. Coding templates and rules can be formed based on the
common error patterns and can be integrated with the
development IDE(Integrated Development Environment) to
prevent the common type of software bugs.

F. Effective Workload Balancing for Developers
Looking the history of software bug repository, the suitable
or expert developer can be identified for assiging the newly
arrived bugs in the software bug repository. This techniqe can
also be used for effectively workloads of the developers
involved in the bug resolution process. Effort kept by the
software developers can be captured from the software bug
repositories and load balancing techniques can be applied for
workload balancing of tasks assigned to the software


Data mining requires the data preprocessing at the beginning
for the available data. There are number of techniques present
today for data preprocessing of the locally available data but the
problems associated with online repositories was not yet
addressed. In this paper a new framework for data
preprocessing of software bug repositories is proposed. The
goal of this paper is to provide the research community with a
base for experimentation of new techniques in online bug data
acquisition, parsing and traversing with the help of generic
interface based framework implementation.
As a future direction of research improvements can be done
by adding more features for the various modules. For example
the addition of various constraints and normalization on local
database can be added for the schema generator module; in
parsing module few more categories can be included.
Additionally the framework can be extended for fetching and
parsing the online source code repositories apart from the
online software bug repositories.

[1] Alvaro Arenas, Juan Bicarregui, Tiziana Margaria, "The FMICS View on
the Verified Software Repository", Proceedings of Integrated Design and
Process Technology, IDPT-2006.
[2] Anupriya Ankolekar, Katia Sycara, James, Chris Welty, "Supporting
Online Problem Solving Communities with the Semantic Web",
Proceedings of WWW 2006, May 23–26, 2006.
[3] BugZilla, bug tracking tool:
[4] Chadd C. Williams and Jeffrey K. Hollingsworth, "Automatic Mining of
Source Code Repositories to Improve Bug Finding Techniques", IEEE
[5] Chadd C. Williams, Jeffrey K. Hollingsworth, "Recovering System
Specific Rules from Software Repositories", Proceedings of MSR 2005:
International Workshop on Mining Software Repositories, 2005.
[6] Christine A. Halverson, Jason B. Ellis, Catalina Danis, Wendy A.
Kellogg, "Designing Task Visualizations to Support the Coordination of
Work in Software Development", Proceedings of the 2006 20th
anniversary conference on Computer supported cooperative work, pp. 39
- 48, 2006.
[7] Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, "Mining Software
Repositories with iSPARQL and a Software Evolution Ontology",
Proceedings of the 29th International Conference on Software
Engineering Workshops table of contents, 2007.
[8] E. Rowland Watkins, Denis A. Nicole, "Version Control in Online
Software Repositories", Version Control in Online Software Repositories.
ACM TechNews, 7 (872), 2005.
[9] Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes and Pierre
Baldi, “Mining Internet-Scale Software Repositories”, Advances in
Neural Information Processing Systems (NIPS), Vol. 21,2008.
[10] Fatudimu I.T, Musa A.G, Ayo C.K, Sofoluwe A. B, “Knowledge
Discovery in Online Repositories: A Text Mining Approach”, European
Journal of Scientific Research, ISSN 1450-216X, Vol. 22 No. 2, 2008 pp.
[11] Gina Venolia, "Textual Allusions to Artifacts in Software-related
Repositories", MSR 2006: The 3rdInternational W orkshop on Mining
Software Repositories, 2006.
[12] Horacio Saggion and Robert Gaizauskas, "Mining on-line sources for
definition knowledge", Proceedings of the Seventeenth International
Florida Artificial Intelligence Research Society Conference, Miami
Beach, Florida, USA 2004.
[13] J. C. Bicarregui, C. A. R. Hoare, and J. C. P. Woodcock,"The Verified
Software Repository:a step towards the verifying compiler", Formal
Aspects of Computing, Springer London, Vol. 18, Number 2 / June, pp.
143-151, 2006.
[14] Jaime Spacco, Jaymie Strecker, David Hovemeyer, and William Pugh,
"Software Repository Mining with Marmoset: An Automated
Programming Project Snapshot and Testing System", Proceedings of
MSR 2005.
[15] Jason Novotny, Steven Tuecke, Von Welch, "An Online Credential
Repository for the Grid: MyProxy", Proceedings of the Tenth
International Symposium on High Performance Distributed Computing
(HPDC-10), IEEE Press, August 2001.
[16] Java, the open source programming API: http://
[17] JAXP, Java XML parsing API:
[18] Jeffrey S. Poulin, "Populating Software Repositories: Incentives and
Domain-Specific Software", Journal of Systems and Software archive,
Vol 30 , Issue 3, Special issue on software reuse, pp. 187 - 199, 1995.
[19] JIRA, bug tracking tool :
[20] John Anvik, Lyndon Hiew and Gail C. Murphy, "Coping with an Open
Bug Repository", OOPSLA workshop on eclipse technology eXchange
archive, Proceedings of the 2005 OOPSLA workshop on Eclipse
technology eXchange, pp. 35 - 39, 2005.
[21] John Anvik, Lyndon Hiew and Gail C. Murphy, "Who Should Fix This
Bug?", Proceedings of ICSE’06, Shanghai, China, 2006.
[22] Keir Mierle, Kevin Laven, Sam Roweis, Greg Wilson, "Mining Student
CVS Repositories for Performance Indicators", Proceedings of the 2005
International Workshop on Mining Software Repositories (MSR2005),
pp.41--45, May 2005.
[23] Lucas D. Panjer, "Predicting Eclipse Bug Lifetimes", Proceedings of the
Fourth International Workshop on Mining Software Repositories, 2007 .
CiiT International Journal of Data Mining and Knowledge Engineering, Vol 1, No 7, September 2009
0974 – 9683/CIIT-IJ-0441/10/$20/$100 © 2009 CiiT Published by the Coimbatore Institute of Information Technology
[24] Lucian Voinea, Alexandru Telea, “An Open Framework for CVS
Repository Querying, Analysis and Visualization”, Proceedings of the
2006 international workshop on Mining software repositories table of
contents, Shanghai, China, pp. 33 - 39, 2006.
[25] MantisBT, bug tracking tool :
[26] MySql, the open source database management system:
[27] Nagwani, N.K. Singh, P., “Bug Mining Model Based on
Event-Component Similarity to Discover Similar and Duplicate GUI
Bugs", Advance Computing Conference, 2009. IACC 2009. IEEE
International, pp.1388-1392, 2009.
[28] Olga Baysal, Michael W. Godfrey, Robin Cohen, "A Bug You Like: A
Framework for Automated Assignment of Bugs", Proc. of 2009 IEEE Intl.
Conference on Program Comprehension (ICPC-09), 17-19 May 2009.
[29] Randall H. Trigg, Jeanette Blomberg, Lucy Suchman, "Moving document
collections online: The evolution of a shared repository", Proceedings of
the Sixth European Conference on Computer-Supported Cooperative
Work, 12-16 September 1999.
[30] Repository Data, Georgios Gousios, Eirini Kalliamvakou and Diomidis
Spinellis, "Measuring Developer Contribution from Software",
Proceedings of the 2008 international working conference on Mining
software repositories table of contents, pp 129-132, 2008.
[31] Romain Robbes, "Mining a Change-Based Software Repository",
Proceedings in Fourth International Workshop on Mining Software
Repositories MSR '07, 2007.
[32] Scott Henninger, "An Evolutionary Approach to Constructing Effective
Software Reuse Repositories", ACM Transactions on Software
Engineering and Methodology (TOSEM) archive, Vol. 6 , Issue 2, pp.
111 - 140, 1997.
[33] Sean and Hojun: Automated Detection of Duplicate Bug Reports with
Semantic Concepts, IEEE COMPSAC, 2008.
[34] Shih-Fu Chang, John R. Smith, Mandis Beigi, and Ana Benitez, "Visual
Information Retrieval from Large Distributed On-line Repositories",
Communications of the ACM archive, Volume 40 , Issue 12, pp 63 - 71,
[35] Shuji Morisaki, Akito Monden, Tomoko Matsumura, Haruaki Tamada
and Kenichi Matsumoto: Defect Data Analysis Based on Extended
Association Rule Mining, 2007.
[36] Sushil Bajracharya, Joel Ossher, Cristina Lopes, "Sourcerer: An
internet-scale software repository", Proceedings of the 2009 ICSE
Workshop on Search-Driven Development-Users, Infrastructure, Tools
and Evaluation table of contents, pp. 1-4, 2009.
[37] Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik and Jiasu Sun, "An
Approach to Detecting Duplicate Bug Reports using Natural Language
and Execution Information", International Conference on Software
Engineering archive, Proceedings of the 30th international conference on
Software engineering, pp. 461-470, 2008.

Naresh Kumar Nagwani was born on 15
1980 at Raipur, India. He completed his graduation in
Computer Science & Engineering in 2001 from Guru
Ghasidas University, Bilaspur. He completed his post
graduation M.Tech. in Information Technology from
ABV- Indian Institute of Information Technology,
Gwalior in 2005. His area of interest is DBMS, Data
Mining, Text Mining and Information Retrieval. His
employment experience includes SSCET Bhilai, Team Lead in Persistent
Systems Limited and NIT Raipur. Presently he is assistant professor at
department of computer science & engineering, National Institute of
Technology, Raipur.

Dr. Shrish Verma has completed his graduation in
Electronics & Telecommunication Engineering and his
post graduation M.Tech. in Computer Engineering from
Indian Institute of Technology, Kharagpur. He has
completed his PhD in Engineering from Pt. Ravi Shankar
Shukla University Raipur. Presently he is head &
associate professor at department of information
technology, National Institute of Technology, Raipur.