Using Machine Learning for Software Criticality Evaluation
.
Miyoung Shin
1
and Amrit L.Goel
2
.
1
Bioinformatics Team, Future Technology Research Division, ETRI
Daejeon 305

35
0, Korea
shinmy@etri.re.kr
2
Dept. of Electrical Engineering and Computer Science, Syracuse University
Syracuse, New York 13244, USA
goel@ecs.syr.edu
A
bstract.
During software development, early identification of critical components is necessary
to allocate adequate resources to ensure high quality of the delivered system. The purpose of this
paper is to describe a methodology for modeling component crit
iciality based on its software
characteristics. This involves development of a relationship between the input and the output
variables without knowledge of their joint probability distribution. Machine learning techniques
have been remarkably successful in
such applications; and in this paper we employ one such
technique is employed here. In particular, we use software component data from the NASA
metrics database and develop radial basis function classifiers using our new, innovative algebraic
algorithm fo
r determining the model parameters. Using a principled approach for classifier
development and evaluation, we show that design characteristics can yield parsimonious
classifiers with impressive classification performance on test data.
1. Introduction
In
spite of much progress in the theory and practice of software engineering, software systems continue to
suffer from cost overruns, schedule slips and poor quality. Due to the lack of a sound mathematical basis
for this discipline, empirical models play a
crucial role in the planning, monitoring and control during
software system development.
Machine Learning techniques address issues related to programs that learn relationships between inputs and
outputs of a system without involving their distributions.
Such techniques have been found very useful in
many applications, including some aspects of software engineering. In this paper our interest is in
relationships among the characteristics of software components and their criticality levels. Here the inputs
are module characteristics and the output is its criticality level. In most software engineering applications
the underlying distributions are not known and hence it is highly desirable to employ machine learning
techniques to develop input

output relation
ships.
Two main classes of models used in software engineering applications are effort prediction [1] and module
classification [2]. A number of studies have dealt with both of the types of the models. In this paper we
address the second class of models,
viz two

class classifier development using machine learning.
A large number of studies have been reported in the literature about the use of classification techniques for
software components [see eg. 2,3,4,5 6,7]. In most of these, a classification model
is built using the known
values of the relevant software characteristics called metrics and known class labels. Then, based on
metrics data about a previously unseen software component, the classification model is used to assess its
class, viz. high or low
criticality. A recent case study [2] provides a good review and summary of the main
techniques used for this purpose.
Some of the commonly used techniques are listed below. For applications of these and related techniques in
software engineering, see [7,
8].
Logistic regression
Case based reasoning (CBR)
Classification and regression trees (CART)
Quinlan’s classification trees (C 4.5, SEE 5.0)
Each of these, and other approaches, has its own pros and cons. The usual criteria for evaluating classifiers
are
classification accuracy, time to learn, time to use, robustness, interoperability, and classifier complexity
[9, 10]. For software applications, accuracy, robustness, interoperability and complexity are or primary
concern.
In developing software classifi
ers an important consideration is what metrics to use as features. An
important goal of this paper is to evaluate the efficacy and accuracy of component classifiers using only a
few software metrics, that are available at early stages of software developme
nt life cycle. In particular we
are interested in developing good machine learning classifiers based on software design metrics and
evaluating their efficacy relative to those that employ coding metrics and combined design and coding
metrics. In this paper
, we employ the radial basis function (RBF) classifiers because RBF is a versatile
modeling technique by virtue of its separate treatment of non

linear and linear mapping, possesses
impressive mathematical properties of universal [1, 11] and best approxima
tion, and our recent algebraic
algorithm for RBF design [10,11] provides an easy to use and principled methodology for developing
classifiers. Finally, we employ Gaussian kernels as they are the most popular kernels in RBF applications
and also possess the
above mathematical properties.
This paper is organized as follows. In Section 2, we provide a brief description of the data set employed.
An outline of the RBF model and classification methodology is given in Section 3. The main results of the
empirical
study are discussed in Section 4. Finally, some concluding remarks are presented in Section 5.
2. Data Description
The data for this study was obtained from a NASA software metrics database available in the public
domain. It provides various project and
process measures for space related software systems. The particular
data set used here consists of eight systems with a total of 796 components. The number of components in a
system varies from 45 to 166 with an average of about 100 components per system.
The size of the data set
is a little less than half a million program lines. Since our primary interest in this study is to evaluate
classification performance based on design and coding measures, we only considered the relevant metrics.
They are listed i
n Table 1. The first three metrics (x
1
, x
2
, x
3
) represent design measures and the last three
(x
4
, x
5
, x
6
) are coding measures. Two statistics, average and standard deviation, for each of these metrics
are also listed in Table 1.The design metrics are avail
able to software engineers much earlier in the life
cycle than the coding metrics so that predictive models based on design metrics are more useful than those
based on say, coding metrics that become available much later in the development life cycle. Typi
cal
values of the six metrics for five components in the database are shown in Table 2, along with the
component class label where 1 refers to high criticality and 0 to low criticality.
Table 1: List of Metrics
Variable
Description
Avg
Std. dev
X
1
Function calls from this
component
9.51
11.94
X
2
Function calls to this component
3.91
8.45
X
3
Input/Output component
parameters
27.28
23.37
X
4
Size in lines
257.49
171.22
X
5
Comment lines
138.94
102.22
X
6
Number of decisions
21.18
21.22
Y
Comp
onent class (0 or 1)
Table 2: Values of selected metrics for five components from the database.
Component
Number.
x
1
x
2
x
3
x
4
x
5
x
6
class
1
16
39
1
689
388
25
1
2
0
4
0
361
198
25
0
3
5
0
3
276
222
2
0
4
4
45
1
635
305
32
0
5
38
41
22
891
407
130
1
The metrics data for software systems tends to be very skewed so that a large number of components have
small values and a few components have large values. This is also true for the data analyzed in this c
ase
study. In general, metrics data tend to follow a Poisson distribution whose mean equals the standard
deviation. Some evidence in support of this observation is provided by the fact that for several metrics in
table 1, the average is almost equal to the
standard deviation.
The classification of modules into high critical (class 1) and low critical (class 0) was determined on the
basis of actual number of faults detected. Modules with 5 or less faults were labeled class 0 and others as
class 1. This clas
sification resulted in about 25 % of the components in class 1 and about 75 % in class 0.
Classification into high and low critical classes is determined by the application domain. However, the
classification used here is quite typical of real world applic
ations.
The 796 module data set was randomly divided into three subsets, 398 for classifier development (training),
199 for validation, and 199 for test. Further, ten random permutation of data sets were prepared to study the
variability in classification
accuracy and model complexity due to the random partitioning of the data into
three sets.
3 Classification methodology.
[NOTE: Eqns and symbols need to be retyped]
The objective of this study is to construct a model that captures an unknown input outp
ut mapping pattern
on the basis of limited evidence about its nature. The evidence available is a set of labeled historic data,
called the training set, denoted as.
Here both the d

dimensional inputs x
i
and their correspond
ing outputs y
i
are made available and the outputs
represent the class label, zero or one.
where Φ(.) is called a basis function and μ
j
and σ
j
are called the centre and the width of the j
th
basis function,
respectively
. Also, w
j
is the weight associated with the j
th
basis function output and m is the number of
basis functions. For the common case where σ
j
= σ,
i = 1,…,m, the RBF model is fully determined by the
parameter set P = (m, μ, σ, w). Also, we employ Gaussian
kernels in this study so that
Φ
i
( x
–
μ
j
 / σ
j
) = exp[(

 x
–
μ
j

2
/ σ
2
)]
In practice, we seek a classifier that is neither too simple nor too complex. A classifier that is too simple
will suffer from under fitting because it does not learn enoug
h from the data and hence provides a poor fit.
On the other hand, a classifier that is too complicated would learn too many details of data including even
noise and thus suffers from over fitting. It cannot provide good generalization on unseen data. Hence
an
important issue is to determine an appropriate number of basis functions (m) that represents a good
compromise between these competing goals.
In this study we employed our recently developed algorithm [1, 11] for providing a compromise between
these
competing goals. In this algorithm, a complexity control parameter is specified by the user. We used a
default value of 0.5 percent. The classifier width values depend on the dimensionality of the input data. The
algorithm automatically computes the other
model paramteres m,
μ and w. The mathematical and
computational details of the algorithm are beyond the scope of this paper. For details, the reader is referred
to [11].
Classifiers for specified values of σ are developed for the training data and classification error (CE) f
or
each model is computed as the fraction of incorrectly classified components. Next, for each model,
validation data is used to evaluate validation error. The model with the smallest validation error is the
selected classifier. Finally, classification err
or for the test set is computed and provides an estimate of the
classification error for the future components for which only the metrics data would be available and the
class label will need to be determined.
4. Classification Results
In this section we
present and discuss the results for three experiments using (1) design metrics, (2) coding
metrics, and (3) combined design and coding metrics. In each case, we develop RBF classifiers according
to the methodology described above, viz. develop classifiers
for training data, select the best one based on
the validation data and estimate future performance using the test data. The model development algorithm
used in this study provides consistent models without any statistical variability. Therefore, we do no
t need
to perform multiple runs for a given permutation of the data set. In other words, the results for one run are
repeatable. The results for the ten permutations, of course, will vary due to different random assignments of
components amongst the traini
ng, validation and test sets.
In evaluating the results, we are primarily interested in classifier complexity and classification error. A
commonly used measure of the RBF model complexity is the number of basis functions (m) and is also
used here. The resu
lts are presented below.
4.1 Design Metrics:
Model complexity and classification errors based on design metrics data (x
1
,x
2
,x
3
) are listed in Table 3. We
note that m varies from 3 to 7 with an average value of 5.1. The training error varies from 21.6% to
27.1%
with an average of 24.3%. The validation error varies from 21.1% to 29.2% with an average of 25.9%, and
the test error varies from 21.6% to 28.1% with an average of 25.0%. Clearly, there is noticeable variation
among the results from different permu
tations. A plot of the three error measures is given in Figure 1 and
illustrate the variability as seen in Table 3.
Table 3. Classification results for design metrics.
Classification Error (%)
Permutation
m
Training
Validation
Te
st
1
4
27.1
29.2
21.6
2
6
25.2
23.6
24.6
3
4
25.6
21.1
26.1
4
7
24.9
26.6
22.6
5
4
21.6
27.6
28.1
6
7
24.1
25.1
24.6
7
3
22.6
26.6
24.6
8
5
24.4
28.6
24.1
9
7
24.4
28.6
24.1
10
4
23.1
24.6
27.1
In classification studies, the test data accuracy
is of primary interest, while the other values provide useful
insights into the nature of the classifier and the model development process. Therefore, we concentrate only
on test error. Below, we compute confidence bounds for the true test error based on t
he results from the ten
permutations. It should be noted that the confidence bounds given below are not the same thing as might be
obtained based on data from only one permutation by using techniques such as bootstrap.
The standard de
viation (SD) of test errors from Table 4 is 1.97 %. Using the well known t

statistic, 95 %
confidence bounds for the true, unknown test error are given as below.
{Average Test Error ± t(9; 0.025)* (SD of Test Error) / √10 }
= { 24.95 ± 2.26 (1.97)/ √10
}
= { 23.60, 26.40}
The above values are interpreted as follows. Based on the results for test errors presented in Table 3, we are
95 % confident that the true test error for the RBF model is between 23.6 % and 26.4 %. The 95 % is a
commonly used value; b
ounds for other values can be easily computed. For example, 90 % confidence
bounds for test error here are {23.81, 26.05}. The bounds get narrower as the confidence value decreases
and get wider as it increases.
4.2 Coding and Combined Measures
Since our
focus is on model complexity and test error, we now present only these values in Table 4 for the
coding measures (x
4
, x
5
, x
6
). Also given are the values for the combined (design plus coding) measures (x
1
to x
6
). We note that model complexity here has muc
h more variability than for the case of design measures.
Standard deviations and confidence bounds for these cases were obtained as above. A summary of test error
results for the three cases is given in Table 5
Table 4. Model com
plexity and test errors.
Coding Measures Combined Metrics
Permutation
M
Test Error (%)
M
Test Error (%)
1
2
14.6
6
20.6
2
2
26.1
2
26.1
3
14
24.1
17
26.1
4
22
18.6
6
21.1
5
4
26.1
4
27.6
6
12
25.1
16
24.6
7
4
23.6
6
22.6
8
9
23.6
16
24.6
9
3
23.6
6
22.6
10
21
24.6
28
27.6
Table 5. Summary of test error results
Confidence Bounds and Width
Metrics
Average
SD
90 %
95 %
Design Metrics
24.95
1.97
{23.81, 26.05}
{23.60, 2
6.40}
Coding Metrics
23.00
3.63
{20.89, 25.11}
{21.40, 25.80}
Combined Metrics
24.35
2.54
{22.89, 25.81}
{22.55, 26.15}
4.3 Discussion
The results in Table 5 provide insights into the two issues addressed in this paper. The first one, measure of
vari
ability due to the random assignment of components to the training, validation and test sets is indicated
by the widths of the confidence intervals. For both the 90 % and 95 % bounds, the width are quite
reasonable. This is specially noticeable in light of
the fact that software engineering data tends to be quite
noisy. Further, the variability due to coding metrics alone is considerably higher than for the other two
cases. Regarding the second issue, relative classification accuracies for the three cases,
it appears that
classification based on the design metrics alone is comparable to the other two cases. Overall, it appears
that a test error of about 24% is a reasonable summary for the data analyzed in this study. We note that this
value of 24% is easily
contained in all the confidence bounds listed in Table 5.
5 Concluding Remarks:
In this paper we studied an important software engineering issue of identifying fault

prone components
using selected software metrics as predictor variables. A software crit
icality evaluation model, that is
parsimonious and employs easily available metrics, can be an effective analytical tool in reducing the
likelihood of operational failures. However, the efficacy of such a tool depends on the mathematical
properties of the
classifier and its design algorithm. Therefore, in this study we employed Gaussian radial
basis function classifiers that possess the powerful properties of best and universal approximation. Further,
our new design algorithm yields consistent results using
only algebraic methods. Further, we used input
data permutations and classifiers for each permutation to compute the confidence bounds for test error. A
comparison of these bounds based on design, coding and combined metrics indicated that the errors in t
he
three cases could be considered to be statistically equal, based on the analyses presented here, we believe
that based on a combination of the model and algorithm properties, able to establish, at least empirically,
that design metrics, which are easily
available early in the software development cycle, can be effective
predictors of potentially critical components. Early attention and additional resources allocated to such
components can be instrumental in minimizing the operational failures and thus im
proving system quality.
References:
1.
Shin, M and Goel A. Empirical Data Modelling in Software Engineering using Radial Basis
Functions. IEEE Transactions on Software Engineering, (28).(2002) 567_576.
2.
Khoshgoftaar, T., and Seliya, N. Comparitive As
sessment of Empirical Software Quality
Classification Techniques: An Empirical Case study.
Software Engineering, 9 (2004) 229

257.
3.
Khoshgoftaar, T., Yuan, X., Aleen, E. B., Jones, W. D., and Hudepohl, J. P.:
Uncertain Clasification of Fault

Prone Softw
are Modules. Empirical
Software Engineering, 7(2002) 297

318.
4.
Lanubile, F., Lonigro, A., and Vissagio, G.: Comparing Models for
identifying Fault

Prone Software Components, 7th International Conference
on Software Engineering and Knowledge Engineering,
312

319, Rockville,
Maryland, June 1995.
5.
Pedrycz, W.: Computational Intelligence as an Emerging Paradigm of
Software Engineering. Fourteenth International Conference on Software
Engineering and Knowledge Engineering, Ischia, Itlay, July 2002, 7

14.
6.
Pighin, M. And Zamolo, R.: A Predictive Meric based on Discriminant
Statistical Analysis, International Conference on Software Engineering,
Boston, MA, 1997, 262

270.
7.
Zhang, D. And Tsai, J. J. P.: Machine Learning and Software Engineering.
Software Q
uality journal, 11 (2003) 87

119.
8.
Shull, F., Mendonce, M. G., Basili V., Carver J., Maldonado, J. C., Fabbri,
S., Travassos, G. H., and Ferreira, M. C..: Knowledge

Sharing Issues in
Experimental Software Engineering, Empirical Software Engineering, 9
(
2004) pp 111

137.
9.
Goel, A. L. And Shin, M.: Tutorial on Software Models and Metrics.
International Conference on Software Engineering, Boston, MA (1997).
10.
Han, J. and Kamber, M.: Data Mining Morgan Kauffman, 2001.
11.
Shin M., Goel, A.: Design and
Evaluation of RBF Models based on RC
Criterion, Technical Report, Syracuse University, 2003.
Comments 0
Log in to post a comment