int.j.remote sensing,

2002

,vol.

23

,no.4,725–749

An assessment of support vector machines for land cover classi cation

C.HUANG†

Department of Geography,University of Maryland,College Park,MD 20742,

USA

L.S.DAVIS

Institute for Advanced Computing Studies,University of Maryland,College

Park,MD 20742,USA

and J.R.G.TOWNSHEND

Department of Geography and Institute for Advanced Computing Studies,

University of Maryland,College Park,MD 20742,USA

(Received 27 October 1999;in nal form 27 November 2000)

Abstract.The support vector machine (SVM) is a group of theoreticallysuperior

machine learning algorithms.It was found competitive with the best available

machine learning algorithms in classifying high-dimensional data sets.This paper

gives an introduction to the theoretical development of the SVM and an experi-

mental evaluation of its accuracy,stability and training speed in deriving land

cover classi cations fromsatellite images.The SVMwas compared to three other

popular classi ers,including the maximum likelihood classi er (MLC),neural

network classi ers (NNC) and decision tree classi ers (DTC).The impacts of

kernel con guration on the performance of the SVM and of the selection of

training data and input variables on the four classi ers were also evaluated in

this experiment.

1.Introduction

Land cover information has been identi ed as one of the crucial data components

for many aspects of global change studies and environmental applications (Sellers

et al.1995).The derivation of such information increasingly relies on remote sensing

technology due to its ability to acquire measurements of land surfaces at various

spatial and temporal scales.One of the major approaches to deriving land cover

information from remotely sensed images is classi cation.Numerous classi cation

algorithms have been developed since the rst Landsat image was acquired in early

1970s (Townshend 1992,Hall et al.1995).Among the most popular are the maximum

likelihood classi er (MLC),neural network classi ers and decision tree classi ers.

The MLC is a parametric classi er based on statistical theory.Despite limitations

due to its assumption of normal distribution of class signature (e.g.Swain and Davis

†Current address:Raytheon ITSS,USGS

/

EROS Data Center,Sioux Falls,SD 57108,

USA;e-mail address:huang@edcmail.cr.usgs.gov

International Journal of Remote Sensing

ISSN 0143-1161 print

/

ISSN 1366-5901 online ©2002 Taylor & Francis Ltd

http:

//

www.tandf.co.uk

/

journals

DOI:10.1080

/

01431160110040323

Chengquan Huang et al.726

1978),it is perhaps one of the most widely used classi ers (e.g.Wang 1990,Hansen

et al.1996).Neural networks avoid some of the problems of the MLC by adopting

a non-parametric approach.Their potential discriminating power has attracted a

great deal of research eVort.As a result,many types of neural networks have been

developed ( Lippman 1987);the most widely used in the classi cation of remotely

sensed images is a group of networks called a multi-layer perceptron (MLP) (e.g.

Paola and Schowengerdt 1995,Atkinson and Tatnall 1997).

A decision tree classi er takes a diVerent approach to land cover classi cation.

It breaks an often very complex classi cation problem into multiple stages of simpler

decision-making processes (Safavian and Landgrebe 1991).Depending on the

number of variables used at each stage,there are univariate and multivariate decision

trees ( Friedl and Brodley 1997).Univariate decision trees have been used to develop

land cover classi cations at a global scale (DeFries et al.1998,Hansen et al.2000).

Though multivariate decision trees are often more compact and can be more accurate

than univariate decision trees (Brodley and UtgoV 1995),they involve more complex

algorithms and,as a result,are aVected by a suite of algorithm-related factors (Friedl

and Brodley 1997).The univariate decision tree developed by Quinlan (1993) is

evaluated in this study.

The support vector machine (SVM) represents a group of theoretically superior

machine learning algorithms.As shall be described in the following section,the SVM

employs optimization algorithms to locate the optimal boundaries between classes.

Statistically,the optimal boundaries should be generalized to unseen samples with

least errors among all possible boundaries separating the classes,therefore minimizing

the confusion between classes.In practice,the SVM has been applied to optical

character recognition,handwritten digit recognition and text categorization ( Vapnik

1995,Joachims 1998b).These experiments found the SVM to be competitive with

the best available classi cation methods,including neural networks and decision tree

classi ers.The superior performance of the SVMwas also demonstrated in classifying

hyperspectral images acquired from the Airborne Visible/Infrared Imaging

Spectrometer (AVIRIS) (Gualtieri and Cromp 1998).While hundreds of variables

were used as the input in the experiments mentioned above,there are far fewer

variables in data acquired from operational sensor systems such as Landsat,the

Advanced Very High Resolution Radiometer (AVHRR) and the Moderate Resolution

Imaging Spectroradiometer (MODIS).Because these are among the major sensor

systems from which land cover information is derived,an evaluation of the perform-

ance of the SVM using images from such sensor systems should have practical

implications for land cover classi cation.The purpose of this paper is to demonstrate

the applicability of this algorithm to deriving land cover from such operational

sensor systems and systematically to evaluate its performances in comparison to

other popular classi ers,including the statistical maximum likelihood classi er

(MLC),a back propagation neural network classi er (NNC) (Pao 1989) and a

decision tree classi er ( DTC) (Quinlan 1993).The SVM was implemented by

Joachims ( 1998a) as SVM

light

.

A brief introduction to the theoretical development of the SVM is given in the

following section.This is deemed necessary because the SVM is relatively new to

the remote sensing community as compared to the other three methods.The data

set and experimental design are presented in §3.Experimental results are discussed

in the following three sections,including impacts of kernel con guration on the

performance of the SVM,comparative performances of the four classi ers and

Support vector machines for land cover classi cation 727

impacts of non-algorithm factors.The results of this study are summarized in the

last section.

2.Theoretical development of SVM

There are a number of publications detailing the mathematical formulation of

the SVM (see e.g.Vapnik 1995,1998,Burges 1998).The algorithm development of

this section follows Vapnik (1995) and Burges ( 1998).

The inductive principle behind SVM is structural risk minimization (SRM).

According to Vapnik (1995),the risk of a learning machine (R) is bounded by the

sum of the empirical risk estimated from training samples (R

emp

) and a con dence

interval (Y):R

R

emp

+Y.The strategy of SRMis to keep the empirical risk (R

emp

)

xed and to minimize the con dence interval (Y),or to maximize the margin between

a separating hyperplane and closest data points ( gure 1).A separating hyperplane

refers to a plane in a multi-dimensional space that separates the data samples of two

classes.The optimal separating hyperplane is the separating hyperplane that maxim-

izes the margin from closest data points to the plane.Currently,one SVM classi er

can only separate two classes.Integration strategies are needed to extend this method

to classifying multiple classes.

2.1.The optimal separating hyperplane

Let the training data of two separable classes with k samples be represented by

(x

1

,y

1

),...,(x

k

,y

k

) where xµR

n

is an n-dimensional space,and yµ{+1,1} is class

label.Suppose the two classes can be separated by two hyperplanes parallel to the

optimal hyperplane ( gure 1(a)):

w∙ x

i

+b

1

for y

i

=

1,

i=1,2,...,k (1)

w∙ x

i

+b

1

for y

i

=

1

(2)

Figure 1.The optimal separating hyperplane between (

a

) separable samples and (

b

)

non-separable data samples.

Chengquan Huang et al.728

where w=

(

w

1

,...,

w

n

) is a vector of n elements.Inequalities ( 1) and ( 2) can be

combined into a single inequality:

y

i

[

w´x

i

+b

]

1

i=

1,...,

k (3)

As shown in gure 1,the optimal separating hyperplane is the one that separates

the data with maximum margin.This hyperplane can be found by minimizing the

norm of w,or the following function:

F

(

w

)

=

Ø

(

w´w

)

(4)

under inequality constraint ( 3).

The saddle point of the following Lagrangean gives solutions to the above

optimization problem:

L

(

w

,

b

,

a

)

=

Ø

(

w´w

)

æ

k

i

=

1

a

i

{

y

i

[

w´x

i

+b

]

1}

(5)

where a

i

0 are Lagrange multipliers (Sundaram 1996).The solution to this optim-

ization problem requires that the gradient of L (w,b,a) with respect to w and b

vanishes,giving the following conditions:

w= æ

k

i

=

1

y

i

a

i

x

i

(6)

æ

k

i

=

1

a

i

y

i

=

0

(7)

By substituting (6) and ( 7) into ( 5),the optimization problem becomes:maximize

L

(

a

)

= æ

k

i

=

1

a

i

1

2

æ

k

i

=

1

æ

k

j

=

1

a

i

a

j

y

i

y

j

(

x

i

´x

j

)

(8)

under constraints a

i

0,

i=1,...,k.

Given an optimal solution a

0

=

(

a

0

1

,...,

a

0

k

) to ( 8),the solution w

0

to ( 5) is a linear

combination of training samples:

w

0

= æ

k

i

=

1

y

i

a

0

i

x

i

(9)

According to the Kuhn–Tucker theory (Sundaram 1996),only points that satisfy

the equalities in ( 1) and ( 2) can have non-zero coeYcients a

0

i

.These points lie on

the two parallel hyperplanes and are called support vectors ( gure 1).Let x

0

( 1) be

a support vector of one class and x

0

(1) of the other,then the constant b

0

can be

calculated as follows:

b

0

=

Ø

[

w

0

´x

0(1)

+w

0

´x

0(

1)]

( 10)

The decision rule that separates the two classes can be written as:

f

(

x

)

=sign

A

æ

support vector

y

i

a

0

i

(

x

i

´x

)

b

0

B

( 11)

2.2.Dealing with non-separable cases

An important assumption to the above solution is that the data are separable in

the feature space.It is easy to check that there is no optimal solution if the data

cannot be separated without error.To resolve this problem,a penalty value C for

Support vector machines for land cover classi cation 729

misclassi cation errors and positive slack variables j

i

are introduced ( gure 1(b)).

These variables are incorporated into constraint ( 1) and ( 2) as follows:

w´x

i

+b

1

j

i

for y

i

=

1

( 12)

w

i

´x

i

+b

1

+j

i

for y

i

=

1

( 13)

j

i

0,

i=

1,...,

k ( 14)

The objective function ( 4) then becomes

F

(

w

,

j

)

=

Ø

(

w´w

)

+C

A

æ

k

i

=

1

j

i

B

l

( 15)

C is a preset penalty value for misclassi cation errors.If l=1,the solution to this

optimization problem is similar to that of the separable case.

2.3.Support vector machines

To generalize the above method to non-linear decision functions,the support

vector machine implements the following idea:it maps the input vector x into a

high-dimensional feature space H and constructs the optimal separating hyperplane

in that space.Suppose the data are mapped into a high-dimensional space Hthrough

mapping function W:

W:R

n

H ( 16)

A vector x in the feature space can be represented as W(x) in the high-dimensional

space H.Since the only way in which the data appear in the training problem (8)

are in the form of dot products of two vectors,the training algorithm in the high-

dimensional space Hwould only depend on data in this space through a dot product,

i.e.on functions of the form W

(

x

i

)

´W

(

x

j

).Now,if there is a kernel function K such

that

K

(

x

i

,

x

j

)

=W

(

x

i

)

´W

(

x

j

)

( 17)

we would only need to use K in the training program without knowing the explicit

form of W.The same trick can be applied to the decision function ( 11) because the

only form in which the data appear are in the form of dot products.Thus if a kernel

function K can be found,we can train and use a classi er in the high-dimensional

space without knowing the explicit form of the mapping function.The optimization

problem ( 8) can be rewritten as:

L

(

a

)

= æ

k

i

=

1

a

i

1

2

æ

k

i

=

1

æ

k

j

=

1

a

i

a

j

y

i

y

j

K

(

x

i

,

x

j

)

( 18)

and the decision rule expressed in equation ( 11) becomes:

f

(

x

)

=sign

A

æ

support vector

y

i

a

0

i

K

(

x

i

,

x

)

b

0

B

( 19)

A kernel that can be used to construct a SVM must meet Mercer’s condition

(Courant and Hilbert 1953).The following two types of kernels meet this condition

Chengquan Huang et al.730

and will be considered in this study ( Vapnik 1995).The polynomial kernels,

K

(

x

1

,

x

2

)

=

(

x

1

´x

2

+

1)p

( 20)

and the radial basis functions (RBF),

K

(

x

1

,

x

2

)

=e

Õ

c(x

1

Õ

x

2

)

2 ( 21)

2.4.From binary classi er to multi-class classi er

In the above theoretical development,the SVM was developed as a binary

classi er,i.e.one SVMcan only separate two classes.Strategies are needed to adapt

this method to multi-class cases.Two simple strategies have been proposed to adapt

the SVM to N-class problems (Gualtieri and Cromp 1998).One is to construct a

machine for each pair of classes,resulting in N

(

N

1)/2

machines.When applied to

a test pixel each machine gives one vote to the winning class,and the pixel is labelled

with the class having most votes.The other strategy is to break the N-class case

into N two-class cases,in each of which a machine is trained to classify one class

against all others.When applied to a test pixel,a value measuring the con dence

that the pixel belongs to a class can be calculated from equation (19),and the pixel

is labelled with the class with which the pixel has the highest con dence value

( Vapnik 1995).Without an evaluation of the two strategies,the second one is used

in this study because it only requires to train N SVMmachines for an N-class case,

while for the same classi cation the rst strategy requires to train N

(

N

1)/2

SVM

machines.

With the second strategy each SVMmachine is constructed to separate one class

from all other classes.An obvious problem with this strategy is that in constructing

each SVMmachine,the sizes of the two concerned classes can be highly unbalanced,

because one of them is the aggregation of N1 classes.For data samples that cannot

be separated without error,a classi er may not be able to nd a boundary between

two highly unbalanced classes.For example,a classi er may not be able to nd a

boundary between the two classes shown in gure 2 because the classi er probably

makes least errors by labelling all pixels belonging to the smaller class with the

Figure 2.An example of highly unbalanced training samples in a two-dimensional space

de ned by two arbitrary variables,features 1 and 2.A classi er might incur more

errors by drawing boundaries between the two classes than labelling pixels of the

smaller class with the larger one.

Support vector machines for land cover classi cation 731

larger one.To avoid this problem the samples of the smaller class are replicated

such that the two classes have approximately the same sizes.Similar tricks were

employed in constructing decision tree classi ers for highly unbalanced classes

( DeFries et al.1998).

3.Data and experimental design

3.1.Data and preprocessing

A spatially degraded Thematic Mapper (TM) image and a corresponding refer-

ence map were used in this evaluation study.The TM image,acquired in eastern

Maryland on 14 August,1985,has a spatial resolution of 28.5m.The six spectral

bands ( bands 1–5,and 7) of the TM image were converted to top-of-atmosphere

( TOC) re ectance according to Markhamand Barker ( 1986).Atmospheric correction

was not necessary because the image was quite clear within the study area.Three

broad cover types—forest,non-forest land and water—were delimited from this

image,giving a land cover map with the same spatial resolution as the TM image.

This three-class scheme was selected to ensure the achievement of high accuracy

of the collected land cover map at this resolution.Confused pixels were labelled

according to aerial photographs and eld visits covering the study area.

Both the TM image and derived land cover map were degraded to a spatial

resolution of 256.5m with a degrading ratio of 9:1,i.e.each degraded pixel corre-

sponds to 9 by 9 TM pixels.The main reason for evaluating the classi ers using

degraded data is that a highly reliable reference land cover map with a reasonable

number of classes can be generated at the degraded resolution.The image was

degraded using a simulation programme embedded with models of the point spread

functions ( PSF) of TM and MODIS sensors ( Barker and Burelhach 1992).By

considering the PSF of both sensor systems,the simulation programme gives more

realistic images than spatial averaging (Justice et al.1989).Overlaying the 256.5m

grids on the 28.5m land cover map and calculating the proportions of forest,non-

forest land and water within each 256.5m grid gave proportion images of forest,

non-forest land and water at the 256.5m resolution.A land cover map at the 256.5m

resolution was developed by reclassifying the proportion images according to class

de nitions given in table 1.These de nitions were based on the IGBP classi cation

scheme ( Belward and Loveland 1996,DeFries et al.1998).Class names were so

chosen to match the de nitions used in this study.

3.2.Experimental design

Many factors aVect the performance of a classi er,including the selection of

training and testing data samples as well as input variables (Gong and Howarth

1990,Foody et al.1995).Because the impact of testing data selection on accuracy

Table 1.De nition of land cover classes for the Maryland data set.

Code Cover type De nition

1 Closed forest tree cover

>

60%,water

20%

2 Open forest 30%

<

tree cover

60%,water

20%

3 Woodland 10%

<

tree cover

30%,water

20%

4 Non-forest land tree cover

10%,water

20%

5 Land-water mix 20%

<

water

70%

6 Water water

>

70%

Chengquan Huang et al.732

assessment has been investigated in many works (e.g.Genderen and Lock 1978,

Stehman 1992),only the selection of training sample and the selection of input

variable were considered in this study.In order to avoid biases in the con dence

level of accuracy estimates due to inappropriately sampled testing data ( Fitzpatrick-

Lins 1981,Dicks and Lo 1990),the accuracy measure of each test was estimated

from all pixels not used as training data.

3.2.1.Training data selection

Training data selection is one of the major factors determining to what degree

the classi cation rules can be generalized to unseen samples (Paola and Schowengerdt

1995).A previous study showed that this factor could be more important for

obtaining accurate classi cations than the selection of classi cation algorithms

( Hixson et al.1980).To assess the impact of training data size on diVerent classi ca-

tion algorithms,the selected algorithms were tested using training data of varying

sizes.Speci cally,the four algorithms were trained using approximately 2,4,6,8,10

and 20%pixels of the entire image.

With data sizes xed,training pixels can be selected in many ways.A commonly

used sampling method is to identify and label small patches of homogeneous pixels

in an image (Campbell 1996).However,adjacent pixels tend to be spatially correlated

or have similar values (Campbell 1981).Training samples collected this way under-

estimate the spectral variability of each class and are likely to give degraded classi-

cations (Gong and Howarth 1990).A simple method to minimize the eVect of

spatial correlation is random sampling (Campbell 1996).Two random sampling

strategies were investigated in this experiment.One is called equal sample rate ( ESR)

in which a xed percentage of pixels are randomly sampled from each class as

training data.The other is called equal sample size (ESS) in which a xed number

of pixels are randomly sampled from each class as training data.In both strategies

the total number of training samples is approximately the same as those calculated

according to the prede ned 2,4,6,8,10 and 20% sampling rates for the whole

data set.

3.2.2.Selection of input variables

The six TM spectral bands roughly correspond to six MODIS bands at 250m

and 500m resolutions ( Barnes et al.1998).Only the red ( TM band 3) and near

infrared ( NIR,TM band 4) bands are available at 250m resolution.The other four

TM bands are available at 500m resolution.Because these four bands contain

information that is complementary to the red and NIR bands (Townshend 1984,

Toll 1985),not having them at 250m resolution may limit the ability to derive land

cover information at this resolution.Two sets of tests were performed to evaluate

the impact of not having the four TM bands on land cover characterization at the

250mresolution.In the rst set only the red,NIR band and the normalized diVerence

vegetation index ( NDVI) were used as input to the classi ers,while in the second

set the other four bands were also included.NDVI is calculated from the red and

NIR bands as follows:

NDVI=

NIRred

NIR+red

( 22)

Table 2 summarizes the training conditions under which the four classi cation

algorithms were evaluated.

Support vector machines for land cover classi cation 733

Table 2.Training data conditions under which the classi cation algorithms were tested.

Sample size (% Number of input

Sampling method of entire image) variables Training case no.

Equal sample size 2 3 1

7 2

4 3 3

7 4

6 3 5

7 6

8 3 7

7 8

10 3 9

7 10

20 3 11

7 12

Equal sample rate 2 3 13

7 14

4 3 15

7 16

6 3 17

7 18

8 3 19

7 20

10 3 21

7 22

20 3 23

7 24

3.2.3.Cross validation

In the above experiment only one training data set was sampled from the image

at each training size level.In order to evaluate the stability of the selected classi ers

and for the results to be statistically valid,cross validations were performed at two

training data size levels:6% pixels representing a relatively small training size and

20%pixels representing a relatively large training size.At each size level ten sets of

training samples were randomly selected from the image using the equal sample rate

( ESR) method.As will be discussed in §6.1,this method gave slightly higher accuracies

than the ESS.On each training data set the four classi cation algorithms were

trained using three and seven variables.

3.3.Methods for performance assessment

The criteria for evaluating the performance of classi cation algorithms include

accuracy,speed,stability,and comprehensibility,among others.Which criterion or

Chengquan Huang et al.734

which group of criteria to use depends on the purpose of the evaluation.As a

criterion most relevant to all parties and all purposes,accuracy was selected as the

primary criterion in this assessment.Speed and stability are also important factors

in algorithm selection and these were considered as well.Two widely used accuracy

measures—overall accuracy and the kappa coeYcient—were used in this study

( Rosen eld and Fitzpatrick-Lins 1986,Congalton 1991,Janssen and Wel 1994).The

overall accuracy has the advantage of being directly interpretable as the proportion

of pixels being classi ed correctly ( Janssen and Wel 1994,Stehman 1997),while the

kappa coeYcient allows for a statistical test of the signi cance of the diVerence

between two algorithms (Congalton 1991).

4.Impact of kernel con guration on the performances of the SVM

According to the theoretical development of SVM presented in §2,the kernel

function plays a major role in locating complex decision boundaries between classes.

By mapping the input data into a high-dimensional space,the kernel function

converts non-linear boundaries in the original data space into linear ones in the

high-dimensional space,which can then be located using an optimization algorithm.

Therefore the selection of kernel function and appropriate values for corresponding

kernel parameters,referred to as kernel con guration,may aVect the performance

of the SVM.

4.1.Polynomial kernels

The parameter to be prede ned for using the polynomial kernels is the polynomial

order p.According to previous studies (Cortes and Vapnik 1995),p values of 1 to 8

were tested for each of the 24 training cases.Rapid increases in computing time as

p increases limited experiments with higher p values.Kernel performance is measured

using the overall agreement between a classi cation and a reference map—the overall

accuracy (Stehman 1997).Figure 3 shows the impact of p on kernel performance.In

general,the linear kernel ( p=1) performed worse than nonlinear kernels,which is

expected because boundaries between many classes are more likely to be non-linear.

With three variables as the input,there are obvious trends of improved accuracy as

p increases ( Figure 3(c) and (d)).Such trends are also observed in training cases with

seven input variables when p increases from 1 to 4 ( gure 3(a) and (b)).This observa-

tion is in contrast to the studies of Cortes and Vapnik ( 1995),in which no obvious

trend was observed when the polynomial order p increased from 2 to higher values.

This is probably because the number of input variables used in this study is quite

diVerent from those used in previous studies.The data set used in this experiment

has only several variables while those used in previous studies had hundreds of

variables.DiVerences between observations of this experiment and those of previous

studies suggest that polynomial order p has diVerent impacts on kernel performance

when diVerent number of input variables is used.With large numbers of input

variables,complex nonlinear decision boundaries can still be mapped into linear

ones using relatively low-order polynomial kernels.However,if a data set has only

several variables,it is necessary to try high-order polynomial kernels in order to

achieve optimal performances using polynomial SVM.

4.2.RBF kernels

The parameter to be preset for using the RBF kernel de ned in equation ( 21) is

c.In previous studies c values of around 1 were used ( Vapnik 1995,Joachims 1998b).

Support vector machines for land cover classi cation 735

(

a

) (

b

)

(

c

) (

d

)

Figure 3.Performance of polynomial kernels as a function of polynomial order

p

(training

data size is % pixel of the image).(

a

) Equal sample size,7 variables,(b) equal sample

rate,7 variables,(

c

) equal sample size,3 variables,(

d

) equal sample rate,3 variables.

For this speci c data set,c values between 1 and 20 gave reasonable results ( gure 4).

A comparison between gure 3 and gure 4 reveals that the performance of the RBF

kernel is less aVected by c than that of the polynomial kernel by p.With seven input

variables ( gure 4(a) and (b)),the overall accuracy only changed slightly when c

varied between 1 and 20.With three input variables,however,the impact is more

signi cant.Figure 4(c) and (d) show obvious trends of increased performance when

c increased from 1 to 7.5.For most training cases the overall accuracy only changed

slightly when c increased beyond 7.5.

The impact of kernel parameter on kernel performance can be illustrated using

an experiment performed on arbitrary data samples collected in a two-dimensional

space.Figure 5 shows the data samples of two classes and the decision boundaries

between the two classes as located by polynomial and RBF kernels.Notice that

although the decision boundaries located by all non-linear kernels (all polynomial

kernels with p>1 and all RBF kernels) are similar,for this speci c set of samples

the shape of the decision boundary is adjusted slightly and misclassi cation errors

are reduced gradually,as p increases from 3 to 12 for the polynomial kernel

( gure 5(a)),or as c decreases from 1 to 0.1 for the RBF kernel ( gure 5(b)).With

appropriate kernel parameter values both polynomial ( p=12) and RBF (c=0.1)

kernels classi ed this arbitrary data set without error,though the decision boundaries

de ned by the two types of kernels are not exactly the same.How well these decision

Chengquan Huang et al.736

(

a

) (

b

)

(

c

) (

d

)

Figure 4.Performance of RBF kernels as a function of c (training data size is %pixel of the

image).(

a

) Equal sample size,7 variables,(b) equal sample rate,7 variables,(

c

) equal

sample size,3 variables,(

d

) equal sample rate,3 variables.

boundaries can be generalized to unseen samples depends on the distribution of

unseen data samples.

As will be discussed in §6,classi cation accuracy is aVected by training sample

size and number of input variables.Figures 3 and 4 show that most SVM kernels

gave higher accuracies with a larger training size and more input variables.With

three input variables,however,most SVM kernels gave unexpectedly higher

accuracies on the training case with 2%pixels sampled using the equal sample size

( ESS) method than on several larger training data sets selected using the same

sampling method ( gures 3(c) and 4(c)).This is probably because SVM de nes

decision boundaries between classes using support vectors rather than statistical

attributes which are sample size dependent ( gure 5).Although a larger training

data set has a better chance of including support vectors that de ne the actual

decision boundaries and hence should give higher accuracies,there are occasions

when a smaller training data set includes such support vectors while larger ones do

not.In §6.1 we will show that the other three classi ers did not have such abnormal

high accuracies on this training case (see gure 8(c) later).

5.Comparative performances of the four classi ers

The previous section has already illustrated the impact of kernel para-

meter setting on the accuracy of the SVM.Similarly,the performance of the other

Support vector machines for land cover classi cation 737

Figure 5.Impact of kernel con guration on the decision boundaries and misclassi cation

errors of the SVM.Empty and solid circles represent two arbitrary classes.Circled

points are support vectors.Checked points represent misclassi cation errors.Red and

blue represent high con dence areas for class one (empty circle) and two (solid circle)

respectively.Optimal separating hyperplanes are highlighted in white.

classi cation algorithms may also be aVected by the parameter settings of those

algorithms.For example,the performance of the NNC is in uenced by the network

structure (e.g.Sui 1994,Paola and Schowengerdt 1997),while that of the DTC is

aVected by the degree of pruning ( Breiman et al.1984,Quinlan 1993).In this

experiment the NNC took a three-layer (input,hidden and output) network structure,

which is considered suYcient for classifying multispectral imageries (Paola and

Schowengerdt 1995).The numbers of units of the rst and last layers were set to

Chengquan Huang et al.738

the numbers of input variables and output classes respectively.There is no guideline

for determining the number of hidden units.In this experiment it was determined

according to the number of input variables.Three hidden layer con gurations were

tested on each training case:the number of hidden units equals one,two and three

times of the number of input variables.A major issue in pruning a classi cation tree

is when to stop to produce a tree that generalizes well to unseen data samples.Too

simple a tree may not be able to exploit fully the explanatory power of the data

while too complex a tree may generalize poorly.Yet there is no practical guideline

that guarantees a ‘perfect’ tree that is not too simple and not too complex.In this

experiment a wide range of pruning degrees were tested.

Because of the diVerent nature of the impacts of algorithmparameters on diVerent

algorithms,it is impossible to account for such diVerences in evaluating the comparat-

ive performances of the algorithms.To avoid this problem,the best performance of

each algorithm on each training case was reported in the following comparison.The

performances were evaluated in terms of algorithm accuracy,stability and speed.

5.1.Classi cation accuracy

The accuracy of classi cations was measured using the overall accuracy.The

signi cance of accuracy diVerences was tested using the kappa statistics according

to Congalton et al.(1983) and Hudson and Ramm( 1987).Figure 6 shows the overall

accuracies of the four algorithms on the 24 training cases.Table 3 gives the signi c-

ance values of accuracy diVerences between the four algorithms.Table 4 gives the

mean and standard deviation of the overall accuracies of classi cations developed

through cross validation at two training size levels:6%and 20%pixels of the image.

Several patterns can be observed from gure 6 and tables 3 and 4,as follows.

(1) Generally the SVM was more accurate than DTC or the MLC.It gave

signi cantly higher accuracies than the MLC in 18 out of 20 training cases

(MLC could not run on four training cases due to insuYcient training

samples) and than DTC in 14 of 24 training cases.In all remaining training

cases the MLC and DTC did not generate signi cantly better results than

the SVM.The SVMalso gave signi cantly better results than NNC in six of

the 12 training cases with seven input variables,and though insigni cantly,

gave higher accuracies than NNC in ve of the remaining six training cases.

On average when seven variables were used,the overall accuracy of the SVM

was 1–2% higher than that of NNC,and 2–4% higher than those of DTC

and the MLC (table 4).When only three variables were used,the average

overall accuracies of the SVM were about 1–2% higher than those of DTC

and the MLC.These observations are in general agreement with previous

works in which the SVM was found to be more accurate than either NNC

or DTC( Vapnik 1995,Joachims 1998b).This is expected because,as discussed

in §2,the SVMis designed to locate an optimal separating hyperplane,while

the other three algorithms may not be able to locate this separating hyper-

plane.Statistically the optimal separating hyperplane located by the

SVM should be generalized to unseen samples with least errors among all

separating hyperplanes.

(2) Unexpectedly,however,SVM did not give signi cantly higher accuracies

than NNC in any of the 12 training cases with three input variables.On the

contrary,it was signi cantly less accurate than NNC in three of the 12

Support vector machines for land cover classi cation 739

Figure 6.Overall accuracies of classi cations developed using the four classi ers.Y-axis is

overall accuracy (%).X-axis is training data size (% pixel of the image).(

a

) Equal

sample size,7 variables,(b) equal sample rate,7 variables,(

c

) equal sample size,3

variables,(

d

) equal sample rate,3 variables.

training cases.The average overall accuracies of the SVMwere slightly lower

than those of NNC (table 4).The lower accuracies of SVM than those of

NNC on data with three variables are probably due to the inability of the

SVM to transform non-linear class boundaries in the original space into

linear ones in a high-dimensional space.According to the algorithm develop-

ment detailed in §2,the applicability of the SVM to non-linear decision

boundaries depends on whether the decision boundaries can be transformed

into linear ones by mapping the input data into a high-dimensional space.

With only three input variables,the SVM might have less success in trans-

forming complex decision boundaries in the original input space into linear

ones in a high-dimensional space.The complex network structure of NNC,

however,might be able to approximate complex decision boundaries even

when the data contain very few variables,and therefore have better compar-

ative performances over the SVM.The comparative performances of the

SVM on data sets with very few variables should be further investigated

because data sets with such few variables were not considered in previous

studies (Cortes and Vapnik 1995,Joachims 1998b).

Chengquan Huang et al.740

Table 3.Signi cance value (

Z

) of diVerences between the accuracies of the four classi ers.

Equal sample Equal sample Equal sample Equal sample

Sample size,rate,size,rate,

size (%) 7 variables 7 variables 3 variables 3 variables

SVMvs.NNC

2 1.77 3.65 1.20

1.02

4 1.96

1.50

2.29

2.38

6 1.92 1.00

4.60 0.22

8 2.28 1.19

1.06

0.88

10 1.94 3.96

0.02 0.02

20 2.55 2.26

1.50 0.02

SVMvs.DTC

2 0.61 2.48 3.46 1.65

4 2.33

0.81 0.61

1.37

6 4.43 1.89 0.46 3.01

8 4.58 2.25 4.51 1.52

10 2.70 4.58 2.46 5.23

20 4.68 3.10 1.19 1.43

SVM vs.MLC

2 8.03 NA 5.04 NA

4 7.27 NA 0.33 NA

6 6.34 3.38 2.35 3.03

8 3.30 4.24 4.80 6.48

10 4.73 7.54 1.51 4.51

20 6.32 5.03 3.39 3.86

DTC vs.NNC

2 1.17 1.17

2.31

2.70

4

0.37

0.69

2.91

1.01

6

2.52

0.89

5.07

2.79

8

2.30

1.06

5.60

2.40

10

0.76

0.61

2.48

5.22

20

2.13

0.83

2.71

1.42

DTC vs.MLC

2 7.44 NA 1.60 NA

4 4.94 NA

0.28 NA

6 1.90 1.49 1.88 0.02

8

1.29 1.99 0.28 4.98

10 2.02 2.97

0.96

0.7

20 1.63 1.94 2.19 2.46

NNC vs.MLC

2 6.25 NA 3.91 NA

4 5.33 NA 2.64 NA

6 4.42 2.38 6.99 2.80

8 1.01 3.05 5.88 7.39

10 2.78 3.58 1.54 4.50

20 3.76 2.77 4.93 2.53

Notes.

1.DiVerences signi cant at 95%con dence level (Z

1.96) are highlighted in bold face.

A positive value indicates better performance of the rst classi er,while a negative

one indicates better performance of the second classi er.

2.NA indicates that the MLC did not work due to insuYcient training samples for

certain classes and no comparison was made.

Support vector machines for land cover classi cation 741

Table 4.Mean and standard deviation (

s

) of the overall accuracies (%) of classi cations

developed using ten sets of training samples randomly selected from the Maryland

data set.

SVM NNC DTC MLC

Training condition Mean

s

Mean

s

Mean

s

Mean

s

Training size

=

20%,75.62 0.19 74.02 0.81 73.31 0.65 71.76 0.79

Input variables

=

7

Training size

=

6%,74.20 0.60 72.10 1.31 71.82 0.94 70.92 1.04

Input variables

=

7

Training size

=

20%,66.41 0.39 66.82 0.91 65.92 0.52 64.59 0.62

Input variables

=

3

Training size

=

6%,65.49 1.20 65.97 0.79 64.45 0.58 63.95 0.97

Input variables

=

3

(3) Of the other three algorithms,NNC gave signi cantly higher results than

DTC in ten of the 12 training cases with three input variables and in three

of the 12 training cases with seven input variables.Again NNC showed better

comparative performances on training cases with three variables than on

training cases with seven variables.DTC did not give signi cantly better

results than NNC on any of the remaining training cases.Both NNC and

DTC were more accurate than the MLC.NNC had signi cantly higher

accuracies than the MLC in 18 of 20 training cases while DTC did so

in eight of 20 training cases.The MLC did not have signi cantly higher

accuracies than NNC and DTC on any of the remaining training cases.

(4) The accuracy diVerences of the four algorithms on the data set used in this

study were generally small.However,many of them were statistically

signi cant.

5.2.Algorithm stability and speed

The standard deviation of the overall accuracy of an algorithm estimated in cross

validation is a quantitative measure of its relative stability (table 4).Figure 7 shows

the variations of the accuracies of the four classi ers.Both table 4 and gure 7 reveal

that the stabilities of the algorithms diVered greatly and were aVected by training

data size and number of input variables.In general,the overall accuracies of the

algorithms were more stable when trained using 20% pixels than using 6% pixels,

especially when seven variables were used ( gures 7(a) and (b)).The SVM gave far

more stable overall accuracies than the other three algorithms when trained using

20%pixels with seven variables.It also gave more stable overall accuracies than the

other three algorithms when trained using 6%pixels with seven variables ( gure 7(b))

and using 20%pixels with three variables ( gure 7(c)).But when trained using 6%

pixels with three variables,it gave overall accuracies in a wider range than the other

three algorithms ( gure 7(d)).Of the other three algorithms,DTC gave slightly

more stable overall accuracies than NNC or the MLC,both of which gave overall

accuracies in wider ranges in all cases.

The training speeds of the four classi ers were substantially diVerent.In all

training cases training the MLC and DTC did not take more than a few minutes

on a SUN Ultra 2 workstation,while training NNC and the SVM took hours and

days,respectively.Furthermore,the training speeds of the above algorithms were

Chengquan Huang et al.742

(a) (b)

(d)(c)

Figure 7.Boxplots of the overall accuracies of classi cations developed using ten sets of

training samples randomly selected from the Maryland data set.(a).Training size

=

20%pixels of the image,number of input variables

=

7.(b) Training size

=

6% pixels

of the image,number of input variables

=

7.(c) Training size

=

20%pixels of the image,

number of input variables

=

3.(d) Training size

=

6% pixels of the image,number of

input variables

=

3.

aVected by many factors,including numbers of training samples and input variables,

noise level in the training data set,as well as algorithm parameter setting.This is

especially the case for the SVM and NNC.Many studies have demonstrated that

the training speed of NNC depends on network structure,momentum rate,learning

rate and converging criteria (Paola and Schowengerdt 1995).The training of

the SVM was aVected by training data size,kernel parameter setting and class

separability.Generally,when the training data size was doubled,the training time

would be more than doubled.Training the SVMto classify two highly mixed classes

could take several times longer than training it to classify two separable classes.For

the SVM programme used in this study,polynomial kernels,especially high-order

kernels,took far more time to train than RBF kernels.

6.Impacts of non-algorithm factors

6.1.Impact of training sample selection

Training sample selection includes two parts:training data size and selection

method.Reorganizing the numbers in gure 6 shows the impact of training data size

on algorithm performance ( gure 8).As expected,increases in training data size

generally led to improved performances.While the increases in overall accuracy were

Support vector machines for land cover classi cation 743

Figure 8.Impact of training data size on the performances of the classi ers.Y-axis is overall

accuracy (%).Training data size is % pixel of the image.(

a

) Equal sample size,7

variables,(b) equal sample rate,7 variables,(

c

) equal sample size,3 variables,(

d

) equal

sample rate,3 variables.

not monotonic as training data size increased,larger training data sets (>6%of the

image) generally gave results better than smaller ones (<6%).

One of the goals of this experiment was to determine the minimum training data

size for suYcient training of an algorithm.The obvious increases in overall accuracy

as training data size increased from 2% to 6% indicate that for this test data set

training pixels less than 6%of the entire image are insuYcient for training the four

algorithms.Beyond 6%,however,it is hard to tell when an algorithm is trained

adequately.When seven variables were used and the training samples were selected

using the equal sample rate ( ESR) method ( gure 8(b)),the largest training data set

( 20% pixels) gave the best results.For other training cases,however,the best

performance of an algorithm was often achieved with training pixels less than 20%

of the image ( gure 8(a

),(

c

) (

d

)

).Hepner et al.( 1990) considered a training data size

of a 10 by 10 block for each class as the minimum data size for training NNC.

Zhuang et al.( 1994) suggested that training data sets of approximately 5–10% of

an image were needed to train a neural network classi er adequately.The results

of this experiment suggest that the minimum number of samples for adequately

Chengquan Huang et al.744

training an algorithm may depend on the algorithm concerned,the number of input

variables,the method used to select the training samples,and the size and spatial

variability of the study area.

The impact of the two sampling methods for selecting training data—equal

sample size ( ESS) and equal sample rate ( ESR)—on classi cation accuracy was

assessed using kappa statistics.Table 5 shows that the two sampling methods did

give signi cantly diVerent accuracies for some training cases.For most training cases

slightly higher accuracies were achieved when the training samples were selected

using the ESR method.Considering the disadvantage of undersampling or even

totally missing rare classes of the ESR method,the sampling rate of very rare classes

should be increased when this method is employed.

6.2.Impact of input variables

It is evident from gures 6 and 8 that substantial improvements were achieved

when the classi cations were developed using seven variables instead of using three.

The respective average improvements in overall accuracy for the SVM,NNC,DTC

and the MLC were 8.8%,5.8%,8.0%and 5.9%when training samples were selected

using the ESS method,and 8.1%,6.1%,7.6%and 7.3%when training samples were

selected using the ESR method,respectively.Figure 9 shows two SVMclassi cations

developed using three and seven variables.They were developed from the training

data set consisting of 20% pixels of the image selected using the ESR method.A

visual inspection of the two classi cations reveals that using the four additional TM

bands led to substantial improvements in discriminating between the four land

classes (closed forest,open forest,woodland and non-forest land).Table 6 gives the

number of pixels classi ed correctly in the two classi cations.The last row shows

that the relative increases in number of pixels classi ed correctly for the four land

classes are much higher than those for the classes of water and land–water mix.

It should be noted that improvements in classi cation accuracy achieved by

using more variables were substantially higher than those achieved by choosing

better classi cation algorithms or by increasing training data size,underlining the

importance of using as much information as possible in land cover classi cation.

Table 5.Signi cance value (Z) of diVerences between classi cations developed from training

samples selected using the equal sample size (ESS) and equal sample rate (ESR)

methods.

Algorithm

SVM DTC NNC MLC

Sample

rate (%) 3-band 7-band 3-band 7-band 3-band 7-band 3-band 7-band

2 2.72

3.16

0.94

1.28

0.54

5.83 — —

4

1.04 1.92

3.01

1.21

1.19

1.53 — —

6

3.07 1.12

0.53

1.42 1.74 0.21

2.40

1.83

8

0.81 0.85

3.83

1.47

0.63 0.24 0.85 1.80

10

2.70

2.07

0.01

0.20

2.67 0.06 0.30 0.75

20

3.13

1.74

2.93

3.35

1.64

1.24

2.67

3.06

Note.DiVerences signi cant at the 95%con dence level (Z

1.96) are highlighted in bold

face.Positive Z values indicate higher accuracies for the ESS method while negative ones

indicate higher accuracies for the ESR method.

Support vector machines for land cover classi cation 745

Figure 9.SVM classi cations developed for the study area in eastern Maryland,USA,

using three and seven variables from the training data set consisting of 20%training

pixels selected using the equal sample rate (ESR) method.The classi cations cover

an area of 22.5km by 22.5km.(a) Classi cation developed using three variables.

(b) Classi cation developed using seven variables.

Table 6.Number of pixels classi ed correctly in the two classi cations shown in gure 8 and

per-class improvement due to using seven instead of three variables in the classi cation.

Classi cation Closed Open Wood- Non-forest Land-water

developed using forest forest land land mix Water

Per-class agreement (number of pixel) between a classi cation and the reference map

Three variables 1317 587 376 612 276 974

Seven variables 1533 695 447 752 291 982

Relative increases (%) in per-class agreement when the number of input variables increased

from 3 to 7

16.4 18.4 18.9 22.9 5.4 0.8

Many studies have demonstrated the usefulness of the two mid-infrared bands of

the TM sensor in discriminating between vegetation types (e.g.DeGloria 1984,

Townshend 1984),yet the two bands will not be available at 250m resolution on

the MODIS instrument (Running et al.1994).Results from this experiment show

that the loss of discriminatory power due to not having the two mid-infrared bands

at 250m resolution could not be fully compensated for by using better classi cation

algorithms or by increasing training data size.Whether the lost information can be

fully compensated for by incorporating spatial and temporal information needs to

be further investigated.

Chengquan Huang et al.746

7.Summary and conclusions

The support vector machine (SVM) is a machine learning algorithm based on

statistical learning theory.In previous studies it had been found competitive with

the best available machine learning algorithms for handwritten digit recognition and

text categorization.In this study,an experiment was performed to evaluate the

comparative performances of this algorithm and three other popular classi ers (the

maximumlikelihood classi er (MLC),neural network classi ers ( NNC) and decision

tree classi ers (DTC)) in land cover classi cation.In addition to the comparative

performances of the four classi ers,impacts of the con gurations of SVM kernels

on its performance and of the selection of training data and input variables on all

four classi ers were also evaluated.

SVMuses kernel functions to map non-linear decision boundaries in the original

data space into linear ones in a high-dimensional space.Results from this experiment

revealed that kernel type and kernel parameter aVect the shape of the decision

boundaries as located by the SVMand thus in uence the performance of the SVM.

For polynomial kernels better accuracies were achieved on data with three input

variables as the polynomial order p increased from 1 to 8,suggesting the need for

using high-order polynomial kernels when the input data have very few variables.

When seven variables were used in the classi cation,improved accuracies were

achieved when p increased from 1 to 4.Further increases in p had little impact on

classi cation accuracy.For RBF kernels the accuracy increased slightly when c

increased from 1 to 7.5.No obvious trend of improvement was observed when c

increased from 5 to 20.However,an experiment using arbitrary data points revealed

that misclassi cation error is a function of c.

Of the four algorithms evaluated,the MLC had lower accuracies than the three

non-parametric algorithms.The SVMwas more accurate than DTC in 22 out of 24

training cases.It also gave higher accuracies than NNC when seven variables were

used in the classi cation.This observation is in agreement with several previous

studies.The higher accuracies of the SVMshould be attributed to its ability to locate

an optimal separating hyperplane.As shown in gure 1,statistically,the optimal

separating hyperplane found by the SVMalgorithm should be generalized to unseen

samples with fewer errors than any other separating hyperplane that might be found

by other classi ers.Unexpectedly,however,NNC were more accurate than the SVM

when only three variables were used in the classi cation.This is probably because

the SVMhad less success in transforming non-linear class boundaries in a very low-

dimensional space into linear ones in a high-dimensional space.On the other hand,

the complex network structure of NNC might be more eYcient in approximating

non-linear decision boundaries even when the data have only very few variables.

Generally the absolute diVerences of classi cation accuracy were small among the

four classi ers.However,many of the diVerences were statistically signi cant.

In terms of algorithm stability,the SVM gave more stable overall accuracies

than the other three algorithms except when trained using 6% pixels with three

variables.Of the other three algorithms,DTC gave slightly more stable overall

accuracies than NNC or the MLC,both of which gave overall accuracies in wide

ranges.In terms of training speed,the MLC and DTC were much faster than the

SVM and NNC.While the training speed of NNC depended on network structure,

momentum rate,learning rate and converging criteria,that of the SVMwas aVected

by training data size,kernel parameter setting and class separability.

All four classi ers were aVected by the selection of training samples.It was not

Support vector machines for land cover classi cation 747

possible to determine the minimum number of samples for suYciently training an

algorithm according to results from this experiment.However,the initial trends of

improved classi cation accuracies for all four classi ers as training data size increased

emphasize the necessity of having adequate training samples in land cover classi ca-

tion.Feature selection is another factor aVecting classi cation accuracy.Substantial

increases in accuracy were achieved when all six TM spectral bands and the NDVI

were used instead of only the red,NIR and the NDVI.The additional four TM

bands improved the discrimination between land classes.Improvements due to the

inclusion of the four TMbands exceeded those due to the use of better classi cation

algorithms or increased training data size,underlining the need to use as much

information as possible in deriving land cover classi cation from satellite images.

Acknowledgments

This study was made possible through a NSF grant ( BIR9318183) and a contract

from the National Aeronautics and Space Administration (NAS596060).The SVM

programme used in this study was made available by Mr Thorstan Joachain.

References

Atkinson,P.M.,and Tatnall,A.R.L.,1997,Neural networks in remote sensing.

International Journal of Remote Sensing

,

18

,699–709.

Barker,J.L.,and Burelhach,J.W.,1992,MODIS image simulation from Landsat

TM imagery.In Proceedings ASPRS/ACSM/RT,Washington,DC,April 22–25,1992

(Washington,DC:ASPRS),pp.156–165.

Barnes,W.L.,Pagano,T.S.,and Salomonson,V.V.,1998,Prelaunch characteristics of

the Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1.

IEEE

Transactions on Geoscience and Remote Sensing

,

36

,1088–1100.

Belward,A.,and Loveland,T.,1996,The DIS 1 km land cover data set.Global Change

News L etter,27,7–9.

Breiman,L.,Friedman,J.H.,Olshend,R.A.,and Stone,C.J.,1984,Classi cation and

Regression Trees (Belmont,CA:Wadsworth International Group).

Brodley,C.E.,and Utgoff,P.E.,1995,Multivariate decision trees.

Machine L earning

,

19

,45–77.

Burges,C.J.C.,1998,A tutorial on support vector machines for pattern recognition.

Data

Mining and Knowledge Discovery

,

2

,121–167.

Campbell,J.B.,1981,Spatial correlation eVects upon accuracy of supervised classi cation

of land cover.Photogrammetric Engineering and Remote Sensing,47,355–363.

Campbell,J.B.,1996,Introduction to Remote Sensing (New York:The Guilford Press).

Congalton,R.,1991,A review of assessing the accuracy of classi cations of remotely sensed

data.

Remote Sensing of Environment

,

37

,35–46.

Congalton,R.G.,Oderwald,R.G.,and Mead,R.A.,1983,Assessing Landsat classi cation

accuracy using discrete multivariate analysis statistical techniques.Photogrammetric

Engineering and Remote Sensing,49,1671–1678.

Cortes,C.,and Vapnik,V.,1995,Support vector networks.

Machine L earning

,

20

,273–297.

Courant,R.,and Hilbert,D.,1953,Methods of Mathematical Physics (New York:John

Wiley).

DeFries,R.S.,Hansen,M.,Townshend,J.R.G.,and Sohlberg,R.,1998,Global land

cover classi cations at 8km spatial resolution:the use of training data derived from

Landsat imagery in decision tree classi ers.

International Journal of Remote Sensing

,

19

,3141–3168.

DeGloria,S.,1984,Spectral variability of Landsat-4 Thematic Mapper and Multispectral

Scanner data for selected crop and forest cover types.IEEE Transactions on Geoscience

and Remote Sensing,GE-22,303–311.

Dicks,S.E.,and Lo,T.H.C.,1990,Evaluation of thematic map accuracy in a land-use and

land-cover mapping program.Photogrammetric Engineering and Remote Sensing,56,

1247–1252.

Chengquan Huang et al.748

Fitzpatrick-Lins,K.,1981,Comparison of sampling procedures and data analysis for a land-

use and land-cover map.Photogrammetric Engineering and Remote Sensing,47,

343–351.

Foody,G.M.,McCulloch,M.B.,and Yates,W.B.,1995,The eVect of training set size

and composition on arti cial neural network classi cation.International Journal of

Remote Sensing,16,1707–1723.

Friedl,M.A.,and Brodley,C.E.,1997,Decision tree classi cation of land cover from

remotely sensed data.

Remote Sensing of Environment

,

61

,399–409.

Genderen,V.J.L.,and Lock,B.F.,1978,Remote sensing:statistical testing of thematic

map accuracy.

Remote Sensing of Environment

,

7

,3–14.

Gong,P.,and Howarth,P.J.,1990,An assessment of some factors in uencing multispectral

land-cover classi cation.Photogrammetric Engineering and Remote Sensing,56,

597–603.

Gualtieri,J.A.,and Cromp,R.F.,1998,Support vector machines for hyperspectral remote

sensingclassi cation.In Proceedings of the 27th AIPRWorkshop:Advances in Computer

Assisted Recognition,Washington,DC,Oct.27,1998 (Washington,DC:SPIE),

pp.221–232.

Hall,F.G.,Townshend,J.R.,and Engman,E.T.,1995,Status of remote sensingalgorithms

for estimation of land surface state parameters.

Remote Sensing of Environment

,

51

,

138–156.

Hansen,M.,DeFries,R.S.,Townshend,J.R.G.,and Sohlberg,R.,2000,Global land

cover classi cation at 1 km spatial resulution using a classi cation tree approach.

International Journal of Remote Sensing

,

21

,1331–1364.

Hansen,M.,Dubayah,R.,and DeFries,R.,1996,Classi cation trees:an alternative to

traditional land cover classi ers.

International Journal of Remote Sensing

,

17

,

1075–1081.

Hepner,G.F.,Logan,T.,Ritter,N.,and Bryant,N.,1990,Arti cial neural network

classi cation using a minimal training set:comparison to conventional supervised

classi cation.Photogrammetric Engineering and Remote Sensing,56,496–473.

Hixson,M.,Scholz,D.,Fuhs,N.,and Akiyama,T.,1980,Evaluation of several schemes

for classi cation of remotely sensed data.Photogrammetric Engineering and Remote

Sensing,46,1547–1553.

Hudson,W.D.,and Ramm,C.W.,1987,Correct formulation of the Kappa coeYcient of

agreement.Photogrammetric Engineering and Remote Sensing,53,421–422.

Janssen,L.L.F.,and Wel,F.,1994,Accuracy assessment of satellite derived land cover

data:a review.IEEE Photogrammetric Engineering and Remote Sensing,60,419–426.

Joachims,T.,1998a,Making large scale SVM learning practical.In Advances in Kernel

Methods—Support Vector L earning,edited by B.Scholkopf,C.Burges and A.Smola

(New York:MIT Press).

Joachims,T.,1998b,Text categorization with support vector machines—learning with

many relevant features.In Proceedings of European Conference on Machine L earning,

Chemnitz,Germany,April 10,1998 (Berlin:Springer),pp.137–142.

Justice,C.O.,Markham,B.L.,Townshend,J.R.G.,and Kennard,R.L.,1989,Spatial

degradation of satellite data.

International Journal of Remote Sensing

,

10

,1539–1561.

Lippman,R.P.,1987,An introduction to computing with neural nets.IEEE ASSP Magazine,

4,2–22.

Markham,B.L.,and Barker,J.L.,1986,Landsat MSS and TMpost-calibration dynamic

ranges,exoatmospheric re ectances and at-satellite temperatures.EOSAT L andsat

Technical Notes,1,3–8.

Pao,Y.-H.,1989,Adaptive Pattern Recognition and Neural Networks (New York:Addison-

Wesley).

Paola,J.D.,and Schowengerdt,R.A.,1995,A review and analysis of backpropagation

neural networks for classi cation of remotely sensed multi-spectral imagery.

International Journal of Remote Sensing

,

16

,3033–3058.

Paola,J.D.,and Schowengerdt,R.A.,1997,The eVect of neural network structure on

a multispectral land-use

/

land cover classi cation.Photogrammetric Engineering and

Remote Sensing,63,535–544.

Quinlan,J.R.,1993,C4.5 Programs for Machine L earning (San Mateo,CA:Morgan

Kaufmann Publishers).

Support vector machines for land cover classi cation 749

Rosenfield,G.H.,and Fitzpatrick-Lins,K.,1986,A coeYcient of agreement as a measure

of thematic classi cation accuracy.Photogrammetric Engineering & Remote Sensing,

52,223–227.

Running,S.W.,Justice,C.O.,Salomonson,V.,Hall,D.,Barker,J.,Kaufmann,Y.J.,

Strahler,A.H.,Huete,A.R.,Muller,J.P.,Vanderbilt,V.,Wan,Z.M.,

Teillet,P.,andCarneggie,D.,1994,Terrestrial remotesensingscience andalgorithms

planned for EOS

/

MODIS.

International Journal of Remote Sensing

,

15

,3587–3620.

Safavian,S.R.,and Landgrebe,D.,1991,A survey of decision tree classi er methodology.

IEEE Transactions on Systems,Man,and Cybernetics

,

21

,660–674.

Sellers,P.J.,Meeson,B.W.,Hall,F.G.,Asrar,G.,Murphy,R.E.,Schiffer,R.A.,

Bretherton,F.P.,et al.,1995,Remote sensing of the land surface for studies of

global change:models—algorithms—experiments.

Remote Sensing of Environment

,

51

,3–26.

Stehman,S.V.,1992,Comparison of systematic and random sampling for estimating the

accuracy of maps generated from remotely sensed data.

Photogrammetric Engineering

and Remote Sensing

,

58

,1343–1350.

Stehman,S.V.,1997,Selecting and interpreting measures of thematic classi cation accuracy.

Remote Sensing of Environment

,

62

,77–89.

Sui,D.Z.,1994,Recent applications of neural networks for spatial data handling.Canadian

Journal of Remote Sensing,20,368–380.

Sundaram,R.K.,1996,A First Course in Optimization Theory (New York:Cambridge

University Press).

Swain,P.H.,and Davis,S.M.(editors),1978,Remote Sensing:the Quantitative Approach

(New York:McGraw-Hill).

Toll,D.L.,1985,EVect of Landsat Thematic Mapper sensor parameters on land cover

classi cation.Remote Sensing of Environment,17,129–140.

Townshend,J.R.G.,1984,Agricultural land-cover discrimination using Thematic Mapper

spectral bands.International Journal of Remote Sensing,5,681–698.

Townshend,J.R.G.,1992,Land cover.International Journal of Remote Sensing,13,

1319–1328.

Vapnik,V.N.,1995,The Nature of Statistical L earning Theory (New York:Springer-Verlag).

Vapnik,V.N.,1998,Statistical L earning Theory (New York:Wiley).

Wang,F.,1990,Fuzzy supervised classi cation of remote sensing images.

IEEE Transactions

on Geoscience and Remote Sensing

,

28

,194–201.

Zhuang,X.,Engel,B.A.,Lozano-Garcia,D.F.,Fernandez,R.N.,and Johannsen,C.J.,

1994,Optimization of training data required for neuro-classi cation.

International

Journal of Remote Sensing

,

15

,3271–3277.

## Comments 0

Log in to post a comment