PEBLS 2.0 User's Guide

randombroadΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

70 εμφανίσεις


User's Guide

John Rachlin and Steven Salzberg

Department of Computer Science

Johns Hopkins University

September, 1993

Copyright (c) 1993 by The Johns Hopkins University

Table of Contents



1. Introduction

2. Obtaining PEBLS 2.0 by Anonymous FTP

3. An Overview of PEBLS

4. PEBLS Command Reference Guide

5. Error messages

6. Overview of the source code

7. Notes on modifying PEBL



PEBLS (Parallel Exemplar
Based Learning System) is a

neighbor learning system designed for applications

where the in
stances have symbolic feature values. PEBLS has

been used, for example, to predict protein secondary

structure based on the primary amino acid sequence of

protein sub
units, where the features in this case are

letters of the alphabet corresponding to each

of the 20

amino acids.

PEBLS 2.0 is a serial version written entirely in ANSI

C. It is thus capable of running on a wide range of

platforms. Version 2.0 incorporates a number of additions

to the original PEBLS which is described in detail in the

llowing paper:

Scott Cost and Steven Salzberg. A Weighted Nearest

Neighbor Algorithm for Learning with Symbolic Features,

Machine Learning, 10:1, 57
78 (1993).

Please direct all comments and questions to:

Prof. Stev
en Salzberg

Dept. of Computer Science

The Johns Hopkins University

Baltimore, MD 21210


Phone: (410) 516

PEBLS 2.0 incorporat
es a number of features intended to

support flexible experimentation in symbolic domains. We

have provided support for k
nearest neighbor learning, and

the ability to choose among different techniques for

weighting both exemplars and individual features.

A number

of post
processing techniques specific to the domain of

protein secondary structure have also been provided.

The purpose of this User's guide is to explain how

PEBLS works, and to provide a complete reference for those

wishing to set up and

perform actual experiments. Section 2

explains how to obtain the latest version of PEBLS via

anonymous FTP. Section 3 provides an overview of PEBLS,

and includes a discussion of the basic algorithm. Section 4

explains how to set up actual experiments

through the use of

called configuration files which allow the user to select

among a variety of training and test formats. Section 5

provides an in
depth explanation of PEBLS error messages

that the user may encounter when using the system. Section

6 and 7 has been included for the benefit of those wishing

to modify the source code. Section 6 is an overview of the

source code itself, while section 7 provides step

instructions for performing certain common kinds of

modifications such as add
ing additional exemplar

methods, or post
processing techniques.

We hope you enjoy using PEBLS!



The latest version of PEBLS is

publicly available, and

may be obtained via anonymous FTP from the Johns Hopkins

University Computer Science repository.

To obtain a copy of PEBLS, type the following commands:

UNIX_prompt> ftp

Name: anonymous

ord: [enter your email address]

ftp> bin

ftp> cd pub/pebls

ftp> get pebls.tar.Z

ftp> bye

[Place the file pebls.tar.Z in a convenient subdirectory.]

UNIX_prompt> uncompress pebls.tar.Z

UNIX_prompt> tar
xf pebls.tar

[Read the files "README" and "pebls.doc"]

If for any reason you have difficulty obtaining PEBLS,

please contact Steven Salzberg at the above address.



In this section, we will present an overview of PEBLS

2.0. The basic idea of instance
based (or exemplar

learning is to treat a set of training examples as points in

a multi
dimensional feature space. Test instances are then

classified by find
ing the closest exemplar currently

contained in the feature space. The nearest neighbors are

determined by computing the distance to each object in the

feature space using some distance metric. These neighbors

are then used to assign a classification to
the test


The key problem is defining the distance metric. In

domains where features are numeric, the problem is

relatively straight
forward. Distances between two

instances can be computed geometrically in terms of

Euclidian distance, for

example. Indeed, nearest neighbor

algorithms for learning from examples have traditionally

worked best in domains in which all features have numeric

values. When the features are symbolic, however, a more

sophisticated distance metric is required. In t
he past, the

standard technique for handling symbolic features has been

to use the so
called overlap method which simply counts the

number of feature values that two instances have in common.

Although this method is simple, it suffers from poor overall

formance in certain complex domains.

PEBLS employs a more sophisticated instance
based algorithm

designed specifically for domains in which feature values are

symbolic. As described in detail in our paper, it is based on

the non
Euclidean Value Dif
ference Metric (VDM) of Stanfill and

Waltz in which the distance d between two symbolic values, V1, V2,

of a given feature is defined as:





d(V1,V2) = /_ | (C1i/C1

C2i/C2) | (1)


In this equation, V1 and V2 are two possible values for

the feature, e.g., for the protein data these would be two

amino acids. The distan
ce between the values is a sum over

all n classes. C1i is the number of times V1 was classified

into category i, C1 is the total number of times value V1

occurred, and k is a constant usually set to 1 or 2. This

equation is thus used to generate a VxV ma
trix for each

feature (where V is the number of possible values for the

given feature). This matrix is called a distance table. We

compute a unique distance table for every feature.

Unlike the overlap method, this metric is based on

statistical inf
ormation derived from the examples contained

in the training set. The idea behind this metric is that we

wish to establish that values are similar if they occur with

the same relative frequency for all classifications. The

term C1i/C1 represents the like
lihood that an instance will

be classified as i given that the feature in question has

value V1. Thus we say that two values are similar if they

give similar likelihoods for all possible classifications.

Equation 1 computes overall similarity between two
values by

finding the sum of the differences of theses likelihoods

over all classifications.

The distance metric used in PEBLS is a modified value

distance metric (MVDM) based on the original value distance

metric of Stanfill and Waltz. The dif
ferences are as



1. We omit a weighting term Wf contained in the Stanfill

and Waltz VDM which makes their distance metric

asymmetric. In PEBLS,

D(X,Y) = D(Y,X)

This is not necessarily the case using the VDM of


2. Stanfill and Waltz used the value of k=2 in their

version of Equation 1. Our experiments have indicated

that equally good performance is achieved in
most cases

when k=1. For purposes of experimentation, we have

made the k value a configurable parameter.

3. We have added exemplar weights to our distance formula.

Using the modified value distance metric, the total

distance, D, betwee
n two instances is thus given by:





D(X,Y) = Wx Wy /_ d(xi, yi) (2)


where X and Y represent two
instances with X being an

exemplar in memory (i.e., a training instance) and Y a new

(test) example. The variables xi and yi are values of the

th feature for X and Y where each example has N features.

Wx and Wy are the weighting terms assigned to exempl

These weights are assigned by the system automatically, and

are used to account for the accuracy and reliability of

certain instances as representatives of particular classes.

A reliable exemplar is given a low weight, and thus has more

"drawing power
" than other exemplars. By contrast,

unreliable exemplars are assigned higher weights so that the

PEBLS system will be less inclined to choose it as one of

the test instance's nearest neighbors. These unreliable

exemplars may represent either noise or "e
xceptions." There

are many ways to assign these weights. These are discussed

in Section 4. The exponent, r, in equation (2) is an

adjustable parameter normally set to 1 or 2. (In domains

with numeric features, r = 1 yields Manhattan distance, and

r = 2

corresponds to Euclidean distance.)

PEBLS 2.0 also allows the user to assign specific

weights to individual features. This technique has been

shown to improve performance under some circumstances.

PEBLS allows you to assign standard weighting curve
s to the

entire set of features, or you may set feature weights

individually. PEBLS 2.0 also incorporates an experimental

technique that allows the system to determine feature

weights for you. Specific feature weighting methods are

explained in Section 4
. With the incorporation of feature

weights, The MVDM becomes:





D(X,Y) = Wx Wy /_ f(i) d(xi, yi) (3)


where f(i) is the feature weight of feature i. Note that

the feature weight is not included under the exponent, r.

Performing experiments with PEBLS involves decisions

about how exemplars and features should be weighted (if at

all), whether the s
ystem should find multiple nearest

neighbors, or only the single
most nearest neighbor, and how

input data should be divided among training and test

instances. The system will output a variety of information

depending, again, on how the experiment is conf
igured. In

the next section, we explain how experiments are set up and

executed using PEBLS 2.0.



PEBLS 2.0 is designed to make life easy
for those

wishing to set up machine learning experiments in symbolic

domains. To execute PEBLS, one must first create a PEBLS

Configuration File, (PCF). The PCF is a small ASCII text

file that defines the parameters of the test to be

conducted. It is c
reated using any text editor such as VI

or EMACS. In this section, we demonstrate how to define a

PCF step
step. This discussion will also serve to convey

the full set of features currently supported by PEBLS 2.0.

Once you have familiarized yourself w
ith the features of

PEBLS, you may find it convenient to make use of the quick

reference sheet included as part of this package.

Overall Format of a PCF


A PCF is simply a collection of parameter assignments

# This is a Comment

# This figure shows the general format of a PCF

1> = <value

2> = <value

3> = <value
3a> <value
3b> <value





As indicated above, some parameters are assigned a

single value, while others are assigned a set of values.

These values may be numeric, symbolic, or prede

reserved words, depending upon the syntax of the parameter

itself. In addition to parameter assignments, the PCF

provides for the insertion of comments by placing a '#' at

the beginning of a line. In such cases, everything to the

end of the line is


Before explaining the parameters of a PCF, several

things should be noted.

1. A PCF is NOT case
sensitive. Parameter names and

values may be written as upper or lower case. There

are a couple of exceptions to this, however,
such as

when file names are specified. These exceptions will

be noted below.

2. Blank lines in a PCF are ignored.

3. Except in a few rare cases, parameters and their values

may be listed in any order, although for purposes of

dability, it is best to keep certain parameter

definitions together.

4. The system assigns DEFAULT values to many parameters.

They may be explicitly defined in the configuration

file for purposes of improved documentation if desired.

For each configuration parameter defined below, the

default values will be indicated.

We now turn to explaining each of the parameters of a

PCF, and their possible values. There are basically two

different classes of parameters:


data format parameters

B. train and test parameters

Data format parameters tell PEBLS about the format of

the input file containing the actual learning data. These

parameters specify, among other things, how many features

each instanc
e has, what the possible classes are, what

values each feature may have, and so on. By contrast, train

and test parameters are used to indicate how the training

and test data should be processed. Such parameters

indicate, for example, how exemplars shoul
d be weighted, if

at all, the percentage of training data that should actually

be used (if less than 100%), and the level of output detail

generated. Many other features are provided, as will be

explained momentarily.

A. Data Format Parameters



Data File

Syntax: data_file = filename.dat

Example: data_file = dna.dat

Default: None. This parameter must be provided.

Defines the name of the input file where the t
raining and

test data is stored. Note that PEBLS is case
sensitive with

respect to the filename (because UNIX is case


Data Format

Syntax: data_format = STANDARD or SUBUNITS

ample: data_format = STANDARD

Default: None. This parameter must be provided.

The PEBLS system supports two basic formats, which we call

STANDARD and SUBUNITS. In STANDARD format, each learning

instance is defined by a class, an instance id, and an

dered collection of feature values, as follows:

class, id1, feature1 feature2 feature3 ...

class, id2, feature1 feature2 feature3 ...

class, id3, feature1 feature2 feature3 ...




The DNA promoter sequence data is in STANDARD format.

+, S10, tactagcaatacgct...

+, AMPC, tgctatcctgacagt...

, 867, atatgaacgttgaga...

, 1169, cgaacgagtcaatca...




The commas separating the class value, instance id, and

feature values are for readability, and are not actually


In SUBUNITS format, the input data consists of a

sequence of sub
units from which training and test instances

are constructed. (The term SUBUNITS comes from the term

given to a protein subsection.) A subunit is defined by a

set of values, each of which is associated with a particular


BEGIN subunit

feature1 class1

2 class2

feature3 class3




The protein data used to predict secondary structure

consists of a collection of protein subunits. Here is what

a typical protein segment might looks like:
















Normally, the protein segments are much longer. The feature

values correspond to particular amino acids. The class

value defines the sec
ondary structure of the protein at that

location in the segment, where 'H' = alpha helix, 'E' = beta

sheet, and 'C' = coil. A collection of instances is

constructed by scanning a fixed
length window along the

entire length of the segment. The features th
at fall within

the window at a particular location define the feature

values for that instance. The class of the central feature

defines the class of the entire instance. Using a window

length of five, for example, the above protein segment is

into the following collection of instances:

C, 1, * * G V P

C, 2, * G V P S

C, 3, G V P S F

E, 4, V P S F T

E, 5, P S F T G

C, 6, S F T G A

C, 7, F T G A F




The asterisks, '*', shown above correspond to NULL feature

values where the scanning window reaches beyond the boarders

of the protein segment.



of Classes

Syntax: classes = <integer>

Example: classes = 2

Default: None. Must be provided.

This parameter tells the system how many different classes

there are in the input data. In the DNA promoter sequence

data, there are only two classes, + a
, corresponding to

positive and negative instances respectively. In the

protein segment data, there are three classes (alpha

sheet, and coil).


Class Names

Syntax: class_nam
es = name1 name2 name3 ...

Example: class_names = +

Default: None. Must be provided.

This identifies the symbol used to represent each class.

The class name may be a single character or a word. The

class names may be listed in any order, but all cla
ss names

must be listed, and each name is delineated by a space.


Number of features

Syntax: features = <integer>

Example: features = 57

Default: None. Must be provided.

This parameter i
ndicates how many features there are per

instance. In SUBUNITS mode, this parameter defines the

window length used to construct instances as explained



Value Spacing

Syntax: value_spac
ing = ON or OFF

Example: value_spacing = OFF

Default: ON

This tells the system whether there is a space separating

the list of features in the data file when STANDARD mode is

employed. It is possible to eliminate spaces between

feature values if and only

if each feature value is a single

character long. The DNA data is an example of this:

+, S10, tactagcaatacgcttg....

This instance could be written:

+, S10, t a c t a g c a a t a c g c t t g ...

but this is less space efficient.

When feature values are

words rather than single characters, value spacing must be

left ON, and feature values must be separated by a blank

space. In SUBUNITS mode, value spacing is irrelevant

because each value
class pair is placed on a separate line.


Feature Values

Syntax: feature_values <N> = value1 value2 ...

Example: feature_values 1 = a c g t

The feature_values parameter is used to indicate the

possible symbolic values associated
with each feature. Each

feature may have different values, and a different number of

values. In the above syntax, "N" refers to the number of

the feature, starting at 1. If N exceeds the total number

of features specified previously, an error condition

Each symbolic value must be separated by a space, even if

they are single
character names. (If you were to write:

feature_values 1 = acgt, the system would think that feature

1 had only one possible value: "acgt".


Common Values

Syntax: common_values = value1 value2 value3 ...

Example: common_values = a c g t

Default: None. This parameter must be specified if

individual feature values are not provided.

In some cases,
all features have the same set of values.

The DNA promoter sequence data is an example. It would be

inconvenient to write:

feature_values 1 = a c g t

feature_values 2 = a c g t

feature_values 3 = a c g t



s 57 = a c g t

Instead, PEBLS allows you to indicate that, in fact, all

features have the same set of possible values. Thus, the

simple statement,

common_values = a c g t

is sufficient.

B. Train and Test Parameters



Training Mode

Syntax: training_mode = SUBSET or



Example: training_mode = LEAVE_ONE_OUT

Default: Non
e. This parameter must be provided.

PEBLS supports a variety of techniques for defining the

training and test instances of a particular set of data.

There are three basic options.

SUBSET trains on a random subset of the instances

contained across
the entire data set. The size of this

subset is determined by the parameter training_size (see


SPECIFIED_GROUP is used when one has a specific

predefined training set and test set. This requires the use

of two additional parameters in
the instance data file,

"TRAIN" and "TEST". These parameters define the training

set and test set respectively. Thus, an input file will be

formatted as follows:

# Header information and other comments


ance 1

training_instance 2



training_instance N


test_instance 1

test_instance 2



Note that when the data consists
of a collection of sub

units (SUB_UNIT data format), the TRAIN and TEST parameters

separate sub
units, not actual instances. This means that

all instances of a particular protein segment are either

training instances or test instances. Also it should be

emphasized that the use of the TRAIN and TEST parameters

should only be used in conjunction with the SPECIFIED_GROUP

training mode, and that in such cases, training instances

should be listed first and together.

By specifying a training size (see tr
aining_size parameter

below) it is also possible to train on a random subset of

instances taken only from those instances or subunits

specified in the TRAIN block of instances.

LEAVE_ONE_OUT is a special training mode used in

conjunction with th
e DNA promoter sequence data. The idea

is to test each instance by first training on all other

instances in the training test. This method can be prohibitively

expensive when the data set contains a large number of

instances. Note that this method is eq
uivalent to n

validation, where n is the size of the entire data set.


Training Size

Syntax: training_size = <real>

in the range: (0.00


Example: training_s
ize = 0.50

Default: 1.00

In SUBSET training mode, the training_size parameter is

used to define the size of the training set. For example, a

value of 0.50 indicates that the system is to train on half

of the instances across the entire data set. W
hen used in

conjunction with the SPECIFIED_GROUP training mode, the

training size refers to a percentage of the training set

only. Thus, if there were 10 training instances and 30 test

instances, and a value of 0.50 were specified, the system

would train
on 5 of 10 randomly selected training instances

before then proceeding to test the 30 specified test



Nearest Neighbors

Syntax: nearest_neighbor = <integer>

where the in
teger is greater than or equal to one.

Example: nearest_neighbor = 5

Default: 1

PEBLS 2.0 can be made to classify by finding the single

nearest neighbor (nearest_neighbor = 1) or it can classify

based on the classes of multiple nearest neighbors
. When

performing multiple nearest neighbor search, the chosen

class may be defined by simple majority vote, i.e., the most

common class among the nearest neighbor, or the voting may

take be weighted with respect to the actual distance of each

neighbor (s
ee Nearest Neighbor Voting below).


Nearest Neighbor Voting Scheme

Syntax: nearest_voting = MAJORITY or


class> <int> ....

Examples: nearest_voting = MAJORITY

nearest_voting = THRESHOLD C 1 H 2 E 2


This parameter defines how the located K nearest

neighbors are used to define the class of the test instance.

There are three

choices. In MAJORITY mode, each neighbor

gets one vote. The test instance is assigned the same class

as the most frequently occurring class among the K nearest

neighbors. In cases of a tie, the class is chosen randomly

among the largest majorities. In


each neighbor contributes a vote towards its particular

class inversely proportional to its distance. The farther

away a neighbor is, the less of a vote it gets. The

THRESHOLD method places a threshold condition on the


of each class. Suppose we have the following:

nearest_neighbors = 5

nearest_voting = THRESHOLD C 1 H 2 E 2

This is equivalent to saying the following:

If (# of neighbors with class = COIL) >= 1

Instance) = C

If (# of neighbors with class = ALPHA_HELIX) >= 2

Instance) = ALPHA_HELIX

If (# of neighbors with class = BETA_SHEET) >= 2

Instance) = BETA_SHEET

Thus, if there were in fact f
ive neighbors, two alpha helix,

one beta sheet, and two coil, then the test instance would

initially be assigned the class COIL, but would then be

reassigned the class ALPHA_HELIX because the helix condition

has higher precedence (as in the pseudo code pro

above.) It is important to note that the ordering in which

these threshold conditions are listed defines the precedence



Exemplar weighting

Syntax: exemplar_weighting = OFF or




Example: exemplar_weighting = USED_CORRECT

Default: OFF

If this parameter is set OFF, then exempla
rs are

unweighted. As indicated above, this is the default value.

Otherwise, the PEBLS system supports several unique

strategies for weighting exemplars.

ONE_PASS is an exemplar weighting method used in

conjunction with the nearest_neighbor paramete
r. Each

exemplar (trained instance) is tested in turn, and the k

nearest neighbors are found from among the remaining

training set. If j neighbors have a matching class, then

the weight assigned to the current instance is:

ht = 1 + k


For example, suppose that we have set the

nearest_neighbors parameter equal to five. The exemplar

weights will range from 1 to 6 depending on how many nearest

neighbors have a matching class. Remember th
at (counter to

our intuitions about gravity) a high weight makes the

exemplar instance less attractive to test instances because

exemplar weights are distance multipliers.

It is important to note that when finding the nearest

neighbors during the dete
rmination of exemplar weights, the

weights of all exemplars are assumed to be 1.00 even if a

weight has already been assigned. This insures that the

specific weights assigned to each exemplar are independent

of the assignment order. It is only when proce
ssing the

test set that exemplar weights come into play.

The USED_CORRECT method is an alternative weighting

technique that assigns weights in accordance with its

performance history where the weight is defined by two

integer parameters, the number o
f times the exemplar is

used, and the number of times it is used correctly, where:

weight(i) = used(i) / correct(i)

For example, suppose we are assigning a weight to

exemplar j. We determine that its single nearest neighbor

is exemplar i. Thus used(i) is incremented by one. If

exemplar i has the same class as exemplar j, then correct(i)

is incriminated as well because it has been used correctly.

Finally, we set the used and correct values of our current

exemplar equal to th
e used and correct values of its

(current) nearest neighbor. These values may change as it

too is used while remaining exemplars are assigned weights

of there own.

It is important to note that under the USED_CORRECT

methodology, the final weight ass
ignments will depend on the

order in which the actual assignments take place. PEBLS

chooses exemplars at random until all training instances

have been assigned a weight. Over multiple trials (see

below) individual exemplar weights may change.

In the INCREMENT method, all exemplars are initially

assigned a value of 1.00. Then, the single nearest neighbor

of each training instance is determined. The distance is

computed assuming that all exemplars remain unweighted

(equivalent to a weight of 1.
00.) Whenever the nearest

neighbor has a different class, the weight of the nearest

neighbor is incriminated by one. An exemplar's weight is

thus a count of the number of times that it was used (i.e,

identified as a nearest neighbor) minus the number of

that it had the correct class. Thus, if an exemplar is

never used during the weighting process, or if it is used

and always has the correct class, its final weight will be

1.00. Otherwise, it will be greater than 1.00.


Feature Weighting

Syntax: feature_weighting =

OFF or


GENETIC <integer
count> <real


Syntax: feature_weight <integer>
= <real>

(individual feature weight assignment)

Default: OFF

Individual features may be assigned weights as part of

the value
distance metric (see Section 3). For example, in

protein sequence prediction, it is common to assign a

reater weight to features nearer to the central residue.

PEBLS supports three basic approaches to assigning feature

weights. One method is to assign weights using a predefined

shape. "OFF" corresponds to a flat curve where all features

are assigned the a

weight of 1.00. A TRIANGLE feature

weight shape gives more weight to features closer to the

center. These predefined shapes are diagrammed below:


1.00|* * * * * * * * * * * * * * *






0 features


1.00| **

| ** **

| ** **

| ** **

| ** **

|* **


0 features

The second alternative is to assign feature weights

individually. This method give the user complete

flexibility over the feature weights, but is somewhat less

convenient tha
n using one of PEBLS' pre
defined shapes. We

will outline in Section 7 how additional shapes may be

incorporated into the system if desired.

Thus, the command:

feature_weight 16 = 2.00

assigns a weight of

2.00 to feature 16. Feature numbers go

from 1 to N, where N is the total number of features per


The third and final alternative is to allow the system

to set feature weights for you. The GENETIC method is so

called because it uses a

technique suggestive of a genetic

algorithm. The idea is to tweak a random feature weight by

a random amount in the range
adj...+adj, where "adj" is the

floating point value specified as part of this option. A

random training instance is then selected,

and its nearest

neighbor determined. If the neighbor's class is identical

to the chosen training instance, then the adjustment is

accepted. Otherwise, the adjustment is rejected. We do

this <integer> times. This method is highly experimental.

It has b
een found to improve performance in a few limited



Defining Multiple Trials

Syntax: trials = <integer>

where the number is greater than or equal to one.

mple: trials = 10

Default: 1

In some cases, the results will vary over multiple

trials due to the built
in randomness of certain features

such as USED_CORRECT exemplar weighting in which the

training order is random, or SUBSET training mode in whi

the training instances are chosen at random. Under these

circumstances the user may wish to perform multiple trials.

Under certain output modes (see below) the results of

each trial will be provided. In other cases, only the

averages over a
ll trials are reported.



Syntax: post_processing = OFF or




Example: post_processing = PROTEIN_STANDARD

Default: OFF

A post
processing routine is a function which changes

the PEBLS classification in accordance with some well

defined set of rules. For example it
is known that the

minimum length of a protein alpha
helix is 2 residues (amino

acids), while a protein beta
sheet must be at least 4

residues long. One standard classification technique,

therefore, is to simply change all alpha
helix and beta

sheet chain
s to coil when they do not meet these minimum

length restrictions. We call this technique


An alternative technique is to first smooth the

classifications by scanning a window of some specified

length along the complete set of class
ified test instances,

changing the class of the central residue according to the

majority classifications within the window. Consider the

following hypothetical example:

Amino True PEBLS

Acid Class Class

















In this example, PEBLS has imperfectly classified the

protein seg
ment. The chain of four amino acids which form

an alpha
helix are instead classified as two alpha

chains, one of length 1, and one of length 2. Also note

that PEBLS has generated a false beta
sheet. If we employ

the standard technique, PEBLS will
correctly eliminate the

spurious beta
sheet, but will unfortunately eliminate all

three alpha
helix classifications because each of the two

separate chains are below the minimum length requirement.

If we employ the smoothing technique, using a window


length 3, the missing alpha
helix will be properly assigned,

while the lone beta
sheet classification will still be

eliminated. After this smoothing is performed, we reprocess

the test instances using the PROTEIN_STANDARD technique.

This smoothi
ng technique is not perfect. In some cases

this type of window scan may produce additional

classification errors, but over all, this method has been

found to improve the performance of PEBLS when applied to

protein data.

As the name suggests, the PR

smooths the data only. The classifications are not then

subjected to the PROTEIN_STANDARD method of processing.

Many other post processing techniques are possible.

PEBLS has been specifically designed to allow for the

oration of further techniques if so desired.

Guidelines for adding post
processing techniques are

provided in Section 7.


Generating Output

Syntax: output_mode = COMPLETE or



Example: output_mode = DETAILED


This parameter is used to specify the level of detail

provided when generating output results. Output always

includes the c
ontents of the .PCF file for purposes of

documentation. All output is directed to the screen but may

be redirected to a file by use of standard UNIX shell

commands, e.g.,

$ pebls protein.pcf > results.out &

PEBLS assig
ns a class to each test instance. This

classification is compared with the instance's true class.

PEBLS then generates a table showing its performance record

over each class. Here is what such a table looks like for a

typical test conducted on protein da




NAME # (P) (U) (%) (N) (O) (C)









H 776.0 401.0 375.0 0.52 1411.0 267.0 0.37

E 517.0 165.0 352.0 0.32 1647.0 113.0 0.33

C 1487.0 1246.0 241.0 0.84 566.0 588.0 0.35









TOTAL 2780.0 1812.0 968.0 0.65 3624.0 968.0 0.44

PEBLS keeps track of its performance over every trial

and averages the results. In AVERAGES_ONLY mode, only the

averages are displayed. In DETAILED mode, this table is

provided for every single trial. One uses COMPLETE to show

how every instance is actually classified. If post

processing is invoked, PEBLS will show how the instance was

classified before and afte
r processing.



Syntax: debug = ON or OFF

Example: debug = ON

Default: OFF

This parameter sets a flag which can be used by those

wishing to modify the source code. In the
following code

segment, debugging information is output if and only if the

debug flag is turned on in the PCF.

if (CONFIG.debug == ON)


<Print Debugging Info>


The deb
ug flag thus provides a simple method for turning on

and off all debugging information.


(Listed in alphabetical order.)


Data file <filename> does not exist.

The data file specified in the PEBLS configuration file

does not exist.

Feature value index is illegal.

The feature value index specified in the command:

feature_values <index> = {Sym
bol List}

is either less than or equal to 0, or it exceeds the total

number of features declared by the command:

features = <N>

The feature value index must be a value 1..N, where N is

the total number of features.


weight index is illegal.

The feature weight index specified in the command:

feature_weight <index> = <real>

is either less than or equal to 0, or it exceeds the total

number of features declared by the command:

features = <

The feature weight index must be a value 1..N, where N is

the total number of features.

GENETIC feature weighting is only for use with

SPECIFIED_GROUP training mode.

explanatory. Please refer to Section 4 for a

complete dis
cussion of this feature weighting


Illegal Training Size

The training size value must be in the range


(For more information on the use of this parameter, see

Section 4.)

Nearest neighbor parameter m
ust be greater than 1

Self explanatory. See Section 4 for a complete

discussion of the nearest neighbor parameters.

No training instances specified in <datafile.dat>

In SPECIFIED_GROUP training mode, the training and test

es must be explicitly identified using the

"TRAIN" and "TEST" keywords. For more information on

training modes, please refer to Section 4.

Number of classes exceeds CLASSES_MAX (modify config.h)

The file config.h contains a defined con

(CLASSES_MAX) used to specify the maximum number of

classes that may be specified in a given configuration

file. You can support a larger number of classes

simply by changing this constant.

Number of features exceeds FEATURES_
MAX (modify config.h)

The file config.h contains a defined constant

(FEATURES_MAX) used to specify the maximum number of

features per instance that may be specified in a given

configuration file. You can support a larger number of

features per instance simply by changing this constant.

Number of values per feature exceeds VALUES_MAX (modify config.h)

The file config.h contains a defined constant

(VALUES_MAX) used to specify the maximum number of

symbolic value
s for any feature that may be specified in a

given configuration file. You can support a larger number of

values per feature simply by changing this constant.

Number of instances exceeds INSTANCES_MAX (modify config.h)

The file config
.h contains a defined constant

(INSTANCES_MAX) used to specify the maximum total

number of training and test instances. If PEBLS reads

a data file containing too many instances, this error

is generated. You can support a larger number


classes simply by changing this constant.

Number of nearest neighbors exceeds K_NEIGHBOR_MAX

(modify config.h)

The file config.h contains a defined constant

(K_NEIGHBOR_MAX) used to specify the maximum number of

nearest neighbo
rs that may be specified in the

configuration file. You can allow the system to find

more neighbors simply by changing this constant.

Trial size exceeds TRIALS_MAX (modify config.h)

The file config.h contains a defined constant

RIALS_MAX) used to specify the maximum number of

trials that may be specified in a given configuration

file. You can support a larger number of trials simply

by changing this constant.

Unknown constant in configuration file.

An un
known constant was encountered while processing

the configuration file. This is most likely due to a

spelling mistake.

Unknown exemplar weighting technique

Please refer to Section 4 for a list of the legal

exemplar weighting techn

Unknown feature weighting technique

Please refer to Section 4 for a list of the legal

feature weighting techniques.

Unknown post
processing technique

Please refer to Section 4 for a list of the legal

exemplar weighting

Undeclared feature value encountered in instance data

All possible feature values must be declared in the

configuration file. Section 4 explains how to do this.

Undeclared class encountered in instance data

All possible
class assignments must be declared in the

configuration file. Section 4 explains how to do this.

USAGE: pebls <filename.pcf>

This message occurs if the user fails to specify an

input PEBLS configuration file when executing the pebls

program. The name of an existing configuration file

must be specified. Note that if you actually say for


pebls protein.pcf

the output will be directed to the screen, and the

process will be run in the

foreground. To redirect the

results to a file while running PEBLS in the

background, as is recommended, the following syntax is


pebls protein.pcf > protein.out &

If you want your outp
ut to contain a time and date


(date; pebls protein.pcf) > protein.out &



PEBLS 2.0 is written in ANSI C. Every attemp
t has been

made to keep the system modular so that modifications will

be relatively painless. Users wishing to alter the source

code should read both this section, as well as the specific

suggestions provided in Section 7.

The source code contains t
he following files:


This file contains about a small number of defined

constants used to configure the system with respect to the

size of statically allocated arrays. Certain parameter

values such as the number of declared classes, or the


of features as declared in a PEBLS configuration file (.PCF)

must not exceed the maximum values defined in config.h. For

example, if config.h contains the line:

#define FEATURES_MAX 100

then the declared number of features per
instance in the PCF

may be less than or equal to 100, but not greater.

Otherwise, an error will occur telling you to modify

config.h. (See Section 5). The constants defined in

config.h are:


The maximum number of total trai
ning and test



The maximum number of classes.


The maximum number of features per instance.


The maximum length of any subu
nit (SUBUNITS format



The maximum number of values for any feature in an



The maximum number of trials per test run.


The maximum number of neighbors that will be



The maximum length of an instance or sub
unit id.


Used to configure the hash table containing

symbolic names.


The maximum line length for data or configuration



This file contains predefined keyword constants, type

declarations, and function prototypes. Of p

interest to those wishing to make source code changes is the

C structure:

config_type CONFIG;

Config is a global structure used to contain information

extracted from the PEBLS configuration file. Some field

values, such as
the number of instances, is determined by

reading the actual data input file. A program statement

that checks the declared method of exemplar weighting will

be written, (for example):

if (CONFIG.exemplar_weighting == ONE_PASS)


<Specific code here>



Pebls.c contains main(), the core nearest neighbor

routines, and the training and test functions that are

responsible for invoking initialization, reading, and

processing functions contained in other source


Initialization routines. Clears internal arrays, and

processes the PEBLS configuration file including checks for

configuration errors.


Contains functions for reading data input files for all

supported formats. All tr
aining and test instances are read

into the array data declared globally in pebls.c:

instance_type data[INSTANCES_MAX];


Includes core functions for building distance


and computing distance metrics. The modified value distance

metric (MVDM) is contained in this file.


Individual procedures for assigning weights to

exemplars and features.


This file contains routines for storing sy

feature value names in a hash table for fast retrieval.

Every feature value is assigned an integer which is used for

indexing into the distance tables during the computation of

distances. The size of the hash table may be configured in



Basic utility routines including functions for random

number generations, and error reporting. This file also

contains domain
specific post
processing routines designed

to improve PEBLS classification performance in certain

domains. Currently
, the post
processing routines provided

are specifically for secondary protein structure prediction.


Routines for the generation of PEBLS output.



You are welcome to modify PEBLS 2.0 to suit your needs. Please

let us know what changes or improvements you have made so that we can

consider including them into future versions of the program. Please

do not redistribute PEBLS if you in fac
t decide to make any changes.

If you give PEBLS to others, please make sure you give them the

complete file set, including the Machine Learning journal paper.

This section offers a few details about how to add additional

features to PEBLS 2.0. Such

features would include added exemplar

weighting techniques, feature weighting techniques, or post

routines. We have set aside a number of predefined constants to make

the task easier, but after reading this section, programmers will wish

to sc
rutinize the encoded functions of currently supported methods.

Adding exemplar weighting methods.


Three defined constants have been set aside for

creating additional exemplar weighting methods:




For example, in the PEBLS configuration file, you would say:

exemplar_weighting = USER_EWEIGHT_2

These constants co
rrespond to dummy procedures declared in

the file weights.c:




These routines must assign floating point valu
es to each

training instance by referencing the array of instances:

data[i].weight = <floating_point_value>;

Adding feature weighting methods.


Adding feature weighting methods is similar. There are

three pre
defined constants for user
defined feature

weighting methods:




Thus, your PEBLS configuration file would include a line

such as:

feature_weighting = USER_FWEIGHT_1

These constants correspond to dummy procedures also declared

in the file weights.c:




These routines must assign floating point values to the

array of feature weights contained within the global CONFIG


For example, you might say:



t features = CONFIG.features;

int x;

for (x=0; f<features; x++)

CONFIG.feature_weights[x] =



Adding post processing techniques.


Every instance in the data array contains three

classification fields.

data[i].class_true (The true class of the instance)

data[i].class_nearest (The class assigned by PEBLS)

data[i].class_pp (The post
ng class assigned


some domain
specific method.)

Normally, the post
processing class will use

information about how PEBLS classified the instance,

otherwise, there would be no reason to use PEBLS in the

first place! When
the data consists of sub
units (e.g.,

protein sub
units) then the post
processing scheme may

depend on how PEBLS classified neighboring instances at

1] or data[i+1]. See Section 4 for details on how

current post
processing techniques work. Program
mers should

also examine currently supported post
processing functions

for additional assistance.

In PEBLS, output is based on data[i].class_nearest

unless a post
processing technique is specified

(CONFIG.post_processing). Then it is based on the va

contained in the class_pp field.

For your convenience, we have set aside the following

predefined constants and functions (in util.c):