PEBLS 2.0 User's Guide

randombroadΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

80 εμφανίσεις





















PEBLS 2.0


User's Guide














John Rachlin and Steven Salzberg




Department of Computer Science


Johns Hopkins University




September, 1993














Copyright (c) 1993 by The Johns Hopkins University











Table of Contents


------------
-------



Section


1. Introduction


2. Obtaining PEBLS 2.0 by Anonymous FTP


3. An Overview of PEBLS


4. PEBLS Command Reference Guide


5. Error messages


6. Overview of the source code


7. Notes on modifying PEBL
S








































SECTION 1. INTRODUCTION


-------------------------



PEBLS (Parallel Exemplar
-
Based Learning System) is a

nearest
-
neighbor learning system designed for applications

where the in
stances have symbolic feature values. PEBLS has

been used, for example, to predict protein secondary

structure based on the primary amino acid sequence of

protein sub
-
units, where the features in this case are

letters of the alphabet corresponding to each

of the 20

amino acids.



PEBLS 2.0 is a serial version written entirely in ANSI

C. It is thus capable of running on a wide range of

platforms. Version 2.0 incorporates a number of additions

to the original PEBLS which is described in detail in the

fo
llowing paper:




Scott Cost and Steven Salzberg. A Weighted Nearest


Neighbor Algorithm for Learning with Symbolic Features,


Machine Learning, 10:1, 57
-
78 (1993).




Please direct all comments and questions to:



Prof. Stev
en Salzberg


Dept. of Computer Science


The Johns Hopkins University


Baltimore, MD 21210




Email: salzberg@cs.jhu.edu


Phone: (410) 516
-
8438




PEBLS 2.0 incorporat
es a number of features intended to

support flexible experimentation in symbolic domains. We

have provided support for k
-
nearest neighbor learning, and

the ability to choose among different techniques for

weighting both exemplars and individual features.

A number

of post
-
processing techniques specific to the domain of

protein secondary structure have also been provided.



The purpose of this User's guide is to explain how

PEBLS works, and to provide a complete reference for those

wishing to set up and

perform actual experiments. Section 2

explains how to obtain the latest version of PEBLS via

anonymous FTP. Section 3 provides an overview of PEBLS,

and includes a discussion of the basic algorithm. Section 4

explains how to set up actual experiments

through the use of

so
-
called configuration files which allow the user to select

among a variety of training and test formats. Section 5

provides an in
-
depth explanation of PEBLS error messages

that the user may encounter when using the system. Section
s

6 and 7 has been included for the benefit of those wishing

to modify the source code. Section 6 is an overview of the

source code itself, while section 7 provides step
-
by
-
step

instructions for performing certain common kinds of

modifications such as add
ing additional exemplar
-
weighting

methods, or post
-
processing techniques.



We hope you enjoy using PEBLS!





SECTION 2. OBTAINING PEBLS BY ANONYMOUS FTP


-------------------------------------------



The latest version of PEBLS is

publicly available, and

may be obtained via anonymous FTP from the Johns Hopkins

University Computer Science repository.



To obtain a copy of PEBLS, type the following commands:



UNIX_prompt> ftp blaze.cs.jhu.edu


Name: anonymous


Passw
ord: [enter your email address]



ftp> bin


ftp> cd pub/pebls


ftp> get pebls.tar.Z


ftp> bye


[Place the file pebls.tar.Z in a convenient subdirectory.]



UNIX_prompt> uncompress pebls.tar.Z


UNIX_prompt> tar
-
xf pebls.tar




[Read the files "README" and "pebls.doc"]



If for any reason you have difficulty obtaining PEBLS,

please contact Steven Salzberg at the above address.





SECTION 3. AN OVERVIEW OF PEBLS


-------------------------------




In this section, we will present an overview of PEBLS

2.0. The basic idea of instance
-
based (or exemplar
-
based)

learning is to treat a set of training examples as points in

a multi
-
dimensional feature space. Test instances are then

classified by find
ing the closest exemplar currently

contained in the feature space. The nearest neighbors are

determined by computing the distance to each object in the

feature space using some distance metric. These neighbors

are then used to assign a classification to
the test

instance.



The key problem is defining the distance metric. In

domains where features are numeric, the problem is

relatively straight
-
forward. Distances between two

instances can be computed geometrically in terms of

Euclidian distance, for

example. Indeed, nearest neighbor

algorithms for learning from examples have traditionally

worked best in domains in which all features have numeric

values. When the features are symbolic, however, a more

sophisticated distance metric is required. In t
he past, the

standard technique for handling symbolic features has been

to use the so
-
called overlap method which simply counts the

number of feature values that two instances have in common.

Although this method is simple, it suffers from poor overall

per
formance in certain complex domains.



PEBLS employs a more sophisticated instance
-
based algorithm

designed specifically for domains in which feature values are

symbolic. As described in detail in our paper, it is based on

the non
-
Euclidean Value Dif
ference Metric (VDM) of Stanfill and

Waltz in which the distance d between two symbolic values, V1, V2,

of a given feature is defined as:



n


__


\


k


d(V1,V2) = /_ | (C1i/C1
-

C2i/C2) | (1)



i=1




In this equation, V1 and V2 are two possible values for

the feature, e.g., for the protein data these would be two

amino acids. The distan
ce between the values is a sum over

all n classes. C1i is the number of times V1 was classified

into category i, C1 is the total number of times value V1

occurred, and k is a constant usually set to 1 or 2. This

equation is thus used to generate a VxV ma
trix for each

feature (where V is the number of possible values for the

given feature). This matrix is called a distance table. We

compute a unique distance table for every feature.



Unlike the overlap method, this metric is based on

statistical inf
ormation derived from the examples contained

in the training set. The idea behind this metric is that we

wish to establish that values are similar if they occur with

the same relative frequency for all classifications. The

term C1i/C1 represents the like
lihood that an instance will

be classified as i given that the feature in question has

value V1. Thus we say that two values are similar if they

give similar likelihoods for all possible classifications.

Equation 1 computes overall similarity between two
values by

finding the sum of the differences of theses likelihoods

over all classifications.




The distance metric used in PEBLS is a modified value

distance metric (MVDM) based on the original value distance

metric of Stanfill and Waltz. The dif
ferences are as

follows:



g

1. We omit a weighting term Wf contained in the Stanfill


and Waltz VDM which makes their distance metric


asymmetric. In PEBLS,



D(X,Y) = D(Y,X)





This is not necessarily the case using the VDM of


Stanfill
-
Waltz.



2. Stanfill and Waltz used the value of k=2 in their


version of Equation 1. Our experiments have indicated


that equally good performance is achieved in
most cases


when k=1. For purposes of experimentation, we have


made the k value a configurable parameter.


3. We have added exemplar weights to our distance formula.




Using the modified value distance metric, the total

distance, D, betwee
n two instances is thus given by:



N


__


\

r


D(X,Y) = Wx Wy /_ d(xi, yi) (2)



i=1


where X and Y represent two
instances with X being an

exemplar in memory (i.e., a training instance) and Y a new

(test) example. The variables xi and yi are values of the

i
-
th feature for X and Y where each example has N features.

Wx and Wy are the weighting terms assigned to exempl
ars.

These weights are assigned by the system automatically, and

are used to account for the accuracy and reliability of

certain instances as representatives of particular classes.

A reliable exemplar is given a low weight, and thus has more

"drawing power
" than other exemplars. By contrast,

unreliable exemplars are assigned higher weights so that the

PEBLS system will be less inclined to choose it as one of

the test instance's nearest neighbors. These unreliable

exemplars may represent either noise or "e
xceptions." There

are many ways to assign these weights. These are discussed

in Section 4. The exponent, r, in equation (2) is an

adjustable parameter normally set to 1 or 2. (In domains

with numeric features, r = 1 yields Manhattan distance, and

r = 2

corresponds to Euclidean distance.)



PEBLS 2.0 also allows the user to assign specific

weights to individual features. This technique has been

shown to improve performance under some circumstances.

PEBLS allows you to assign standard weighting curve
s to the

entire set of features, or you may set feature weights

individually. PEBLS 2.0 also incorporates an experimental

technique that allows the system to determine feature

weights for you. Specific feature weighting methods are

explained in Section 4
. With the incorporation of feature

weights, The MVDM becomes:



N


__


\

r


D(X,Y) = Wx Wy /_ f(i) d(xi, yi) (3)



i=
1


where f(i) is the feature weight of feature i. Note that

the feature weight is not included under the exponent, r.



Performing experiments with PEBLS involves decisions

about how exemplars and features should be weighted (if at

all), whether the s
ystem should find multiple nearest

neighbors, or only the single
-
most nearest neighbor, and how

input data should be divided among training and test

instances. The system will output a variety of information

depending, again, on how the experiment is conf
igured. In

the next section, we explain how experiments are set up and

executed using PEBLS 2.0.






SECTION 4. PEBLS COMPLETE REFERENCE GUIDE


------------------------------------------



PEBLS 2.0 is designed to make life easy
for those

wishing to set up machine learning experiments in symbolic

domains. To execute PEBLS, one must first create a PEBLS

Configuration File, (PCF). The PCF is a small ASCII text

file that defines the parameters of the test to be

conducted. It is c
reated using any text editor such as VI

or EMACS. In this section, we demonstrate how to define a

PCF step
-
by
-
step. This discussion will also serve to convey

the full set of features currently supported by PEBLS 2.0.

Once you have familiarized yourself w
ith the features of

PEBLS, you may find it convenient to make use of the quick

reference sheet included as part of this package.



Overall Format of a PCF

-----------------------



A PCF is simply a collection of parameter assignments






# This is a Comment


# This figure shows the general format of a PCF




<parameter
-
1> = <value
-
1>


<parameter
-
2> = <value
-
2>


<parameter
-
3> = <value
-
3a> <value
-
3b> <value
-
3
c>


.


.


.


etc.





As indicated above, some parameters are assigned a

single value, while others are assigned a set of values.

These values may be numeric, symbolic, or prede
fined

reserved words, depending upon the syntax of the parameter

itself. In addition to parameter assignments, the PCF

provides for the insertion of comments by placing a '#' at

the beginning of a line. In such cases, everything to the

end of the line is

ignored.



Before explaining the parameters of a PCF, several

things should be noted.



1. A PCF is NOT case
-
sensitive. Parameter names and


values may be written as upper or lower case. There


are a couple of exceptions to this, however,
such as


when file names are specified. These exceptions will


be noted below.


2. Blank lines in a PCF are ignored.


3. Except in a few rare cases, parameters and their values


may be listed in any order, although for purposes of


rea
dability, it is best to keep certain parameter


definitions together.


4. The system assigns DEFAULT values to many parameters.


They may be explicitly defined in the configuration


file for purposes of improved documentation if desired.



For each configuration parameter defined below, the


default values will be indicated.




We now turn to explaining each of the parameters of a

PCF, and their possible values. There are basically two

different classes of parameters:



A.

data format parameters


B. train and test parameters



Data format parameters tell PEBLS about the format of

the input file containing the actual learning data. These

parameters specify, among other things, how many features

each instanc
e has, what the possible classes are, what

values each feature may have, and so on. By contrast, train

and test parameters are used to indicate how the training

and test data should be processed. Such parameters

indicate, for example, how exemplars shoul
d be weighted, if

at all, the percentage of training data that should actually

be used (if less than 100%), and the level of output detail

generated. Many other features are provided, as will be

explained momentarily.



A. Data Format Parameters

--------
------------------



============================================================


Data File


Syntax: data_file = filename.dat

Example: data_file = dna.dat

Default: None. This parameter must be provided.


Defines the name of the input file where the t
raining and

test data is stored. Note that PEBLS is case
-
sensitive with

respect to the filename (because UNIX is case
-
sensitive).




============================================================


Data Format


Syntax: data_format = STANDARD or SUBUNITS

Ex
ample: data_format = STANDARD

Default: None. This parameter must be provided.


The PEBLS system supports two basic formats, which we call

STANDARD and SUBUNITS. In STANDARD format, each learning

instance is defined by a class, an instance id, and an

or
dered collection of feature values, as follows:



class, id1, feature1 feature2 feature3 ...


class, id2, feature1 feature2 feature3 ...


class, id3, feature1 feature2 feature3 ...


.


.


etc.




The DNA promoter sequence data is in STANDARD format.



+, S10, tactagcaatacgct...


+, AMPC, tgctatcctgacagt...


-
, 867, atatgaacgttgaga...


-
, 1169, cgaacgagtcaatca...


.


.



etc.



The commas separating the class value, instance id, and

feature values are for readability, and are not actually

required.



In SUBUNITS format, the input data consists of a

sequence of sub
-
units from which training and test instances

are constructed. (The term SUBUNITS comes from the term

given to a protein subsection.) A subunit is defined by a

set of values, each of which is associated with a particular

class.



BEGIN subunit
-
id


feature1 class1


feature
2 class2


feature3 class3


.


.


END



The protein data used to predict secondary structure

consists of a collection of protein subunits. Here is what

a typical protein segment might looks like:




BEGIN 1


G C


V C


P C


S E


F E


T C


G C


A H


F H


D H


Q H


S C


N C


END




Normally, the protein segments are much longer. The feature

values correspond to particular amino acids. The class

value defines the sec
ondary structure of the protein at that

location in the segment, where 'H' = alpha helix, 'E' = beta

sheet, and 'C' = coil. A collection of instances is

constructed by scanning a fixed
-
length window along the

entire length of the segment. The features th
at fall within

the window at a particular location define the feature

values for that instance. The class of the central feature

defines the class of the entire instance. Using a window

length of five, for example, the above protein segment is

converted
into the following collection of instances:




C, 1, * * G V P


C, 2, * G V P S


C, 3, G V P S F


E, 4, V P S F T


E, 5, P S F T G


C, 6, S F T G A


C, 7, F T G A F



.


.


etc.



The asterisks, '*', shown above correspond to NULL feature

values where the scanning window reaches beyond the boarders

of the protein segment.



============================================================


Number

of Classes


Syntax: classes = <integer>

Example: classes = 2

Default: None. Must be provided.


This parameter tells the system how many different classes

there are in the input data. In the DNA promoter sequence

data, there are only two classes, + a
nd
-
, corresponding to

positive and negative instances respectively. In the

protein segment data, there are three classes (alpha
-
helix,

beta
-
sheet, and coil).



============================================================


Class Names


Syntax: class_nam
es = name1 name2 name3 ...

Example: class_names = +
-

Default: None. Must be provided.


This identifies the symbol used to represent each class.

The class name may be a single character or a word. The

class names may be listed in any order, but all cla
ss names

must be listed, and each name is delineated by a space.



============================================================


Number of features


Syntax: features = <integer>

Example: features = 57

Default: None. Must be provided.


This parameter i
ndicates how many features there are per

instance. In SUBUNITS mode, this parameter defines the

window length used to construct instances as explained

above.


============================================================


Value Spacing


Syntax: value_spac
ing = ON or OFF

Example: value_spacing = OFF

Default: ON


This tells the system whether there is a space separating

the list of features in the data file when STANDARD mode is

employed. It is possible to eliminate spaces between

feature values if and only

if each feature value is a single

character long. The DNA data is an example of this:



+, S10, tactagcaatacgcttg....


This instance could be written:



+, S10, t a c t a g c a a t a c g c t t g ...


but this is less space efficient.

When feature values are

words rather than single characters, value spacing must be

left ON, and feature values must be separated by a blank

space. In SUBUNITS mode, value spacing is irrelevant

because each value
-
class pair is placed on a separate line.



============================================================


Feature Values


Syntax: feature_values <N> = value1 value2 ...

Example: feature_values 1 = a c g t


The feature_values parameter is used to indicate the

possible symbolic values associated
with each feature. Each

feature may have different values, and a different number of

values. In the above syntax, "N" refers to the number of

the feature, starting at 1. If N exceeds the total number

of features specified previously, an error condition
occurs.

Each symbolic value must be separated by a space, even if

they are single
-
character names. (If you were to write:

feature_values 1 = acgt, the system would think that feature

1 had only one possible value: "acgt".



===============================
=============================


Common Values


Syntax: common_values = value1 value2 value3 ...

Example: common_values = a c g t


Default: None. This parameter must be specified if


individual feature values are not provided.



In some cases,
all features have the same set of values.

The DNA promoter sequence data is an example. It would be

inconvenient to write:



feature_values 1 = a c g t


feature_values 2 = a c g t


feature_values 3 = a c g t


.


.


feature_value
s 57 = a c g t



Instead, PEBLS allows you to indicate that, in fact, all

features have the same set of possible values. Thus, the

simple statement,



common_values = a c g t


is sufficient.



B. Train and Test Parameters

----------------------------
-


============================================================



Training Mode


Syntax: training_mode = SUBSET or


SPECIFIED_GROUP or


LEAVE_ONE_OUT


Example: training_mode = LEAVE_ONE_OUT

Default: Non
e. This parameter must be provided.



PEBLS supports a variety of techniques for defining the

training and test instances of a particular set of data.

There are three basic options.



SUBSET trains on a random subset of the instances

contained across
the entire data set. The size of this

subset is determined by the parameter training_size (see

below.)




SPECIFIED_GROUP is used when one has a specific

predefined training set and test set. This requires the use

of two additional parameters in
the instance data file,

"TRAIN" and "TEST". These parameters define the training

set and test set respectively. Thus, an input file will be

formatted as follows:






# Header information and other comments




TRAIN


training_inst
ance 1


training_instance 2


.


.




training_instance N






TEST


test_instance 1


test_instance 2


.


.






Note that when the data consists
of a collection of sub
-

units (SUB_UNIT data format), the TRAIN and TEST parameters

separate sub
-
units, not actual instances. This means that

all instances of a particular protein segment are either

training instances or test instances. Also it should be

emphasized that the use of the TRAIN and TEST parameters

should only be used in conjunction with the SPECIFIED_GROUP

training mode, and that in such cases, training instances

should be listed first and together.



By specifying a training size (see tr
aining_size parameter

below) it is also possible to train on a random subset of

instances taken only from those instances or subunits

specified in the TRAIN block of instances.




LEAVE_ONE_OUT is a special training mode used in

conjunction with th
e DNA promoter sequence data. The idea

is to test each instance by first training on all other

instances in the training test. This method can be prohibitively

expensive when the data set contains a large number of

instances. Note that this method is eq
uivalent to n
-
fold

cross
-
validation, where n is the size of the entire data set.



============================================================


Training Size


Syntax: training_size = <real>


in the range: (0.00
-

1.00]


Example: training_s
ize = 0.50

Default: 1.00



In SUBSET training mode, the training_size parameter is

used to define the size of the training set. For example, a

value of 0.50 indicates that the system is to train on half

of the instances across the entire data set. W
hen used in

conjunction with the SPECIFIED_GROUP training mode, the

training size refers to a percentage of the training set

only. Thus, if there were 10 training instances and 30 test

instances, and a value of 0.50 were specified, the system

would train
on 5 of 10 randomly selected training instances

before then proceeding to test the 30 specified test

instances.



============================================================


Nearest Neighbors


Syntax: nearest_neighbor = <integer>


where the in
teger is greater than or equal to one.


Example: nearest_neighbor = 5


Default: 1




PEBLS 2.0 can be made to classify by finding the single

nearest neighbor (nearest_neighbor = 1) or it can classify

based on the classes of multiple nearest neighbors
. When

performing multiple nearest neighbor search, the chosen

class may be defined by simple majority vote, i.e., the most

common class among the nearest neighbor, or the voting may

take be weighted with respect to the actual distance of each

neighbor (s
ee Nearest Neighbor Voting below).



============================================================


Nearest Neighbor Voting Scheme


Syntax: nearest_voting = MAJORITY or


WEIGHTED_DISTANCE or


THRESHOLD <
class> <int> ....


Examples: nearest_voting = MAJORITY


nearest_voting = THRESHOLD C 1 H 2 E 2


Default: MAJORITY




This parameter defines how the located K nearest

neighbors are used to define the class of the test instance.

There are three

choices. In MAJORITY mode, each neighbor

gets one vote. The test instance is assigned the same class

as the most frequently occurring class among the K nearest

neighbors. In cases of a tie, the class is chosen randomly

among the largest majorities. In

WEIGHTED_DISTANCE mode,

each neighbor contributes a vote towards its particular

class inversely proportional to its distance. The farther

away a neighbor is, the less of a vote it gets. The

THRESHOLD method places a threshold condition on the

assignment

of each class. Suppose we have the following:



nearest_neighbors = 5


nearest_voting = THRESHOLD C 1 H 2 E 2



This is equivalent to saying the following:



If (# of neighbors with class = COIL) >= 1


Class(Test
-
Instance) = C
OIL.




If (# of neighbors with class = ALPHA_HELIX) >= 2


Class(Test
-
Instance) = ALPHA_HELIX




If (# of neighbors with class = BETA_SHEET) >= 2


Class(Test
-
Instance) = BETA_SHEET





Thus, if there were in fact f
ive neighbors, two alpha helix,

one beta sheet, and two coil, then the test instance would

initially be assigned the class COIL, but would then be

reassigned the class ALPHA_HELIX because the helix condition

has higher precedence (as in the pseudo code pro
vided

above.) It is important to note that the ordering in which

these threshold conditions are listed defines the precedence

itself.



============================================================


Exemplar weighting


Syntax: exemplar_weighting = OFF or


USED_CORRECT or


ONE_PASS or


INCREMENT



Example: exemplar_weighting = USED_CORRECT


Default: OFF




If this parameter is set OFF, then exempla
rs are

unweighted. As indicated above, this is the default value.

Otherwise, the PEBLS system supports several unique

strategies for weighting exemplars.



ONE_PASS is an exemplar weighting method used in

conjunction with the nearest_neighbor paramete
r. Each

exemplar (trained instance) is tested in turn, and the k

nearest neighbors are found from among the remaining

training set. If j neighbors have a matching class, then

the weight assigned to the current instance is:




weig
ht = 1 + k
-

j




For example, suppose that we have set the

nearest_neighbors parameter equal to five. The exemplar

weights will range from 1 to 6 depending on how many nearest

neighbors have a matching class. Remember th
at (counter to

our intuitions about gravity) a high weight makes the

exemplar instance less attractive to test instances because

exemplar weights are distance multipliers.


It is important to note that when finding the nearest

neighbors during the dete
rmination of exemplar weights, the

weights of all exemplars are assumed to be 1.00 even if a

weight has already been assigned. This insures that the

specific weights assigned to each exemplar are independent

of the assignment order. It is only when proce
ssing the

test set that exemplar weights come into play.



The USED_CORRECT method is an alternative weighting

technique that assigns weights in accordance with its

performance history where the weight is defined by two

integer parameters, the number o
f times the exemplar is

used, and the number of times it is used correctly, where:




weight(i) = used(i) / correct(i)




For example, suppose we are assigning a weight to

exemplar j. We determine that its single nearest neighbor

is exemplar i. Thus used(i) is incremented by one. If

exemplar i has the same class as exemplar j, then correct(i)

is incriminated as well because it has been used correctly.

Finally, we set the used and correct values of our current

exemplar equal to th
e used and correct values of its

(current) nearest neighbor. These values may change as it

too is used while remaining exemplars are assigned weights

of there own.



It is important to note that under the USED_CORRECT

methodology, the final weight ass
ignments will depend on the

order in which the actual assignments take place. PEBLS

chooses exemplars at random until all training instances

have been assigned a weight. Over multiple trials (see

below) individual exemplar weights may change.




In the INCREMENT method, all exemplars are initially

assigned a value of 1.00. Then, the single nearest neighbor

of each training instance is determined. The distance is

computed assuming that all exemplars remain unweighted

(equivalent to a weight of 1.
00.) Whenever the nearest

neighbor has a different class, the weight of the nearest

neighbor is incriminated by one. An exemplar's weight is

thus a count of the number of times that it was used (i.e,

identified as a nearest neighbor) minus the number of
times

that it had the correct class. Thus, if an exemplar is

never used during the weighting process, or if it is used

and always has the correct class, its final weight will be

1.00. Otherwise, it will be greater than 1.00.




======================
======================================


Feature Weighting


Syntax: feature_weighting =



OFF or


TRIANGLE or


GENETIC <integer
-
count> <real
-
adj>



OR:


Syntax: feature_weight <integer>
= <real>


(individual feature weight assignment)


Default: OFF



Individual features may be assigned weights as part of

the value
-
distance metric (see Section 3). For example, in

protein sequence prediction, it is common to assign a

g
reater weight to features nearer to the central residue.

PEBLS supports three basic approaches to assigning feature

weights. One method is to assign weights using a predefined

shape. "OFF" corresponds to a flat curve where all features

are assigned the a

weight of 1.00. A TRIANGLE feature

weight shape gives more weight to features closer to the

center. These predefined shapes are diagrammed below:




OFF



1.00|* * * * * * * * * * * * * * *


|



|


|


|


0.00|
------------------------------


0 features








TRIANGLE




1.00| **


| ** **


| ** **


| ** **


| ** **


|* **


0.00|
------------------------------


0 features







The second alternative is to assign feature weights

individually. This method give the user complete

flexibility over the feature weights, but is somewhat less

convenient tha
n using one of PEBLS' pre
-
defined shapes. We

will outline in Section 7 how additional shapes may be

incorporated into the system if desired.


Thus, the command:



feature_weight 16 = 2.00



assigns a weight of

2.00 to feature 16. Feature numbers go

from 1 to N, where N is the total number of features per

instance.




The third and final alternative is to allow the system

to set feature weights for you. The GENETIC method is so

called because it uses a

technique suggestive of a genetic

algorithm. The idea is to tweak a random feature weight by

a random amount in the range
-
adj...+adj, where "adj" is the

floating point value specified as part of this option. A

random training instance is then selected,

and its nearest

neighbor determined. If the neighbor's class is identical

to the chosen training instance, then the adjustment is

accepted. Otherwise, the adjustment is rejected. We do

this <integer> times. This method is highly experimental.

It has b
een found to improve performance in a few limited

experiments.



============================================================


Defining Multiple Trials


Syntax: trials = <integer>


where the number is greater than or equal to one.



Exa
mple: trials = 10


Default: 1



In some cases, the results will vary over multiple

trials due to the built
-
in randomness of certain features

such as USED_CORRECT exemplar weighting in which the

training order is random, or SUBSET training mode in whi
ch

the training instances are chosen at random. Under these

circumstances the user may wish to perform multiple trials.




Under certain output modes (see below) the results of

each trial will be provided. In other cases, only the

averages over a
ll trials are reported.



============================================================


Post
-
processing


Syntax: post_processing = OFF or


PROTEIN_STANDARD or


PROTEIN_SMOOTH <integer>



PROTEIN_SMOOTH_ONLY <integer>


Example: post_processing = PROTEIN_STANDARD


Default: OFF



A post
-
processing routine is a function which changes

the PEBLS classification in accordance with some well

defined set of rules. For example it
is known that the

minimum length of a protein alpha
-
helix is 2 residues (amino

acids), while a protein beta
-
sheet must be at least 4

residues long. One standard classification technique,

therefore, is to simply change all alpha
-
helix and beta
-

sheet chain
s to coil when they do not meet these minimum

length restrictions. We call this technique

PROTEIN_STANDARD.



An alternative technique is to first smooth the

classifications by scanning a window of some specified

length along the complete set of class
ified test instances,

changing the class of the central residue according to the

majority classifications within the window. Consider the

following hypothetical example:



Amino True PEBLS


Acid Class Class


--
--

-----

-----



I C C


P C E


S C C


S C C


E C C


P H H


L H C



L H H


D H H


F C C


N C C


N C C


Y C C



In this example, PEBLS has imperfectly classified the

protein seg
ment. The chain of four amino acids which form

an alpha
-
helix are instead classified as two alpha
-
helix

chains, one of length 1, and one of length 2. Also note

that PEBLS has generated a false beta
-
sheet. If we employ

the standard technique, PEBLS will
correctly eliminate the

spurious beta
-
sheet, but will unfortunately eliminate all

three alpha
-
helix classifications because each of the two

separate chains are below the minimum length requirement.



If we employ the smoothing technique, using a window

of

length 3, the missing alpha
-
helix will be properly assigned,

while the lone beta
-
sheet classification will still be

eliminated. After this smoothing is performed, we reprocess

the test instances using the PROTEIN_STANDARD technique.



This smoothi
ng technique is not perfect. In some cases

this type of window scan may produce additional

classification errors, but over all, this method has been

found to improve the performance of PEBLS when applied to

protein data.



As the name suggests, the PR
OTEIN_SMOOTH_ONLY method

smooths the data only. The classifications are not then

subjected to the PROTEIN_STANDARD method of processing.



Many other post processing techniques are possible.

PEBLS has been specifically designed to allow for the

incorp
oration of further techniques if so desired.

Guidelines for adding post
-
processing techniques are

provided in Section 7.



============================================================


Generating Output



Syntax: output_mode = COMPLETE or



DETAILED or


AVERAGES_ONLY


Example: output_mode = DETAILED


Default: AVERAGES_ONLY




This parameter is used to specify the level of detail

provided when generating output results. Output always

includes the c
ontents of the .PCF file for purposes of

documentation. All output is directed to the screen but may

be redirected to a file by use of standard UNIX shell

commands, e.g.,



$ pebls protein.pcf > results.out &




PEBLS assig
ns a class to each test instance. This

classification is compared with the instance's true class.

PEBLS then generates a table showing its performance record

over each class. Here is what such a table looks like for a

typical test conducted on protein da
ta.



AVERAGES (1 TRIALS)


PERCENT CORRECT FALSE CORREL


CLASS TOTAL HITS MISSES CORRECT REJECTS POSTV COEFF


NAME # (P) (U) (%) (N) (O) (C)

------

------

------

------

------

------

------

------


H 776.0 401.0 375.0 0.52 1411.0 267.0 0.37


E 517.0 165.0 352.0 0.32 1647.0 113.0 0.33


C 1487.0 1246.0 241.0 0.84 566.0 588.0 0.35

---
---

------

------

------

------

------

------

------


TOTAL 2780.0 1812.0 968.0 0.65 3624.0 968.0 0.44





PEBLS keeps track of its performance over every trial

and averages the results. In AVERAGES_ONLY mode, only the

averages are displayed. In DETAILED mode, this table is

provided for every single trial. One uses COMPLETE to show

how every instance is actually classified. If post
-

processing is invoked, PEBLS will show how the instance was

classified before and afte
r processing.



============================================================


Debugging


Syntax: debug = ON or OFF

Example: debug = ON

Default: OFF




This parameter sets a flag which can be used by those

wishing to modify the source code. In the
following code

segment, debugging information is output if and only if the

debug flag is turned on in the PCF.




if (CONFIG.debug == ON)


{


<Print Debugging Info>


}



The deb
ug flag thus provides a simple method for turning on

and off all debugging information.







SECTION 5. ERROR MESSAGES


(Listed in alphabetical order.)


===============================







Data file <filename> does not exist.



The data file specified in the PEBLS configuration file


does not exist.





Feature value index is illegal.




The feature value index specified in the command:




feature_values <index> = {Sym
bol List}



is either less than or equal to 0, or it exceeds the total


number of features declared by the command:




features = <N>



The feature value index must be a value 1..N, where N is


the total number of features.





Feature

weight index is illegal.



The feature weight index specified in the command:




feature_weight <index> = <real>



is either less than or equal to 0, or it exceeds the total


number of features declared by the command:




features = <
N>



The feature weight index must be a value 1..N, where N is


the total number of features.




GENETIC feature weighting is only for use with

SPECIFIED_GROUP training mode.



Self
-
explanatory. Please refer to Section 4 for a


complete dis
cussion of this feature weighting


technique.





Illegal Training Size



The training size value must be in the range


(0.00..1.00].



(For more information on the use of this parameter, see


Section 4.)




Nearest neighbor parameter m
ust be greater than 1



Self explanatory. See Section 4 for a complete


discussion of the nearest neighbor parameters.





No training instances specified in <datafile.dat>



In SPECIFIED_GROUP training mode, the training and test


instanc
es must be explicitly identified using the


"TRAIN" and "TEST" keywords. For more information on


training modes, please refer to Section 4.





Number of classes exceeds CLASSES_MAX (modify config.h)



The file config.h contains a defined con
stant


(CLASSES_MAX) used to specify the maximum number of


classes that may be specified in a given configuration


file. You can support a larger number of classes


simply by changing this constant.




Number of features exceeds FEATURES_
MAX (modify config.h)



The file config.h contains a defined constant


(FEATURES_MAX) used to specify the maximum number of


features per instance that may be specified in a given


configuration file. You can support a larger number of



features per instance simply by changing this constant.




Number of values per feature exceeds VALUES_MAX (modify config.h)



The file config.h contains a defined constant


(VALUES_MAX) used to specify the maximum number of


symbolic value
s for any feature that may be specified in a


given configuration file. You can support a larger number of


values per feature simply by changing this constant.





Number of instances exceeds INSTANCES_MAX (modify config.h)



The file config
.h contains a defined constant


(INSTANCES_MAX) used to specify the maximum total


number of training and test instances. If PEBLS reads


a data file containing too many instances, this error


is generated. You can support a larger number

of


classes simply by changing this constant.





Number of nearest neighbors exceeds K_NEIGHBOR_MAX

(modify config.h)



The file config.h contains a defined constant


(K_NEIGHBOR_MAX) used to specify the maximum number of


nearest neighbo
rs that may be specified in the


configuration file. You can allow the system to find


more neighbors simply by changing this constant.





Trial size exceeds TRIALS_MAX (modify config.h)



The file config.h contains a defined constant


(T
RIALS_MAX) used to specify the maximum number of


trials that may be specified in a given configuration


file. You can support a larger number of trials simply


by changing this constant.





Unknown constant in configuration file.



An un
known constant was encountered while processing


the configuration file. This is most likely due to a


spelling mistake.





Unknown exemplar weighting technique



Please refer to Section 4 for a list of the legal


exemplar weighting techn
iques.





Unknown feature weighting technique



Please refer to Section 4 for a list of the legal


feature weighting techniques.





Unknown post
-
processing technique



Please refer to Section 4 for a list of the legal


exemplar weighting
techniques.




Undeclared feature value encountered in instance data



All possible feature values must be declared in the


configuration file. Section 4 explains how to do this.




Undeclared class encountered in instance data



All possible
class assignments must be declared in the


configuration file. Section 4 explains how to do this.




USAGE: pebls <filename.pcf>



This message occurs if the user fails to specify an


input PEBLS configuration file when executing the pebls



program. The name of an existing configuration file


must be specified. Note that if you actually say for


example:



pebls protein.pcf



the output will be directed to the screen, and the


process will be run in the

foreground. To redirect the


results to a file while running PEBLS in the


background, as is recommended, the following syntax is


used:



pebls protein.pcf > protein.out &




If you want your outp
ut to contain a time and date


stamp:



(date; pebls protein.pcf) > protein.out &









SECTION 6. AN OVERVIEW OF THE SOURCE CODE


==========================================



PEBLS 2.0 is written in ANSI C. Every attemp
t has been

made to keep the system modular so that modifications will

be relatively painless. Users wishing to alter the source

code should read both this section, as well as the specific

suggestions provided in Section 7.



The source code contains t
he following files:



config.h



This file contains about a small number of defined

constants used to configure the system with respect to the

size of statically allocated arrays. Certain parameter

values such as the number of declared classes, or the

number

of features as declared in a PEBLS configuration file (.PCF)

must not exceed the maximum values defined in config.h. For

example, if config.h contains the line:



#define FEATURES_MAX 100


then the declared number of features per
instance in the PCF

may be less than or equal to 100, but not greater.

Otherwise, an error will occur telling you to modify

config.h. (See Section 5). The constants defined in

config.h are:



INSTANCES_MAX:


The maximum number of total trai
ning and test


instances.




CLASSES_MAX:


The maximum number of classes.




FEATURES_MAX:


The maximum number of features per instance.




SUBUNIT_LENGTH_MAX:


The maximum length of any subu
nit (SUBUNITS format


only).




VALUES_MAX:


The maximum number of values for any feature in an


instance.




TRIALS_MAX:


The maximum number of trials per test run.





K_NIEGHBOR_MAX:


The maximum number of neighbors that will be


found.




ID_LENGTH_MAX:


The maximum length of an instance or sub
-
unit id.




HASH_SIZE:


Used to configure the hash table containing



symbolic names.




LINE_MAX:


The maximum line length for data or configuration


files.







pebls.h



This file contains predefined keyword constants, type

declarations, and function prototypes. Of p
articular

interest to those wishing to make source code changes is the

C structure:



config_type CONFIG;


Config is a global structure used to contain information

extracted from the PEBLS configuration file. Some field

values, such as
the number of instances, is determined by

reading the actual data input file. A program statement

that checks the declared method of exemplar weighting will

be written, (for example):



if (CONFIG.exemplar_weighting == ONE_PASS)


{


<Specific code here>


}



pebls.c



Pebls.c contains main(), the core nearest neighbor

routines, and the training and test functions that are

responsible for invoking initialization, reading, and

processing functions contained in other source
-
code
files.


init.c



Initialization routines. Clears internal arrays, and

processes the PEBLS configuration file including checks for

configuration errors.


readers.c



Contains functions for reading data input files for all

supported formats. All tr
aining and test instances are read

into the array data declared globally in pebls.c:



instance_type data[INSTANCES_MAX];





metric.c



Includes core functions for building distance

tables

and computing distance metrics. The modified value distance

metric (MVDM) is contained in this file.


weights.c



Individual procedures for assigning weights to

exemplars and features.


symtab.c



This file contains routines for storing sy
mbolic

feature value names in a hash table for fast retrieval.

Every feature value is assigned an integer which is used for

indexing into the distance tables during the computation of

distances. The size of the hash table may be configured in

config.h.


u
til.c



Basic utility routines including functions for random

number generations, and error reporting. This file also

contains domain
-
specific post
-
processing routines designed

to improve PEBLS classification performance in certain

domains. Currently
, the post
-
processing routines provided

are specifically for secondary protein structure prediction.


output.c



Routines for the generation of PEBLS output.






SECTION 7. NOTES ON MODIFYING PEBLS


=============================
=======



You are welcome to modify PEBLS 2.0 to suit your needs. Please

let us know what changes or improvements you have made so that we can

consider including them into future versions of the program. Please

do not redistribute PEBLS if you in fac
t decide to make any changes.

If you give PEBLS to others, please make sure you give them the

complete file set, including the Machine Learning journal paper.



This section offers a few details about how to add additional

features to PEBLS 2.0. Such

features would include added exemplar

weighting techniques, feature weighting techniques, or post
-
processing

routines. We have set aside a number of predefined constants to make

the task easier, but after reading this section, programmers will wish

to sc
rutinize the encoded functions of currently supported methods.



Adding exemplar weighting methods.

----------------------------------



Three defined constants have been set aside for

creating additional exemplar weighting methods:




USER_EWEIGHT_1


USER_EWEIGHT_2


USER_EWEIGHT_3


For example, in the PEBLS configuration file, you would say:



exemplar_weighting = USER_EWEIGHT_2



These constants co
rrespond to dummy procedures declared in

the file weights.c:



user_eweight_1();


user_eweight_2();


user_eweight_3();



These routines must assign floating point valu
es to each

training instance by referencing the array of instances:



data[i].weight = <floating_point_value>;



Adding feature weighting methods.

---------------------------------



Adding feature weighting methods is similar. There are

three pre
defined constants for user
-
defined feature

weighting methods:


USER_FWEIGHT_1


USER_FWEIGHT_2


USER_FWEIGHT_3


Thus, your PEBLS configuration file would include a line

such as:




feature_weighting = USER_FWEIGHT_1



These constants correspond to dummy procedures also declared

in the file weights.c:



user_fweight_1();


user_fweight_2();


user_
fweight_3();



These routines must assign floating point values to the

array of feature weights contained within the global CONFIG

structure.

For example, you might say:



user_fweight_1()


{


in
t features = CONFIG.features;


int x;




for (x=0; f<features; x++)


CONFIG.feature_weights[x] =


YOUR_FUNCTION(x);


}



Adding post processing techniques.

--------
--------------------------


Every instance in the data array contains three

classification fields.


data[i].class_true (The true class of the instance)

data[i].class_nearest (The class assigned by PEBLS)

data[i].class_pp (The post
-
processi
ng class assigned

by


some domain
-
specific method.)



Normally, the post
-
processing class will use

information about how PEBLS classified the instance,

otherwise, there would be no reason to use PEBLS in the

first place! When
the data consists of sub
-
units (e.g.,

protein sub
-
units) then the post
-
processing scheme may

depend on how PEBLS classified neighboring instances at

data[i
-
1] or data[i+1]. See Section 4 for details on how

current post
-
processing techniques work. Program
mers should

also examine currently supported post
-
processing functions

for additional assistance.



In PEBLS, output is based on data[i].class_nearest

unless a post
-
processing technique is specified

(CONFIG.post_processing). Then it is based on the va
lues

contained in the class_pp field.



For your convenience, we have set aside the following

predefined constants and functions (in util.c):



USER_POSTPROC_1


USER_POSTPROC_2


USER_POSTPR
OC_3




user_postproc_1();


user_postproc_2();


user_postproc_3();