Demo Notes - Intelligensys

jinksimaginaryAI and Robotics

Nov 7, 2013 (3 years and 5 months ago)

86 views

© 2012 Intelligensys Ltd  all rights reserved
1. Introduction
Four examples are illustrated in this INForm Demonstrator. The Demonstrator is based on
INForm version 5.
The four examples have continuous numeric property values for which models can be
developed using either Neural Networks (NN) or Gene Expression Programming (GEP).
All these examples use literature data. Two of the examples relate to pharmaceuticals  a
tablet formulation, and a controlled release tablet formulation. The third is for a hot melt
adhesive, and the fourth is an automotive clearcoat. The notes below (Section 7 to 11)
attempt to explain what the formulator is trying to achieve in each case, so that you can try to
relate these to your own formulation problem.
Although on-line Help is available throughout the Demonstrator, the following guide gives a
few more tips and pointers about the specific examples given in the Demonstrator, including
things which you might want to try for yourself. For more information, though, or for full
explanations, please check out the Help.
2. Loading the Examples
When you launch the Demonstrator, you will see an Introduction screen with buttons that
include Create New Task and Open Existing Task. These are not operational in the
Demonstrator. In the full version of INForm, they allow you to proceed directly to the task that
you want to undertake.
The New, Open, Save, Save As and Recent Tasks option on the Task menu are also not
operational in the demo version. Where features are not available in this Demonstrator you
will be given the "Feature not enabled in INForm Demonstrator" message.
To load an example in the Demonstrator, select Task | Load Demo (the first menu item under
the Task pull-down menu) and choose the example you want to look at from the picklist
provided. You will then be taken immediately to the Set Field Types screen for that example,
with the appropriate choices of Ingredient, Processing Condition and Property already made.
In the first instance, it is probably sensible to leave the selections as they are. However,
subsequently you might want to see what happens if you leave out some of the variables,
using the Not Used option.
If you want to go back to look at the original data (though you won't be able to change it) then
select Model | Enter/Edit Data, or click on the Previous button. You can not edit this data in
the Demonstrator.
When you click on Next at any stage, you will be moved forward automatically to the next
step. So, clicking on Next on the Set Field Types screen will take you to the Data Analysis
screen.
3. Data Analysis
This step is frequently omitted by many users, who simply click on the Next screen to move
on to Training screen. However, it does offer an opportunity to examine the data prior to
model development, should you wish to do so.
The Data Analysis screen looks a lot like the Enter/Edit Data screen. However, here it is not
possible to edit the data - even in the full program, if you try to replace the information in one
of the spreadsheet cells, you will see it is not possible.
© 2012 Intelligensys Ltd  all rights reserved
You can see from the buttons across the top of the Data Analysis screen that there are
various options which can be used to analyse the data. First, let's look at the Graph capability.
Press the Graph button to get the Data Graph Options window. Pick one of the items for the
x-axis, and another for the y-axis, and press OK. The Data Graph Options window will also
give you additional options, which you might want to explore. You can not plot classified data
using this option.
To see what happens if you want to look at Statistics, select (i.e. highlight) all the data. To do
this, hold the mouse down while you sweep across the grey fields at the top of each column,
from A to H. Then, press the Statistics button. You will see that a new sheet has been
generated (denoted by a new tab, labelled Statistics#1, at the top of the spreadsheet) that
contains the statistical data. Statistics can only be obtained for columns containing numeric
data.
The Analyze button will get the Analysis Options window. The Analyze option is optional; you
do not need to perform this step. It is used to check data for outliers using either statistical
criteria (specifying the number of standard deviations from the mean) or else to specify
ranges, to identify outliers.
The Preprocess button opens the Preprocess Options screen. Here you can repair or
compress your data using hierarchical clustering, use Principal Feature Analysis to analyze
your data for input correlations, rebalance or compress your data.
If the data is modified using either of these options then it can be restored to its original form
using the Restore button.
Click on the Next button when you are ready to move on to develop the models, a process
called Training.
4. Training the Model
The Training screen has options for you to set up Training parameters, or to work in an
'interactive' mode that lets you change parameters for different properties. However, quite a
bit of effort has been made to ensure that the default parameters will develop reasonably
good models, provided the data points are reliable and accurate. So, you can simply press
the Train button in the first instance. By default a Neural Network (NN) model is created.
The Training display area shows the progress of the model development. The last two
columns display the R
2
value of the training and test data as a percentage. The R
2
value
normally lies between 0 to 100% and tells you how well the model has fit to the training data.
The higher the R
2
value the better the fit.
Once the model has trained, you will get a small window advising you that 'Training is
complete!'. Click on OK to dismiss this window. You will see that now the View Results,
Output Scripts and Cost buttons, which were inactive prior to training, are now available.
If you want to investigate the effect of changing the parameters, for example changing the
model type to Gene Expression Programming (GEP), then press the Parameters button to
access the Model Training Parameters screen. Here you can set the Model Type as either
Neural Network (NN) or Gene Expression Programming (GEP). A GEP model takes much
longer to train than does a neural network model. In the examples talked about in sections 7
to 10 we assume that the default Neural Network is used, however you are free to experiment
with the various training options.
When training, it is often useful to select some Test Data for validation. Typically this is
achieved by dividing the data into training and test data subsets. Since these literature
examples generally used designed experiments, it is not ideal that data points are removed
© 2012 Intelligensys Ltd  all rights reserved
for validation. However, validation of the neural network is the best way to see how predictive
it will be.
To remove some data for validation, press the Test Data button to open the Set Test Data
screen, which displays a spreadsheet showing the complete data. Use the Options button at
the top of this spreadsheet to get the Test Data Options window. None will be selected by
default; we recommend that you try Smart Selection with 10% (the default) of the data
records kept for testing. Press OK on the Test Data Options window to complete the selection.
You can see which records have been selected as test data by choosing the Test Data
spreadsheet tab on the Set Test Data screen. Press OK to return to the Training window.
When you train your model you will see that now the test error and Test R
2
return non-zero
values.
5. Spreadsheet Results
Pressing the View Results button on the Training window will give the Training Results
window, which displays a number of tabbed spreadsheets. The one which displays on top
gives a summary of all the Models that are generated from the models. Depending on
whether the model was developed using Neural Network or GEP the models will be displayed
in a different format.
The Neural Networks develop black box models which give little useful information on the
mathematical form of the model. For each neural network the Network Structure is displayed
in the format I(j) - HL(k) - HL(m) - O(n), where there are j inputs, k nodes in the first Hidden
layer, m nodes in the second Hidden Layer, and n outputs which always equals 1. Usually,
only one hidden layer is used for most formulation problems.
GEP develop models which can be displayed as equations with ingredients, processing
condition and possibly random number constant variables. An index of ingredient
abbreviations used within the equation is listed above the equation itself.
In addition to the Models sheet, the next most useful sheet is the one given by the Model
Statistics tab. If the model is developed for data with numeric properties, as is the case for
these examples, then this shows the ANOVA (Analysis of Variance) statistics that show how
well the models fit the training data. As mentioned above the R
2
value gives an indication of
how reliable a model you have.
You might also want to look at the Training Data tab, since this recaps the training data but
also gives new columns with the values predicted by the model. It can be useful to select
Graph, and plot the actual vs predicted values for a property. Select one of the actual values
for the x axis, and the corresponding predicted value for the y axis, and make sure that Show
Linear Regression Fit Line is switched on (i.e. there is a tick in the tick-box). This will give you
a line, with the corresponding slope and intercept. Obviously in a perfect world, this would be
a straight line with slope = 1 and intercept = 0. In real life, the amount of scatter in the plot
gives a good idea of how accurately the model fits the initial data.
6. Consult Mode
The Model Consult screen is where you use the model, for what if investigations and for
optimization.
There are many options on the Model Consult screen. We will discuss some of these for each
of the cases, below.
7. Tablet Formulation
The Tablet example is taken from Kesavan and Peck (Proc. 14th Pharm Tech Conference,
Barcelona, 1995). Chapter 6 of Intelligent Software for Product Formulation by Rowe and
© 2012 Intelligensys Ltd  all rights reserved
Roberts (Taylor and Francis, 1998) also uses this as its illustrative example. Briefly, this is a
tablet formulation consisting of:
anhydrous caffeine (40% w/w) as a model active
dicalcium phosphate dihydrate (Ditab) or lactose (44.5-47.5% w/w) as a filler
polyvinylpyrrolidone (PVP) (2.0 -5.0% w/w) as a binder
corn starch (10% w/w) as a disintegrant
magnesium stearate (0.5% w/w) as a lubricant.
Two types of granulation equipment - fluidized bed and high shear mixing - are used, and the
binder is added either dry, or as a solution. The amount of caffeine and the percentage of
cornstarch were held constant, so the five variables were:
Diluent (Ditab or lactose)
Diluent%
PVP%
Binder Addition (wet or dry)
Granulation Equipment (Fluidized Bed or High Shear Mixer)
Properties measured included tablet hardness, tablet friability, tablet thickness and
disintegration time. The aim of the optimization is to make hard tablets (which will be robust
and will not break up while you are carrying the bottle around in your pocket, for example) that
also disintegrate quickly (so that the drug can get to work right away).
Friability has some relationship with hardness, since typically hard tablets are not very friable.
Thickness, for the purpose of our study, is pretty unimportant.
Kesavan and Peck carried out a designed experiment, which means that the data were
determined at specified points. Therefore, withholding some of the data for training leaves
some areas of design space under-represented. You will see the effect of this, in that some of
the models train poorly. However, we wanted to use readily accessible literature data, despite
this particular limitation for our purposes.
The variables Diluent, Binder-addn and Gran-equip are classified values, since they refer to
specific classes (e.g. Diluent is Lactose or Ditab).
To get you started, we suggest that you train the model using the default settings. After doing
so, press the Consult button to move onto the Model Consult screen. Press the Best Match
button to display the Best Match screen. In the Target column on the properties side of the
screen, fill out the values 10, 2, 2 and 240 for the values of Hardness, Friability, Thickness
and Disintegration time respectively. Now in the Match box on the right, press the Properties
button. You will see that the Found columns of the ingredients and properties are completed
with values that correspond to one of the known formulations. Best Match is simply a retrieval
function - it does not use the model but simply looks through existing data for the data record
that is closest to our requirements.
Now, press the To Consult-> button to copy the values in the Found columns in Best Match
to the ingredient Value and property Actual columns of the Model Consult screen.
At this stage we can try a what if experiment. Change the value of the Diluent% to 45, and
press the Predict button at the right hand side of the screen. The Predicted properties column
tells you what the model predicts for this ingredient combination.
The 3D graph can be used to show how one of the properties varies with two ingredients.
Make sure that there are sensible values (i.e. those lying within the data ranges) in the
ingredient Value column so that the 'hidden variables' are treated properly. Press the 3D
Graph button to open the 3D Graph Setup. Pick, for example, X and Y as Diluent% and
© 2012 Intelligensys Ltd  all rights reserved
PVP% respectively, and select Z to be Hardness, and press OK to observe how the predicted
value of Hardness varies in respect to the values of Diluent% and PVP%.
The Optimize button will take you into the Optimizer Configuration screen, where you can set
up your objectives for an optimization session, and set up any constraints that you want. Here,
the original data had PVP% + Diluent% = 49.5, so you might want to use that. Optimization is
talked about in more detail in the next example.
8. In Vitro Release
The Invitro example looks at how you formulate a tablet with a specific release profile. Models
that relate in vitro profiles to in vivo release exist, so that getting the correct in vitro profile is a
key step in finding the correct formulation.
The data here are taken from information provided by A I Ware Inc., and were determined by
Y Chen and his colleagues. They published their study in the Journal of Controlled Release
59 33-41 (1999), although the data are not actually given in the paper. Chen and colleagues
used 10 different formulation variables
Amount of Polymer A in each tablet
Amount of Polymer B in each tablet
Amount of Dextrose
Amount of Lubricant
Tablet Weight
Drug/(Polymer+Drug) ratio
Polymer A/Polymer B ratio
Tablet hardness
Particle size
% Moisture
Several of these variables are dependent on other variables  e.g. Tablet Weight depends on
the amount of other ingredients added (since the amount of drug was a constant 9.6 g), and
obviously the Polymer A/Polymer B ratio depends directly on the amounts of Polymer A and
Polymer B. We chose to leave these dependent variables out of the model -you can see that
by noting that, when you load the data set, they are set as Not Used. If you want to try to
reproduce the Chen et al paper, you might want to set them to be ingredients.
The outputs are measured releases at different times. Clearly there is error in the
measurements, since some of the in vitro results suggest a release of over 100%.
When it comes to training, you will see that a lower R
2
is returned for the properties at longer
time scales. There are only 24 different experiments, and even for 7 input variables (the
number of truly independent variables in this problem) this represents relatively little data.
INForm builds the neural network models for these properties using a 2-node single hidden
layer. This very simple architecture is suggested so that the chances of over-training the
network are minimized. However due to the designed nature of this data set it is difficult to get
good Test R
2
values for all the properties when 10% of the data is withheld for validation.
10% of records is only 2 data points and the results depend very much on which 2 data
records are used and the random starting seed. So for this example we will not use any test
data.
After training the models press View Results and look at the Model Statistics spreadsheet on
the Training Results to see how good the models are. A value of training R
2
greater than
about 80 is desirable, provided it is supported by a reasonable f-ratio (greater than 4).
Close the Training window and move onto the Model Consult screen. Our goal now will be to
produce a formulation with a specific release profile. If tablet formulation is your field of
© 2012 Intelligensys Ltd  all rights reserved
expertise, you will be able to specify a suitable profile. Otherwise, you could do as we did 
look at something where the release was 10% after 1 hour, 20% after 2, 30% after 4, 40%
after 6, 50% after 8, 60% after 10, 70% after 12, 80% after 16, 90% after 20 and 100% after
24 hours. If you have included the dependent inputs, e.g. the PolymerA/Polymer B ratio, and
the Tablet Weight, you should set up constraints for the optimization to make sure that these
are satisfied. Otherwise, the optimization will think that Polymer A/PolymerB ratio is
independent of the amount of Polymer A and Polymer B, which is clearly not the case.
Press the Optimize button to bring up the Optimizer Configuration screen, and in the Mid1
column enter the desired release profile (for example as given above). Then press the Set
Values button and choose Mids2 = Mids1 which will automatically fill in the Mid2 column with
the same values as Mid1. In this case we will leave the Min and Max at their default values,
all the Weights as 1.0 and use the Tent Desirability Function. Press OK to move onto the
Optimizer screen, and press the Optimize button. Answer Yes to the question as to whether
to start with random ingredient values, this will start the optimization. You will observe the
ingredient values change as the optimizer attempts to maximize the desirability by
reproducing the release profile as best as it can. The Optimize window and the Consult
window are coupled, so that when optimization ends, the final optimized values are filled out
in the ingredient Value and property Predicted columns on the Model Consult screen.
The 3D plots can be useful to see which of the polymers has the greatest effect at short times,
and which is more important for long term release. One way of looking at that is to use the 3D
Graph option, a more interactive alternative is to use the 3D Plot Surface Explorer. Press the
Explorer button to open the 3D Plot Surface Explorer. For the X axis choose Polymer A and
the for Y axis choose Polymer B. Now select the Z axis as 1hr, and then change the Z axis to
one of the other properties to observe how the dependency on Polymer A and Polymer B
changes for the different release times. You can also try changing the values of the 'hidden'
ingredient values on the plot by selecting a 'hidden' ingredient on the right and moving the
slider bar to change its value.
9. Hot Melt Adhesives
Hot melt adhesives are used in many polymer applications. What is required is a formulation
that melts relatively easily, that binds well to substrates, and that gives a strong adhesive
bond. And, it is helpful if it is easy to apply, so there are usually requirements for the viscosity,
too.
The example given here uses data from Setz and coworkers, reported in the Journal of
Chemometrics 11 403-418 (1997). Their hot melt adhesive is for bonding to polypropylene,
which poses challenges to the formulator  because of its low surface energy, most things
dont stick to it. In a typical formulation, oligo(propene) is mixed with SEBS (hydrogenated
polystyrene  block polybutadiene  block polystyre ne), and a range of tackifiers. In their
particular formulation, they used
iPP10  an isotactic oligo(propene) with Mn = 10,00 0
TPE  a hydrogenated polystyrene  block polybutadi ene  block polystyrene thermoplastic
elastomer TPE)
TPEm  like TPE, but with grafted maleic anhydride
T1  a hydrocarbon resin tackifier with Mn 690
T2  a straight mineral oil
There is a constraint on the ingredients in this case  the amounts must add to 100%.
28 experiments were reported in their paper. The properties they measured included the lap
shear strength (tauB) and the viscosity, as well as delta (tauB). Viscosity has less data points
then the other two properties.
INForm recommends a 3-node hidden layer for this problem.
© 2012 Intelligensys Ltd  all rights reserved
For this example we will use test data for model validation. Press Set Test Data and use the
Smart selection option with 10% of the data to be withheld for validation. This will select 2
data points for test data. With only 2 data points the Test R
2
value may return erratic values
which can go below zero, especially when removed from a designed data set. The results
depend very much on which 2 data records are used and the random starting seed. It is
possible that in the case of Viscosity a record with missing value is used for validation (which
will return a zero Test R
2
because there are not enough data points). When there are only 2
test data records the situation can also arise where the test values for a property are identical
which means the Test R
2
cannot be calculated and returns zero value. You might want to try
withholding different data records, either by using Smart selection again, or by selecting
records manually. When you make these changes you will probably see that in some cases
models are better, while in other cases they are worse.
There are a number of ways you can improve the Test R
2
values. You can also experiment
by changing the Random Seed in the Model Training Parameters, or alternatively setting the
Training Loops parameter to a value greater than 1 (for example 10). By using loops the
training will repeat using a different random seed each time, choosing the 'best' model. When
using loops you might want to increase the speed of the training procedure by reducing the
number of times the screen and plot is updated, for example increasing the Screen Update
Rate to 50.
By default Smart Stop is used when using test data (Smart Stop Enabled is checked on, on
the Test Data tab). You will notice that the training stops prior to reaching its target number of
iterations. The point at which it stops is determined by a combination of the training and test
error. To stop the training sooner, a higher weighting can be applied to the Test Error
Weighting parameter. At the extreme the training can be stopped based solely on the test
error value. To do this check off Auto Weight and set the value of the Test Error Weighting
to 1.0. If you still get poor values then manually select records 6 and 10 for testing which will
produce good Test R
2
values.
You might want to optimize to see if you can find a formulation with high lap shear strength
and low viscosity. Remember to add the ingredient constraint, making all three ingredients
add to 100%. You can trade off to see if you need to sacrifice viscosity to get high lap shear
strength, and vice versa, if you wish. And have a look at the 3D plots, since these show how
very non-linear the models are.
10. Automotive Clearcoats
This is another literature example. Kruithof and van den Haak of Akzo Coatings B.V. have
reported a study, using statistics, of a clearcoat containing novel monomers. Here, we have
used their data, reported in Journal of Coatings Technology 62 47-52 (1990) - but we have
treated their data using a neural network. The present note therefore provides useful
comparisons with the statistical treatment.
The aim for automotive clearcoats is to increase the solids content, since this means that
there is less solvent, in line with environmental pressures. Adding monomers with linear
flexible bulky groups increases the solids content, but reduces the hardness of the coating,
and this is not desirable. Monomers with rigid bulky groups (rather than flexible ones) are
expected to improve the solids content, but without sacrificing the hardness. Kruithof and van
den Haak looked at four possible monomers.
For each monomer, there were 20 experiments, varying the film thickness, the percent of
novel monomer, and the percent of melamine formaldehyde crosslinker. Because of the
limitations of the statistical package, only three variables were used - neural nets can of
course cope with many more variables.
Kruithof and van den Haak found that, of the 4 monomers they considered, TMCMA led to the
highest hardness, so this example uses the data just for that monomer. Consequently, there
© 2012 Intelligensys Ltd  all rights reserved
are three input variables - Monomer %, MF (melamine formaldehyde) %, and film thickness.
Two properties - the solids content, and the hardness, were measured.
In their data, one point gives a higher value than usual for solids content. We have assumed
that this is a valid point, but it may be a typographical error. If the latter case is true, then we
might expect to develop relatively poor models.
In the example we will generate a GEP model. From the Training screen, press the
Parameters button. Make sure that the All Property Models radio button is selected near the
bottom of the Model Training Parameters screen. Then select Gene Expression
Programming as the Model Type. Keep the other parameters at their defaults and press OK.
Now when you train INForm you will notice it takes significantly longer to create the models
than when using neural networks.
The ANOVA statistics generally show that the SoildCont returns an R
2
value close to 80%.
While KnoopHard model returns a lower R
2
value in the region of 70%. This might indicate
that other factors, which have not been measured, are having an effect on this property.
Now when you press View Results the GEP Models for the two properties are displayed.
By exploring the 3D graphs, you can see that the behaviour is quite non-linear for both
properties - and, interestingly, SoildCont is not linear with film thickness.
11. And in Conclusion
Remember, there is on-line help available throughout the Demo, to show you what the
different buttons are for.
We hope that you have found this INForm Demonstrator useful. Please refer any queries (or
provide any feedback) to
Intelligensys Ltd
Springboard Business Centre
Ellerbeck Way
Stokesley, North Yorkshire TS9 5JZ, UK
e-mail: postmaster@intelligensys.co.uk