Study on
Image Pattern Selection
via
Support Vector
Machine
for Improving
Chinese Herb
GC×GC
Data
Classification and Clustering
Performance
wu zhili
Vincent@comp.hkbu.edu.hk
Comp Dept , HKBU
And other autho
rs
…
.
Abstract
The two

dimensional Gas Chromatography (2D

GC)
has been
a highly powerful technique in the
analysis of complex mixtures. However, despite
the much informative 2

D
GC
intensity image
that is
easily visualized by experts
for
manual
interpret
ation
, it has imposed great complexities and
difficulties
upon computational analysis approaches when intending to precisely and automatically
process
the
2D

GC data compar
ed
with the
already matured signal processing method for
1D

GC
data.
Complemented by
some techniques used in the pre/pro

processing for image analysis,
this paper
proposes
the support vector machine (SVM) method for pattern selection from 2D

GC images.
The experimentation for Chinese Herb data classification and clustering shows the impro
vement
adopting the SVM feature selection method.
Keywords:
Chinese herb, 2D

GC, SVM, Feature Selection, Image Analysis, Classification
,
Clustering
.
1 Introduction
1.1
The Significance
and importance
for Chinese Herb Data analysis
…
.
1.2
The superiority of 2D

GC
when compared with 1D

GC, and its exact
suitability for Chinese Herb analysis
…
.
1.3
The
difficulties
of analyzing 2D

GC data when dealing with the
computational complexity and the intractability of
pattern
recognition
…
.
As from the previous introduction to
2D

GC, the data captured is save
d
with a matrix form
with each column of data corresponding to intensities sampled within the retention time for
the second column of the 2D

GC device and the row length corresponding to the total time the
experimentation la
sts.
It is thus
computational
ly
overhead to analyze such large data matrices.
A way to address the computational complexity is to reduce the matrix dimension. Only
significant and distinctive patterns in the image are retai
ned as meaningful features. For
example, the ANOVA method is adopted [ Ref
…
] for feature selection from 2

D GC data
matrix. It uses a small subset of samples from each class of data and remains the matrix
entries which are with large inter

class variances and small intra

class variances.
This paper presents linear SVM for feature selection. Linear SVM, as a linear classifier for
each pair of two classes of data, tr
ies
to
specify a weighting for each matrix entry. The
weighting is separable (e.g. opposite sign, or an obvious threshold) fo
r entries from different
classes. And the absolute
weightings signify
the importance of the corresponding entries.
Hereby those entries with the largest weightings are retained as meaningful features.
For a set
of data from more than two classes, the featu
re selection operates in a pair by pair manner,
and the features selected by multiple runs are unified into one combined feature set.
The 2D

GC data is
analyzed
as image
on which the patterns are regarded as
stable and unique
characteristics for a certain
herb
species
or chemical component. But the properties
of patterns
such as the areas, intensities and positions always show some varieties, it thus requires
the
comparing algorithms to be
variant

tolerant,
but at the same time
not to degrade the sensit
ivi
ty
when analyzing patterns from
two
different species.
Thus some image processing techniques
are adopted for more accurately pattern extraction and matching.
1.4
Why the machine learning approaches such as SVM and the algorithm
family of classification and c
lustering can help the Chinese Herb 2D

GC
Image data analysis
…
.
Machine Learning is the study of computer algorithms that improve automatically through
experience
[1]
.
As remarked in [2], the machine learning will be
2. The Feature Selection Methodolog
y: Linear Support Vector
Machine
The formulism of SVM is introduced here
…
..
3. SVM applied to pattern extraction from Chinese Herb 2D

GC
Image
Part 1.
Input data:
Please give the full name of the element A and element B
A
B
Sample ID
0%
100%
1 2 3
4 5
30%
70%
6 7 8 9 10
50%
50%
11 12 13 14 15
Data format
:
Form a (400 x 510) matrix for each sample observation. During
each
time segment
of 4
seconds,
400 readings are sampled from the column 2 of the GC device at a rate of 1FID/
0.01s
.
And
a compl
ete
run of experiment lasts 34 [(510 x 4s)/60 = 34] minutes. And the FID
intensities
range
from 21 to nearly 4500.
But in the following analysis, we discard the part of readings
obtained
in the first 8 minutes
when the compounds are going through the
GC
device
due to the severe noise
when booting the
machine
. Thus we are handling
totally
15 data matrices with size 400 x 390.
Part 2. 2D GC has larger information capacity than 1D GC:
It is claimed
the 1D GC can be obtained by
accumulating the FID signal s
trengths
in each column
of
the
2D GC data
matrix (Fig 1)
.
If that is true,
it is straightforward that the 2 D GC has larger
information capacity
than the corresponding 1D GC
(a 400x390 matrix vs. a 1x390 vector) (Fig
2).
Although it is argued that such a
reconstruction does not compare the 1D GC with the 2D GC
under the same condition (e.g. 2D is sampled at the frequency 100 = 1/(0.01s), but 1D is sampled
at the frequency 0.25 = 1/(4s) ), our other experiments show that this reconstruction is credible.
A
bunch of 1D experiments are conducted at a high frequency (f=100) and lasting 34 minutes too.
And then the FID readings are sequentially folded to segments with a length of 400, thus
transferred to data matrices with size 400 x 510. Shown as images (Fig 3
), it is nearly identical to
those obtained by simply reconstructing the 2D GC and then replicating and tiling with mean
readings to a matrix.
Fig. 1 (a) Image show of a sample GC x GC chromatogram. (b) Reconstructed
first column chr
omatogram of the same 2D GC x GC chromatogram.
Fig 2. E
xtending the 1D GC data into the same matrix form,
whose column can be
supposed to be
filled with a flat
mean reading
obtained from the
column 2
of the
GC device.
It is obvious that t
he
2D GC x GC
si
gnals are
more distinct
due to
the various strengths along each column.
Fig 3. 1D GC data by increasing the frequency to (f=100)
Part 3. Further Data preprocessing
3.1 Guassian filtering with size
3
x
7
.
It is generally believed that same character
istic patterns can be observed in the graphs
for two 2D GC experiments on same compounds. Now assume a significant pattern centering at
(x,y) in the graph is observed, where x is the rowwise pixel position and y is the columnwise pixel
position (or it is r
egarded as time when
referring
the
2D
GC experiment). And from the knowledge
of experimentation, such a pattern should not be observed as only a single pulse at an isolated
position, but be observed during an
x and
y interval (For the image representatio
n, it should be
a rectangle box with width
x and height
y. Thus
w
hen comparing two 2D GC graph, we
cannot
simply notice the difference of FID signal strengths between each pair of pixels at the same (time /
graph) position.
However, w
e should consider th
e pattern difference at nearly the same region of
graphs. A simple way to enforce the effect of the neighboring pixels
for
a centering pixel is to use
some local filters with a small window size.
Among the huge set of filters
available in the field of
imag
e analysis
, t
he Gaussian smooth filter is popularly used. In our GC graph analysis, we select
the Gaussian filter with
w
indow size 3 x 7, which fits for the case
that the
columnwise correlation
of
2D GC
data
should be more
accuractely
pinpointed.
Part 4.
Feature Selection Using Linear Support Vector Classification Machine
4.1 Feature Selection for 2D GC Data by Using Linear SVM
It is necessary to reduce each 2D GC data from a huge matrix to a more economic size by
discarding some insignificant values
, only keeping those important features. It is not only a great
help to reduce the computational burden cast
ing
upon the lately used classification or clustering
algorithms, more importantly, it also essential to sketch out the featured pattern
s
in the 2D
GC
data for chemist’s inspection or further chemical analysis.
A novel machine learning approach to feature selection is proposed recently by utilizing the
state

of

the art Support Vector Machine. Following the general setting to re

shape each 2D GC
matr
ix to a one dimensional vector by sequentially
tiling
each column, and stack the vectors
of all
samples together
, we form a N x d data matrix, where N = 15 and d = 400
x
390 = 15600.
Since support vector machine is a classification method, some training
samples are used to
guide the feature selection procedures. For instance, four samples (the 1
st
, 2
nd
, 11

th, 12

th) are
regarded as the training
samples
, where the 1
st
and 2
nd
samples are in the same class (purely
composed
with
A) and the remaining two sam
ples
(the 11

th, 12

th)
are grouped
in
to the opposite
class (
contaminate
d
by
some B). Denote the four data vectors as xi (
I
= 1,2,11,12).
The linear Support Vector Machine tries to construct a separation function f(x) = wx + b such
that wx1 + b > 0, wx2
+ b > 0, wx
11
+ b<0 and wx12 + b < 0 with some
constraints
on w and b.
After
some
systematic solving procedure
s
, we can obtain an explicit solution for w
from
a linear
Support Vector Machine. And generally the w, which has the same
dimension
as the x, expr
esses
the importance of each dimension of xi by the corresponding term in w.
A
B
Sample
Classification Result
0%
100%
1 2 3 4 5
1 1 1 1 1
30%
70%
6 7 8 9 10
2 2 2 2 2
50%
50%
11 12 13 14 15
2 2 2 2 2
Table 2.
Using parameter C = 1, tolerance = 0.001,
and
cache size = 100
MB
. The Training error is
0 and the testing err is also 0.
And constructing the linear SVM mainly aims at feature selection, although its classification is
already encouraging enough
as shown in (Table 2)
. After training a linear SVM
, the w obtained is
illustrated in following graph:
Fig
4
. Each FID signal (total 15600) is associated with a weight value,
Fig 5. Reducing the number of features doesn’t hurt the classification accuracy too much.
Fig 6. Fractional Area of
features along the threshold value
4.2 Classification accuracy comparison between 2D and 1D data
A comparison has been done to validate that 2D GC data has a larger information capacity
than reconstructed 1D GC: using Linear SVM to separate more
classes of experimental data and
compare their classification accuracies.
We then produce 5 classes of compounds with various percentages of elements B. In
particular, the percentage of B are 0% 10% 20% 30% 40%, and each specific kind of blending has
be
en fed into the GC device 5 times to get a set of 5 replicated 2D measurements and then obtain
the same number of reconstructed 1D data vectors.
It is a multi

task classification task in fact. And we report the separation rate per pair of
classes by usin
g the training sample rates 0. 4 (2 training per class):
B:A=0:100
B:A=10:90
B:A=20:80
B:A=30:70
B:A=40:60
B:A=0:100

0.77
1
1
1
B:A=10:90
0.78

0.83
0.96
0.96
B:A=20:80
1.00
0.92

0.75
0.99
B:A=30:70
1.00
1.00
0.86

0.81
B:A=40:60
1.00
1.00
0.9
3
0.93

Table 3. The overall (both training and testing) classification accuracies for 1D and 2D GC
data using linear SVM under the parameter settings: C=1, tolerance = 0.001, cache size = 100
MB
.
Each table cell shows the accuracy of classifying one ty
pe of sample named by the column title
from another type marked with the row title. The upper triangular part shows the results for 1D
GC data, and the lower triangular part is for 2D data. The better accuracy of any diagonally
symmetric pair of values is
highlighted. The above results are averaged from 10 repeated
experiments with different training samples used.
From the table 3, we can notice that most results for classifying 2D data are better than those
for classifying 1D data. The only exception hap
pens when classifying 40%

B from 30%

B. But
their accuracy difference is not too large to disprove the superiority of using 2D data by taking an
account of the device noise and the limited samples used.
Part 5: Further optimizing the 2D GC features by Ima
ge processing Methods
The features selected by the novel Linear Support Vector Machine can effectively distinguish
samples without B mixed from those contaminated by B. For example, extracting only 1 percent
features still completely classifies the two cl
asses of data (Fig 5 & 6).
Although the classification results, as shown above, are insensitive to the number of features,
choosing how many percents of features is still critical from the view of chemistry domain experts
because it might be too dangerou
s to use an extremely small set of features to represent a sample
originally high

dimensionally featured. We might have to determine the optimal threshold for w
obtained from the Linear SVM such that both the number of features is not formidably large and
the sample representation by features is not vulnerably oversimplified.
If we have plenty of training samples, some classical methods such as the cross validation
can be used to guide the process of deciding the optimal w. But in chemo metric field, i
n particular,
for our 2D GC experiment case, obtaining more samples is very time

consuming and labor
demanding.
The w itself is a long vector and has the one

to

one correspondence to the data vector of the
2D vectors. Recalling many unsupervised feature
selection methods achieve good results too by
only noticing the pixel intensities and pattern spatiality on each single 2D image, we can apply
same methodology to the way of selecting w if we retransforming the w vector to a 2

D image
with identical size o
f each 2D GC image.
To extract more reasonable features from the images without too much supervision, we can
employ some image processing techniques for contour/boundary detection. We adopt a set of
threshold values (totally twenty levels) to locate the
contours in
the
image
valued by w
. Those
threshold values are automatically selected for each sample image in an unsupervised manner.
Following is a set of
important area
found for
the image of w
.
Figure 7.
Part 6: Clustering upon the whole dataset
using new feature vectors
6.1 PCA analyses to show the improvement due to feature selection
To verify the effectiveness of the Support Vector Machine feature selection approach, we
report the clustering results on the feature

reduced 2D GC data compared
with the originally raw
2D GC data.
The clustering algorithms used are PCA combining with
the
K

mean methods.
The input data
are
firstly
represented by a subset of principle components through PCA, and then are clustered
into several groups by the K

mean a
lgorithm.
For high dimensional input vectors, such as the raw
2D GC
vector
s
with a
length
of 15600,
t
he
PCA with the K

L transform trick (identical to the
kernel

PCA using Linear Kernel) is used to avoid
directly operat
ing
on the covariance matrix
which
in
size
is proportional to the square of
the
dimension
of input vectors
.
And even
for other
lower dimension data, the linear kernel PCA is
also
performed
besides
the conventional PCA
testing, and the better results are reported under the same category
of
(
PC
A
+K

mean) in the
following table
.
Training
sample
s
Raw 2D GC Data
Feature Selection (a)
Feature Selection
(b)
Linear SVM
K

mean
PCA
+K

mean
PCA + K

mean
PCA + K

mean
2
0.9356
0.8956
0.8874
0.9244
0.8533
4
0.9711
0.9156
0.9289
6
0.9911
0.9022
0.9244
Table 3.
Validate the feature selection Methods for 2D GC data.
Shown in table 3, the
Linear SVM , as a supervised classification algorithm, always achieves
the best results under different training sample rate even though it operates on the raw
data. While
the K

mean clustering
, as an unsupervised
method, either with PCA added or not, does not achieve
good results on raw 2D data
.
After feature selection, the clustering algorithms obtain higher
accuracies with smaller computation complexity.
One
can also observe that the feature selection (b) has a positive correlation with the training
sample rate used at the linear SVM stage. Although the scheme (b) runs worse when the training
rate is small, it boosts to a higher precision compared with the sch
eme (a) as the training rate
increases. Also it is verifiable from the following figure 8 and
figure
9. The summation of variance
percentage of the first 3 principle components
in Figure 9
achieve 80%, which is
much
higher than
the total variance percentag
e
s
of the first three principle components
in Figure 8.
Figure 8. PCA and K

mean on the data using feature selection scheme (a)
Figure 9. PCA and K

mean on the data using feature selection scheme (b).
4. Conclusion
5. References
[1]
Machine Learn
ing,
Tom Mitchell
, McGraw Hill, 1997.
Comments 0
Log in to post a comment