A Toolkit for Remote Sensing Enviroinformatics Clustering
Fazlul Shahriar, George Bonev
Advisors: Michael Grossberg, Irina Gladkova, Srikanth Gottipati
Clustering Module:
k

means

This is a top

down clustering algorithm which attempts to find k representative centers
for the data. The initial means are selected from the training data itself.
Fuzzy k

means

This is a top

down clustering algorithm which attempts to find k representative
centers for the data. The initial means are selected from the training data itself. This algorithm uses
a slightly different gradient search than the simple standard k

means algorithm, but generally yields
the same final solution.
Expectation

Maximization

Estimates the means and covariances of components in Gaussian
Mixture Model.
Competitive learning

Competitive learning clustering, where the nearest cluster center is
updated according to the position of a randomly selected training pattern.
Leader follower

Basic leader

follower clustering, which is similar to competitive learning but
additionally generates a new cluster center whenever a new input pattern differs by more than
threshold distance
\
theta from existing clusters.
ADDC
(agglomerative clustering)

An on

line (simple

pass) clustering algorithm which accepts a
single sample at each step, updates the cluster centers and generates new centers as needed. The
algorithm is efficient in that it generates the cluster centers with a single pass of the data.
DSLVQ
(distinction sensitive linear vector quantization)

Performs earning vector quantization
(i.e., represents a data set by a small number of cluster centers) using a distinction or classification
criterion rather than a traditional sum

square

error criterion.
Minimum spanning tree
(undirected)

Builds a minimum spanning tree for a data set based on
nearest neighbors.
Connected components

Finds connected components for a data set based on nearest neighbor.
Returns a list of the connected components of the given graph.
Graph cut

Graph cut clustering is achieved by cutting edges of the graph to form a good set of
connected components such that the weights of within

components edges will be minimized
compared to across

component edges.
Spectral clustering

Spectral clustering techniques make use of the spectrum of the similarity
matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
HDR
(hierarchical dimensionality reduction)

Clusters similar features so as to reduce the
dimensionality of the data.
SOHC
(stepwise optimal hierarchical clustering)

Bottom

up clustering. The algorithm starts by
assuming each training point is its own cluster and then iteratively merges the two clusters that
change a clustering criterion the least, until the desired number of clusters, k are formed.
Utility Functions:
Make graph
(similarity matrix)

Gen a set of data points A, the similarity matrix may be defined
as a matrix S where each elements represents a measure of the similarity between points i,j in A.
Normalization
(standard deviation based)

Normalize a group of observations on a per feature
basis. This is done by dividing each feature by its standard deviation across all observations.
UniqueRand

Generates unique set of random points drawn from N(0,1)
Training

Nearest neighbor classifier is used to classify test data set with the clustering obtained
from trained data set.
UniqueVector

Computes the unique set of feature vectors from a given set of feature vectors.
Whitening transform

Performed on d

dimensional data set, it first subtracts the sample mean
from each point, and then multiplies the data set by inverse of the square root of the covariance
matrix
Validation Module:
Cross validation
–
statistical method for validating a predictive model. Subsets of the data are held
out, to be used as validating sets; a model is fit to the remaining data (a training set) and used to
predict for the validation set. Averaging the quality of the predictions across the validation sets
yields an overall measure of prediction accuracy.
Bootstrap

statistical method for estimating the sampling distribution of an estimator by sampling
with replacement from the original sample, most often with the purpose of deriving robust
estimates of standard errors and confidence intervals of a population parameter.
Jackknife

To estimate the bias and standard error in a statistic, when a random sample of
observations is used to calculate it. The basic idea behind the jackknife estimator lies in
systematically recomputing the statistic estimate leaving out one observation at a time from the
sample set.
BIC

In parametric methods, there might be various candidate models each with a different
number of parameters to represent a data set.
The Bayesian information criterion is a useful
statistical criterion for model selection for parametric methods.
AIC
–
Tool for nonparametric model selection. Given a data set, several competing models may be
ranked according to their AIC, with the one having the lowest AIC being the best.
Clustering obtained using K

means algorithm where
5 clusters were specified and run with a random
initial staring point.
Clustering obtained with preprocessing step
using whitening followed by K

means algorithm
Clustering obtained with preprocessing step
using whitening followed by EM algorithm
Clustering obtained using EM algorithm where
5 clusters were specified and run with a
random initial staring point. The algorithm
usually gets stuck in a local minima.
Modes obtained during the mean shift algorithm.
Red dots represent the local peaks of the density
estimate of the data
Clustering obtained using a combination of
mean shift and connected components algorithms
Issues:
Remotely sensing data is typically vast.
Data size requires advanced tools to explore them semi

automatically.
Clustering is one such tool.
Implementation:
Many clustering algorithms have been proposed in the literature but they are dispersed in
multiple libraries in different languages. Hence it becomes difficult to test these algorithms on
applications at hand.
Our goal is to create a single library (platform independent) so that users can test them on
remote sensing data
To accomplish this, we choose Python programming language which gives a MATLAB

like
interface and and at the same time lends to deal with large databases
Furthermore, Python allows easy integration with C/C++/R libraries.
Physics based cluster labeling: 1
Physics based cluster labeling: 2
Unsupervised nonparametric classification
Colors are the same as used in the scatter
plot above
Clustering obtained on 2

d cloud of points
by running mean shift procedure with
various initializations
Trajectories of the mean shift procedures drawn
over the density estimate computed over the same
data set. The peaks retained for final classification
are marked with red dots.
2

d cloud of points where clustering using normal distribution based
methods could fail while methods like spectral and geometric clustering
algorithms could do a better job
MODIS cloud classification over the eastern part of United States.
3

d cloud of points which could easily be clustered
using a parametric method like the Expectation

Maximization (EM) algorithm
Code fragment from clustering toolkit
IPython

MATLAB

like interface for Python
This research has been funded by NOAA

CREST grant # NA06OAR4810162
Comments 0
Log in to post a comment