Higher-Order Data Mining Using Tensors - Michael Barnathan

bijoufriesAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

125 views

Mining Complex

High
-
Order Datasets

Michael Barnathan

Temple University

Department of Computer and Information Sciences


April 23, 2010

Structure of Presentation


Introduction


What constitutes “complex” data?


Why functional medical imaging?


Tensor definitions and nomenclature.


Motivation and contributions.


Background


Literature review.


Matrix decompositions.


Why SVD, LSA, and their higher
-
order analogs are still state
-
of
-
the
-
art methods.


Minimization of
Frobenius

norm: Theoretical justification.


“Halfway to the Netflix Prize”: Empirical justification.


Graph theoretic interpretation.


Tensor operations and decompositions.


WaveCluster.


Methods


Tensor
-
theoretic multidimensional wavelet transform.


Classification.


Clustering.


WaveCluster (+ several improvements)


TWaveCluster


Lloyd + WaveCluster.


Latent Concept Discovery (High
-
Order LSA).


Results


Datasets.


Low and High
-
Order Classification


Clustering Results


Handedness Detection and Discovered Concepts


Behavior vs. Sparseness


Approximation Accuracy


Conclusion and Future Work


References

What makes datasets complex?


Interactions between many features.


Analyzing
simultaneous
effects of many features.


Nonscalar

features.


Large scale.


Size expands exponentially with order.


Scalability is a
major

issue.


We should use whatever tricks we can!


Sparseness.


Sensitivity to locality.


Approximation. (Nice if it has some optimality properties too).


Compression using a fast method.


Features
themselves

can be high
-
order.


For example, the diffusion tensors used in Diffusion Tensor Imaging encapsulate
direction
.


More commonly encountered, (1x3) or (1x4) pixels in an RGB image.


Analyzing each component of each feature in isolation is sometimes impractical.


For example, looking for specific RGB colors entails examining all three channels at a time.


Most variations of R, G, and B are not independent. Why treat them that way?


If we want to use the full dataset, a matrix is not the right model.


And combinations of these! Tradeoffs:


Methods for untangling interactive features can be too slow on large
-
scale datasets.


Locality in space and time can be lost
-

or false neighborhoods can be created.

RGB images are high
-
order when analyzed
across all channels.

Medical Imaging


Primary application of our work: Medical Image Analysis and Computer
-
Aided Diagnosis.


A particularly complex imaging domain.



Spatiotemporal


Spatial locality usually very important.


Spatial activation patterns correlate to anatomical site, function.


Normal anatomy produces patterns that would be abnormal elsewhere.


Determination of border regions particularly important in some applications; e.g. lesion segmentation.


Structural medical data evolves
slowly

over time.


e.g. Follow up images of the same patient.


Even occurs in normal patients; effects of growth/aging.


But this could take months or years!


Functional data evolves much more rapidly!


Changes occur
during

imaging! Temporal resolutions of seconds or less.


Space
and

time (among many factors) need to be evaluated
together
.


Patterns may move to different parts of an organ over time.


e.g., frontal cortex (planning)
-
> motor cortex (execution) in motor task data.


Images vary with additional parameters


Acquisition parameters.


Subject.


Functional Task.


Locality in space and time must be captured.


Analysis should use all information possible, including subject and experimental design parameters.

“Modes” of Functional Data

3 Spatial Modes

1 Temporal Mode

Trends in both are important.

Subject + task also modes.

The big picture:

How do we combine
this information?

Tensors


Matrices can’t model these datasets.


Certainly not all of these modes in tandem.


Solution: add additional indices.


Result: Multidimensional array or “data cube”.


We call this a
tensor
.



(Data mining definition does not imply constraint on coordinate transform behavior. If you’ve never heard of a tensor befo
re, don’t worry about it).


The number of indices is the
order

of the tensor.


“High
-
Order”: order > 2.


Each index is a “mode”:


E.g. “Rows”
-
> Mode 1, “Columns”
-
> Mode 2.

Scalar

Vector

Matrix

Tensor

1

0

1

2

r

Order

Tensors: Advantages and Disadvantages


Advantages:


Full exploitation of the high
-
order structure of a dataset:


Can represent high
-
order data without losing information or creating false
neighborhoods.


Tensor techniques can make inferences across modes.


E.g. a co
-
clustering derived from and reported with similarities to all modes of the
tensor at once.


Very

powerful. Relates modes and describes their underlying meanings.


Excellent representation for spatiotemporal data:


For example, functional experiments:


3 orders of volumetric data, 1 order for time, 1 order for subject, 1 order for
experimental task…


Disadvantages:


Poor efficiency:


Storage requirements scale exponentially with order.


Tensor factorizations can take many iterations to converge.


Methods are
global
: they do not take spatiotemporal locality into account.


Low
-
order techniques are much more established in the literature, particularly in
the biomedical domain.


No comprehensive high
-
order data mining framework currently exists.


We address all of these.

Contributions


Although built on established theoretical foundations, tensors are not well studied in biomedical data
mining.


We apply our techniques in the domain of computer
-
aided diagnosis and discover new potentially clinically
relevant patterns derived from simultaneous analysis across all modes in motor task fMRI data.


We present one of the first comparative analyses of tensor and matrix methods within this domain, including
performance on a synthetic dataset of varying sparseness.


Our primary dataset is large
-
scale (9.3 GB, order 6, 2,734,221,600 voxels), thus presenting an additional data
mining challenge.


No comprehensive framework exists for performing common data mining tasks using tensors.


We develop TWave, a higher
-
order framework for classification, clustering, compression, feature extraction,
summarization, and latent concept discovery of tensor data.


In the process of extending our framework, we make additional improvements to the WaveCluster algorithm
and derive novel high and low order algorithms suited to data mining and biomedical analysis.


We compare the performance of our framework to traditional low
-
order models.


Tensor methods can be slow or even intractable on large datasets.


We utilize optimized tensor analysis techniques, such as the memory efficient tucker decomposition (
Kolda

and Sun, 2008), and build our methods on more efficient analysis techniques, such as PARAFAC (
Harshman
,
1970; Carroll and Chang, 1970).


We develop a hybrid approach using wavelets and based on the WaveCluster algorithm (
Sheikholeslami
,
Chatterjee
, & Zhang, 2000), compressing our dataset by 98% (9.3 GB to 181 MB) and speeding analysis by
two orders of magnitude (8 days to 2 hours) without a loss of subsequent classification accuracy or significant
deformation of the discovered concept space.


We assess the performance characteristics of several tensor and matrix approaches as dataset size, sparseness,
and approximation accuracy change.


Most tensor methods ignore spatiotemporal locality.


We utilize multilevel wavelet decompositions and WaveCluster, both of which naturally capture dataset
locality, prior to analysis with tensor techniques.


We present an improvement to WaveCluster which deforms grid cells with a prior clustering step.


This both makes the algorithm context
-
aware and reduces a potentially massive partial volume effect.


Structure of Presentation


Introduction


What constitutes “complex” data?


Why functional medical imaging?


Tensor definitions and nomenclature.


Motivation and contributions.


Background


Literature review.


Matrix decompositions.


Why SVD, LSA, and their higher
-
order analogs are still state
-
of
-
the
-
art methods.


Minimization of
Frobenius

norm: Theoretical justification.


“Halfway to the Netflix Prize”: Empirical justification.


Graph theoretic interpretation.


Tensor operations and decompositions.


WaveCluster.


Methods


Tensor
-
theoretic multidimensional wavelet transform.


Classification.


Clustering.


WaveCluster (+ several improvements)


TWaveCluster


Lloyd + WaveCluster.


Latent Concept Discovery (High
-
Order LSA).


Results


Datasets.


Low and High
-
Order Classification


Clustering Results


Handedness Detection and Discovered Concepts


Behavior vs. Sparseness


Approximation Accuracy


Conclusion and Future Work


References

Literature Review:


Matrix Methods and Singular Value Decomposition:


SVD used for low
-
rank matrix approximation, PCA
(
Eckart

and Young, 1936).


Latent Semantic Analysis

(
Deerwester
,
Dumais
, Furnas,
Landauer
, and
Harshman
, 1990).


Used in clustering and co
-
clustering (Pan, Zhang, Wang, 2008).


“Folding in”: a powerful recommendation engine used with great success in pursuit of the Netflix Prize (
Koren
, 2009).


PCA used to remove correlations, ICA developed (
Comon
, 1994).


Principle Direction Divisive Partitioning


Hierarchical clustering using SVD (
Boley
, 1998).


Incremental linear
-
time SVD developed

(Brand, 2003; Brand, 2006).


Optimal k
-
means clustering related to PCA subspace (Ding and He, 2004).


Tensor Decompositions


Milestones:


Introduced by Hitchcock (Hitchcock, 1927).


Tucker decomposition

(Tucker, 1963).


PARAFAC / CANDECOMP

(
Harshman
, 1970; Carroll and Chang, 1970).


High
-
order SVD (special case of Tucker decomposition) (
De Lathauwer, De Moor, Vandewalle, 2000).


Tensor C
-
U
-
R (
Mahoney,
Maggioni
, and
Drineas
, 2006).


Incremental tensor analysis:

(Sun, Tao, and
Faloutsos
, 2006).


Multilinear PCA (and MPCA+LDA for supervised analysis) (Lu,
Plataniotis
, and
Venetsanopoulos
, 2008).


Applications


High
-
order techniques used primarily in
chemometrics
: (Bro, 1996;
Andersson

and Bro, 2000;
Niazi

and Mohammad, 2006;
Niazi

and
Yazdanipour
, 2007).


Facial recognition (“
TensorFaces
”) and computer vision: (
Vasilescu

&
Terzopoulos
, 2002), (Kim, Wong, and
Cipolla
, 2007).


Handwritten digit recognition: (
Savas
, 2003).


Network traffic analysis: (Sun, Tao, and
Faloutsos
, 2006).


Web link analysis: (
Kolda

and Bader, 2006;
Kolda

and Bader, 2007).


Spatiotemporal tensor mining: (Sun,
Tsourakakis
,
Hoke
,
Faloutsos
, and
Eliassi
-
Rad
, 2008).


Surveys: (
Comon
, 2002), (Martin, 2004), (
Kolda

and Bader, 2007), (
Skillicorn
, 2007),



(
Acar

and
Yener
, 2009).



Gaping hole: Four decompositions (SVD, Tucker, PARAFAC, NNMF) used to do just about everything in this area.


Improvements on the computations, but no real replacements or optimizations.


There’s a reason for this.

Biomedical Applications


Outside of
chemometrics
, less utilized than in other domains.


PARAFAC has been the technique of choice.


(Requires significantly less memory than Tucker and even Memory
-
Efficient
Tucker in our experiments).


EEG analysis of epilepsy data: (
Acar
,
Aykut
-
Bingol
,
Bingol
, Bro, &
Yener
,
2007).


EEG classification of visual potentials evoked by 3 geometric shapes: (Li,
Zhang, and Zhao, 2007).


MRI studies of lingual morphology: (
Zheng
, Hasegawa
-
Johnson, and Pizza,
2003).


Fusion of EEG and fMRI modalities: (
Martínez
-
Montes,
Valdés
-
Sosa,
Miwakeichi
, Goldman, and Cohen, 2004).


Tensor CUR paper was applied on multispectral histology classification:
(Mahoney,
Maggioni
, &
Drineas
, 2006).



There is no comparative analysis of tensor analysis techniques (except CUR
vs. Tucker) or tensor techniques against matrices in this domain.


We will present such an analysis with our framework.

Why SVD?


Why hasn’t something better come along?


SVD is
ancient
.


Proof of existence and uniqueness: 1936 at latest.


Used even earlier.


Yet it’s still used everywhere for co
-
clustering and recommendation (See prev.
slide).


High order techniques are also based on SVD and used in the same way.


Theoretical justifications:


SVD is a linearly optimal approximation (min.
Frobenius

norm):




The concepts are ranked by contribution to dataset variance. If SNR is high, noise
tends to go first.


Empirical justifications:


It works very well. A simple SVD
-
based recommendation engine came halfway to
winning the Netflix prize early and was a component of the winning solution
(
Koren

2009).


Simple to compute and can greatly compress the dataset for subsequent analysis.


Singular Value Decomposition


Factors an
m

x
n
rank
-
r matrix
A

into two orthonormal projection matrices
U

and
V

containing left and right
singular
vectors

and a diagonal core matrix
S

containing
singular values
:

A = USV
T


Full SVD
:
U

is
m

x
m
,
V

is
n

x
n
,
S

is
m

x
n
.


Compact SVD
: U

is
m

x
r
,
V

is
n x r
,
S

is
r

x
r
.



Unique decomposition; there is only
one

SVD of matrix A satisfying the constraints.


Uses:


Optimal low
-
rank approximation (minimizes
Frobenius

norm of approximation error).


Made by arranging singular values (i.e. variance captured) in descending order, zeroing all but the desired number of values,

reconstituting the matrix.


Computation of pseudoinverse: A* =
VS
-
1
U
T
.


Basis of range and null space:


Left singular vectors (columns of
U
) with nonzero singular values form a basis on A’s range.


Right singular vectors with zero singular values form a basis of A’s null space.


Determination of rank (number of nonzero singular values in
S
).


Least squares minimization (
b

= V
T
S
-
1
U
T
y


probably not the most efficient way to do this).


Co
-
clustering (LSA) and hierarchical clustering (PDDP).



Basis for PCA,
LSA
.


Straightforward computation based on Eigendecomposition of (A
T
A)
½

(in thesis).

SVD

U

V
T

S

Latent Semantic Analysis: Basics


Use of SVD to extract latent concepts from a term
-
document matrix (
Deerwester
,
Dumais
,
Furnas,
Landauer
, and
Harshman
, 1990).


Construct “term
-
document” matrix of term occurrence frequencies or weights.


(Works well with
tf
-
idf

weighting
).


Normalize columns.


Perform SVD on term
-
document matrix.


Resulting U contains document
-
concept similarities, V contains term
-
concept similarities.


S contains “strength” of concepts: contribution to overall dataset variance.


Foundation of our approach
; works with higher order decompositions and non
-
textual
datasets as well.


Groups of voxels, groups of subjects, groups of motor tasks…


“Synonym” detection: terms that share concepts with high similarity are likely synonyms or
otherwise related.


Useful for noise reduction.


If SNR is sufficiently high, noise is naturally represented by “weak” concepts.


Truncation of the SVD may be performed to eliminate concepts:
noise tends to go first
.


Unsupervised; optimal number of concepts to keep may be determined with the aid of the
elbow criterion or measures such as
Akaike

or Bayesian Information Criteria.

LSA example.

Term

Document

Ventricles

Hippocampus

Flow

Traffic

MRI
-
1

24

10

1

0

MRI
-
2

16

8

0

0

MRI
-
3

8

6

0

0

NET
-
1

0

0

6

17

NET
-
2

0

0

4

13

Example term
-
document matrix for
MRI reports and network RFCs.

SVD

94.8%

“MRI” concept.

“NET” concept.

LSA Graphical Example

Rows and columns plotted in the same space!

Left shift an effect of that instance
of “flow” in the MRI
-
1 document.

Graph Theoretic Interpretation


LSA represents a
co
-
clustering

of rows and columns into
the same
space
.


# clusters = # concepts held.


Cluster memberships = similarities in U, V.


Cluster strengths = singular values in S.


Very powerful high order extension:
every mode of a tensor projects to a
common space
.


Can be done on either the adjacency matrix or the
Laplacian

of a graph (L =
Adj. Mat
-

Deg. Mat).


Translates into matching vertices through a common set of conceptual
“waypoints” rather than directly from one to another:

Recommendation With “Folding In”


Means of projecting a previously unseen query into
the existing concept space:




q'

can now be compared to any row of
U (
or V
)
.


i.e. cosine similarity:


Query can be appended to U to seed new queries,
but be careful.


The model will lose its optimality as a projection.


Accuracy gradually degrades (
Sarwar

et al. 2002).


Solution: periodically rebuild the SVD or use an
incremental formulation (Brand 2003).


Folding In Example

Tensor Operations


Tensor (outer) product.


Generalizes Kronecker product.


Extracting slices and fibers.





Unfolding (matricization) and folding.


Matrix mode product.


Gen. matrix multiplication.


“Horizontal”, “Lateral”, and “Frontal” slices.

“Column”, “Row”, and “Tube” fibers.

1

2

3

4

1

2

3

4

5

6

7

8

9

1
0

1
1

1
2

Unfolding on mode 1 (take mode
1 fibers and linearize).

Equivalently:

LSA Interpretation


LSA concepts still usable with higher
-
order decompositions.


Projection matrices contain mode
-
to
-
concept similarities.


Core tensor / scaling vector contains concept strengths
(contributions to dataset variance).



Now possible to assess the similarity to concepts of
each mode
relative to the entire dataset
! Not just rows vs. columns.


i.e. Discrimination of handedness by subject, taking
spatiotemporal patterns and motor task into account.


This is done in parallel on all modes of the dataset.


Modes can also be ranked and compared directly.

Tucker Decomposition


High
-
order analogue of SVD.


Let
𝒜

be an order
-
r
tensor.


There exists a decomposition of the form:




Where
U
1
,
U
2
, …,
U
r

are unitary projection matrices on each mode and
𝒮

is a core tensor containing concept strengths.


Not a unique decomposition.


No longer possible to ensure orthonormal projection
matrices
and
diagonal core tensor.


Orthonormal matrices generally more desirable.


With orthonormal matrices, a similar optimality guarantee to SVD.


Computation by alternating least squares in thesis.

PARAFAC/CANDECOMP


High
-
order analogue to factor analysis / PCA.


Independently discovered by
Harshman

and (Carroll and Chang) in 1970.


Same technique, different names.


Demonstrated as a generalization of PCA by (
De Lathauwer, De Moor, Vandewalle 2000).


As with factor analysis, PCA is similar to PARAFAC when order <= 2 and the residuals are homoskedastic.


Unlike PCA, no approximation optimality guarantee in the high
-
order case.


Unique rotationally
-
sensitive solution.




Let
𝒜

be an order
-
r

tensor.


For a specified number of factors
f
, there exists a unique decomposition of the form:





U
(1)

U
(r)

contain factor loadings on their respective modes.


Lambda vector contains variance captured within factors.


Computation by alternating least squares in thesis.

=

+

+

Wavelets


Multiresolution spectral analysis tools.


Continuous wavelet transform convolves scaled and translated
wavelet functions with signal:





Where * denotes the complex conjugate operation, s denotes the scaling parameter, t denotes the translation parameter, ψ deno
tes

the mother wavelet function, and f(x)
denotes the original signal.


Discrete analogue/filter bank more often used:





Convolves x with
lowpass

and
highpass

filters, downsamples
(“decimates”) signal in half.


Perfect reconstruction still possible for orthogonal wavelets.


(“
Quadrature

mirror filter”)


Stationary wavelet transform:
upsamples

filters instead.


Since we can reconstruct with only half the signal, this is redundant.


Nevertheless, it is useful for grid
-
based algorithms (e.g. WaveCluster).

Multilevel Wavelet Decomposition


Multiresolution analysis possible by
cascading

a
Mallat

tree

of filters:






Also possible to create a multidimensional wavelet
decomposition by rotating the tensor between levels.


Novel idea: let’s use tensor operations to extend this to
structures of arbitrary order.


We’ll come back to this momentarily.

WaveCluster


Grid and density
-
based clustering algorithm using wavelets by (
Sheikholeslami
,
Chatterjee
, and Zhang,
2000).


Perfect complement to our hybrid wavelet + tensor approach.


Primary advantages:


No
k

parameter, as in k
-
means.


Linear
-
time.


Multiresolution (uses wavelet decompositions at different scales).


Can identify non
-
spherical clusters (connectivity rather than distance defines clusters).


Identifies outliers
-

used in
FindOut
, an outlier detection method (Yu,
Sheikholeslami
, and Zhang, 2002).


Primary disadvantages:


Grid introduces a quantization error.


Huge partial volume effect possible if grid cells are too large.


Density threshold and grid size are additional parameters.


Naïve algorithm can’t cluster image intensity data; voxels must be binary.


Cannot identify spatially disjoint clusters.


Algorithm Overview:


Quantizes data to a grid, using the count of each grid cell in place of the original data.


Applies a wavelet transformation using a hat
-
shaped wavelet (such as the (2,2) or (4,2) biorthogonal
wavelets), retaining the approximation coefficients and emphasizing regions in which points cluster.


Thresholds cells in the transformed space. Cells with values above a user specified density threshold are
considered “significant”.


Applies a connected component algorithm to the significant cells to discover and label clusters.


Maps the cells back to the original data using a lookup table built during quantization.


In
-
house implementation.


Impossible

to find source code, even by asking the authors.


So others can avoid the same trouble, I am
opensourcing

it.

WaveCluster Illustrated

2

0

3

1

dwt

quantization

thresholding

1

0

1

0

2.1

0
.4

2.8

1.5

2

0

1

0

conncomp

unmapping

outlier

First public implementation. Code
available

and licensed under the GPL v3.

Structure of Presentation


Introduction


What constitutes “complex” data?


Why functional medical imaging?


Tensor definitions and nomenclature.


Motivation and contributions.


Background


Literature review.


Matrix decompositions.


Why SVD, LSA, and their higher
-
order analogs are still state
-
of
-
the
-
art methods.


Minimization of
Frobenius

norm: Theoretical justification.


“Halfway to the Netflix Prize”: Empirical justification.


Graph theoretic interpretation.


Tensor operations and decompositions.


WaveCluster.


Methods


Tensor
-
theoretic multidimensional wavelet transform.


Classification.


Clustering.


WaveCluster (+ several improvements)


TWaveCluster


Lloyd + WaveCluster.


Latent Concept Discovery (High
-
Order LSA).


Results


Datasets.


Low and High
-
Order Classification


Clustering Results


Handedness Detection and Discovered Concepts


Behavior vs. Sparseness


Approximation Accuracy


Conclusion and Future Work


References

Tensor
-
Theoretic Multidimensional Wavelet Transform


Motivation: Necessary for our clustering work.


Insight: a 2D wavelet transform can be constructed by transforming the
rows then transforming the (row
-
transformed) columns.


The generalization is not as straightforward: each mode must be
transformed against all other modes.


Intuition: Unfolding guarantees that one mode of the tensor will be made
into rows and all others into columns of a matrix.


The DWT can be taken of this representation, then the tensor re
-
folded and
rotated.


Method: Given tensor X of order
r
,


Unfold tensor X on mode 2 and transpose (all other modes concatenated in rows,
mode 2 becomes columns).


Perform a 1D DWT on each row.


Transpose and fold the tensor. Mode 2 is now transformed across all other modes.


Circularly shift every mode of the tensor by 1.


Repeat
r

times (all modes will end in their starting positions).



Illustration

Rotate and repeat.

Rotate 90º
clockwise.

This code is also
available

under the GPL v3.

TWave Classification


High
-
order locality
-
preserving classification.


Apply multidimensional multilevel DWT on the spatiotemporal modes of the tensor using an orthogonal wavelet (we used
Daubechies
-
4).


Concatenate each level of approximation and detail coefficients and linearize the wavelets.


Multiresolution analysis: each level corresponds to one resolution.


Purpose: Encapsulate neighborhood information in analysis.


Retain the structure of the categorical modes of the tensor. (Neighborhoods among these modes do not exist).


Optionally threshold wavelet coefficients.


Normalize each column to have a mean of zero.


Apply Tucker, PARAFAC, or Multilinear PCA.


Any number of concepts can be retained. The number of distinct classes usually works well.


Fewer concepts => less fidelity but greater degree of abstraction.


Purpose: Inter
-
modal reasoning and conceptual abstraction.


If analyzing more than one mode at once (e.g. subject and task combinations), multiply the target modes back together by
plugging them into the PARAFAC/Tucker equation (without the
λ

scaling factor).


Unfold on mode 2 and transpose, concatenating the rows of the tensor to become the rows of the feature matrix.


Pass the unfolded and preprocessed matrix to a classifier.


Despite operating on a matrix, the intuitive classification procedure on a tensor remains “features in the columns, observati
ons

on
all other modes”.


This is exactly what transposing the mode 2 unfolding gives us.


Transpose and fold the resulting vector of class labels on mode 2 to yield a d
1
x 1 x d
3
x … tensor containing class labels for
each observation on each mode.


The columns (features) are gone, replaced by class labels. This is not a representation of the original tensor, but a conveni
enc
e for
examining results.

TWave Concept Discovery

Similar to classification using
Twave
.


1.
Perform the same preprocessing steps as classification, up to centering.

2.
Decompose using the suspected number of concepts present in the dataset (e.g. “left
handed” and “right handed” suggests 2 concepts).

1.
AIC, BIC, elbow criterion, etc. may help guide, but these are still imprecise.

2.
As with any unsupervised learning, the ideal parameters are guided by domain knowledge.

3.
The resulting tensor is a projection into concept space containing mode
-
to
-
concept
similarities.

4.
Discovered concepts may be analyzed and displayed directly or used as input to further
analysis.

1.
If plotted, all modes can be shown in the same plot.

2.
If analyzed, (cosine) similarity between projected modes can be used to build a
recommendation engine.


(This is something I’ve done with great success at work using
Tanimoto

similarity and a
sparse
tf
-
idf

weighted graph adjacency matrix, but I unfortunately can’t share those
results)

TWaveCluster


Based on the WaveCluster algorithm.


Modified to work on real (rather than binary) data.


WaveCluster is a density
-
based clustering algorithm.


That means it counts frequencies of voxel appearance under certain constraints (in this case their grid cells).


This is precisely what we do to construct the LSA term
-
document matrix!


So what we’re really doing is a more sophisticated form of LSA co
-
clustering.


Idea: Replace the connected component algorithm with a decomposition.


Use each voxel’s similarity to each concept as a measure of its cluster membership. Voxels with similar
concept memberships will cluster together.


Advantages:


High
-
order (primary advantage of the other techniques as well).


Naturally fuzzy (concept similarities real
-
valued).


Can discover spatially disjoint yet similar clusters (Tucker/PARAFAC not locality
-
sensitive).


Still integrates preferences for dense neighborhoods (due to the wavelets).


Can cluster across modes of the tensor in parallel.


Built
-
in cluster validity measure: strength of concepts!


Comparable efficiency to WaveCluster; data size greatly reduced prior to invocation of the tensor
decomposition due to the binning and wavelet transformation.


Disadvantages:


Requires the number of concepts as a parameter.


TWaveCluster

Algorithm


Begin as in WaveCluster:

1.
Quantize data to a grid, using the count of each grid cell in place of the original data.

2.
Apply a wavelet transformation using a hat
-
shaped wavelet (such as the (2,2) or (4,2)
biorthogonal wavelet), retaining the approximation coefficients.

3.
Threshold cells in the transformed space. Cells with values above a user specified density
threshold are considered “significant”.

1.
Model significant cells as a tensor:

2.
For a parameter
k
, run a
k
-
concept PARAFAC analysis on X:


3.
For each
c

from 1 to
k
, recompose a tensor using only column
c

of each projection matrix (omit
the scaling factor). The resulting tensor contains voxel similarities to concept
c
:



4.
Assign every voxel the cluster label of the concept to which it exhibits the greatest similarity:



5.
Threshold: keep the top
s
% most similar voxels within each cluster concept, discard the others.


.





Why can’t WaveCluster segment?


Clustering algorithms are naturally used for image segmentation.


WaveCluster is generally a poor algorithm to use for this purpose, particularly on real
-
valued images.


This applies to TWaveCluster as well.


Why?


Clean and concise margins are sometimes
very important
.


Meaning in diagnostic radiology:


Too wide
-
> tissue loss and surgical complications.


Too small
-
> treatment failure.


WaveCluster is a grid
-
based algorithm.


One point in a cell cannot cluster alone. The whole cell is labeled, all or nothing.


Wavelet transform blurs cells with neighbors.


Desirable for capturing locality.


Not desirable when a rectangular cell crosses a margin! Huge partial volume effect!


The clustering ends up being “blocky”; boundaries of sparse areas are eroded, boundaries of dense areas dilated.


Good for outlier detection (e.g.
FindOut

method, Yu et al.)


Bad for precise segmentation.



Potential Solution: Use (more) smaller grid cells.


Problem: Causes the clustering to lose generality and homogeneity, alters the scale of the analysis.


Clusters are still fundamentally
hyperrectangles
.


Potentially more false splits.



Better Solution:
Mold cells to the underlying image margins.

Lloyd + WaveCluster


Idea: Avoid the quantization error associated with WaveCluster by using k
-
means centroids
discovered by Lloyd iteration as cells.


Seed the algorithm with the WaveCluster grid cells.


Attain new cells from algorithm.


Map new grid to these cells.


Run WaveCluster.


Instead of assigning voxels a label, assign grid cells.


Starting state: WaveCluster grid cell boundaries.


Can achieve better boundary resolution with far fewer cells than WaveCluster.


Greatly reduces partial volume effect of the grid.


Benefits particularly pronounced when few grid cells are present.


Tradeoff: Efficiency is reduced to that of k
-
means.


Which in a very small number of cases is super
-
polynomial.


If used in a public application (e.g. CBIR), an adversary could exploit this.


Unlikely but possible to arise naturally.


This may be an acceptable tradeoff in applications such as determination of margins in
medical images.


Other segmentation approaches valid here as well.


Watershed, edge detection, level sets, fuzzy connectedness…


Same basic principle.

Structure of Presentation


Introduction


What constitutes “complex” data?


Why functional medical imaging?


Tensor definitions and nomenclature.


Motivation and contributions.


Background


Literature review.


Matrix decompositions.


Why SVD, LSA, and their higher
-
order analogs are still state
-
of
-
the
-
art methods.


Minimization of
Frobenius

norm: Theoretical justification.


“Halfway to the Netflix Prize”: Empirical justification.


Graph theoretic interpretation.


Tensor operations and decompositions.


WaveCluster.


Methods


Tensor
-
theoretic multidimensional wavelet transform.


Classification.


Clustering.


WaveCluster (+ several improvements)


TWaveCluster


Lloyd + WaveCluster.


Latent Concept Discovery (High
-
Order LSA).


Results


Datasets.


Low and High
-
Order Classification


Clustering Results


Handedness Detection and Discovered Concepts


Behavior vs. Sparseness


Approximation Accuracy


Conclusion and Future Work


References

Experimental Datasets: High
-
Order


fMRI digital opposition motor task dataset.


Four tasks: Left finger
-
to
-
thumb, left hand squeeze, right finger
-
to
-
thumb, right hand squeeze.


Tasks were strongly periodic.


11 subjects.


120 time points per subject (resolution = 3s).


79x95x69 voxels per time point.


Thus a tensor of order 6 and dimensionality 79x95x69x120x4x11.
9.3 GB in size.

Classification Results


Leave
-
one
-
out
kNN

experiments performed on voxelwise, wavelet
-
only, SVD, PARAFAC,
TWave, and TWave with MPCA+LDA:











SVD and PARAFAC alone were run on a supercomputer (Euler):


16 2.6 GHz dual
-
core CPUs.


128 GB of memory.


All other methods were run on a dual processor 2.2 GHz
Opteron

system with 4 GB of
memory.


Tucker decomposition initially did not complete due to memory usage; we were able to use
Memory Efficient Tucker, however (
Kolda

and Sun, 2008).


Ω
(n
2
) memory usage aside, Tucker results were nearly identical to PARAFAC in time,
accuracy, and handedness detection and are not shown.


CUR decomposition could not complete on either system and is thus not included.

Voxels

Wavelets

SVD

PARAFAC

TWAVE

TWAVE + MPCA/LDA

Runtime

95 min

112 min

3 days

8 days

133
min

130 min

Subjects

52%

98%

80%

88%

96%

100%

Tasks

34%

68%

56%

52%

74%

93%

Size

9.3 GB

181 MB

9.3 GB

9.3 GB

181 MB

181 MB

Lefties?

No

No

No

Yes

Yes

N/A

k=5

k=4

Concept Discovery Results


TWave was capable of separating right and left handed subjects in a 2 concept analysis.


So was pure PARAFAC, but it took nearly
100x

longer to do so with similar results.


SVD could not do it: the unfolding distorted the discovered concept space.


Although only subjects are shown in the plot, the concepts have linkages to
every mode in the tensor
.


Every voxel is represented by a point in this space as well, as is every point in time.


And every task, but these were far right of all subjects and tightly clustered with each other relative to subject, indicatin
g t
hey were
likely orthogonal to the handedness concept.


Makes sense: variance between subjects = 9066.85, within
-
subject variance between tasks: 179.29.


Tasks thus project better onto lower concepts and begin to become more discriminative past the 9
th

concept.


This is positive: it suggests that the pattern may be learned independently of the motor task being performed.


It also means we could probably leave the task mode out and get similar results on subjects.

TWave

PARAFAC

SVD

Concept Discovery: Tasks

Tasks start to break up on concepts
9
-
10. (Smaller variance
-
> higher
-
numbered concepts)

2
-
concept PARAFAC space.

Concept Discovery: Control


Sex of the subject serves as a task
-
unrelated
control for TWave:

We’ve never really been sure of him…

TWaveCluster

Results:

Original
dataset (subject 6, RH).

k
-
means results (k=4).

Raw TWaveCluster
results (k=4).

Temporal mean shown.

Precentral

gyrus

/ “motor strip”?

Watch what happens when we
threshold…

TWaveCluster Final Results

The most salient cluster
is
frontal
. (
λ
1

= 28,812)

2
nd

most salient cluster
is
motor
. (
λ
2

= 18,083)

TWaveCluster Final Results

3
rd

most salient cluster
is
caudal
(
λ
3

= 15,958).

Least salient cluster is
noisy
(
λ
4

= 11,366).

Lloyd + WaveCluster Results

Large cell sizes (few cells) lead to quantization errors
in naïve WaveCluster.

Lloyd + WaveCluster lacks this quantization error,
even with few cells
.

Many more cells
and a lower density threshold
are
required to improve the
resolution of WaveCluster, but now two adjacent cells are merged
.

K
-
means mean cluster time series:

Subject 2 (LH), k=3

Low
-
Order MNIST Results


Classification also performed on the low
-
order MNIST digit recognition dataset.


There are many optimizations in the literature, but we were primarily interested in the
comparative behavior of the techniques when the dataset was dense and low
-
order.


Hypothesis: tensor decompositions would degrade to slower versions of their low
-
order
counterparts on a low
-
order dataset.








And that is essentially what happened: SVD and Tucker were
exactly the same

except for
trivial differences in sign.


PARAFAC was not, but attained similar results.


Interesting to note is the scalability: performance on a small dataset is dominated by the
classification itself.


This is not the case on the larger dataset.


Tensor and wavelet decompositions compress the dataset and speed up the subsequent
classification step, thus the faster time.

Voxels

Wavelets

SVD

Tucker

PARAFAC

TWave

Total Time

213 sec

122 sec

17 sec

135 sec

55 sec

36 sec

Prep. Time

N/A

6 sec

2 sec

96 sec

39 sec

18 sec

Accuracy

95.4%

95.6%

88%

88%

89%

90%

k = 2

Approximation Accuracy


Tucker, PARAFAC, and SVD are all capable of low
-
rank approximation.


SVD has a theoretical approximation optimality guarantee which a globally optimal Tucker decomposition should also obey.


SVD can’t operate on tensors natively, but can work on the unfolded representation.


This is part of the point of using Tucker/PARAFAC: SVD can’t capture high
-
order structure.


Measured on a single (LH) subject’s volume represented as a 4
th

order tensor.


Averaged over 5 runs: PARAFAC
-
ALS and Tucker
-
ALS are
locally
optimal.


Concepts held ranged from 1 to 11.


Approximation error measured using MSE between reconstructed tensor and original.



Tucker outperforms SVD on a high
-
order dataset.


On low
-
order datasets the lines completely overlap.


Approximation accuracy isn’t everything.


(Or we’d use voxelwise methods everywhere).


More important is the
meaning

of the summarized concepts.


e.g. LH vs. RH.


As previously shown, the PARAFAC results have greater
discriminative power by subject.


SVD tends to group all subjects closely together.


This suggests that the error is due to
abstraction

of useful
information.

Sparseness


Our experimental data is rather sparse due to CSF masking and background removal.


We’re interested in seeing how that affects performance.


Since some algorithms can take over a week per run on a large dataset, we assessed performance vs.
sparseness on a small dataset:


1000 x 100 sequential integers from 1 to 10000. (Order 2)


Bottom x% changed to zeros per run.


x varied in increments of 10% from 0% to 100%*.


*One element always nonzero; ALS could not run on a dataset of all zeros.


The actual data is irrelevant; all that matters is that all models use the same dataset and that the sparseness is
controllable.


Small dataset: classification time tended to dominate.


Not the case on larger datasets, where decomposition time outstripped it.

Structure of Presentation


Introduction


What constitutes “complex” data?


Why functional medical imaging?


Tensor definitions and nomenclature.


Motivation and contributions.


Background


Literature review.


Matrix decompositions.


Why SVD, LSA, and their higher
-
order analogs are still state
-
of
-
the
-
art methods.


Minimization of
Frobenius

norm: Theoretical justification.


“Halfway to the Netflix Prize”: Empirical justification.


Graph theoretic interpretation.


Tensor operations and decompositions.


WaveCluster.


Methods


Tensor
-
theoretic multidimensional wavelet transform.


Classification.


Clustering.


WaveCluster (+ several improvements)


TWaveCluster


Lloyd + WaveCluster.


Latent Concept Discovery (High
-
Order LSA).


Results


Datasets.


Low and High
-
Order Classification


Clustering Results


Handedness Detection and Discovered Concepts


Behavior vs. Sparseness


Approximation Accuracy


Conclusion and Future Work


References

Future Work


Incremental Learning and Better Computation.


The actual computation is something we haven’t really explored much yet.


Is there a way to extend Brand’s result (Brand 2003) to a high
-
order space?


PARAFAC
-
ALS converges well but has poor efficiency (
Tomasi

and Bro 2006).


Optimizing for memory usage.


Tucker ran out of memory. Memory Efficient Tucker barely ran.


Strongly suspect the O(mn
2
) memory usage is unnecessary.


Frobenius

norm:
sqrt
(
Tr
(A
T
A)).


We just need the diagonal of the covariance matrix, not the whole thing.


An incremental model would solve this problem as well.


Developing higher
-
order manifold learning/nonlinear dimensionality reduction techniques.


“Kernel PARAFAC” / “High
-
order kernel PCA”.


Higher
-
order LLE/
Isomap
/MVU.


Mathematically modeling abstraction in general is a fascinating area to me.


DTI dataset and structural registration.


DTI: Innately high
-
order features.


Once spatially registered, can be
integrated

into an fMRI tensor model as additional modes.


Jackpot if voxels, time, and modes representing the direction of diffusion gradient cluster to a common set of
concepts: this will allow us to
infer functional patterns from structural data
.


(From a neurological standpoint this ideal condition is unlikely because the variance of functional patterns is likely
much greater than that of structural patterns, but it can certainly used to pinpoint the association between
structural lesions and functional deficits if nothing else).


The natural noise removal properties of the decompositions will also be of great use.

Conclusion


Certain datasets naturally take the form of higher
-
order structures.


Tensors can accurately model this structure. Tensor decompositions allow us to reason
across modes of the tensor.


“Conceptual co
-
clustering”: All modes directly comparable through similarities to a shared set
of latent concepts.


“Folding in”: performing similarity analysis across modes in this clustered space.


Tensor methods have been underutilized in the traditional data mining literature.
Comprehensive mining frameworks are required.


Tensor decompositions provide powerful analysis tools that have not yet been put to their
full use.



Naïve methods do not scale and do not utilize spatiotemporal locality information.


Hybrid analysis tools may be required to leverage the power of tensor decompositions while
avoiding their generally poor efficiency.


Wavelets capture neighborhoods, speed analysis, compress well, and represent an effective
preprocessing tool.



Tensors have the ability to discover new patterns in high
-
order biomedical datasets.

Relevant Publications:


Published:


TWave: High
-
Order Analysis of Spatiotemporal Data:
By Michael Barnathan, et al. Accepted for
publication in Proceedings of PAKDD 2010, Hyderabad, India.


High
-
Order Concept Discovery in Functional Brain Imaging:

By Michael Barnathan, et al. Published
in Proceedings of the International Symposium on Biomedical Imaging 2010, Rotterdam, The Netherlands.


Analyzing Tree
-
Like Structures in Biomedical Images Based on Texture and Branching: An
Application to Breast Imaging:

By Michael Barnathan,
Jingjing

Zhang,
Despina

Kontos
,
Predrag

Bakic,
Andrew
Maidment
, and Vasileios
Megalooikonomou
. Published in Proceedings of the International Workshop
on Digital Mammography (IWDM) 2008, Tucson, Arizona, July 20


23, 2008.


Wavelet Analysis of 4D Motor Task fMRI Data:

By Michael Barnathan,
Rui

Li, Vasileios
Megalooikonomou
,
Feroze

Mohamed, and Scott Faro. Published in Proceedings of Computer Assisted
Radiology and Surgery (CARS) 2008, Barcelona, Spain, June 25


28, 2008.


A Texture
-
Based Methodology for Identifying Tissue Type in Magnetic Resonance Images:

By
Michael Barnathan, et al. Published in Proceedings of the International Symposium on Biomedical Imaging
2008, Paris, France, May 14


17, 2008.


A Web
-
Accessible Framework for Automated Storage and Analysis of Biomedical Images:

By
Michael Barnathan,
Jingjing

Zhang, and Vasileios
Megalooikonomou
. Published in Proceedings of the
International Symposium on Biomedical Imaging 2008, Paris, France, May 14


17, 2008.


In Review:


A Spatiotemporal Clustering Framework for fMRI Time Series Analysis:
By
Rui

Li, Michael
Barnathan, Vasileios
Megalooikonomou
, Scott Faro, and
Feroze

Mohamed. Submitted to Human Brain
Mapping.


Efficient Techniques for Database Similarity Searches of Brain Images:
By Troy Schrader, et al.
Submitted to AI In Medicine.

Other Publications:


A Representation and Classification Scheme for Tree
-
like Structures in Medical Images:
Analyzing the Branching Pattern of Ductal Trees in X
-
ray Galactograms: By
Vasileios

Megalooikonomou
, Michael
Barnathan
,
Despina

Kontos
,
Predrag

Bakic
, and Andrew D.A.
Maidment
. Published in Vol. 28, Issue 4 of IEEE Transactions on Medical Imaging, pp.
487
-
493.


Probabilistic Branching Node Detection Using
AdaBoost

and Hybrid Local Features: By
Tatyana
Nuzhnaya
, et al. (2nd author). Published in Proceedings of ISBI 2010, Rotterdam,
The Netherlands, April 14
-
17, 2010.


Spatial Feature Extraction Techniques for the Analysis of Ductal Tree Structures: By
Aggeliki

Skoura
, Michael
Barnathan
, and
Vasileios

Megalooikonomou
. Published in
Proceedings of EMBC 2009, Minneapolis, Minnesota, September 2


6, 2009.


Probabilistic Branching Node Detection Using Hybrid Local Features: By
Haibin

Ling,
Michael
Barnathan
,
Vasileios

Megalooikonomou
,
Predrag

Bakic
, and Andrew D.A.
Maidment
. Published in Proceedings of ISBI 2009, Boston, Massachusetts, June 28


July
1, 2009.


Classification of Ductal Tree Structures in Galactograms: By
Aggeliki

Skoura
, Michael
Barnathan
, and
Vasileios

Megalooikonomou
. Published in Proceedings of ISBI 2009,
Boston, Massachusetts, June 28


July 1, 2009.


A High
-
Level Language for Homeland Security Response Plans: By Richard
Scherl

and
Michael
Barnathan
. Published in Proceedings of the 2005 AAAI Spring Symposium, March
22, 2005.

References


Kolda
, T. G., & Sun, J. (2008). Scalable Tensor Decompositions for Multi
-
aspect Data Mining.
Proceedings of the 8th IEEE International Conference on
Data Mining
, (pp. 363
-
372).


Harshman
, R. A. (1970). Foundations of the PARAFAC procedure: Models and Conditions for an "Explanatory" Multimodal Factor Analysis.
UCLA
Working Papers in Phonetics

, 16
, 1
-
84.


Carroll, J. D., & Chang, J.
-
J. (1970). Analysis of Individual Differences in Multidimensional Scaling via an n
-
way Generalizatio
n of "
Eckart
-
Young"
Decomposition.
Psychometrika
, 35

(3), 283
-
319.


Mahoney, M. W.,
Maggioni
, M., &
Drineas
, P. (2006). Tensor
-
CUR Decompositions for Tensor
-
Based Data.
Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining

(pp. 327
-
336). Philadelphia: ACM.


Sheikholeslami
, G.,
Chatterjee
, S., & Zhang, A. (2000). WaveCluster: A Wavelet
-
Based Clustering Approach for Spatial Data.
The VLDB Journal

, 8
, 289
-
304.


Eckart
, C., & Young, G. (1936). The Approximation of One Matrix by Another of Lower Rank.
Psychometrika

, 1

(3), 183
-
187.


Deerwester
, S.,
Dumais
, S. T., Furnas, G. W.,
Landauer
, T. K., &
Harshman
, R. (1990). Indexing by Latent Semantic Analysis.
Journal of the American
Society for Information Science

, 41
, 391
-
407.


Comon
, P. (1994). Independent Component Analysis: A New Concept?
Signal Processing

, 36

(3), 287
--
314.


Boley
, D. (1998). Principal Direction Divisive Partitioning.
Data Mining and Knowledge Discovery

, 2

(4), 325
-
344.


Brand, M. (2006). Fast low
-
rank modifications of the thin singular value decomposition.
Linear Algebra and its Applications

, 415

(1), 20
-
30.


Brand, M. (2003). Fast online
svd

revisions for lightweight recommender systems.
SIAM International Conference on Data Mining
, (pp. 37
-
46).


Ding, C., & He, X. (2004). K
-
means Clustering via Principal Component Analysis.
Proceedings of the 21st International Conference on Machine
Learning
, (pp. 225
-
232). Banff, Canada.


Pan, F., Zhang, X., & Wang, W. (2008). CRD: fast co
-
clustering on large datasets utilizing sampling
-
based matrix decomposition.
SIGMOD 2008

(pp.
173
-
184). Vancouver, Canada: ACM.


Hitchcock, F. L. (1927). The Expression of a Tensor or a
Polyadic

as a Sum of Products.
Journal of Mathematical Physics

, 6
, 164
-
189.


Tucker, L. R. (1963). Implications of Factor Analysis of Three
-
way Matrices for Measurement of Change.
Problems in Measuring Change (University of
Wisconsin Press)
.


De
Lathauwer
, L., De Moor, B., &
Vandewalle
, J. (2000). A Multilinear Singular Value Decomposition.
SIAM Journal on Matrix Analysis and
Applications

, 21

(4), 1253
-
1278.


Sun, J., Tao, D., &
Faloutsos
, C. (2006). Beyond Streams and Graphs: Dynamic Tensor Analysis.
Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
, (pp. 374
-
383).


Lu, H.,
Plataniotis
, K. N., &
Venetsanopoulos
, A. N. (2008). MPCA: Multilinear Principal Component Analysis of Tensor Objects.
IEEE Transactions on
Neural Networks

, 19

(1), 18
-
39.


References


Bro, R. (1996). Multi
-
Way Calibration. Multi
-
Linear PLS.
Journal of
Chemometrics

, 10

(1), 47
-
62.


Andersson
, C. A., & Bro, R. (2000). The N
-
way Toolbox for MATLAB.
Chemometrics

& Intelligent Laboratory Systems

, 52

(1), 1
-
4.


Niazi
, A., & Mohammad, S. (2006). PARAFAC and PLS applied to
spectrophotometric

determination of tetracycline in pharmaceutical formulation and
biological fluids.
Chemical and Pharmaceutical Bulletin

, 54

(5), 711
-
713.


Niazi
, A., &
Yazdanipour
, A. (2007). PLS and PARAFAC Applied to Determination of
Noscapine

in Biological Fluids by Excitation
-
Emission Matrix
Fluorescence.
Pharmaceutical Chemistry Journal

, 41

(3), 170
-
175.


Vasilescu
, M. A., &
Terzopoulos
, D. (2002). Multilinear Analysis of Image Ensembles: Tensor
-
Faces.
Proceedings of the 7th European Conference on
Computer Vision

(pp. 447
-
460). Springer.


Kim, T.
-
K., Wong, S.
-
F., &
Cipolla
, R. (2007). Tensor Canonical Correlation Analysis for Action Classification.
Proc. CVPR 2007.

Minneapolis,
Minnesota, USA.


Savas
, B. (2003).
Analyses and Tests of Handwritten Digit Recognition Algorithms.

Sweden: Master's Thesis, Linkoping University.


Kolda
, T. G., & Bader, B. W. (2007).
Tensor Decomposition and Applications.

Technical Report, Sandia National Laboratories.


Kolda
, T. G., & Bader, B. W. (2006). The TOPHITS Model for Higher
-
order Web Link Analysis.
Workshop on Link Analysis, Counterterrorism and
Security.


Sun, J.,
Tsourakakis
, C. E.,
Hoke
, E.,
Faloutsos
, C., &
Eliassi
-
Rad
, T. (2008). Two Heads Better than One: Pattern Discovery in Time
-
Evolving Multi
-
Aspect Data.
Data Mining and Knowledge Discovery

, 17

(1), 111
-
128.


Comon
, P. (2002). Tensor Decompositions, State of the Art and Applications.
Mathematics in Signal Processing

, 1
-
24.


Martin, C. D. (2004). Tensor Decompositions Workshop Discussion Notes. Palo Alto, CA: American Institute of Mathematics (AIM)
.


Skillicorn
, D. (2007).
Understanding Complex Datasets: Data Mining with Matrix Decompositions.

CRC Press.


Acar
, E., &
Yener
, B. (2009). Unsupervised
Multiway

Data Analysis: A Literature Survey.
IEEE Transactions on Knowledge and Data Engineering

, 21

(1), 6
-
20.


Acar
, E.,
Aykut
-
Bingol
, C.,
Bingol
, H., Bro, R., &
Yener
, B. (2007).
Multiway

Analysis of Epilepsy Tensors.
Proc. ISMB/ECCB 2007
, (pp. i10
-
i18). Vienna,
Austria.


Li, J., Zhang, L., & Zhao, Q. (2007). Pattern Classification of Visual Evoked Potentials Based on Parallel Factor Analysis.
Proceedings of the
International Conference on Cognitive
Neurodynamics

(pp. 571
-
575). Springer Netherlands.


Zheng
, Y., Hasegawa
-
Johnson, M., & Pizza, S. (2003). PARAFAC Analysis of the Three Dimensional Tongue Shape.
Journal of the Acoustical Society of
America

, 113

(1), 478
-
486.


Martínez
-
Montes, E.,
Valdés
-
Sosa, P. A.,
Miwakeichi
, F., Goldman, R. I., & Cohen, M. S. (2004). Concurrent EEG/fMRI analysis by
multiway

Partial
Least Squares.
NeuroImage

, 22

(3), 1023
-
1034.


Yu, D.,
Sheikholeslami
, G., & Zhang, A. (2002).
FindOut
: Finding Outliers in Very Large Datasets.
Knowledge and Information Systems

, 4

(4), 387
-
412.


Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems.
Annals of Eugenics

, 7
, 179
-
188.


Koren
, Y. (2009).
The
BellKor

Solution to the Netflix Grand Prize
.
Retrieved from Netflix Prize:
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf.


Sarwar
, B.,
Karypis
, G.,
Konstan
, J., &
Riedl
, J. (2002). Incremental SVD
-
Based Algorithms for Highly Scalable Recommender Systems. Proc. 5th
International Conference on Computer and Information Technology (ICCIT), (pp. 27
-
28).


Lecun
, Y.,
Bottou
, L.,
Bengio
, Y., &
Haffner
, P. (1998). Gradient
-
based learning applied to document recognition.
Proceedings of the IEEE

, 86

(11),
2278
-
2324.

Thanks.

Questions?