Multidimensional data processing

separatesnottySoftware and s/w Development

Nov 25, 2013 (3 years and 11 months ago)

71 views

Multidimensional data processing


Multivariate data consist of several variables
for each observation.


Actually, serious data is always multivariate.


Some variables are usually not collected to
simplify collecting and processing.


Removal of variables before data analysis leads to
information loss.


Unknown information is never recovered.


One of the most common task is clustering or
classification.


classification


target classes are known


properties of target classes are usually unknown


goal: find rules which separate observed data into
target classes


clustering


target classes are unknown


goal: find observations with common properties
which may (or may not) represent classes in real
world


difficult situation


we are trying to extract information from data


measurements, observations, surveys


data preparation


data adjustment


removal of invalid or
incomplete observations/measurements


normalization?


best handled when collecting


extracting information


we know what we are looking for


testing of an
hypothesis


trying to discover something new


data
exploration


preliminary analysis of the data


better understanding of its characteristics


allows to select the right tools for preprocessing or
analysis


wrong tools may yield invalid information or hide
important patterns


also known as Exploratory Data Analysis (EDA)


a different approach


mind shift is required


concentrates on the larger view


1977+


aka visual data mining

Richard Wesley
Hamming
, Numerical Methods
for Scientists and
Engineers, 1962


steps


maximize
insight into a data
set


uncover underlying
structure


extract important
variables


detect outliers and
anomalies



test underlying
assumptions



develop
minimalistic models



determine optimal
property settings


heavily relies on graphics


numbers are very abstract



Characteristics:


N
= 11


Mean
of X = 9.0


Mean
of Y = 7.5


Intercept
= 3


Slope
= 0.5


Residual
standard deviation = 1.237


Correlation
=
0.816



Have we realized something important?

10.00 8.04


8.00 6.95

13.00 7.58


9.00 8.81

11.00 8.33

14.00 9.96


6.00 7.24


4.00 4.26

12.00 10.84


7.00 4.82


5.00 5.68


Run
-
sequence plot


similar to line
-
chart in excel


shifts in variations


shifts in location


outliers


Histogram


center, spread, skew, multimodality


outliers


very useful


know how to create it!


nice presentations (e.g. word
-
cloud, tag
-
cloud)


check whether the data set is random or no


random data should have no observable
structure


lag = fixed time displacement


can be arbitrary


most common is 1


observe


week autocorrelation


strong autocorrelation


sinusoidal model


outliers



1 dimension


piece of cake (pie)


2 dimensions


still easy


Cartesian coordinate
system


3 dimensions


still doable in Cartesian system


4 and more dimensions


only Chuck Norris can
do that in Cartesian system


other types of visualization are required


some may be useful only for some types of data




understanding the data is very important


good visualization can help us understand the
contained information


results need to be presented to other people


sanity check, intuition


people capture patterns,
which are missed by automated methods


some options:


bubble
chart (3dim scatter plot)


scatter
plot array


star
plot, Radviz, Polyviz


parallel coordinates


also called: 3 dimensional scatter plot


2 data dimensions


graph X and Y


3
rd

dimension


point size


optional 4
th

dimension


point color


advantages


allows to uncover clusters and variable
dependencies


easy to understand


disadvantages


different combinations need to be tried



extension to common scatter plot


2 dimensional array of scatter plots


each combination of variables is drawn (twice)


diagonal descriptions


easy to create


messy


dependencies between more than two
variables are still hidden


Sepal width

Petal length

Petal width

Sepal length


axes radiate from central point


Star plot


values of a data point are connected to form a
polygon


can display only a small number of points


order of variables may be important


Radviz


values of a data point act as spring stiffness


values normalized into
interval
<0, 1>


object is placed in equilibrium of all forces


order of variables becomes very important



Iris
-
virginica

Iris
-
versicolor

Iris
-
setosa


similar principle to Radviz


data points are not attracted to a single point


data points are attracted to an axis


circle becomes polygon → Polyviz


order of variables is less important


polygon edges become very important


candidates
for classification
rules


different combinations of variables


exact position of point is displayed


no
information loss





advantages


determine correlation between variables


both positive and negative


determine partial correlations


only some values of some variable are correlated with
some values of other variable


very important


disadvantages


dependent on variable ordering


not that useful without interactive software


may be hard to understand for newbies




Exploratory data analysis:


http://
www.itl.nist.gov/div898/handbook/eda/eda.htm


Have a look at the graphical techniques:


http://
www.itl.nist.gov/div898/handbook/eda/section3/ed
a33.htm


Orange Canvas


open
-
source data mining


http://orange.biolab.si
/


interface similar to IBM Clementine (SPSS Modeler)


widget
documentation:
http
://orange.biolab.si/doc/widgets
/


Sample data


http://
archive.ics.uci.edu/ml/index.html


http://www
-
958.ibm.com/software/data/cognos/manyeyes
/