# Multidimensional data processing

Software and s/w Development

Nov 25, 2013 (4 years and 7 months ago)

78 views

Multidimensional data processing

Multivariate data consist of several variables
for each observation.

Actually, serious data is always multivariate.

Some variables are usually not collected to
simplify collecting and processing.

Removal of variables before data analysis leads to
information loss.

Unknown information is never recovered.

One of the most common task is clustering or
classification.

classification

target classes are known

properties of target classes are usually unknown

goal: find rules which separate observed data into
target classes

clustering

target classes are unknown

goal: find observations with common properties
which may (or may not) represent classes in real
world

difficult situation

we are trying to extract information from data

measurements, observations, surveys

data preparation

removal of invalid or
incomplete observations/measurements

normalization?

best handled when collecting

extracting information

we know what we are looking for

testing of an
hypothesis

trying to discover something new

data
exploration

preliminary analysis of the data

better understanding of its characteristics

allows to select the right tools for preprocessing or
analysis

wrong tools may yield invalid information or hide
important patterns

also known as Exploratory Data Analysis (EDA)

a different approach

mind shift is required

concentrates on the larger view

1977+

aka visual data mining

Richard Wesley
Hamming
, Numerical Methods
for Scientists and
Engineers, 1962

steps

maximize
insight into a data
set

uncover underlying
structure

extract important
variables

detect outliers and
anomalies

test underlying
assumptions

develop
minimalistic models

determine optimal
property settings

heavily relies on graphics

numbers are very abstract

Characteristics:

N
= 11

Mean
of X = 9.0

Mean
of Y = 7.5

Intercept
= 3

Slope
= 0.5

Residual
standard deviation = 1.237

Correlation
=
0.816

Have we realized something important?

10.00 8.04

8.00 6.95

13.00 7.58

9.00 8.81

11.00 8.33

14.00 9.96

6.00 7.24

4.00 4.26

12.00 10.84

7.00 4.82

5.00 5.68

Run
-
sequence plot

similar to line
-
chart in excel

shifts in variations

shifts in location

outliers

Histogram

outliers

very useful

know how to create it!

nice presentations (e.g. word
-
cloud, tag
-
cloud)

check whether the data set is random or no

random data should have no observable
structure

lag = fixed time displacement

can be arbitrary

most common is 1

observe

week autocorrelation

strong autocorrelation

sinusoidal model

outliers

1 dimension

piece of cake (pie)

2 dimensions

still easy

Cartesian coordinate
system

3 dimensions

still doable in Cartesian system

4 and more dimensions

only Chuck Norris can
do that in Cartesian system

other types of visualization are required

some may be useful only for some types of data

understanding the data is very important

good visualization can help us understand the
contained information

results need to be presented to other people

sanity check, intuition

people capture patterns,
which are missed by automated methods

some options:

bubble
chart (3dim scatter plot)

scatter
plot array

star

parallel coordinates

also called: 3 dimensional scatter plot

2 data dimensions

graph X and Y

3
rd

dimension

point size

optional 4
th

dimension

point color

allows to uncover clusters and variable
dependencies

easy to understand

different combinations need to be tried

extension to common scatter plot

2 dimensional array of scatter plots

each combination of variables is drawn (twice)

diagonal descriptions

easy to create

messy

dependencies between more than two
variables are still hidden

Sepal width

Petal length

Petal width

Sepal length

Star plot

values of a data point are connected to form a
polygon

can display only a small number of points

order of variables may be important

values of a data point act as spring stiffness

values normalized into
interval
<0, 1>

object is placed in equilibrium of all forces

order of variables becomes very important

Iris
-
virginica

Iris
-
versicolor

Iris
-
setosa

data points are not attracted to a single point

data points are attracted to an axis

circle becomes polygon → Polyviz

order of variables is less important

polygon edges become very important

candidates
for classification
rules

different combinations of variables

exact position of point is displayed

no
information loss

determine correlation between variables

both positive and negative

determine partial correlations

only some values of some variable are correlated with
some values of other variable

very important

dependent on variable ordering

not that useful without interactive software

may be hard to understand for newbies

Exploratory data analysis:

http://
www.itl.nist.gov/div898/handbook/eda/eda.htm

Have a look at the graphical techniques:

http://
www.itl.nist.gov/div898/handbook/eda/section3/ed
a33.htm

Orange Canvas

open
-
source data mining

http://orange.biolab.si
/

interface similar to IBM Clementine (SPSS Modeler)

widget
documentation:
http
://orange.biolab.si/doc/widgets
/

Sample data

http://
archive.ics.uci.edu/ml/index.html

http://www
-
958.ibm.com/software/data/cognos/manyeyes
/