Multidimensional data processing
•
Multivariate data consist of several variables
for each observation.
•
Actually, serious data is always multivariate.
•
Some variables are usually not collected to
simplify collecting and processing.
•
Removal of variables before data analysis leads to
information loss.
•
Unknown information is never recovered.
•
One of the most common task is clustering or
classification.
•
classification
•
target classes are known
•
properties of target classes are usually unknown
•
goal: find rules which separate observed data into
target classes
•
clustering
•
target classes are unknown
•
goal: find observations with common properties
which may (or may not) represent classes in real
world
•
difficult situation
•
we are trying to extract information from data
•
measurements, observations, surveys
•
data preparation
•
data adjustment
–
removal of invalid or
incomplete observations/measurements
•
normalization?
–
best handled when collecting
•
extracting information
•
we know what we are looking for
–
testing of an
hypothesis
•
trying to discover something new
–
data
exploration
•
preliminary analysis of the data
•
better understanding of its characteristics
•
allows to select the right tools for preprocessing or
analysis
•
wrong tools may yield invalid information or hide
important patterns
•
also known as Exploratory Data Analysis (EDA)
•
a different approach
–
mind shift is required
•
concentrates on the larger view
•
1977+
•
aka visual data mining
Richard Wesley
Hamming
, Numerical Methods
for Scientists and
Engineers, 1962
•
steps
•
maximize
insight into a data
set
•
uncover underlying
structure
•
extract important
variables
•
detect outliers and
anomalies
•
test underlying
assumptions
•
develop
minimalistic models
•
determine optimal
property settings
•
heavily relies on graphics
•
numbers are very abstract
•
Characteristics:
•
N
= 11
•
Mean
of X = 9.0
•
Mean
of Y = 7.5
•
Intercept
= 3
•
Slope
= 0.5
•
Residual
standard deviation = 1.237
•
Correlation
=
0.816
•
Have we realized something important?
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
•
Run

sequence plot
•
similar to line

chart in excel
•
shifts in variations
•
shifts in location
•
outliers
•
Histogram
•
center, spread, skew, multimodality
•
outliers
•
very useful
–
know how to create it!
•
nice presentations (e.g. word

cloud, tag

cloud)
•
check whether the data set is random or no
•
random data should have no observable
structure
•
lag = fixed time displacement
•
can be arbitrary
•
most common is 1
•
observe
•
week autocorrelation
•
strong autocorrelation
•
sinusoidal model
•
outliers
•
1 dimension
–
piece of cake (pie)
•
2 dimensions
–
still easy
–
Cartesian coordinate
system
•
3 dimensions
–
still doable in Cartesian system
•
4 and more dimensions
–
only Chuck Norris can
do that in Cartesian system
•
other types of visualization are required
•
some may be useful only for some types of data
•
understanding the data is very important
•
good visualization can help us understand the
contained information
•
results need to be presented to other people
•
sanity check, intuition
–
people capture patterns,
which are missed by automated methods
•
some options:
•
bubble
chart (3dim scatter plot)
•
scatter
plot array
•
star
plot, Radviz, Polyviz
•
parallel coordinates
•
also called: 3 dimensional scatter plot
•
2 data dimensions
–
graph X and Y
•
3
rd
dimension
–
point size
•
optional 4
th
dimension
–
point color
•
advantages
•
allows to uncover clusters and variable
dependencies
•
easy to understand
•
disadvantages
•
different combinations need to be tried
•
extension to common scatter plot
•
2 dimensional array of scatter plots
•
each combination of variables is drawn (twice)
•
diagonal descriptions
•
easy to create
•
messy
•
dependencies between more than two
variables are still hidden
Sepal width
Petal length
Petal width
Sepal length
•
axes radiate from central point
•
Star plot
•
values of a data point are connected to form a
polygon
•
can display only a small number of points
•
order of variables may be important
•
Radviz
•
values of a data point act as spring stiffness
•
values normalized into
interval
<0, 1>
•
object is placed in equilibrium of all forces
•
order of variables becomes very important
Iris

virginica
Iris

versicolor
Iris

setosa
•
similar principle to Radviz
•
data points are not attracted to a single point
•
data points are attracted to an axis
•
circle becomes polygon → Polyviz
•
order of variables is less important
•
polygon edges become very important
•
candidates
for classification
rules
•
different combinations of variables
•
exact position of point is displayed
–
no
information loss
•
advantages
•
determine correlation between variables
•
both positive and negative
•
determine partial correlations
•
only some values of some variable are correlated with
some values of other variable
•
very important
•
disadvantages
•
dependent on variable ordering
•
not that useful without interactive software
•
may be hard to understand for newbies
•
Exploratory data analysis:
•
http://
www.itl.nist.gov/div898/handbook/eda/eda.htm
•
Have a look at the graphical techniques:
•
http://
www.itl.nist.gov/div898/handbook/eda/section3/ed
a33.htm
•
Orange Canvas
–
open

source data mining
•
http://orange.biolab.si
/
•
interface similar to IBM Clementine (SPSS Modeler)
•
widget
documentation:
http
://orange.biolab.si/doc/widgets
/
•
Sample data
•
http://
archive.ics.uci.edu/ml/index.html
•
http://www

958.ibm.com/software/data/cognos/manyeyes
/
Comments 0
Log in to post a comment