of optimal clustering method and optimal

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

57 εμφανίσεις

O
PTIMCLASS: Simultaneous identification
of optimal clustering method and optimal
number of clusters in vegetation
classification studies


Tich
y

L
ubomír
1
,

Chytr
y

M
ilan
1
, B
otta
-
Dukát

Zoltán
2
,

Hájek M
ichal
1
; Talbot S
tephen S.
3


1
Masaryk University, Brno, Czech Republic

2
Hungarian Academy of Sciences, V
á
cr
á
tot, Hungary

3
U.S. Fish and Wildlife Service, Anchorage, USA


Why do we need a method for identification of optimal
clustering algorithm and optimal number of clusters?


The same dataset


-
A
huge variety of clustering methods

produce

reasonable


results
.


-
Subjective selection of the clustering method

and

no. of clusters

is usually based on
empirical experience




Why do we need a method for identification of optimal
clustering algorithm and optimal number of clusters?


Methods published:

Most algorithms identify the optimal partition mathematically, without
considering ecological interpretation


The Method


A posteriori

description of phytosociological tables is based on


diagnostic

species

Diagnostic species

describes a cluster. Therefore, the
number of
diagnostic species

determines whether the classified table can be
sufficiently interpreted.

Species

1




98788 12112 3.211

Species

2




51123 1223. 11132

Species 3





23132 ..... .....

Species

4




..2.4 112.. 1..5.

Species

5




..... .1.1. 1.213


The Method


The same

dataset:

The Method



Measure of the classification quality:
the total sum of
diagnostic species



Fisher

s Exact Test


calculates the probability of observ
ed occurrence of species

across clusters for a right
-
tailed test
hypothesis



The measure reduces the importance of very small clusters.



Easy interpretation: the more diagnostic species in the
dataset, the better description of

the

clusters.



The Method


Test on three different datasets


Southern Siberia, Sayan
Mountains
(310 plots; forest,
steppe and tundra vegetation)

Central Europe, Carpathians
(241 plots; mire vegetation)

Alaska, Kenai Peninsula

(171 plots; wetlands)

The Method


Classifications tested


Flexible beta clustering
WARD‘s clustering

UPGMA

(
PC
-
ORD
)


Cover

transformations

(percentage
s
, log

percentages
,
Braun
-
Blanquet
, presence/absence)

Distance measures

(Bray
-
Curtis, Manhattan,
Euclidean)

Ordinal cluster analysis

(
SYN
-
TAX
)

Modified TWINSPAN
classification


(
JUICE
)


The sequence of splits in divisive
classification is determined by
internal heterogeneity of clusters
.
Therefore, any number of clusters is
possible

(three modifications of pseudospecies cut
levels)

Distance measures
(Kruskal
-
Wallis, Kendall,
Gower
-
Podani coef
ficient
)


Results


Sayan Mountains, Siberia

(310 plots, 1036 species)


0
100
200
300
400
500
600
700
800
900
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
0
5
10
15
20
25
30
35
40
45
50
Probability = 10
-
3

Probability = 10
-
6

Probability = 10
-
9

No. of clusters

No. of diagnostic species

No. of clusters

No. of clusters

No. of diag. spec.

No. of diag. spec.

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Untransformed cover data

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Euclidean distance measure

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Manhattan distance measure

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Bray
-
Curtis distance measure

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

UPGMA

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Ward‘s method

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Flexible beta
-
0.25

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Ordinal cluster analyses (SYN
-
TAX)

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

Results


Sayan Mountains, Siberia

(310 plots, 1036 species)

Modified TWINSPAN

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of diagnostic species

Number of clusters

The Method


Test on three different datasets


Southern Siberia, Sayan
Mountains
(310 plots; forest,
steppe and tundra vegetation)

Central Europe, Carpathians
(241 plots; mire vegetation)

Alaska, Kenai Peninsula

(171 plots; wetlands)

Similar results:

Conclusion
s


Classifications based on transformed cover values give better results than
percentage covers.


Euclidean distance

-

slightly poorer results than Manhattan or Bray
-
Curtis
distances
.


UPGMA clustering method

-

poorer results than Ward

s and Flexible beta
methods.


No significant difference between
ordinal cluster analysis

proposed by
Podani (SYN
-
TAX 2000) and other cluster
ing

methods
.


Modified TWINSPAN



performs well with small numbers of clusters.



0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of clusters

Number of diagnostic species occurrences

Modified TWINSPAN classification

0
50
100
150
200
250
300
0
5
10
15
20
25
30
35
40
45
50
Number of clusters

Sum of diagnostic species

Modified TWINSPAN classification

0
5
10
15
20
0
5
10
15
20
25
30
35
40
45
50
Number of clusters

Number of clusters with more than 4 diagnostic species

Modified TWINSPAN classification