StreamKM++: A Clustering Algorithm for Data Streams∗ - SIAM

spiritualblurtedΤεχνίτη Νοημοσύνη και Ρομποτική

24 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

57 εμφανίσεις

      

  
y
 
z
 
y
 
y
 
z
 
y

    k    
        
          
       k 
           
        
   k       
         
        
          
           
           
         
    
       
     
         
       
           
          
       
         
           
         
          
        
        
           
 
  
          
          
         
         
         
          
        
          
          
      
      

     
      
y
       
   
z
      
 
          
         
      
       
       k
         
        
          
         
          
        
         
         
         
        
       k
         
 k     
         
  k     
        
     
         
           k
   
       
        
        
       
        
       
      
     
        
        
         
         
         
        
  k  k  
            
       
173
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
           
         
           
 k  
       
  k     
       
        
        
         
2
i
m       i    
m       
           
     
         
        
      
      
          
   k    
      
          
       
       
      
       
          
    k  
         
        
        
         
     
         
       
      
          
       
    k = 100    
        
     k 
       
        k
       
    k = 50   
    k       
40    k
        
        
      k
         
 
 
 k  k  `
2
  R
d
  d(x;y) =
kx  yk       
d
2
(x;y) = kx  yk
2
    
x;y 2 R
d
   d(x;C) = min
c2C
d(x;c) d
2
(x;C) =
min
c2C
d
2
(x;c); cost(P;C) =
P
x2P
d
2
(x;C) 
C;P  R
d
      S 
R
d
   w:S!R
0
  
cost
w
(S;C) =
P
y2S
w(y) d
2
(y;C):  k
     
      P  R
d
 jPj = n
 k 2 N    C  R
d
 jCj = k 
 cost(P;C):
 

k
(P) = min
C
0
R
d
:jC
0
j=k
cost(P;C
0
)
        k
  P
          
         P   
        k  
         
        
P         
         
        
       
   P      
 
   k 2 N " 1  
 S  R
d
    w:s!
R
0

P
y2S
w(y) = jPj    (k;")  P
   C  R
d
  jCj = k  
(1 ")cost(P;C)  cost
w
(S;C)  (1 +")cost(P;C):
      
           
            
        
          
          

  
          k
        
      k 
       
          
174
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
        
      
    P  R
d
 jPj = n 
k         

       q
1
2 P  

  S          
P    p 2 P   

d
2
(p;S)
cost(P;S)
    S
      S     
  
  S       d
2

     m   
         S =
fq
1
;q
2
;:::;q
m
g   m     d
2

 Q
i
       P   
  q
i
     
 w:S!R
0
 w(q
i
) = jQ
i
j   
  S       
         
         d
       
          
    m   m = 200k
          
          
            
      
  (k;")      
          
 
   m = 

k log n

d=2
"
d
log
d=2

k log n

d=2
"
d


 
    1     S 
 (k;6")  P
        
            
       
   S  P     m   
   d
2
    E[cost(P;S)] 
8(2 +lnm)
m
(P):
   > 0  m 

9d


d
2
k dlog(n) + 2e
 
m
(P)  
k
(P):
  C       k  
p 2 P  q
p
    S  
p       
  jcost(P;C) cost
w
(S;C)j 
P
p2P


d
2
(p;C) 
d
2
(q
p
;C)


: P
0
=

p 2 P
d(p;q
p
) "d(p;C)


P
00
= P n P
0
     P  
       
           
            

   p 2 P
0
 


d
2
(p;C) d
2
(q
p
;C)


 3"d
2
(p;C):
   p 2 P
00
 


d
2
(p;C) d
2
(q
p
;C)



3
"
d
2
(p;q
p
):
       


cost(P;C) cost
w
(S;C)



X
p2P
0


d
2
(p;C) d
2
(q
p
;C)


+
X
p2P
00


d
2
(p;C) d
2
(q
p
;C)


 3"
X
p2P
0
d
2
(p;C) +
3
"
X
p2P
00
d
2
(p;q
p
)
 3"cost(P;C) +
3
"
cost(P;S):
       
cost(P;S) 
8

(2 + lnm)
m
(P)   
 1      =
"
2

8(2+lnm)

    
cost(P;S) 
8

(2 +lnm)
m
(P)
"
2

k
(P) "
2
cost(P;C)
 m = 

k log n

d=2
"
d
log
d=2

k log n

d=2
"
d


  
  
   
      
 k      
     S = fq
1
;q
2
;:::;q
i
g 
   P  R
d
   i < m 
jPj = n         
     q
i+1
     
      P      
S        
    (dnm)    m
            m
        
175
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
        
          
         
   P      
 n         
       (log k)  
  (dnlog m)    m   
        
          
d
2
        
 S         
k 
        
          
    
        
T      P        
      P  
          
P       
           
           
           
          
     T     
 
     T        
  
     T       
       P
            
 C           
 C
    v  T    
     P
v
     q
v
 P
v
   size(v)    cost(v) 
   P
v
        v 
   P
v
        
    T       v   P
v

           
   q
v
    v  
    d
2
 P
v
     
        q
`
     `
               
      
size(v)     v        
 P
v
       cost(v) 
cost(P
v
;q
v
)        
    P
v
 q
v
   cost(v)   
  v           
       
           
       
1     T        
      1     
    P   q
1
     
    S      
      P     
    i    1;2;:::;i   
      q
1
;q
2
;:::;q
i
  
     q
i+1
      
         T 
     
      ` 
          q
i+1

  P
`
 
   q
`
 q
i+1
  P
`
   
      ` T
        
    T  u         
       u   
       u    
         v 
    u    
cost(v)
cost(u)

          
  `        q
`
  
    `   P
`
    
P    `
           
 P
`
    d
2
   p 2
P
`
   
d
2
(p;q
`
)
cost(P
`
;q
`
)
  
       P  
           
          
          
   k     
        
          
  
          
`
1
`
2
       P
`
         
 `
1
   q
`
   `
2
   
   q
i+1
   P
`
  
P
`
1
= fp 2 P
`
d(p;q
`
) < d(p;q
i+1
)g  P
`
2
= P
`
n P
`
1
           
`            
`
1
`
2
       
  `
1
`
2
        
 `       
         
176
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
(P;m)
   q
1
    P
 root    q
root
= q
1

size(root) = jPj cost(root) = cost(P;q
1
)
 S fq
1
g
  i 2  m 
   root   
     
`
 
   q
i
  d
2
 P
`
 S S [ fq
i
g
     `
1
`
2
`
  size(`)  cost(`)
        root
   
 (p)
  p  B
0
  B
0
  
    Q
     B
0
 Q
  B
0

i
1
  B
i
   
     B
i
 Q
     Q
  B
i
 i i +1
     Q  B
i
   
        
   m     q
1
;q
2
;:::;q
m
 
           
S = fq
1
;q
2
;:::;q
m
g     q
i
  
            
   q
i

  
         
       m     
         m
       
         
           
       
         k
            
k       m  
           
            
      
        
  k      
          
        

        
            
        
  n     
L = dlog
2
(
n
m
) + 2e  B
0
;B
1
;:::;B
L1
 
B
0
       0  m   
  i  1  B
i
    
 m         
        B
i
    
   m  2
i1
m    
 
         
     B
0
   B
0
  
 m      B
0
    
  B
1
   B
1
    
   B
1
  m   
    Q   m    
 2m     B
0
 B
1
   
         B
0
 B
1
    m     Q
    B
2
    B
2
            
         
        
            
   m         
           
         mdlog
2
(
n
m
) +2e
       B
0
;B
1
;:::;B
L1

       
    m
      
      
       O(dm
2
)  
  (dmlog m)      
         n   d
n
m
e
        
      O(dnm) 
           m
      O(dm
2
log
n
m
)  
k         
 m   (dkm)    
     (dmlog
n
m
)
177
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
         
  d       
 
        
        m 
       m= 200k  
          
  
  
       
      

           
      
          
       
         2:6:9
       
          
         
2:6:18 
     
      
       
         
         
       
     k 
       
 k     
    k  
       
       
          
       
      
       
  R
d
        
      
         
    

    


        


         
         

         
        


        

       

     
       

  

 

4 601
57


311 079
34
 

581 012
54


4 915200
3

 
2 458285
68


11 620300
57

     
     1:5     
    11 620300     57 
         
        
   
       
        
      
        
         
         
          
         
          
    k       
         
        
        
          
         m = 200k 
        
         
         
         
     20%    
 
     
   

   
      
         k 
           
        
       
        
         
         
     
       
        

      
178
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
10
20
30
40
50
0
5000
10000
15000
20000
25000
30000
147
460
1027
1773
2588
245
297
378
454
617
44
44
44
44
44
3389
5160
14933
16713
25803
covertype: average running time
StreamLS
StreamKM++
BIRCH
kmeans++
number of centers k
averag
e ti
me in
secon
d
s
10
20
30
40
50
0,00E+00
5,00E+10
1,00E+11
1,50E+11
2,00E+11
2,50E+11
3,00E+11
3,50E+11
4,00E+11
4,50E+11
covertype: average cost
StreamLS
StreamKM++
BIRCH
kmeans++
number of centers k
a
verage cost
20
40
60
80
100
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
679
1989
3849
6212
8946
157
168
187
211
248
77
78
77
77
77
2960
6902
11247
19206
17161
tower: average running time
StreamLS
StreamKM++
BIRCH
kmeans++
number of centers k
averag
e t
ime in
secon
d
s
20
40
60
80
100
0,00E+00
1,00E+08
2,00E+08
3,00E+08
4,00E+08
5,00E+08
6,00E+08
7,00E+08
8,00E+08
9,00E+08
1,00E+09
tower: average cost
StreamLS
StreamKM++
BIRCH
kmeans++
number of centers k
a
vera
ge
cost
10
20
30
40
50
0
2000
4000
6000
8000
10000
12000
14000
631
2362
5504
10054
11842
1571
1724
1839
1956
2057
271
271
271
272
272
census: average running time
StreamLS
StreamKM++
BIRCH
number of centers k
averag
e ti
me in

secon
d
s
10
20
30
40
50
0,00E+00
5,00E+07
1,00E+08
1,50E+08
2,00E+08
2,50E+08
3,00E+08
3,50E+08
4,00E+08
4,50E+08
census: average cost
StreamLS
StreamKM++
BIRCH
number of centers k
average cost
15
20
25
30
0
5000
10000
15000
20000
25000
6239
10502
15780
22779
5486
5738
5933
6076
1006
998
996
996
bigcross: average running time
StreamLS
StreamKM++
BIRCH
number of centers k
averag
e ti
me in
secon
d
s
15
20
25
30
0,00E+00
1,00E+12
2,00E+12
3,00E+12
4,00E+12
5,00E+12
6,00E+12
7,00E+12
8,00E+12
bigcross: average cost
StreamLS
StreamKM++
BIRCH
number of centers k
av
erage cost

           
179
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
10
20
30
40
50
0
20
40
60
80
100
120
140
160
180
200
3,06
7,04
16,45
28,93
44,48
3,57
8,22
19,05
20,54
25,9
19,02
59,85
88,8
132,03
182,08
spambase: average running time
StreamKM++
kmeans++
kmeans
number of centers k
averag
e
ti
me in

secon
d
s
10
20
30
40
50
0,00E+00
2,00E+07
4,00E+07
6,00E+07
8,00E+07
1,00E+08
1,20E+08
1,40E+08
1,60E+08
1,80E+08
spambase: average cost
StreamKM++
kmeans++
kmeans
number of centers k
av
erage c
ost
10
20
30
40
50
1
10
100
1000
10000
100000
74
103
144
198
250
51
262
1973
1257
1340
409
2711
4389
10734
14282
intrusion: average running time
StreamKM++
kmeans++
kmeans
number of centers k
av
erag
e
ti
me i
n
se
con
ds (l
og
a
r
ith
mi
c scal
e
)
10
20
30
40
50
1,00E+00
1,00E+02
1,00E+04
1,00E+06
1,00E+08
1,00E+10
1,00E+12
1,00E+14
1,00E+16
intrusion: average cost
StreamKM++
kmeans++
kmeans
number of centers k
av
erage c
ost (lo
ga
r
i
thmic
scale
)

        
       k  
        
      
     2    
        
      
      
     
        
         
       5%  
      
        
        
     
          
       
        
          
         
       
      k  10  
        
        
  k = 100     
          
   
         
       
         
         
   k  k
       
  k   
       k 
   k     
        
          
         
    k     
     k 
         
          
k         
         
          
        
           
   k      
180
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
k = 20
 



k


k




 10
5

 10
6




 10
10

 10
11




 10
9
 10
10
 10
8




 10
6
 10
7
 10
7
 



 10
6
 10
6





 10
10
 10
11

     k = 20
 k       
     k 
        
       
          
     k 
           
       k 
           
        
      k 
   k     
        
  k     
       
       
       
   k   
 k         
          
    
     

        
       
     
        
      
       

        
      
     
        
     
  
   k    
     
  
        
     
  
        
       
        
        
      
     
        
      
         
 
         
      
    
         k
  k    
      
        
     

        
     
     
    
       
    
       
    
      k  
     
       
        

         
       
        
  
181
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
     
 C = fc
1
;:::;c
k
g       
 k   P  jPj = n 
cost(P;C) = 
k
(P)      
  c
i
      
   
 R =
1
n

k
(P)   j = 1;2;:::;dlog(n)+2e
   c
i
  Q
ij
   
  c
i
  
p
2
j
R  
 U
i0
= Q
i0
 U
ij
= Q
ij
n Q
i;j1
 j  1
  p 2 P     U
ij
 
   
d
2
(p;C) >
1
4
2
dlog(n)+2e
R
 
k
(P);
   
  i;j    U
ij
 
    
p

9d
2
j
R    
     P      
        G   
         
 (
9d

)
d
2
kdlog(n)+2e     jGj  m
 g
p
    p 2 P  G
  

m
(P)  cost(P;G)

X
p2P
d
2

p;g
p

:
   p 2 U
i0
   d
2

p;g
p



9
R
     p 2 U
ij
 j  1  
d
2
(p;C)  2
j3
R      
d
2

p;g
p



9
2
j
R 
8
9
d
2
(p;C):
 

m
(P)  n

9
R+
8
9
X
p2P
d
2
(p;C)
=

9

k
(P) +
8
9

k
(P) = 
k
(P):

      
 d(p;C)  d(q
p
;C)  c
p
  
 C   p     
d(q
p
;C)  d(q
p
;c
p
)
 d(p;c
p
) +d(p;q
p
)
 (1 +")d(p;C):
      
d
2
(q
p
;C)  (1 +")
2
d
2
(p;C)  (1 +3")d
2
(p;C):
   d
2
(q
p
;C) d
2
(p;C)  3"d
2
(p;C)
  d(q
p
;C) < d(p;C)  c
s

   C   q
p
   
  
d(p;C)  d(p;c
s
)
 d(q
p
;c
s
) +d(p;q
p
)
 d(q
p
;C) +"d(p;C);
 p 2 P
0
  (1 ")d(p;C)  d(q
p
;C):
    
d
2
(q
p
;C)  (1 2"+"
2
)d
2
(p;C)
> (1 2")d
2
(p;C):
  
d
2
(p;C) d
2
(q
p
;C)  2"d
2
(p;C)
< 3"d
2
(p;C):

      
 d(p;q
p
) >"d(p;C) " 1  


d
2
(p;C) d
2
(q
p
;C)


=


d(p;C) d(q
p
;C)




d(p;C) +d(q
p
;C)

 d(p;q
p
) 

2d(p;C) +d(p;q
p
)



2
"
+1

d
2
(p;q
p
) 
3
"
d
2
(p;q
p
):

182
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
      
   


k

k
k

k
k





 10
7
 10
7
 10
8




 10
7
 10
7
 10
8




 10
7
 10
7
 10
8




 10
6
 10
6
 10
8




 10
6
 10
6
 10
8





 10
13
 10
13
 10
14




 10
12
 10
12
 10
14




 10
11
 10
11
 10
14




 10
11
 10
11
 10
14




 10
11
 10
11
 10
14
               
       
   

k



k




















































            
183
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.


k



k


 10
11
 10
11
 10
11
 10
11

 10
11
 10
11
 10
11
 10
11

 10
11
 10
11
 10
11
 10
11

 10
11
 10
11
 10
11
 10
11

 10
11
 10
11
 10
11
 10
11


 10
8
 10
8
 10
8
 10
8

 10
8
 10
8
 10
8
 10
8

 10
8
 10
8
 10
8
 10
8

 10
8
 10
8
 10
8
 10
8

 10
8
 10
8
 10
8
 10
8
           
       
   


k











 10
12
 10
12
 10
12




 10
12
 10
12
 10
12




 10
12
 10
12
 10
12




 10
12
 10
12
 10
12
 




 10
8
 10
8
 10
8




 10
8
 10
8
 10
8




 10
8
 10
8
 10
8




 10
8
 10
8
 10
8




 10
8
 10
8
 10
8
                
184
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
      
   

k


k
k








































































































 














































           
185
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.


k


k
k


 10
6

 10
6
 10
6

 10
5

 10
6
 10
4

 10
5

 10
5
 10
4

 10
5

 10
5
 10
6

 10
5

 10
5
 10
6


 10
12

 10
12
 10
11

 10
10

 10
11
 10
9

 10
10

 10
10
 10
10

 10
9

 10
10
 10
8

 10
9

 10
10
 10
8


 10
9
 10
10
 10
9


 10
9
 10
10
 10
8


 10
9
 10
9
 10
8


 10
8
 10
9
 10
8


 10
8
 10
9
 10
8



 10
6
 10
7
 10
7


 10
6
 10
7
 10
6


 10
6
 10
7
 10
6


 10
6
 10
6
 10
6


 10
5
 10
7
 10
6

 

 10
6
 10
5



 10
6
 10
6



 10
6
 10
5



 10
6
 10
5



 10
6
 10
5




 10
10
 10
11



 10
10
 10
11



 10
10
 10
11



 10
10
 10
11


          
186
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.
    


 

p =




         




  
p   
  
  
  
  
  
  


 
   
 
   
  

 

 















 
   

    k




 

 

 







       
187
Copyright © by SIAM.
Unauthorized reproduction of this article is prohibited.