Frequent Itemset Mining on

birdsowlSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

137 views

wenbin@cse.ust.hk

Frequent Itemset Mining on
Graphics Processors

Wenbin Fang

, Mian Lu, Xiangye Xiao, Bingsheng He
1
, Qiong Luo


Hong Kong Univ. of Sci. and Tech.

Microsoft Research Asia
1




Presenter: Wenbin Fang

2
/33

Outline


Contribution


Introduction


Design


Evaluation


Conclusion

3
/33

Contribution

Accelerate the Apriori algorithm for Frequent Itemset Mining
using Graphics Processors (GPUs).


Two GPU implementations:

1.
Pure Bitmap
-
based implementation (PBI): processing
entirely on the GPU.


2.
Trie
-
based implementation (TBI): GPU/CPU co
-
processing.

4
/33

Frequent Itemset Mining (FIM)

Finding groups of items, or
itemsets

that co
-
occur frequently
in a transaction database.

Transaction ID

Items

1

A, B, C, D

2

A, B, D

3

A, C, D

4

C, D

Minimum support: 2

1
-
itemsets (frequent items):

A : 3

B : 2

C : 3

D : 4

5
/33

Frequent Itemset Mining (FIM)

Aims at finding groups of items, or
itemsets

that co
-
occur
frequently in a transaction database.

Transaction ID

Items

1

A, B, C, D

2

A, B, D

3

A, C, D

4

C, D

Minimum support: 2

1
-
itemsets (frequent items):

A, B, C, D


2
-
itemsets:

AB: 2

AC: 2

AD: 3

BD: 2

CD: 3

6
/33

Frequent Itemset Mining (FIM)

Aims at finding groups of items, or
itemsets

that co
-
occur
frequently in a transaction database.

Transaction ID

Items

1

A, B, C, D

2

A, B, D

3

A, C, D

4

C, D

Minimum support: 2

1
-
itemsets (frequent items):

A, B, C, D


2
-
itemsets:

AB, AC, AD, BD, CD


3
-
itemsets:

ABD, ACD

7
/33

Graphics Processors (GPUs)



Exist in commodity machines, mainly for graphics rendering.


Specialized for compute
-
intensive, highly data parallel apps.


Compared with CPUs, GPUs provide 10x faster computational
horsepower, and 10x higher memory bandwidth.

CPU

GPU

--
From NVIDA CUDA Programming Guide

8
/33

Programming on GPUs


OpenGL/DirectX


AMD CTM


NVIDIA CUDA

The many
-
core architecture model of the GPU

SIMD parallelism

(Single Instruction,

Multiple Data)

9
/33

Hierarchical multi
-
threaded in NVIDIA CUDA



A warp = 32 GPU threads => SIMD schedule unit.











Thread Block









# of threads in a thread block

# of thread blocks

Thread Block









Warp







Warp







Warp







Warp







Warp







Warp







10
/33

General Purpose GPU Computing (GPGPU)


Applications utilizing GPUs


Scientific computing


Molecular Dynamics Simulation


Weather forecasting


Linear algebra


Computational finance


Folding@home, Seti@home


Database applications


Basic DB Operators [SIGMOD’04]


Sorting [SIGMOD’06]


Join [SIGMOD’08]

11
/33

Our work

As a first step, we consider the GPU
-
based Apriori, with intention to extend
to another efficient FIM algorithm
--

FP
-
growth.


Why Apriori?

1.
a classic algorithm for mining frequent itemsets.

2.
also applied in other data mining tasks, e.g., clustering, and functional
dependency.

12
/33

The Apriori Algorithm

Input: 1) Transaction Database 2) Minimum support

Output: All frequent itemsets


L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k



k += 1

}

Frequent 1
-
itemsets

Candidate 2
-
itemsets

Frequent 2
-
itemsets

Candidate 3
-
itemsets

Frequent 3
-
itemsets



Candidate (K
-
1)
-
itemsets

Frequent (K
-
1)
-
itemsets

Candidate K
-
itemsets

Frequent K
-
itemsets

13
/33

Outline


Contribution


Introduction


Design


Evaluation


Conclusion

14
/33

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k


k += 1

}


GPU
-
based Apriori

Input: 1) Transaction Database 2) Minimum support

Output: All frequent itemsets

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k


k += 1

}

Pure Bitmap
-
based Impl. (PBI)

Itemsets:
bitmap

Candidate generation on the
GPU

Transactions:
bitmap

Support counting on the
GPU

Trie
-
based Impl. (TBI)

Transactions:
bitmap

Support counting on the
GPU

Itemsets:
Trie

Candidate generation on the
CPU

15
/33

Horizontal and Vertical data layout


Horizontal data layout

TID

Items

1

A, B, C, D

2

A, B, D

3

A, C, D

4

C, D

Vertical data layout

Itemset

TID List

AB

1, 2

AC

1, 3

AD

1, 2, 3

BD

1, 2, 3

CD

1, 3, 4

Itemset

TID List

ABD

1, 2

ACD

1, 3

Scan all transactions

Support counting is
done on specific
itemsets.

1.
Intersect two transaction lists.

2.
Count the number of transactions


in the intersection result.


16
/33

Bitmap representation for
transactions

T1

T2

T3

T4

AB

1

1

0

0

AC

1

0

1

0

AD

1

1

1

0

BD

1

1

1

0

CD

1

0

1

1

Intersection = bitwise AND operation

# of transactions

# of itemsets

Counting = # of 1’s in a string of bits

17
/33

Lookup table


Index

Count

0000

0

0001

1



1000

1

1010

2

1011

3

1100

2

Lookup table

T1

T2

T3

T4

ABD

1

1

0

0

ACD

1

0

1

0

Bitmap representation for transactions

Constant memory

1.
Cacheable

2.
64 KB

3.
Shared by all GPU threads

Index

Count

0000 0000 0000 0000 (0)

0

0000 0000 0000 0001 (1)

1



1111 1111 1111 1110 (65534)

15

1111 1111 1111 1111 (65535)

16

1 byte

2
16
= 65536

# of 1’s = TABLE[12];

// decimal: 12

// binary: 1100

// (a string of bits)

18
/33

Support Counting on the GPU (Cont.)

Thread block 1

Thread block 2

T1

T2

T3

T4

AB

1

1

0

0

AC

1

0

1

0

AD

1

1

1

0

BD

1

1

1

0

CD

1

0

1

1

T1

T2

T3

T4

ABD

1

1

0

0

ACD

1

0

1

0

1.
Intersect two transaction lists.

2.
Count the number of transaction


in the intersection result.


LOOKUP TABLE

2

19
/33

Support Counting on the GPU (Cont.)


int

int

int

int

int

int

int

int

int

int

int

int

int

int

int

int

AND

AND

AND

AND

AND

AND

AND

AND

int

int

int

int

int

int

int

int

LOOKUP TABLE

Counts of 1’s for every 16
-
bit integer

Parallel Reduce

Support for this itemset

Thread Block

Thread 1

Thread 2

Access vector type int4

In one instruction

1

1

0

0

1

1

1

0

1

1

0

0

Example:

Counts: 2

Support:2

AD

AB

ABD

20
/33

Input: 1) Transaction Database 2) Minimum support

Output: All frequent itemsets


L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {




//Generate candidate k
-
itemsets.


Join


Subset test



//Generate frequent k
-
itemsets


Support Counting



k += 1

}

Support Counting on the
GPU

Candidate Generation

1.
Join


e.g., Join two 2
-
itemsets to obtain a
candidate

3
-
itemset:


A
C
JOIN

A
D =>
A
CD

2.
Subset test

e.g., Test all 2
-
subsets of ACD: {AC, AD, CD}

GPU
-
based Apriori

21
/33

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k



k += 1

}


GPU
-
based Apriori

Input: 1) Transaction Database 2) Minimum support

Output: All frequent itemsets

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k



k += 1

}

Pure Bitmap
-
based Impl. (PBI)

Itemsets:
bitmap

Candidate generation on the
GPU

Transactions:
bitmap

Support counting on the
GPU

Trie
-
based Impl. (TBI)

Transactions:
bitmap

Support counting on the
GPU

Itemsets:
Trie

Candidate generation on the
CPU

Itemsets: bitmap

Candidate generation on the GPU

Pure Bitmap
-
based Impl. (PBI)

One GPU thread generates one candidate itemset.

22
/33

A

B

C

D

ABD

1

1

0

1

ACD

1

0

1

1

A

B

C

D

AB

1

1

0

0

AC

1

0

1

0

AD

1

0

0

1

BD

0

1

0

1

CD

0

0

1

1

Bitwise OR In Join

(e.g., AB JOIN AD = ABD)

Binary search

In Subset test

(e.g., 2
-
subsets {AB, AD, BD})

# of items

# of

itemsets

23
/33

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k



k += 1

}


GPU
-
based Apriori

Input: 1) Transaction Database 2) Minimum support

Output: All frequent itemsets

L
1

= {All frequent 1
-
itemsets}

k = 2

While (L
k
-
1

!= empty) {


//Generate candidate k
-
itemsets.


C
k

<
-

Self join on L
k
-
1


C
k
<
-

(K
-
1)
-
Subset test on C
k



//Generate frequent k
-
itemsets


L
k
<
-

Support Counting on C
k



k += 1

}

Pure Bitmap
-
based Impl. (PBI)

Itemsets:
bitmap

Candidate generation on the
GPU

Transactions:
bitmap

Support counting on the
GPU

Trie
-
based Impl. (TBI)

Transactions:
bitmap

Support counting on the
GPU

Itemsets:
Trie

Candidate generation on the
CPU

Itemsets: Trie

Candidate generation on the CPU

D

24
/33

Trie
-
based Impl. (TBI)


Root

A

B

C

D

B

C

D

D

D

1
-
itemsets: {A, B, C, D}

2
-
itemsets: {AB, AC, AD, BD, CD}

A

AB

B

JOIN

AC

C

= ABC

C

{AB, AC, BC}

B

AB

JOIN

AD

= ABD

{AB, AD, BD}

D

D

D

D

AC

JOIN

AD

= ACD

{AC, AD, CD}

On CPU

1, Irregular memory access

2, Branch divergence

C

D

Candidate 3
-
itemsets: { ABD, ACD}

Depth 0

Depth 1

Depth 2

25
/33

Outline


Contribution


Introduction


Design


Evaluation


Conclusion

26
/33

Experimental setup


Intel Core2 quad
-
core CPU

NV GTX 280 GPU

Processors

2.66 GHz * 4

1.35 GHz * 30 * 8

Memory Bandwidth

(GB/sec)

5.6

141.7

Development Env.

Windows XP + Visual Studio 2005 + CUDA

Dataset

#Item

Avg.
Length

#Tran

Density

T40I10D100K

(synthetic)

1,000

40

100,000

4%

Retail

16,469

10.3

88,162

0.6%

Chess

75

37

3196

49%

Platform:

Experimental datasets:

Density =

Avg. Length / # items

27
/33

Apriori Implementations


Impl.

Candidate

Generation

Support

Countnig

Itemsets

Transactions

BORGELT

Single
-
threaded
on the CPU

Single
-
threaded
on the CPU

Trie

Trie

GOETHALS

Single
-
threaded
on the CPU

Multi
-
threaded on
the CPU

Trie

Horizontal
layout

TBI
-
CPU

Single
-
threaded
on the CPU

Multi
-
threaded on
the CPU

Trie

Bitmap

TBI
-
GPU

Single
-
threaded
on the CPU

Multi
-
threaded on
the GPU

Trie

Bitmap

PBI
-
GPU

Multi
-
threaded
on the GPU

Multi
-
threaded on
the GPU

Bitmap

Bitmap

Best Apriori implementation in FIMI repository.

(Frequent Itemset Mining Implementations Repository)

28
/33

TBI
-
CPU vs GOETHALS

Impl.

Itemset/Candidat
e Generation

Transactions /

Suport Counting

TBI
-
CPU

Trie / CPU

Bitmap / CPU

GOETHALS

Trie / CPU

Horizontal layout
/ CPU

Dense

Dataset
-

Chess

Sparse

Dataset
-

Retail

The impact of using bitmap representation

for transactions in support counting.

1.2x ~ 25.7x

Sparse

Dataset
-

Retail

29
/33

TBI
-
GPU vs TBI
-
CPU

Impl.

Itemset/Candidat
e Generation

Transactions /

Suport Counting

TBI
-
GPU

Trie / CPU

Bitmap / GPU

TBI
-
CPU

Trie / CPU

Bitmap / CPU

Dense

Dataset
-

Chess

The impact of GPU acceleration in

support counting.

1.1x ~ 7.8x

30
/33

PBI
-
GPU vs TBI
-
GPU

Impl.

Itemset/Candidat
e Generation

Transactions /

Suport Counting

PBI
-
GPU

Bitmap / GPU

Bitmap / GPU

TBI
-
GPU

Trie / CPU

Bitmap / GPU

Sparse

Dataset
-

Retail

Dense

Dataset
-

Chess

The impact of bitmap
-
based itemset

and trie
-
based itemset in

candidate generation.

PBI
-
GPU is faster in dense dataset.

TBI
-
GPU is better in sparse dataset.

31
/33

PBI
-
GPU/TBI
-
CPU vs BORGELT

Impl.

Itemset/Candidat
e Generation

Transactions /

Suport Counting

PBI
-
GPU

Bitmap / GPU

Bitmap / GPU

TBI
-
GPU

Trie / CPU

Bitmap / GPU

BORGELT

Trie /CPU

Trie /CPU

Sparse

Dataset
-

Retail

Dense

Dataset
-

Chess

Comparison to the best

Apriori implementation in FIMI.

1.2x ~ 24.2x

32
/33

Comparison to FP
-
growth


With minsup 1%, 60%, and 0.01%

PARSEC benchmark

33
/33

Conclusion



GPU
-
based Apriori


Pure Bitmap
-
based impl.


Bitmap Representation for itemsets.


Bitmap Representation for transactions.


GPU processing.


Trie
-
based impl.


Trie Representation for itemsets.


Bitmap Representation for transactions.


GPU + CPU co
-
processing.


Better than CPU
-
based Apriori.


Still worse than CPU
-
based FP
-
growth

Backup Slide


Time breakdown on dense dataset Chess

Time breakdown on dense dataset Retail

Time Breakdown