An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework

scarcehoseΛογισμικό & κατασκευή λογ/κού

14 Ιουλ 2012 (πριν από 5 χρόνια και 1 μήνα)

1.785 εμφανίσεις

Journal of Convergence Information Technology
Volume 5, Number 9. November 2010

An Implementation of FP-Growth Algorithm Based on High Level Data
Structures of Weka-JUNG Framework

1
Shui Wang
*Corresponding author
,
2
Le Wang
1
Software School, Nanyang Institute of Technology, seawan@163.com
2
School of Innovation Experiment, Dalian University of Technology, wangleboro@163.com
doi:10.4156/jcit.vol5. issue9.30

Abstract
FP-Growth is a classical data mining algorithm; most of its current implementations are based on
programming language's primitive data types for their data structures; this leads to poor readability &
reusability of the codes. Weka is an open source platform for data mining, but lacks of the ability in
dealing with tree-structured data; JUNG is a network/graph computation framework. Starting from the
analysis on Weka's foundation classes, builds a concise implementation for FP-Growth algorithm
based on high level object-oriented data objects of the Weka-JUNG framework; comparison
experiments against Weka's built-in Apriori implementation are carried out and its correctness is
verified. This implementation has been published as an open source Google Code project.

Keywords: FP-Growth Algorithm, Frequent Itemset Mining, Weka, JUNG

1. Introduction

FP-growth (frequent pattern growth) [1] uses an extended prefix-tree (FP-tree) structure to store the
database in a compressed form. It adopts a divide-and-conquer approach to decompose both the mining
tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of
candidate generation and testing used by Apriori.
Weka [2] is an open source data mining framework, integrates multiple algorithms for classification,
clustering, association rule, etc, and supports abundant data I/O and visualization functionalities. But it
lacks the ability to support tree-structured data type directly, and up to version 3.6 it has not
implemented FP-Growth algorithm [5]. In its data mining monograph [3], information about Weka's
internal data structure or data processing work flow is still insufficient; this makes it difficult to build
customized algorithm based on this platform.
JUNG [6][9] is a universal graph/network framework; its functionality includes construction,
computation and visualization of graphs, trees and forests.
FP-growth implementations based on primitive data types of programming languages lack reusable
high-level data structures such as tree, itemset etc., and therefore are hard to read or migrate, or to
modify for customized algorithms.
This paper analyzes the basic data structure and fundamental classes of both the Weka and JUNG
frameworks, gives a concise implementation for FP-Growth algorithm based on high level object-
oriented data objects of the two frameworks, and compares its result against Weka's build-in Apriori
implementation to verify its correctness, provides a "cloneable" template for data mining programmers
to build their own algorithms on this integrated platform.

2. Related work

Although there're lots of papers discussing various derivatives or improvements of the FP-Growth
algorithm, only a few of them talked about the implementation details beyond the skeleton description
of the algorithm itself. Some student implementations can be found such as in [11], but usually they are
poor documented and not general applicable, and/or without thorough testing. This situation makes it
difficult for learners to study/research existing coding methods - they have to begin from scratch even
if they just want to make a small modification to the original algorithm.
Xinyu Wang et al [12] tested 3 different approaches for constructing the tree node: the vector
approach, the linked list approach and the binary tree approach. They found that (upon their testing
287
An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework
Shui Wann, Le Wang

datasets), contrary to common beliefs, the vector approach had the best performance. However, a
vector is not a "natural" way to manifest a tree, and nor the "binary tree" approach.
C. Borgelt [13] gave a C implementation of the FP-Growth algorithm, with his own specialized
memory allocation management module. The initial FP-tree is built as a simple list of integer arrays.
This list is sorted lexicographically and can be turned into an FP-tree with a recursive procedure. The
proposed 2 projecting approaches do not need parent-too-child pointers, so the structure of tree node
can be more compact. Despite this implementation's technical merits, its C coding style and complex
data structures make it difficult to be used as educational purpose or fast application prototype building.
Zi-guang Sun [14] discussed an implementation approach using STL (Standard Template Library)
in C++ programming language. He argued that STL's "set" data type was implemented with a black-red
tree with O(logN) searching time complexity, and could help boost the performance when constructing
the header table & FP-trees despite its relatively higher memory cost. Although high-level data types
were used in this implementation, these types were not intuitive.
The purpose of this paper is to provide an intuitive, concise source code implementation for the FP-
Growth algorithm, using high-level data types (with affordable performance loss of course), to make it
easy to be adopted for education or application prototype building.

3. High-level data types and functions in Weka & JUNG framework

3.1. Weka's layered data structure

To deal with data transactions in a unified way, Weka provides several data types to serve this
purpose; these data objects can be analogized to database terms such as transaction, table, record and
field; they can be categorized into several layers as follows [4]:
(1) DataSource: the source where we obtain the data; usually a data file;
(2) Instances: the collection of data transactions, or database;
(3) Instance: a single transaction, or record;
(4) Item: a unique value for a field; this is an abstract class, its subclass such as NominalItem or
NumericalItem should be used in practice for nominal items or numerical items respectively.
Association rule mining uses NominalItem as its data structure; the built-in "equals()" function
is used to determine where two items belong to the same field (i.e. getAttribute() returns the
same value) and their values (or precisely, the index of their values) are also the same. Note that
the frequency (or "support") of the two items are not compared. Item's innate "compareTo()"
function compares their frequencies and attribute name with descending order, that is, the
natural order of Items is the descending order on their frequencies.
In Weka, an Item's "value" is represented by an "index" of the value domain; the real meaning
of this index can only be obtained by referencing the underlying Attribute object. Class
"Attribute" contains attribute information of a data field, including its name and value domain;
e.g., suppose the attribute's value domain is {"Li", "Wang", "Zhang"}, a nominal item with a
value index of 0 corresponds to "Li", and a value index of 1 corresponds to "Wang".
Besides the above mentioned 4 layers of data objects, an "ItemSet" object represents the collection of
one or more data items; its inner structure is an array of integers, each of which represents the value
index of one item; the size of this array is the length of this itemset.
Apriori-like algorithms use horizontal representation for transactions, in which the basic data
element is "Instance" (aka transaction or record); FP-Growth algorithm uses vertical presentation of
data, i.e., it uses data "Items" to construct the FP-trees; its implementation requires the ability of tree
computation.

3.2. Tree computation in JUNG

JUNG has powerful support for network/graph computation & visualization functionalities
[6]; Tree and Forest are special cases of Graph, and JUNG provides dedicated APIs for them. A
straight-forward implementation of Tree interface is "DelegateTree", which is in fact a subclass
derived from DirectedGraph. Core methods of this class include:
 addVertex(V vertex): add a vertex as the root node.
288
Journal of Convergence Information Technology
Volume 5, Number 9. November 2010

 addChild(E edge, V parent, V child): add a child node under vertex "parent"; the "edge"
object must be specified.
 getPath(V vertex): get all the vertices from root to node "vertex".
The idea of this paper is this: use Weka's NominalItem data object as JUNG Tree's node
element, to code a concise implement for the FP-Growth algorithm.

4. Constructing the header table

A "Header Table" in FP-Growth algorithm is a map from an item to its total support; the map
is sorted in descending order of support. In the construction of a header table, operations such
as searching, inserting, modifying and deleting of a certain item in the map is required; and to
ensure the efficiency of these operations, a data structure that supports fast retrieval of data item
(such as hashtable or tree) is required.
Also, because the map should be sorted in descending order of support, mechanism that
supports automatic sorting should be enforced on this map. We choose Apache's TreeBidiMap
[7] to do this trick. TreeBidiMap establishes a bi-directional map between the key and the value
elements. Bi-direction means that the key and the value are exchangeable: you can seek the
value corresponding to a specified key, and you can also seek a key corresponding to a specified
value: both operations should be performed efficiently.
These features of the TreeBidiMap class requires that both key and value should be
comparable (i.e., implement the Comparable interface and overload the compareTo() method)
and there should be a 1-to-1 relationship between all keys and values. Because the support of
different items might be the same, so we define a customized HeaderCount class with an
attribute of random value to impose the 1-to-1 relationship between items and supports.
Figure 1 is the class used in header table representing an item's support; the "link" attribute is
the link table required by FP-Growth algorithm. So we can define the header table as:
TreeBidiMap<HeaderCount, NominalItem>
Note that here we use the HeaderCount object as the "key" of the map merely for the
convenience of coding; for a bi-directional map, key and value are exchangeable.
TreeBidiMap is one of the collection classes in “Apache Commons” project [8]; it is a bi-
direction tree structure implemented using the red-black tree approach, and comparison
operation is performed during its construction, so all the nodes should implement Comparable
interface. Detailed structure of the NominalItem class is discussed in the next section.

Figure 1. Class HeaderCount for the header table

5. Constructing FP-trees

There are two types of data stored in an FP-tree: items and supports. Items that come from
different transactions but belong to the same field and with the same value may share one node
Data type 1: header table:

class HeaderCount implements Comparable<HeaderCount> {


int count = 1;//
total support


double random = Math.random();


Vector<NominalItem> link = new Vector<NominalItem>();


public int compareTo(Hea
derCount arg0) {


if (arg0 == this) return 0;


long r = count
-

arg0.count;


if (r < 0) return 1;


else if (r > 0) return
-
1;


else {//
impose "unequal" for different objects


ret
urn (random <=

arg0.random) ? 1
:
-
1;


}

}}

289
An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework
Shui Wann, Le Wang

(if they have common prefix) or reside on different nodes; but JUNG framework do not allow
different nodes to be "equal", so the "item" object in a FP-Growth implementation must satisfy
the following two conditions:
(1) It must be able to distinguish between items that has the same attribute and value but
from different transactions.
(2) It must also be able to identify the above mentioned items to be a special kind of "equal".
Because the NominalItem class only overloads Item's equals() method in which its attribute is
compared, it can not distinguish items from different transactions; so we define a new class for
nominal items named "OrderedNominalItem", with a "serial" property indicating its transaction
id; equals() method is also rewritten to ensure its effectiveness when constructing a JUNG Tree
object with OrderedNominalItem nodes.
In the construction process of the header tables (data type is TreeBidiMap) and FP-Trees (data
type is DelegateTree of JUNG), comparative operations are performed when adding or deleting
nodes. The original compareTo() method only compares item's frequency & attribute's name;
this strategy gives an "equal" result for those items that belong to the same field (attribute) but
with different values. So our comparison strategy is: compare item's frequency, name of
attribute and item value consecutively, as shown in Figure 2.
Condition (2) is met by method equalsWithoutOrder(), in which comparison is made without
involving the serial property; this method is called when seeking a specified item in a collection
when only the item's attribute & value are compared, see seekItemInCollection() method in class
FpTree [8]; this seeking operation is needed when adding a new node to the existing FP-tree.


Figure 2. Comparator used in OrderedNominalItem

6. Algorithm description

Using the high-level data types in Weka and JUNG framework along with data objects from
Apache Collections, the computation process of FP-Growth algorithm can have a simple
description.
Data type 2: comparator of OrderedNominalItem:

public boolean equals(Object o) {


if (serial != ((OrderedNominalItem) o).serial)


return false;


return super.equals(o);

}

public int compareTo(Item o) {


OrderedNominalItem

comp = (OrderedNominalItem) o;


//
1.
first, frequency


if (comp.getFrequency() < m_frequency) {


return
-
1;


}


if (comp.getFrequency() > m_frequency) {


return 1;


}


//
2.
then, by name


int c = m_attribute.name()



.compareTo(comp.getAttribute().name());


if (c != 0)


return
-
1 * c;


//
3.
last, by value


if (m_valueIndex < comp.getValueIndex())


re
turn 1;


else if (m_valueIndex > comp.getValueIndex())


return
-
1;


else


return 0;

}

290
Journal of Convergence Information Technology
Volume 5, Number 9. November 2010

Figure 3 is the work flow of the initial process on transaction database; it constructs the first FP-tree
and header table with two scans. From this point on, the mining process becomes mining on FP-trees.

Figure 3. Initial process for transaction database

The main mining method is defined as mineDbtree(dbTree, dbHeader); it is a recursive function
with two basic steps:
(1) Traverse the header table and construct subtrees;
(2) Mine the subtree (with recursive function call to mineDbtree());
Each subtree is a new transaction database, we can handle it with just the same way we handle the
original database:
(1) First scan, construct header table;
(2) For each transaction, sort its items in header table's order;
(3) Second scan, construct transaction tree;
(4) Mine the resulted transaction tree (recursive call to mineDbtree).
This process is demonstrated in Figure 4.

Figure 4. Mining process on subtrees

Mining the transaction tree comprises two main steps:
(1) Obtaining the subtree corresponding to the header table items (getSubtree).
(2) Mining the resulted subtree (mineSubtree).
Although the actual code may seem a little different, but the idea inside getSubtree() is quite
simple: traverse the link table and get all the branches corresponding to each item; each branch
is obtained by simply invoke:
List<OrderedNominalItem> branch = tree.getPath(link item);
Algorithm 1: construction of the initial fp
-
tree:

//1.
Initialization:

Initialize
i
nstances

object from a data source (e.g. a file);


//2.
First Scan, create initial header table:

For each
instance

in
instances
:


Split it into items (
OrderedNominalItem

objects)


Construct header table (data type:
TreeBidiMap


<HeaderCount, OrderedNominalItem>
)


Delete unfrequent items from header table


//3.
Second scan, create initial
FP
-
Tree

and link table:

Construct d
bTree
;

(
data type:
DelegateTree<OrderedNomina
lItem, Long>)

Construct link table(data type:Vector)


//4.
now do the mining on
dbTtree

mineDbtree
(dbTree, dbHeader)

Algorithm 2: mining sub
-
tree:

mineSubtree(Vector<List<OrderedNominalItem>>
subtree) {

//1.
Traverse subtree build header table


TreeBidiMa
p<HeaderCount, OrderedNominalItem>



header;

//2.
Sort items in transaction

//3.
Rebuild transaction tree


DelegateTree<OrderedNominalItem, Long>

fptree;

//4.
Mine the resulted tree


mineDbtree(fptree, header);

291
An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework
Shui Wann, Le Wang

After removing its root & leaf node, branch is a list of nodes that consists one transaction of
the subtree. All such branches form a new transaction database, which can be mined recursively
using mineDbtree. Recursion termination condition: the resulting subtree is empty.
Mining process of the subtree is illustrated in Figure 5.

Figure 5. Mining the transaction tree

7. Implementation and experiments

The approach described in this paper has been implemented and published as an open source
project on Google Code™; the project URL is:
http://code.google.com/p/weka-jung-fpgrowth/
JUNG & Weka's supporting package that is needed for compiling can be download at [5][9];
the Apache Collections support is included in JUNG package.
We created a simple GUI (see Figure 6) for testing different data sets. The favorite type for
the data file is "arff" which is the standard in Weka [5], while other types such as "csv" are also
supported.
As shown in Figure 6, button "Read Datafile" loads data from a file into a DataSource object
and parses it to an Instances object; button "Apriori->FP" uses the Weka's built in Apriori
algorithm to find all the frequent patterns, and button "Show FP" lists these patterns in the
window. "FP-Tree" is our implementation: it applies the FP-Growth algorithm upon the above
mentioned Instances and shows the results.
Experiments have been performed to check its correctness; data sets used in these experiments
are downloaded from UCI Machine Learning Repository [10]. The computer that runs the
experiments has a software environment of Windows XP and JDK 1.6.0_14 with a hardware
configuration of Intel® Core2 Duo CPU (2.8GHz) & 3GB memory. Source code compilation is
done with MyEclipse V7.5.
Because we only want to check the correctness (instead of performance) of this
implementation, a simple comparison with the result of the built -in Apriori algorithm should
suffice. Table 1 is the testing result on the classical "Breast Cancer" dataset, and clearly it out
performs the Apriori implementation (which is also a Weka based program), and Figure 6 is the
runtime screen shot of this test for the minimum support value setting to 0.5. We have not tested
our implementation against FP-Growth's C++ implementations because we consider there is no
comparability between these two approaches and the overhead of our complex data objects is
obvious.
Algorithm 3: mining transaction
-
tree
:

mineDbtree(dbtree, dbheader) {


Vector<List<OrderedNominalItem>> subtree;


HeaderCount headerItem = dbheader.lastKey();


//
traverse from the tail of the header table


while (headerItem != null) {


//
1. get subtree for this item



subtree = getSubtree(dbtree, headerItem);


//
if subtree is null, go to next


if (subtree == null) {


headerItem =


dbheader.previousKey(headerItem);


continue;


}


//
2. subtree mining



mineSubtree(subtree);


//
next item in header table


headerItem =


dbheader.previousKey(headerItem);


}

}

292
Journal of Convergence Information Technology
Volume 5, Number 9. November 2010


Figure 6. A graphic user interface of our implementation

Table 1. Runtime Comparison with Weka-Apriori Implementation
Mini Sup

Apriori(ms)

FP
-
Tree/JUNG
-
Weka(ms)

0.5

156

63

0.2

375

78

0.1

906

219

0.02

overflow

1594


8. Conclusion and discussion

Utilizing the combined framework of Weka and JUNG, together with other high-level data
types from Apache Collections, algorithms (such as FP-Growth) that need sophisticated data
structures (such as trees, graphs etc) could be implemented concisely with less effort and yet
higher reusability; the overhead on complex data objects and its downside impact on runtime
efficiency can be overlooked when human labor cost is a more important factor.
This implementation does not use Weka's built-in NominalItem data type directly as JUNG
tree's node class, because from experiments we find that if the internal attribute such as
m_frequency is changed via its public method then this object will no longer be considered as a
composing node of the tree; this odd behavior forced the customized OrderedNominalItem class
to have a redundant data attribute for the item's support, and the course of this remains further
study.
Utilizing the high-level data types of JUNG has other benefits: JUNG provides powerful
visualization functionalities, which can be used to present graphical illustration of the mining
results such as needed when dealing with visualization requirements [15]. In fact, setting the
boolean variable "showtrees" to "true" will cause our program to visualize all FP-trees it create,
one example of such trees is shown in Figure 7.
Thoughts on further research work include implementing other algorithms that need tree
computation and visualization, such as the "cluster first" strategy proposed in [16], and text
mining for mind map [17].


Figure 7. Visualization of an FP-tree
293
An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework
Shui Wann, Le Wang

9. References

[1] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao, “Mining frequent patterns without candidate
generation”, Data Mining and Knowledge Discovery, vol. 8, no. 1, pp.53-87, 2004.
[2] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten,
“The WEKA Data Mining Software: An Update”, SIGKDD Explorations, vol.11, no. 1, 2009.
[3] Eibe Frank, Ian H. Witten. Data Mining: Practical Machine Learning Tools and Techniques[M].
Morgan Kaufmann, 2005.
[4] Guang-li Yu, Ying Zhan and Shui Wang, “Analysis on Weka Foundation Classes and Algorithm
Extending Method”, Journal of Nanyang Institute of Technology, vol. 1, no. 6, pp. 9-11, 2009.
[5] Weka (Machine Learning Group at University of Waikato). Data mining with open source
machine learning software in Java [http://www.cs.waikato.ac.nz/~ml/weka/]. Accessed in Sep.
2010.
[6] Shui Wang, Yu-jun Ma, “Introduction to JUNG: Network/Graph Computation Framework on Java
platform”. (to be published).
[7] Apache Software Foundation. Apache Commons [http://commons.apache.org/collections/].
Accessed in Sep 2010.
[8] Shui Wang, Le Wang. Source code of this paper [http://code.google.com/p/weka-jung-
fpgrowth/downloads/]. 2010.
[9] Joshua O'Madadhain, Danyel Fisher, Tom Nelson et al. Java Universal Network/Graph
Framework [http://jung.sourceforge.net/]. Accessed in Sep 2010.
[10] Frank, A. & Asuncion, A. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,
CA: University of California, School of Information and Computer Science. Accessed in Sep 2010.
[11] CSDN, Java source for FP-Growth [http://download.csdn.net/source/665781]. Accessed in Sep
2010.
[12] Xinyu Wang, Xiaoping Du, Kunqing Xie, “Research on Implementation of the FP-Growth
Algorithm”, Computer Engineering and Applications, vol. 40, no. 9, pp. 174-176, 2004. (in
Chinese).
[13] C. Borgelt. An implementation of the FP-growth algorithm. In Proceeding of OSDM 2005, pp.1-5,
2005.
[14] Zi-guang Sun, “Analysis and implementation of the algorithm of FP-growth”, Journal of Guangxi
Institute of Technology, vol. 16,no. 3, pp. 64-67, 2004. (in Chinese).
[15] Jinlong Wang, Can Wen, Shunyao Wu, Huy Quan Vu , “A Visual Mining System for Theme
Development Evolution Analysis of Scientific Literature”, JDCTA: International Journal of Digital
Content Technology and its Applications, vol. 4, no. 3, pp. 21-23, 2010.
[16] Lilin FAN, “Research on Classification Mining Method of Frequent Itemset”, JCIT: Journal of
Convergence Information Technology, vol. 5, no. 8, pp. 71-77, 2010.
[17] Shui Wang, Le Wang, “Mindmap-NG: A novel framework for modeling effective thinking”, In
Proceeding of the 3rd IEEE International Conference on Computer Science and Information
Technology (ICCSIT), vol.2, pp.480-483, July 2010.
294