group1

reelingripehalfΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

121 εμφανίσεις

By Team 1

Jun
-
Duo Chen

Ying
Luo

Yichen

Li

COMP 6521
Advanced Database Systems and Applications


2

Project 1:

External Sorting Algorithm



Problem Statement



Design Principles



Implementation details



Optimization



Results


Problem Statement

COMP 6521
Advanced Database Systems and Applications


3


Goal: External sorting



Specifications


Algorithm: 2PMMS



Language: Java



Restriction: 5MB main memory

Design Principles


Automatic integration of multi
-
pass Phase 2


Modular design for areas intended for optimization

COMP 6521
Advanced Database Systems and Applications


4

Optimization (Wish List)


Buffer management


Total buffer
size


Data type


Number/size of buffers (
2PMMS
)


I/O
module


Selection of optimal reader / writer


Sorting
algorithm


Quicksort
, radix sort, etc.

COMP 6521
Advanced Database Systems and Applications


5

Implementation Details


Fixed total buffer size


Int

data type


Variable maximum size for each buffer


Fixed minimum buffer size


Single
-
pass phase II for large sample dataset


BufferedReader

&
BufferedWriter


Java integrated arrays sort (
Quicksort

from documentation)

COMP 6521
Advanced Database Systems and Applications


6

Results

Execution time is 14
-
15s

COMP 6521
Advanced Database Systems and Applications


7

COMP 6521
Advanced Database Systems and Applications


8


Project II:

Mining Frequent
Itemsets

from Secondary Memory



Problem Statement



Algorithms Considered



Chosen Algorithm and Motivation


Description
of
Algorithm


Design Principles


Result

COMP 6521
Advanced Database Systems and Applications


9

Problem Statement



Compute frequent items of all sizes (pairs, triples, quadruples, etc) given
a file containing all transactions



Restriction of 5MB main memory usage and one disk for secondary
storage



COMP 6521
Advanced Database Systems and Applications


10

Algorithms Considered

Apriori




Generate a large number of candidates under a limited main memory
usage



Multiple scans of the file to get candidates and counts, I/O consuming
incurred

COMP 6521
Advanced Database Systems and Applications


11

Algorithm Considered

PCY



Hashing function design is non
-
trivial



Potential I/O degradation due to increased processing time between
each transaction


COMP 6521
Advanced Database Systems and Applications


12

Algorithm Considered

FP Tree



Memory consuming of nodes of the tree



Better performance for large data file, as the data being compressed



Sufficient memory required to hold compressed data in main memory

COMP 6521
Advanced Database Systems and Applications


13

Chosen Algorithm and Motivation

Improved
Apriori

algorithm


T
riangular matrix is used to save memory

COMP 6521
Advanced Database Systems and Applications


14

Description of algorithm



Customized algorithm in first three passes to reduce memory consumption.



Uniform algorithm for any remaining passes.


COMP 6521
Advanced Database Systems and Applications

15

Description of algorithm

Pass one:
generate frequent items



Read and parse input file.



U
pdate candidate item list and counts.



When reach support threshold, add to frequent item list



Output: frequent items and counts.



COMP 6521
Advanced Database Systems and Applications

16

Description of algorithm

Pass two:
generate frequent pairs, A
-
Priori algorithm



Read and parse input file.



Double loop over frequent items to generate candidate pairs


Store candidate pair count
using Triangular
Matrix

k = (
i

-
1)(n


i
/2) + j + I , k
-
index of count array



When support threshold is reached, add to frequent
-
pair list.



Output: frequent
-
pair list and counts.



COMP 6521
Advanced Database Systems and Applications

17

Description of algorithm

Pass three:
generate frequent triples, A
-
Priori algorithm



Store frequent pairs into a
HashSet



Read and parse each line from input file.



Double loop items; check with
HashSet

to generate candidate pair.



Generate candidate triples and count based on candidate pairs.



When support threshold is reached, add to frequent
-
triple list.



Output: frequent
-
triple list and count.

COMP 6521
Advanced Database Systems and Applications

18

Description of algorithm

Remaining passes:
generate frequent
itemsets
, A
-
Priori algorithm



Read and parse input file.



Generate candidates based on the output of previous step. Count the
candidate set.



When support threshold is reached, add to frequent
-
set list.



Output: frequent
-
set list and counts.



COMP 6521
Advanced Database Systems and Applications

19

Design Principles



Data structure:



Item, Pair, Triple and
FrequentItemSet




ArrayList





int
[]



Memory management:



Only store output of previous pass in MM.



I/O design:



BufferedReader

and
BufferedWriter




COMP 6521
Advanced Database Systems and Applications

20

Demo Result

Program execution time is
121 seconds.

COMP 6521
Advanced Database Systems and Applications

21

Thank you !