A Relational Algebra Processor

tenderlaSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

71 views

A Relational Algebra Processor

6.375 Final Project

Ming Liu,
Shuotao

Xu

Motivation


Today’s Database Management Systems (DBMS): software
running on a standard operating system on a general
purpose CPU


DBMS frequently used in analytics and scientific
computing, but bottlenecked by:


Processor speed, software overhead, latency & bandwidth


Proposal: FPGA
Based Relational Algebra Processor



Host PC

(DBMS)

FPGA

Relational
Algebra
Processor

Physical
Storage

2

Background
|Relational

Algebra (RA)


Many database queries are fundamentally decomposable
to five basic RA operators


Although SQL is capable of much more

Operator


Functions

Selection

Filter rows based on a Boolean condition

Projection

Eliminate selected attributes (columns) of a table; remove
duplicated results

Cartesian
Product

Combine several tables with unique attributes

Union

Combine several tables with the same attributes

Difference

Select rows of several tables where the rows do not match

Design dedicated processors on the FPGA for each operator

3

Project Goal


Design and implement an in
-
memory relational algebra
processor on the FPGA


Explore the types of queries that can benefit from FPGA
acceleration



Secondary: Outperform SQLite!



Some assumptions:


32
-
bit wide table entries


Tables fit in memory


Max number of columns is 32


Read only


4

Microarchitecture |
Host Software

5

FPGA

Microarchitecture |
Top
-
Level
RAProcessor

Host PC

(C++
functions)

RA
Processor

DRAM

PCIe

Host PC

(DBMS)

RA
Processor

Physical
Storage

6

Microarchitecture |
Row
Marshaller

7


Exposes a simple interface for
operators to access tables in
DRAM


Address translation, burst
aggregation, truncation &
alignment


Multiplexes requests


Table values sent/received as
32
-
bit bursts

Microarchitecture |
Selection

8


Filters rows based on predicates
(e.g. age < 40)


16 predicate evaluators


Internally comparators


A tree of gates to qualify the
predicates


Max: 4 ORs of 4 ANDs

Microarchitecture |
Projection

9


Select columns of a table


Column mask one
-
hot encoded


Do not need to buffer row; operate directly on data bursts

Microarchitecture |
Binary Operators

10


Cartesian
Product, Union, Difference and
Deduplication


Nested loop implementation

Microarchitecture|
Inter
-
operator

Bypassing

11


Operators enabled concurrently; data
passed between operators


No intermediate storage


Conditions:

1.
A singly
link of unary
operators

2.
Each operator has a single target output

3.
No structural
hazard


Software reorders and schedules the
RA commands


Data source/destination encoded in
command

Microarchitecture|
Inter
-
operator

Bypassing

12


Multiple 32
-
bit wide output FIFOs to other operators


Implementation Evaluation

13


Timing


Maximum Frequency: 55.786MHz


Critical Path: Row
Marshaller

mux


Area


Slice Registers: 50%


LUTs: 85%


BRAM/FIFOs: 47%

Modules

Slice Registers

LUTs

BRAM/FIFOs


TOTAL

34649
(50%)

59328
(85%)

71 (47%)


Row
Mashaller

2804

6627

0


Controller

4570

6277

29


Selection

3137

19633

0


Projection

739

654

0


Cartesian

Product

1935

1478

0


Union

1939

1983

0


Difference

1875

1949

0


Deduplication

1822

1970

0

Performance Benchmark |
Setup

14


SQLite


Internal
SQLite timer to report execution time of the query


Thinkpad

T430, Core
i7
-
3520M @ 2.90Ghz, 1x8GB DDR3
-
1600


RA
Processor


Performance counters: cycles from start to
ack

of an operator

Table

Relational Algebra Query

SQL Query

1 table

100k
x 30

SELECT,starLong,tableOut
,


mass
,>,
80000,AND,pos_x
,>,
10,


OR,pos_x
,<,
pos_z
,


OR,col12
,>,
col14, AND,col20
,<,col21

SELECT

* FROM
starLong

WHERE mass
> 80000
AND
pos_x

> 10


OR
pos_x

<
pos_z


OR col12
>
col14 AND col20
<
col21;

1 table

100k x
30

PROJECT,starLong,tableOut
,


pos_x,col19,col25,col29

SELECT pos_x,col19
, col25,
col29 FROM
starLong
;

2 tables

1k
x 30

UNION,starMed1,starMed2,starUnion

SELECT * FROM starMed1

UNION

SELECT * FROM starMed2;


2 tables

1k
x 30

XPROD,starMed1,starMed2,starXprod

RENAME,starXprod,0,iOrder0,1,mass0,8,phi0

SELECT,starXprod,starFiltered
, iOrder0,=,
iOrder
,
AND,phi0
,>,1
,


AND,mass0,>,mass

PROJECT,starFiltered,starOut,mass0

SELECT

s1.mass

FROM starMed1
s1
, starMed2
s2

WHERE s1.vx
> s2.vx


AND s1.phi
> 1


AND s1.mass
>
s2.mass;

Performance Benchmark |
Results

15

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Select
Project
Union
Difference
Xprod
Dedup
Complex Join
Time (s)
-

Lower is better

Query

Query Execution Time

FPGA RA Processor
SW SQLite

Limitation: Memory Bandwidth: 200MB/s
vs

12.8GB/s



Performance Benchmark |
Results

16

0
0.02
0.04
0.06
0.08
0.1
0.12
1
2
4
8
16
Time (s)
-

Lower is better

Number of Predicates

Select (Filter) Execution Time with Varying
Number of Predicates

FPGA RA Processor
SW SQLite

Select operator
most competitive with SQLite


What happens with more predicates?

Improvements

17


Increasing data burst width


32
-
bit to 256
-
bit:
potential 8x speedup


Area/critical path increase


Maximizing memory bandwidth


Additional row buffers to
buffer data
from DDR2 Memory


Larger, faster DRAM; Higher clock
speed




Conclusion & Future Work

18


Complex filtering operations performs well on the FPGA


Better than SQLite with sufficient memory bandwidth


Data intensive operators do not perform well


Future opportunities:


An accelerator alongside SQLite


Integration with HDD/SSD controller