David Gibbs and Govardhan Tanniru

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

67 εμφανίσεις


David Gibbs and
Govardhan Tanniru

Georgia State University

Department of Computer Science

P.O. Box 3965

Atlanta, GA 30302
-
3965.


Big Data does not only relate to the size of data


Complexity: missing information, dummy data,
organization


Processing: Software, processing power, parallel and
distributed computing


Data Transfer: Limitations of current systems, CPU
intensive


Storage: Data sets beyond relational database,
clusters, data centers, distributed data


User Interaction: Non
-
programmers need to perform
complex information, real time GUI interfaces,
visualization of data


Primary sources of big data


Meteorology


Complex physics simulations


Biology


Business


Web searching


Social networking


Telecommunications


Many programs for storage and processing


Most Popular: HDFS, GFS,
Hadoop
, and
MapReduce


No standard for processing/storing data



No common “off the shelf” software


Increases the difficulty in mining data within a field or
industry



Storage


Developing a system in which very large amounts of
data can be stored securely and accessed quickly


Transfer


Transfer from the storage site to the processing site


Moving large amounts of data over TCP is costly


Processing


How powerful of a system is needed?


“There is a lot of data but no information”


Processing the data in an efficient manner and
obtaining the correct information


NoSQL



Allows storage of massive data sets without the need for
overwhelming tables and indexing


Each cluster stores part of the data and replicates it on
other clusters


Master/Slave
architecture


HDFS (
Hadoop

Distributed File System)


P2P architecture


Cassandra


ColumnFamily

data model


Increased difficulty for data mining


No Join operations


Pulling in more data than needed


Increased transfer times, processing power



The key advantage of schema
-
free design is that it enables
applications to quickly upgrade the structure of data without table
rewrites.


The data validity and integrity aspect is enforced at the data
management layer.



NoSQL

typically does not maintain complete consistency across
distributed servers because of the burden this places on databases,


particularly in distributed systems.


The Consistency, Availability, Partition (CAP) Theorem states that
with consistency, availability, and partitioning tolerance, only two
can be optimized at any time.


Traditional relational databases enforce strict transactional
semantics to preserve consistency, but many
NoSQL

databases
have more scalable architectures that relax the consistency
requirement.


Some
NoSQL

databases put objects into a conflict state when this
occurs. However, it is inevitably the responsibility of the
application to deal with these conflicts.


Google File System


Map Reduce


Big Table


Google has reexamined traditional choices /Assumptions and explored
radically different points in the design space.



First, component failures are the norm rather than the exception.



-
>
The system is built from many inexpensive commodity components
that often fail. It must constantly monitor itself and detect, tolerate, and
recover promptly from component failures on a routine basis


Second, files are huge by traditional standards. Multi
-
GB files are
common.


Third, most files are mutated by appending new data rather than
overwriting existing data.


Fourth, co
-
designing the applications and the file system API benefits
the overall system by increasing our flexibility .


Random

writes

within

a

file

are

practically

non
-
existent
.

Once

written,

the

files

are

only

read,

and

often

only

sequentially
.


A

variety

of

data

share

these

characteristics
.


Appending

becomes

the

focus

of

performance

optimization

and

atomicity

guarantees,

while

caching

data

blocks

in

the

client

loses

its

appeal
.


Google

has

introduced

an

atomic

append

operation

so

that

multiple

clients

can

append

concurrently

to

a

file

without

extra

synchronization

between

them
.


Snapshot :creates a copy of a file or a directory
treeat

low cost.


Record :append allows multiple clients to append data to the same
file concurrently while guaranteeing the atomicity of each
individual client’s append. (Without Additional Locking).








Master servers keep metadata on the various data files
.


Chunk servers store the actual data on disk. Each chunk is replicates
across three different chunk servers to create redundancy in case of
server crashes.



Once directed by a master server, a client application retrieves files
directly from chunk servers.





MapReduce

is a programming model and an associated implementation
for processing and generating large data sets.


Users specify a map function that processes a key/value pair to generate
a set of intermediate key/value pairs.


A Reduce function that merges all intermediate values associated with the
same intermediate key.


The
MapReduce

system has three different types of servers.

The Master server
assigns user tasks to map and reduce servers. It also
tracks the state of the tasks.

-

The Map servers accept user input and performs map operations on
them. The results are written to intermediate files.


The Reduce servers
accepts intermediate files produced by map servers
and performs reduce operation on them.


The steps look like:
GFS
-
> Map
-
> Shuffle
-
> Reduction
-
> Store
Results back into GFS.

-

In
MapReduce

a map maps one view of data to another, producing a
key value pair,


Data transferred between map and reduce servers is compressed. The
idea is that because servers aren't CPU bound it makes sense to spend on
data compression and decompression in order to save on bandwidth and
I/O.



map(String key, String value):


// key: document name


// value: document contents


for each word w in value:


EmitIntermediate
(w, "1");



reduce(String key,
Iterator

values):


// key: a word


// values: a list of counts


int

result = 0;


for each v in values:


result +=
ParseInt
(v);


Emit(
AsString
(result));



BigTable

is a large scale, fault tolerant, self managing system that
includes terabytes of memory and
petabytes

of storage. It can handle
millions of reads/writes per second.


BigTable

is a distributed hash mechanism built on top of GFS. It is not
a relational database. It doesn't support joins or SQL type queries.



It provides lookup mechanism to access structured data by key. GFS
stores opaque data and many applications needs has data with
structure.


Machines can be added and deleted while the system is running and
the whole system just works.


Each data item is stored in a cell which can be accessed using a row
key, column key, or timestamp.



BigTable

has three different types of servers: ( Master, Tablet
,Lock Servers)





Use ultra cheap commodity hardware and built
software on top to handle their death.



A 1,000
-
fold computer power increase can be
had for a 33 times lower cost if you
you

use a
failure
-
prone infrastructure rather than an
infrastructure built on highly reliable
components. You must build reliability on top
of unreliability for this strategy to work.



Many Papers focus on the integration of Traditional and Big Data
Architectures.


We need architectures to handle both the types of Data.


Below is the diagram from Oracle white Paper.



Knowledge Discovery in Databases.



Bringing the big data and
big compute
communities together is an active area of research.


Hybrid Way of Storing Un
Structuted

Data(File
Systems and DBMS).


Efficient Data Transfer Protocols for Big Data(high
-
performance network data movement )


Use of cloud computing for Big Data.


Compression aspects: I/O Performance Analysis
for Big Data Clustering.


Privacy Implications on Social Networking
sites.(Friends tagging another person).


Faults with HADOOP might help our research.