Big Data – Extract-Transform-Load (ETL) 002 - DBBest

siberiaskeinΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

488 εμφανίσεις

Extract
-
Transform
-
Load (ETL)
T
echnologies



Part 2

Sunday, June 09, 2013



By: Dale T. Anderson



Principal Consultant




DB Best, Technologies, LLC


M
y
last

blog (
Extract
-
Transform
-
Load (ETL) Technologies


Part 1
)
discussed
the purpose, function, and some of the
inherent benefits of ETL technolog
ies

f
o
r

moving data from source appl
ication
s

into target reporting

and
analy
tic

environment
s
.


Hopefully f
rom that

discussion one can

gain some general
understanding of ETL; what it is, how it can be used, and why we would want to
use it.
Now l
et’s dive a bit deeper and discover who some of

the key vendors
are in the ETL Tools marketplace.


ETL tools have been around for some time, have evolved
,

matured, and now
present us with productiv
e environments for
Big Data, Data Warehouse, Business Intelli
gence
, and
analytics

processing
.

The ETL
dat
a
processing cycle itself can be simple or highly complex.
Most vendor
offerings address th
e

diverse requirement
s
well enough

although

some do better than others. The
challenges with ETL
tools
come from the other important considerations, like:




Function
ality



Scalability



Performance



Programmability



Parallel Processing



Data Virtualization



Rerunability



Recoverability



Extensibility



Handling Keys (Primary/Foreign)



Attribute Mapping


While the first
three

on this list are essential, p
ersonally I think that programmability
could
easily
be at
the top of the list
.

Obviously my developer bias is showing! d;)
Since all of the vendors below offer a
graphical user interface for creating ETL code they pass my first test for programmability.
It is very nice
to finally
paint the

code with pictures rather than
typing out
syntax
.
Having a GUI design tool for ETL
programming beats out PHP
, Perl,

or SQL scripting any day of the week and twice on Sunday.


There
is

however some important
vendor
diff
erences under the hood that one must consider. Some
ETL
tools
generate actual
compiled
code (like Java) while others generate intermediate code that must be
processed further. This intermediate code is proprietary and requires interpretation by the ETL e
ngine
impacting the execution performance to some degree. While this can be acceptable, I prefer
to work
with

tools that
generate

code any
skilled
programmer can read and understand.



Informatica
(
www.informatic
a.com
)

A comprehensive integration platform that
promotes code
standardization,

unifies
collaboration

between business and IT roles,
and provides capabilities that handle the high volume and wide variety of
today’s business data. The Informatica Platform

has eight distinct technologies
designed to
be
a true industrial strength ETL solution. These
include
:


o

Messaging

o

Complex Event Processing

o

B2B Data Exchange

o

Cloud Data Integration

o

Enterprise Data Integration

o

Application Lifecycle Management

o

Data Quality

o

Master Data Management





Informatica PowerCenter is the ETL tool that empowers an IT organization to
implement highly scalable, high
-
performance data processing maps using a
graphical interface that generates proprietary intermediate code.

This mapping

code, when coupled with a defined
data workflow
plan can then be scheduled
for a variety of execution strategies.


Widely

accepted as an industry front runner, Informatica boasts high
productivity and low cost. In truth, this may be just the opposite. Perhaps high
productivity at a cost is more accurate; in terms of experienced developers,
administrators, and license fees.

Companies who choose to use Informatica
usually have very large IT teams and budgets.


The Informatica Cloud
Integration version however does
offer
easy to use
ETL
programmability without the need for an extensive team or
the
hosting an IT
infrastructur
e. While very cost effective, its features are simply a subset of the
PowerCenter
that

may not provide all the

required

functionality.
Yet while
basic

it may be

a very suitable alternative for many projects at a much lower cost.





Microsoft SSIS
(
http://msdn.microsoft.com/en
-
us/library/ms141026.aspx
)


SQL Server Integration Services, aka: SSIS is
an enterprise
-
level data integration and
transformation solution from Microsoft.
W
idely

used by customers to solve a variety of complex business problems this
ETL tool

does it all
.
As part of the SQL
Server evolution
,

SSIS has its roots in
the BCP (Bulk Copy Program) and DTS
(Data Transformation Services) which
have been around MS SQL Server for
many years.

SSIS works in
direct
correlation with SSAS (SQL Server
Analytics Service) and SSRS (SQL
Server Reporting Service
) but can be
used for standalone ETL process
ing
.

MS SQL Server Management Studio provides the graphical interface for the
creation of

ETL

‘packages’ against MS SQL Server or other data sources. These
packages are executed manually or can be scheduled.
F
undamentally, under
the hood, SSIS is creating T
-
SQL queries coupled with the process logic defined
in the

package.

With the many robust components that are dropped into the
package the developer can easily craft sophisticated ETL processes.

Microsoft
’s

o
verall Business Intelligence stack
places SSIS on the front line for
data access and processing
. As discussed in my previous blog moving data from
a source to a target can

quickly

become highly complex. Understanding the goal
of that data is critical to
how the ETL process
should be

constructed. Mastering
th
is s
tack is something
else
altogether
;

p
erhaps another blog?




Talend
(
www.talend.com
)

Talend has recently surged as an open source
ETL tool that offers a wide variety of data
integration components and a streamlined
graphical user interface.
O
ffer
ing

freely downloadable
flavors

of the Talend
Open Studio

for Big Data, Data Integration and
E
SB;

these fully functional

version
s

allow for the
design, development, of deployable
execution of real data
processing capabilities
.
The
subscription based
Talend Integration Suite
presents the enhanced version providing the additional enterprise function
ality
and features
you
might

expect f
or

more comprehensive projects. These include:


o

Collaboration (with integrated SVN)

o

Automated Deployment (scheduler)

o

Load Balancing (multiple job processing servers)

o

Monitoring (administration console)

o

Data Quality Pro
filing

o

Master Data Management

o

Big Data connectors (ie: Hadoop
, Cassandra, MongoDB
)

o

Context Variable Management (Global/Local)


As an Open Source technology, Talend
’s latest release

now boasts the first and
only ETL vendor to support 100% Map Reduce functionality. Additional support
for Hadoop, Cassandra, MongoDB, Hive, and even PIG is available.

Expect more
to come!


Talend’s architecture
demonstrates a practical approach to the c
omplexities
involved in deployment and execution of the data integration process. Highly
scalable, this architecture is not only easy to understand but through the Talend
Administration Center is a
straightforward

to configure and maintain.



Unlike othe
r tools
(
which separate these two tightly interwoven functionalities
)
,
as
a
n Eclipse
-
based editor, Talend Integration Studio provides a heterogeneous
application and data integration viewpoint for crafting ETL
processes.
Talend
code generation options for

either Perl

or Java eliminates proprietary
intermediate code and streamlines job execution. Talend

also
boast
s

more than
400 connectors providing virtually unlimited connectivity to enterprise systems,
Big Data,
databases, software packages, mainframes,
files, web services, and the
like.

The tMap component delivers highly complex transformation

capabilities
allowing for multiple data source integration, lookups, error handling, and a
wide variety of data processing options.
I have been using Talend for
over three
years now and admittedly find it the most
productive ETL tool
at a more
competitive price.


Actual product offerings from Talend are
packaged in segments designed to
focus on project needs. These include:


o

Big Data

o

Data Integration

o

Business
Process Management

o

Data Quality

o

Enterprise Service Bus

o

Master Data Management


These open source products are available as free use, non
-
supported downloads
(
http://www.talend.com/products/talend
-
open
-
studio
) and offer considerable
features and functionality for developing working ETL processes.

Talend ‘On Demand’ is the
one of the
first ETL vendor
s

to offer an Open Source
data integration solution in the cloud

(
the
Informatica
C
loud offering is not
Open Source). This SaaS model is free for up to two users and presents a great
starting point for smaller projects or prototyping.






Pentaho
(
www.pentaho.com
)

Coupling

data integration with embedded
analytics, Pentaho sets itself apart from
most of its competitors. This platform offers
business users an interactive and visualized approach that can be deployed in
the cloud or on
-
premise.
Focused on the business analyti
cs aspects of data
integration, Pentaho can manage very large volumes and variety of data.
Visualization, data mining, reporting, and interactive dashboards simplify the
development process that provides rapid deployment platform for a broad set
of users.


Pentaho Data Integration (PDI) can extract data from complex and diverse data
sources offering high performance ETL capabilities, including:


o

Rich graphical visual design environment

o

Broad connectivity to any type or source of data

o

Enterprise
scalability and performance

o

High performance multi
-
threaded processing

o

Data Profiling and Quality

o

Open and Standards
-
based Architecture

o

Big Data integration for Hadoop, NoSQL, and Analytic databases

o

Integrated reporting, data visualization/analysis, and pr
edictive analytics

o

Salesforce and SugarCRM plug
-
ins


ETL developers will appreciate the PDI designer which offers an intuitive, drag
and drop GUI, a
robust
library of pre
-
built components, data transformation
mappings, content versioning and locking, and a
n integrated debugger. With
the centralized monitoring, scheduler, and security, PDI

delivers a reliable
solution for data migration and synchronization, application consolidation,
business intelligence, and data warehousing.



The PDI Community Edition (PDI CE), also known as

Kettle

, is available as a free
download (
http://kettle.pentaho.com
). This self
-
supported open source
software provides many of the key features and functionality

found in the
Enterprise Edition (PDI EE).

Along with Kettle you can find ‘Mondrain’ the
Pentaho Analysis Services Community Edition which, as an OLAP Server, claims
to enable business users to analyze large amounts of data in real
-
time. I am
unable to d
ispute or confirm this claim, however in my experience, achieving
even near
-
real
-
time is difficult making it (IMHO) somewhat dubious.

To wrap up it is fair to point out that there are several other ETL contenders in today’s marketplace. For
practical reas
ons this blog does not dive into them further, yet let’s at least give

some of

them mention.
They include:




Astera
(
www.astera.com
)

A relative

new comer to the crowded ETL
marketplace, Astera’s Centerprise Data
Integrator is focused on easy to use, easy to
manage, enterprise
-
level integration
tool
for everyone.
As
a
100% managed
.NET code

product
, Astera provides a reliable, scalable, and
secure platform

for
flexible, reusable, high
-
performance processing
; at a cost. However, it is
perhaps the only ETL tool that supports EDI transactions.




Clover
(
www.cloveretl.com
)

Clover’s slick
graphical
design
er
helps visualize data flows while
offering a wide variety of flexible components for complex data
integration requirements. As an open
-
source software product,
their Community Edition get
s

you started right away, yet its
proprietary intermediate process

layer may not provide the
most robust development environment.




Adeptia
(
www.adeptia.com
)

Offering a wizard
-
driven, “configure, not
code”
ETL
approach, Adeptia eliminates
complexity thus reducing data integration
c
osts while
improving

deployment

and ROI
for business users. Adeptia’s low cost commercial product provides many
features;

however is not open
-
source source software and thus only trial period
evaluations are available.


So there you have it. More than y
ou might ever want
to know about ETL and Data Integration vendors. Yet, I
think ETL data processing and all its complexities cannot
be ignored. My hope is that this information, while
incomplete at its best, and perhaps
slightly
inaccurate
in its
technical reflection at its worse, still edifies you,
the gentle reader, in a constructive way.


I look forward
to
any
comment, question, or debate


ETL ON

~ Dudes
!