IBM Information Server Training

decorumgroveInternet and Web Development

Aug 7, 2012 (5 years and 3 months ago)

425 views

®


IBM Software Group

©IBM Corporation

IBM Information Server

Cleanse
-

QualityStage

IBM Software Group

IBM Information Server

Delivering information you can trust

Discover, model, and
govern information
structure and content

Standardize, merge,

and correct information

Combine and
restructure information
for new uses

Synchronize, virtualize
and move information
for in
-
line delivery






2

2

IBM Software Group

3

The IBM Solution: IBM Information Server

Delivering information you can trust

IBM Information Server

Unified Deployment

Unified Metadata Management

WebSphere QualityStage

Data cleansing, standardization, matching,
and survivorship for enhancing data quality
and creating coherent business views

IBM Software Group

Need for Data Quality

4

Critical Problems


Need to create & maintain 360 degree views of
customers, suppliers, products, locations, events


Need to leverage data
-

make reliable decisions,
comply with regulations, meet service agreements

Why?


No common standards across organization


Unexpected values stored in fields


Required information buried in free
-
form fields


Fields evolve
-

used for multiple purposes


No reliable keys for consolidated views


Operational data degrades 2% per month

Alternative Approaches


Denial


problem misunderstood and ignored until
too late; load and explode


Hand
-
coding
-

clerical exception processing; very
time consuming and resource intensive


Simplistic
cleansing apps
-

evolved from direct
marketing & list hygiene, lack flexibility

Kent Fried Chick

Kentucky Fried

Kentucky Fried Chicken

KFC

Molly Talber DBA KFC

Mrs. M. Talber

John & Molly Talber

Talber, KFC, ATIMA

Data Sources Data Values

227G CB&NATURAL STICK

MOZZ WRAPPER

227G CB&NAT STICK P

QUE/MOZZ WRAPP.

4

IBM Software Group

Why Should I Care About Cleansing Information?


Lack of information standards



Different formats & structures
across different systems



Data surprises in individual
fields


Data misplaced in the database



Information buried in free
-
form
fields




Data myopia



Lack of consistent identifiers inhibit
a single view



The redundancy nightmare


Duplicate records with a lack of
standards

Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name
Tax ID
Telephone
J Smith DBA Lime Cons.
228
-
02
-
1975 6173380300
Williams & Co. C/O Bill 025
-
37
-
1888 415
-
392
-
2000
1st
Natl
Provident
34
-
2671434
3380321
HP 15 State St.
508
-
466
-
1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868
-
A HEX BOLT .25

-
DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25)
-
DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19
-
84
-
103
RS232 Cable 6' M
-
F
CandS
CS
-
89641
6 ft. Cable Male
-
F, RS232 #87951
C&SUCH6
Male/Female 25 PIN 6 Foot Cable
90328574
IBM
187
N.Pk
.
Str
. Salem NH 01456
90328575
I.B.M. Inc.
187
N.Pk
. St. Salem NH 01456
90238495
Int. Bus. Machines 187 No. Park St Salem NH 04156
90233479
International Bus. M. 187 Park Ave Salem NH 04156
90233489
Inter
-
Nation Consults 15 Main Street Andover MA 02341
90345672
I.B. Manufacturing Park Blvd.
Bostno
MA 04106
5

IBM Software Group

Importance of Data Quality


Low data quality impacts an organization in several ways


Poor data quality leads to misguided marketing promotions


Cross sell opportunities may be missed because same customer appears several
times in slightly different ways


Valued customers may not be recognized during support calls or other important
touchpoints


Data mining is difficult because related items are not detected as related



What is good data quality?


Two percent of “bad” data doesn’t sound that bad?


Two percent of 10M rows means that you have 200K errors




200K errors add up to big problem for analytics/operations/anything!


6

IBM Software Group


Compliance


Business to Business
Standards


Risk Management


Reduce Costs &
Increase Productivity


Increase Revenue /
CRM Payoff


Business Intelligence
Payoff


Supply chain collaboration & item
synchronization


Inventory consolidation


Single view of a customer or supplier


ERP Implementations


ERP instance consolidation


IT System renovation


Consolidation resulting from


M&A activity


Enterprise Data Warehouse


Compliance & Regulatory projects
(SOX, HIPAA, ACCORD, etc.)



Enterprise initiatives…

…to satisfy
critical business
requirements.

…need


high


quality


data…

7

IBM Software Group

IBM WebSphere QualityStage



Shared design environment with
DataStage increases
functionality and
reduces
development time



Visual match rule interface
simplifies match tuning



Service orientation provides
‘continuous’ quality &
delivers
confidence

in your data



Parallel architecture
shortens
execution

time

8

IBM Software Group

9

Database with
Consolidated
Views

1. Free Form Investigation

2. Data Standardization

3. Data Matching

4. Data Survivorship

WebSphere
QualityStage Process

Customers

Transactions

Vendors /
Suppliers

Target

Products /
Materials

How will you get an accurate, consolidated view of your
business?

IBM Software Group

10

Why
Investigate


Discover trends and potential anomalies in the data


100% visibility of single domain and free
-
form fields


Identify invalid and default values


Reveal undocumented business rules and common terminology


Verify the reliability of the data in the fields to be used as matching
criteria


Gain complete understanding of data within context

IBM Software Group

11



Investigation
-

Free Form

Parsing
:

Separating

multi
-
valued fields into individual pieces


“The instructions for handling the data are inherent within the data itself.”

123 | St. | Virginia | St.

Virginia

Lexical analysis:


Determining business significance of individual
pieces

Context Sensitive:


Identifying various data structures and content


number street state street


type type

123 | St. | Virginia | St.


House Street Street


Number Name Type

123 | St. Virginia | St.

123

St.

St.

IBM Software Group

12

Rule Sets


Pre
-
defined rules for parsing and
standardizing:


Name


Address


Area (City, State and Zip)


Multi
-
national address processing


Validate structure:


Tax ID


US Phone


Date


Email


Append ISO country codes


Pre
-
process or filter name, address
and area


Rule sets are stored in the common
repostiory

IBM Software Group

13



Standardization
-

Example

Input File:


Address Line 1






Address Line 2


639 N MILLS AVENUE


ORLANDO,

FLA 32803

306 W MAIN STR, CUMMING, GA 30130

3142 WEST CENTRAL AV


TOLEDO OH 43606

843 HEARD AVE


AUGUSTA
-
GA
-
30904

1139 GREENE ST ACCT #1234


AUGUSTA

GEORGIA 30901

4275 OWENS ROAD SUITE 536 EVANS

GA 30809

Result File:



House #

Dir

Str. Name

Type

Unit

No.

NYSIIS

City

SOUNDEX

State

Zip

ACCT#



639

N

MILLS

AVE



MAL

ORLANDO

O645

FL

32803


306

W

MAIN

ST



MAN

CUMMING

C552

GA

30130


3142

W

CENTRAL

AVE



CANTRAL

TOLEDO

T430

OH

43606


843


HEARD

AVE



HAD

AUGUSTA

A223

GA

30904


1139


GREENE

ST



GRAN

AUGUSTA

A223

GA

30901 1234


4275


OWENS

RD

STE

536

ON

EVANS

E152

GA

30809

IBM Software Group

14

Why

Match


Identify duplicate entities within one or more files


Perform householding


Create consolidated view of customer


Establish cross
-
reference linkage


Enrich existing data with new attributes from external
sources

IBM Software Group

15

WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62

WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62

Are these two records a match?

Deterministic

Decisions Tables:



Fields are compared



Letter grade assigned



Combined letter grades are compared to a vendor delivered file



Result: Match; Fail; Suspect


B B A A B D B A = BBAABDBA


+5 +2 +20 +3 +4
-
1 +7 +9 = +49

Probabilistic Record Linkage:



Fields are evaluated for degree
-
of
-
match



Weight assigned: represents the
“information content”

by value



Weights are summed to derived a total score



Result: Statistical probability of a match

Two Methods to Decide a Match

IBM Software Group

16

Why

Survive


Provide consolidated view of data


Provide consolidated view containing the “best
-
of
-
breed”
data


Resolve conflicting values and fill missing values


Cross
-
populate best available data


Implement business and mapping rules


Create cross
-
reference keys

IBM Software Group

17



Survivorship
-

Example

Survivorship Input (Match Output)

Group

Legacy

First

Middle

Last

No.

Dir.

Str. Name

Type

Unit

No.

1

D150

Bob


Dixon

1500

SE

ROSS CLARK

CIR

1

A1367

Robert


Dickson

1500


ROSS CLARK

CIR


23

D689

Ernest

A

Obrian

5901 SW

74TH

ST

STE

202

23
A436

Ernie

Alex

O’Brian

5901 SW

74TH

ST


23

D352

Ernie


Obrian

5901


74

ST

#

202

Consolidated Output

Group

First

Middle

Last

No.

Dir.

Str. Name

Type

Unit

No.

1

Robert


Dickson

1500

SE

ROSS CLARK

CIR



23

Ernie

Alex

O’Brian

5901
SW

74TH

ST

STE

202

Group

Legacy

1

D150


1

A1367


23

D689

23

A436

23

D352

IBM Software Group

18

How Does WebSphere QualityStage Integrate

Database

DB2

Oracle

Sybase

Onyx

IDMS

etc.

Target

1.
Investigation

2.
Standardization

3.
Integration

4.
Survivorship

QualityStage


Data Extraction and
Load Routines



DB2

Oracle

Sybase

Onyx

IDMS

etc.

IBM Software Group

19

WebSphere DataStage and

WebSphere QualityStage: Fully Integrated!

IBM Software Group

QualityStage: Data Quality Extensions


IBM WebSphere QualityStage GeoLocator


IBM WebSphere QualityStage Postal Verification
Products


WAVES (WorldWide)

IBM WebSphere Worldwide Address Verification Solution


IBM WebSphere QualityStage Postal Certification
Products


CASS (United States)


SERP (Canada)


DPID (Australia)


IBM Information Server Data Quality Module for SAP


IBM WebSphere QualityStage for Siebel

20

20

IBM Software Group

Key Strengths for IBM QualityStage


Intuitive, “Design as you think” User Interface


Simple rule design & fine tuning


Seamless Data Flow integration


Intuitive rule design & fine tuning


Defining the technology standard with SOA


Industry leading probabilistic matching engine

21

21

®


IBM Software Group

©IBM Corporation

Thank You