Love the Data

beansproutscompleteΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

91 εμφανίσεις

Love the Data

By

Neil Hepburn (Dir. of Education, IRMAC)

Three Stories About Data Management


Love the Data. Three Stories About Data Management
by
IRMAC

is licensed under a
Creative Commons
Attribution
-
NonCommercial
-
ShareAlike 2.5 Canada
License
.

Based on a work at
wikipedia.org
.

2

Neil Hepburn

Speaker Bio and Relevant Experience


Bio


Data Architect for Empathica Inc.


Education:


Honours Bachelor of Mathematics in Computer Science from the University of Waterloo


Certified Data Management Professional (Mastery Level)


PMI Certified


18 years IS/IT, both in full time and external consulting capacities with a focus on Data Management over past 7 years


GM of marketing for: Innovative iPhone App for Internet Radio Discovery


Director of Education for IRMAC (Toronto chapter of DAMA
-
I)


Relevant Experience


Consultant to Bell Mobility assisting in a reboot of their Market Analytics and Intelligence programme


Developed and implemented Predictive Analytics model for Call Genie, directly advising their CEO and SVP of Marketing


Technical lead on Business Intelligence Modernization project at Empathica


3

Neil Hepburn

Presentation Roadmap


Why am I giving this presentation?


The Story of the Whiz Kids


The Story of the Relational Model


The Story of Twitter Analytics

4

Neil Hepburn

Why am I giving this presentation?


Data Management is an important discipline as we move to an increasingly
data
-
driven society that relies on quality data to make fact
-
based decisions


Data Management exists at the intersection between technology and
business


Requires understanding the underlying meaning of the data and how
it relates to the business


Requires mastery of technology required to assist in the production,
transformation, and consumption of information


Most IT personnel have a Computer Science degree or similar educational
background


Computer Science and IT programs don’t generally teach data
management


data is regarded as little more than a “stage prop”


Databases are regarded as “bit buckets”


“Garbage in garbage out” is the prevailing attitude in IT departments


Data management is seen as
a techno
-
bureaucracy


5

Neil Hepburn

Story of The Whiz Kids: The World Today


Current wave of “Cultures of Analytics” has begun to capture
the popular the popular imagination. In the last three years we
have seen the following books released:


The Numerati (by Stephen Baker)


Competing on Analytics (by Thomas Davenport & Jeanne
Harris)


Supercrunchers (by Ian Ayres)


Data Driven: Profiting from your most Important Asset (by
Thomas C. Redman)


The Information (by James Gleick)


Much of the inspiration behind these books originates from
“Moneyball: The Art of Winning an Unfair Game” (by Michael
M. Lewis), which documents the success of the Oakland ‘A’s
through “Sabermetrics”


taking an analytical approach to
team picks and real time game strategy


It’s all good stuff, but really nothing new…


6

Neil Hepburn

Where did Evidence Based Management Begin?


Some companies were using data analytics to gain a
competitive advantage


The very use of analytics was regarded as a secret weapon,
and those employed in statistical analysis were warned not to
discuss their work


In 1908, William Sealy Gosset was employed by Arthur
Guinness


Gosset applied statistics to both farm and brewery to
determine the best yielding varieties of barley


Gosset also invented the “Student’s t
-
distribution”, which got
its name from Gosset’s pseudonym “Student”


7

Neil Hepburn

Who were The Whiz Kids? (Pt. I)


The Whiz Kids trace their roots back to the US Air Force under the command of Robert
A. Lovet (assistant secretary of War)


In 1939 Tex Thornton (who was the first Whiz Kid), hired nine other Whiz Kids from the
Harvard Business School including Robert McNamara and Edward Lundy


The team called themselves “Statistical Control” and committed themselves to a new
managerial discipline, basing all decisions on numbers


Statistical Control saved $3.6 billion for the Air Force in 1943 alone, while at the same
time improving pilot and troop morale



After WWII, Tex Thornton sold all 10 Whiz Kids as a team to the Ford Motor Co.
Reporting directly to then president Henry Ford Jr.


Upon arrival, The Whiz Kids discovered the finance department was designed solely for
IRS tax purposes, and was not a tool of management

8

Neil Hepburn

Who were The Whiz Kids? (Pt. II)


The Whiz Kids got off to a rocky start when two layers of management were
inserted between them and Henry Ford Jr.


Tex Thornton left the company, going on to head Litton Industries


Were ridiculed as “The Quiz Kids” (after a popular game show)


Nevertheless, through the techniques and discipline learned from Statistical
Control, The Whiz Kids were able provide substantial cost savings, while at the
same time growing market share


After turning Ford around, they were relabelled “The Whiz Kids”


McNamara was the first to recognize safety as a feature and attempted to
introduce seat belts as a standard feature (tragically, this decision was
collectively snubbed by the entire auto industry, delaying their introduction)


Ed Lundy transformed finance from an IRS compliancy cost centre, into a
reporting powerhouse, establishing the CFO as the right
-
hand
-
man of the CEO


By 1960, Robert McNamara had been promoted to president of the company
and was the first ever non family member to run the company


McNamara left the company shortly after to become JFK’s Secretary of
Defence

9

Neil Hepburn

A Tale of Two Whiz Kids


Jack Reith was “a car guy”


Robert McNamera saw automobiles as consumer appliance, like a
washing machine or refrigerator. Simply a means of transportation


Jack Reith took it upon himself to get involved in design decision with
the Ford Edsel, and conceived the Mercury Comet


The Mercury Comet reflected Reith’s own convictions about driving
as romantic pastime


Both cars bombed, leading to Reith’s departure


McNamara learned that Volkswagens were gaining market share.


Was common wisdom among auto execs that only “beatniks” were
purchasing Volkswagens


McNamara commission a market research study discovering that
customers were often doctors and lawyers


Also learned that buyers purchased Volkswagens due to their
design that made it easier to repair in one’s own driveway


McNamara commissioned the Ford Falcon, which went on to be a
top selling car


McNamara continued to rise at Ford, soon becoming president

10

Neil Hepburn

Lessons Learned From The Whiz Kids


They had the buy
-
in and full support of president Henry Ford Jr.


They were disciplined and forced themselves to adhere to their own principles


As measured by IQ, they were the most intelligent persons Ford had ever hired. Robert McNamara in
particular was “off the charts”


They acted as specialized generalists (i.e. Versatilists):


Were as adept at data collection and statistical analysis as they were at leading and negotiating


Could perform each other’s tasks, but were focussed on a particular role


Continued to learn and seek out best practices


E.g. They implemented some of Peter Drucker’s teachings, such as a division structuring


Their experience in the Air Force infused them with a humility and maturity allowing them operate
effectively within a large organization


In spite of their nickname “The Whiz Kids”, they were not Prima Donnas


They were competitive amongst themselves and were fiercely driven to demonstrate “bottom line”
measurable improvements

11

Neil Hepburn

Smart People will Always Make Bad Decisions


Jonah Lehrer’s book “How We Decide” should be required reading for
all analysts


The book explains why even the best of us, are prone to make bad
decisions


All too often, good information is wilfully ignored



Even the McNamara made some famously bad decisions after he left
Ford


As Secretary of Defence for Vietnam War, Robert McNamara continued to order the use
of Agent Orange, in spite of report from The Rand Corporation showing that it did not help


McNamara disagreed with Edward Lansdale (a general who successfully led a counter
-
insurgency campaign in the Phillipines), and ignored all his “unconvential wisdom”


McNamara (under LB Johnson) arguably rationalized these poor decisions he already
made on poor information and refused to consider any new information



Therefore, if we are to truly act in a rational manner we must above all else
embrace
humility

12

Neil Hepburn

The Story of the Relational Model


Relational Databases such as Oracle, DB2, SQLServer,
PostgreSQL, MySQL, and Access have all been around for a
while


at least since the 1970s


What came before relational databases?


Who invented the relational model and why?


Why is there a “holy war” between the relational purists and
object oriented purists?


What are NOSQL (No Only SQL) databases?


Why were they invented?

13

Neil Hepburn

Punched Card Era


pre magnetic storage


In 1725 punched cards were used in France by
Basile Bouchon
and Jean
-
Baptiste Falcon to
control textile looms


Technique was improved by Jacquard in 1801


In 1832 Semen Korsakov (Ukranian) working Russian govt.
invented a search system using punched cards


In 1890 Herman Hollerith invented a punch card and tabulating
machine for the United States Census


Size was the same as an 1887 dollar bill


Enough for 80 columns and 12 rows (80x25 still exists in
terminals


e.g. Windows 7 DOS terminal)


Hollerith left US government and founded the Tabulating
Machine Company in 1896.


This company became IBM


14

Neil Hepburn

1930s and 1940s: The Information Age Begins


In 1936 Alan Turing introduced the “Universal Turing Machine”
as a thought experiment


Demonstrated that all computers are fundamentally the same


Divorcing the concept of computing from all physical
implementations


In 1947 at AT&T’s Bell Labs, the first working transistor was
created


In 1948 Claude E. Shannon, working at Bell Labs published the
seminal paper “
A Mathematical Theory of Information



Shannon introduced the concept of “bit” and showed how all
information could be reduced to a stream of bits


Shannon’s paper sparked new thinking in practically every
domain, and in particular led to huge paradigm shifts in:
physics; Chemistry; Biology; Psychology; Anthropology


Randomness = Complexity = Information




15

Neil Hepburn

Early 1950s


pre generalization era


Scientific applications dominated early 1950s
with a shift to business administrative systems by
the end of the decade


Standard application packages were rare, most
software was written for the customer (money in
the hardware)


Payroll was the first “killer app”


General Electric set the standard for payroll
processing in 1954 running on a Univac


Difficult to deal with special
-
case handling.


Was more complicated that missile control
systems


Essential Complexity!


Programmers spent much of their time writing
low level data access and manipulation routines


A need to hide the complexity of data
manipulation and retrieval from application
programmers was well recognized

16

Neil Hepburn

Late 1950s


pre DBMS era (Pt. 2)


Software abstraction (known as
“generalization”) began to take hold


Sorting was one of the first things to be
“generalized” into re
-
usable code across
customer installations


Report Generation Program was first
developed in 1957 by GE’s team at the
Hanford Nuclear Reservation on its IBM 702


Consumed as input a data dictionary and a
file containing desired report format
(including calculated fields)


Share Data Processing Committee (like
today’s Open Source communities)


First met October 2
nd

1957, chaired by
Charles Bachman


17

Neil Hepburn


In 1961 Charles Bachman first developed IDS (Integrated
Data Store) at General Electric


Was made possible by new “random access” disk
technology


as opposed to sequential tapes


Developed as the DBMS for a Manufacturing Information
and Control System (MIACS) used for GE’s High Voltage
Switchgear (HVSG) Department


Later sold externally to Mack Truck and Weyerhauser


World’s first true transaction
-
oriented DBMS


Followed a Network Model


Data Element relationships were explicitly encoded and
had to be explicitly traversed


Application programs had to be modified to take
advantage of new indexes


Was later sold to B.F. Goodrich


Was modernized to behave more like an RDBMS and
was rebranded IDMS (Integrated Data Management
Store)


Currently being sold by CA, running on IBM mainframes

1960s


General Electric’s IDS

18

Neil Hepburn

1960s


IBM’s IMS/360


In 1963 IBM was asked to build a “data base” for the
Apollo space mission, to manage parts inventory


IMS (Information Management System) was originally
built in collaboration with Rockwell Space Division
and released in 1965 for IBM 7000 series hardware


Utilized a hierarchical data model


In 1966 IMS was moved under the development of
OS/360 (under the leadership of Fred “Mythical Man
Month” Brooks) IMS was now rebranded as IMS/360


Available for routine use at Rockwell on August 14
th

1968


IMS/360 led to many changes to OS/360 itself to
provide nonstop operation and recovery


IBM also developed an alternative DBMS called GIS
(Generalized Information System). GIS supported
more flexible querying, but never achieved the
success of IMS


IMS 11 currently runs on IBM’s system z mainframes,
and continues to sell well in telecom, airlines, and
finance

19

Neil Hepburn

1965

1973 DBTG and System/360 years


In 1965 Codasyl (Conference on Data Systems Languages) forms
the DBTG (Data Base Task Group)


Was led by Charles Bachman (inventor of IDS)


DBTG’s mission was to create a DBMS standard


Standardized terms such as “record”, “set” and “database”, and
added the term “schema” to describe logical format of data.


Some terms would later change. (e.g. “Data Structure Class”
is now referred to as a “Data Model”)


In 1964 IBM’s System/360 was designed to support software
compatibility between varying hardware platforms


In 1968 IBM began unbundling software, consulting services, and
training services


In 1969, the DBTG published a language specification for a
“Network Database Model” known as the “Codasyl Data Model”


ANSI and ISO adopted the “Codasyl Data Model” calling it
“Network Database Language” (NDL). ISO 8907:1987


Standard was eventually withdrawn in 1998 due to being
superseded by SQL standardization


Confluence of the DBTG recommendations, System/360 and IBM’s
unbundling of software led to an explosion of DBMS vendors


In 1972 there were 82 vendors offering 275 packages for the life
insurance industry


Major DBMSs were: IDMS; IMS; Cincom Total, System 2000;
Adabas; and Datacom/DB

Fun

fact
:

In

1971
,

the

Data

Base

Users

Group

was

formed

in

Toronto

(later

renamed

to

IRMAC

[Information

Resource

Management

Association

of

Canada],

which

went

on

to

become

part

of

DAMA
-
I,

and

is

still

recognized

as

the

first

operating

chapter

of

DAMA
-
I


Fun

fact
:

Tom

Nies,

CEO

of

Cincom

is

the

longest

serving

CEO

of

any

IT

c o m p a n y
.

20

Neil Hepburn

1969: Enter the Relational Model


In 1969, Edgar F. Codd working out of IBM’s San Jose Research Laboratory internally
published a paper titled "A Relational Model of Data for Large Shared Data Banks“


Paper was published externally published in 1970 in Communications of the ACM


The Relational Model was grounded in pure mathematics.


Set Theory (relational algebra) and First Order Logic (relational calculus)


The Relational Model is proved to be better aligned with how the business viewed data


Perspective Neutral:

Shifted responsibility of specifying relationships between tables
from the person designing them to the person querying them


Necessary for establishing large, general purpose databases shared between
different departments and computer systems


Non
-
procedural (i.e. declarative). Tell the RDBMS WHAT you want, not HOW to get
the data


IBM initially passed over on implementing Codd’s recommendations for fear of
cannibalizing sales of IMS


In 1973 IBM began working on System R, based on Codd’s relational model, but the
software architects were cut
-
off from Codd and did not entirely understand the relational
model


IBM eventually released a relational database, DB2, which is to this date their de
-
facto
database solution


Fun

fact
:

Codd

was

born

in

England,

and

moved

to

the

US

in

1948

to

work

for

IBM

as

a

programmer
.

In

1953
,

f ed

up

wi t h

McCar t hyi sm,

he

moved

t o

Ot t awa,

Ont ar i o

and

l i ved

there

for

a

decade

before

moving

back

to

t h e

U S

21

Neil Hepburn

1970s Commercial Implementations (RDMS and INGRES)


The first relational database was RDMS (Relational Data Management System) at MIT by
L.A. Kraning and A.I. Fillat


Written in PL/1 for Multics OS


relation concept is implemented as a matrix of reference numbers which refer to
character string datums which are stored elsewhere in distinct dataclass files


In 1973 two scientists at Berkeley
-

Michael Stonebraker and Eugene Wong


learned of
the System R project and sought funding to create a relational database of their own


Stonebraker and Wong already already had funding for a geographic database called
Ingres (INteractive Graphics REtrieval System). They decided to abandon this project and
pursue an RDBMS


Additional funding came from the National Science Foundation, Air Force Office of
Scientific Research, the Army Research Office, and the Navy Electronic Systems
Command


INGRES was developed at UC Berkeley by a rotating group of students and staff. An
initial proto
-
type was released in 1974.


Ran on DEC UNIX machines


INGRES was quasi open source. You could purchase the source code for a fee, and
build on it.


Used a query language called Quel (as opposed to SQL)


Many companies released source code based on INGRES.


Most successful company was Relational Technology Inc (RTI)


Robert Epstein was one of the lead developers who went on to found Sybase


Flagship RDBMS eventually was acquired by Microsoft and lives on as MS SQLServer

range of e is employee
retrieve (comp = e.salary /
(e.age
-

18)) where e.name =
"Jones"

22

Neil Hepburn

1980s Commercial Implementations (Oracle and Informix)


Oracle was founded in 1977 by Larry Ellison, Bob Miner, and Ed Oats.


The original name of the company was Software Development Laboratories
(SDL), which became Relational Software Inc (RSI), and eventually was
named after their flagship product Oracle.


Ellison wanted to make a product that was compatible with IBM’s System R.
Although this was not possible, since IBM kept the error codes secret.


Oracle derived early success because it was written in C, and was easier to
port to other hardware platforms


Oracle beat out Ingres by 1985 since it had standardized on SQL (as opposed
to Ingres’ Quel), which was more popular.


SQL was in fact based on IBM System R’s non
-
relational “SEQUEL”
(Structured English Query Language)


Oracle out marketed Ingres


Informix (INFORMation on unIX) was founded in 1981 by Roger Sippl and
Laura King


In 1985 introduced new product ISQL which separated database access
code into the query engine (as opposed to requiring the client to perform
direct CISAM manipulations)


Was a pioneer in and set the stage for client server computing which came
to dominate in the 1990s

Fun

fact
:

The

name

Oracle

comes

from

the

code

name

of

a

CIA

project

which

the

Oracle

founders

had

all

worked

on

while

at

t h e

A m p e x

C o r p o r a t i o n
.

23

Neil Hepburn

The 90s: Object Oriented Databases (OODBMS)


In 1988 Versant became the first company to introduce an
OODBMS (object oriented data base management system)


Object Data Management Group was formed in 1991, and
ratified the Object Definition Language (ODL) Object Query
Language (OQL)


Sybase took a different approach and introduced Stored
Procedures


Coupling code and data into the RDBMS


a key OOPs
principle


ANSI SQL (and RDBMS vendors) continue to add complex
datatypes and operators to their offerings:


Geometric datatypes and operators


Spatial datatypes and operators


Hierarchy datatypes and operators


Oracle added Binary Large Objects (BLOBS) and recently
Microsoft has added FILESTREAM support


OODBMS have come back in cloud computing


Salesforce.com / force.com / database.com


Access 2010 Web DB (running on SharePoint 2010)

SELECT manufacturer, AVG(SELECT part.pc.ram FROM partition
part)

FROM PCs pc

GROUP BY manufacturer: pc.manufacturer

Type Date Tuple {year, day, month}

Type year, day, month integer


Class manager attributes(id : string unique name :
string phone : string set employees : Tuple {
[Employee], Start_Date : Date })


Class Employee attributes(id : string unique name :
string Start_Date : Date manager : [Manager])

24

Neil Hepburn

Codd’s 12 Rules and Date’s Third Manifesto


Codd observed that no vendor had correctly implemented the relational
model. To clarify his intent he published 13 (0 to 12) basic conditions
that must be met in order for a DBMS to be considered “relational”


To this date, no vendor can satisfy all 13 rules. E.g.:


Updatable VIEWs are nigh impossible to implement


“Completness constraint” cannot easily be implemented


In spite of the popularity of RDBMS, starting in the 1980s, and
continuing through to the present, Christopher Date (who worked with
Codd on the relational model) believed that commercial RDBMSs were
not truly relational


In 1995 Christopher Date and Hugh Darwin published the Third
Manifesto


Major theme of Third Manifesto is that the relational model is not
flawed. Rather RDBMS vendors have not correctly implemented it.


In particular, SQL is flawed


Describes a new language called “D” to address SQL’s shortcomings


Dataphor is a DBMS implemented with D4 (a later version of “D”)




“Rel” is implemented in Java as an interpretation of Date’s manifesto


SQL continues to evolve in order to meet deficiencies

D4:

T group by { A } add {
Concat
(B, C order by { A, B }) Bs }

Oracle 11.2 SQL:
select A, listagg(B, C) within group (order by B) as Bs from T group by A

25

Neil Hepburn

The Object Relational Impedance Mismatch “Holy War”

Fun

fact
:

The

term

object
-
relational

impedance

mismatch

is

derived

from

t he

el ect ri cal

engi neeri ng

t erm

i m p e d a n c e

m a t c h i n g
.


A philosophical, never
-
ending, and somewhat imagined debate exists
between the “relationalists” and the “object orientedness”


Set
-
oriented vs. Graph
-
oriented


Thinking in sets vs. Thinking in fine
-
grain objects discrete objects


Data models within object oriented programs (e.g. Java, C#) don’t align
with relational data models


Much time is spent interfacing with relational databases


ORM (Object Relational Mapping) layers like Hibernate and ADO.NET
Entity Framework allow OOPs developers to persist data from their own
object models within an RDBMS


Creates a “virtual object database”


Some limitations still exist


Performance issues can arise (especially in joins and batch
deletions)


Often leads to application
-
centric data models


Key data elements required for downstream reporting are often left
out


ORM
-
centric languages also exist (e.g. Groovy)


RDBMS
-
centric people prefer accessing data via stored procedures


Creates clean separation between RDBMS and application


Some RDBMSs support extensions in non
-
RDBMS languages (e.g.
SQLServer allows functions and stored procs to be written in C# or
VB.NET, as well as custom built
-
in scalar and aggregate functions

26

Neil Hepburn

The Semantic Web and Linked Data


Relational Model generally operates under the “Closed World
Assumption”: What is not known to be true is assumed to be false


NULL values are the exception that proves the rule


Semantic Web is based on the opposite, the “Open World Assumption”


Because relational databases are centralized they guarantee data
integrity, and users can safely apply First Order Logic to derive new
facts


The Semantic Web, which is decentralized, cannot provide the same
guarantees of integrity. However, it more closely resembles the organic
(warts and all) nature of the Internet, and in turn the benefits that come
with decentralization


Semantic Web is a set of technologies under the purview of the W3C.
They include:


RDF (Resource Descriptor Framework): metamodel based on a
subject, predicate, object pattern


SPARQL (SPARQL Protocol and RDF Query Language): SQL
-
like
language for querying RDF data


Triplestore: database for storing RDF data


Semantic Web projects:


DBPedia (converting Wikipedia into RDF)


FOAF (Friend of a Friend)


Linking Open Data (one project to rule them all)


Ad hoc integration through web APIs seems to be more popular

PREFIX abc:
<nul://sparql/exampleOntology#> .

SELECT ?capital ?country

WHERE {


?x abc:cityname ?capital ;


abc:isCapitalOf ?y.


?y abc:countryname ?country ;


abc:isInContinent abc:Africa.

}

27

Neil Hepburn

The Big Data Challenge and NOSQL


Big Data represents a class of problems which were hitherto seen as
unrelated, but can in fact be solved with the same tools


Tracking (Geo Location and Proximit, Ads, RFID, you name it)


Causal Factor Discovery


Smart Utility Meters


Genomics Analysis


Data bag (Entity Attribute Value, on
-
the
-
fly data modeling)


Two basic approaches:


Extended RDBMS (e.g. Columnar MPP RDBMS)


Leverages existing data warehouse tools, skills, and data models


Slower load times


Does not work well with unstructured data


NOSQL, Hadoop/MapReduce


Evolving set of tools, both low
-
level and high level


Can deal with any kind of data, including BLOBs


Still cannot solve problem of joining 1 billion dimensions to 1 trillion
facts


Other NOSQL DBs


MongoDB, CouchDB: Document Oriented (JSON). Supports ad hoc
data models and flexible querying


Redis, HBase: Key Value, Real Time analytics, Complex Event
Processing


Cassandra, Riak: Works well in heavy writes. Started at Facebook

28

Neil Hepburn

The Real Challenge of Data Management


Consider the challenge of managing your own personal data and
optimizing your own life, everything here is related:


Finances


Courses


Home property (and all your possessions)


Telephone, Television, Internet


Personal Computer


Automobile, expenses, maintenance


Groceries


Dependents


Is an ultra
-
powerful, ultra
-
flexible database the solution?


Maintaining quality data requires tremendous discipline and sacrifices


Most companies can barely manage their Customer Master Data


Duplication of data is still commonplace


The real solutions are unglamorous, but separate the winners from losers:


Master Data Management


Metadata Management


Data Governance


Enterprise Frameworks and Data Models


Cloud
-
based RDBMS: A good swiss
-
army knife. Even MS Access will do.

29

Neil Hepburn

The Story of TopGun Twitter Analytics


Or... How To Build a Twitter Data Warehouse from public APIs and open
source RDBMS and ETL tools


...and keep the Open Source Code and run your own Twitter monitoring
program

30

Neil Hepburn

Step 1: Choose you Subjects


Subjects are the most important WHATs


Always nouns


Our subjects?


Tweets


Twitterers

31

Neil Hepburn

The Art of Analytics: Deciding on Facts


In general, it’s difficult to know what questions to ask of
our subjects


that is the art of Analytics


KPI (Key Performance Indicator), help us determine
which facts (quantitative data) to track


Also helps us think about how we would like to pivot
around these facts. I.e. What qualitative (dimension)
data we wish to also capture


Altimeter Group has some fancy sounding ones:


share of voice; audience engagement; conversation
reach, active advocates, advocate Influence, advocate
Impact, resolution rate, resolution time, satisfaction
score, topic trends, sentiment ratio, and idea impact


Let’s start simple:


Follower count, following count, num URL click
-
thrus


Decide on a partition key


Tweet Date (UTC) is an obvious one


For now this is not a priority

32

Neil Hepburn

The Art of Analytics: Deciding on Dimensions


Dimensions represent the qualitative attributes pertaining to the
subject


If our subject is a tweet, the following dimensions are useful:


Keyword searched for to find Tweet


Time of Tweet (both GMT and local time)


Text of Tweet


Twitterer who tweeted tweet


Location of Twitterer


Software Client used to send out tweet (e.g. TweetDeck)


Web sites referenced by Tweet


We can continue to add dimensions, as we see necessary


Once we have our facts and dimensions, we can now create a data
model


Denormalized “Star Schema” is a tried
-
and
-
true approach to data
warehouse modeling

33

Neil Hepburn

The Science of Analytics: Build out our Schema in RDBMS

34

Neil Hepburn

The Science of Analytics: Data Definitions


Always include a plain English definition for every data element


Ideally the data definition is unambiguous, accurate, states what it is (as opposed
to what it isn’t), and means the same to everybody

35

Neil Hepburn

The Science of Analytics: Use Staging Tables


Staging tables are mirrors of your fact tables


(e.g. Staging_fact_tweets = fact_tweets)


Staging tables allow you to prepare your fact table data without incurring the
performance hits that are normally occur when manipulating massive tables

36

Neil Hepburn

The Science of Analytics: Use and ETL tool to load data


ETL (Extract Transform Load) are purpose built for loading data warehouses.
Advantages include:


Easy to write code that runs safely in parallel


Configuration
-
oriented: Safer to change in live production environments


Visual Metaphor: Self
-
documenting code. Easier for others to understand and
support

37

Neil Hepburn

Instrument ETL with job and data check pointing


Data checkpointing ensures we only retrieve data that is new or that which has
changed


All Tweets have an auto
-
incrementing ID, and we can restrict our searches by this,
using it as a data checkpoint


Since pulling data from web APIs is inherently unreliable, our jobs must be designed
so that they keep track of the last successful operation, and can recover from that
point


We also need to
wait on Twitter’s
Rate Limit (of 150)
to reset

38

Neil Hepburn

Ensure history is accurate with Type II SCDs


An SCD (Slowly Changing Dimension) is a dimension (e.g. a tweet’s Twitterer)
whose attributes change gradually over time


By adhering to a Type II SCD policy, we archive all changes to a Twitterer and match
them to the tweet for when those attributes were valid.

39

Neil Hepburn

Start your engines


Set up some topics (e.g. NoSQL)


Enter some keywords for the topic


Begin running TopGun Twitter Analytics to commence data collection

40

Neil Hepburn

Load Data into BI Tool (or just query using SQL)


Some BI tools may require you to build an OLAP data model


OLAP tools build “cubes” which contain the aggregation of every fact, for every
combination of dimension value


MOLAP tools handle sparsity well, and can achieve excellent compression, even
for billions of dimension tuples

41

Neil Hepburn

Presentation Over: Download the source code


Includes, Pentaho DI ETL Source Code, MySQL data model, and QlikView v10 load
script.


Licensed as Open Source under Gnu Public License v3.0



Can be downloaded from SourceForge.com


https://sourceforge.net/projects/topgun/


NB: Requires bit.ly developer key (free)


IP address is used to rate limit Twitter user lookups (be aware if you’re sharing an
IP, or using Twitter for other purposes)


Questions can be e
-
mailed to:
education@irmac.ca