Boston KM Forum

basheddockSoftware and s/w Development

Feb 21, 2014 (3 years and 3 months ago)

86 views

Boston KM Forum


How big
d
ata
b
ecomes
a
ctionable information


Tweaked version of Gilbane big data presentation


Other Gilbane Conference impressions


And some open source/content management
market dynamics slides


Discussion


1

Big Data 101 Agenda


Big data in context


Recap


Risks


Recommendations




2

Big Data in Context


What is “big data”?


Unhelpfully, both “big data” and “NoSQL,” generally
considered a key part of the big data wave, are defined
more in terms of what they
aren’t

than what they
are


A typical big data
definition (
Wikipedia
):


“[…] data sets
that grow so large that they become awkward to
work with using on
-
hand database management
tools”


Often associated with Gartner’s volume, variety (and
complexity), and velocity model


Also value and veracity considerations



3

Big Data in Context


Why is big data a big deal now?


The need to deal with
really

big data sources, e.g., Web
site
logs, social network activities,
and sensor network
feeds


Commoditized hardware, software, and networking


Capability and price/performance curves that continue to defy
all economic “laws”


Cloud services with radical new capability/cost equations


Maturation and uptake of related open source software,
especially Hadoop


Powerful and often no
-

or low
-
cost




4

Big Data in Context


Why is big data a big deal now (continued)?


Market enthusiasm for “NoSQL” systems


Which often simply means Hadoop


Useful and often “open source”/public domain data
sources and services


Mainstreaming of semantic tools and techniques


Overall: many things that used to be complex,
expensive, and scarce


Are now relatively straightforward, inexpensive, and
abundant

5

Big Data in Context


Big data reality checks


Most decision
-
makers don’t want big data per se;
instead, they probably want


Relevant, accurate, and timely answers to big questions


Including alerts pertaining to questions they may or may not
have asked yet


The ability to purposefully analyze information without
having to master arcane technologies


It’s more about the ability to formulate and ask big
questions (and to effectively analyze and act on
answers) than it is about related technologies


6

A Prime Minicomputer, c1982

7

Fast
-
Forward to 2012

8

Fast
-
Forward to 2012

9

Fast
-
Forward to 2012

10

Fast
-
Forward to 2012

11

Fast
-
Forward to 2012

12

Google BigQuery

13

Hadoop


Hadoop is often considered central to big data


Originating with Google’s MapReduce architecture,
Apache Hadoop is an open source architecture for
distributed processing on networks of commodity
hardware


From
Wikipedia
:


“’Map’
step: The master node takes the input, divides it into
smaller sub
-
problems, and distributes them to worker
nodes


‘Reduce’
step: The master node then collects the answers to
all the sub
-
problems and combines them in some way to
form the output


the answer to the problem it was
originally trying to solve”

14


Hadoop commercial application domains
(from
Wikipedia
) include


Log
and/or clickstream analysis of various kinds


Marketing analytics


Machine learning and/or sophisticated data mining


Image processing


Processing of XML messages


Web crawling and/or text processing


General archiving, including of relational/tabular data,
e.g. for
compliance

15

Hadoop


Hadoop is popular and rapidly evolving


Most leading information management vendors
have embraced Hadoop


There is now a Hadoop ecosystem



16

Meanwhile, Back in the
Googleplex


Dremel
, BigQuery, Spanner, and other
really

big data projects


17

Meanwhile, Back in the
Googleplex

18

Google Now

19

A NoSQL Taxonomy


From the
NoSQL Wikipedia article
:






20

A View of the NoSQL Landscape

21

Another NoSQL Landscape View

NoSQL Perspectives


The “NoSQL” meme confusingly conflates


Document database requirements


Best served by XML DBMS (XDBMS)


Physical database model decisions on which only DBAs and
systems architects should focus


And which are more complementary than competitive with DBMS


Object databases, which have floundered for decades


But with which some application developers are nonetheless
enamored, for minimized “impedance mismatch,” despite significant
information management compromises


Semantic (e.g., RDF) models


Also more complementary
than competitive
with RDBMS/XDBMS


Also consider: the “traditional” DBMS players can leverage
the same underlying technology power curves





23

Modeling Abstractions

24


Resources


Relations

Conceptual

D
ocuments
and links; document
s
focused primarily on narrative,
hierarchy, and sequence

Entities, attributes, relationships,

and
identifiers

Logical

Model: hypertext

Language: XQuery (ideally…)

Model: extended relational

Language:

SQL

Physical

Indexing (e.g., scalar data types, XML, and
full
-
text), locking and isolation
levels (for transactions), federation, replication/synchronization, in
-
memory
databases, columnar storage, table spaces, caching, and more

Data as a Service


The (single source of) truth is out there?...


High
-
quality data sources are being commoditized


Value is shifting to the ability to discern and leverage conceptual
connections, not just to manage big databases


Some resources and developments to explore


Social networking graphs and activities


Data.com

(
Salesforce.com
)


Data.gov


Google Knowledge Graph


Linked
Data


Microsoft Windows Azure Data Marketplace


Wikidata.org


Wolfram Alpha




25

Mainstreaming Semantics



Tools and techniques applied in search of
more meaning, e.g.,


Vocabulary management


Disambiguation and auto
-
categorization


Text mining and analysis


Context and relationship analysis


It’s still ideal to help people capture and apply
data and metadata in context


Semantic tools/techniques are complementary


26

Mainstreaming Semantics



The Semantic Web is still more vision than reality


But Google, Microsoft, and Yahoo, and
Yandex
, for
example, are improving Web searches by capturing
and applying more metadata and relationships via
schema.org

schemas in Web pages


And Google’s Knowledge Graph is about “things, not
strings,” with, as of mid
-
2012
, “500 million objects, as
well as more than 3.5 billion facts about and
relationships between these different
objects”



27

Recap


Commoditization and cloud


Very significant new opportunities


Hadoop and related frameworks


Complementary to RDBMS and XDBMS


NoSQL


Likely headed for meme
-
bust…


Data services


Game
-
changing potential


Semantic tools and techniques


Rapidly gaining momentum




28

Risks


The potential for an ever
-
expanding set of information silos


Focus on minimized redundancy and optimized integration


GIGO (garbage in, garbage out) at super
-
scale


New opportunities for unprecedented self
-
inflicted damage, for
organizations that don’t model or query effectively


Cognitive overreach


The
potential for information workers to create
and act on
nonsensical
queries based on poorly
-
designed
and/or
misunderstood information models


Skills gaps can create competitive disadvantages


Modeling, query formulation, and data analysis


Critical thinking and information literacy



29

Recommendations


Aim high: big data is in many respects just
getting started…


A lot of technology recycling but also
significant and disruptive innovation


Work to build consensus among stake
-
holders on the opportunities and risks


Focus on human skills


e.g., critical
thinking and information literacy


For now, an instance of the most creative and
powerful type of semantic big data processor
we know of is between your ears



30

[End of tweaked Gilbane presentation]

Gilbane 2012 Impressions


The big themes


Cloud


Social


Mobile


Big data


Web


Other recurring themes


Open source: enterprise
-
ready for many domains



31

Gilbane 2012 Impressions


Projections


Consolidation ahead for W*M and ECM vendors


Likely to be accelerated by market uptake of native XML
information management systems


And rediscovery of the utility of modern DBMSs

»
Along with SQL/XML (e.g., XQuery) synergy


Cloud as accelerator


Ridiculously low entry cost and complexity, relative to
earlier on
-
premises alternatives


Tipping point with other shifts to cloud, e.g., for social,
CRM/SFA, and public data sources


32

Gilbane 2012 Impressions


Projections


New challenges and opportunities for IT groups


Potential to derive unprecedented value from both
existing and new information resources


Transition systems to “the cloud”


With or without IT assistance…


Blurring boundaries


Application, document, page…


Ability to apply and capture data and metadata in
context, e.g., activity streams


33

Gilbane 2012 Impressions


Projections


The next critical IT scarcity is not about technology


It is instead the number of people who can


Think critically and structure problems/scenarios


Understand and apply conceptual models


Formulate queries and objectively analyze results

»
And generally get into an event/action routine, for work and
personal activities


Growing awareness of the critical need for
information responsibility


Producer: information quality, integrity, context…


Consumer: information literacy; critical and purposeful
thinking


34

Reference Slides

35


Content management + open source


Hypertext

Open source examples



36

Open source examples



37

Open source examples



38

Open source examples



39

Hypertext

40


Criteria from a 2006 Burton Group report:


A content model based on collections of
information items and links


Pervasive support for info item labels


Typed and bidirectional info item relationships


A means of creating, organizing, and sharing info
item collections


Journaling (tracking info item changes)


Robust access control privilege management

Discussion



41

peter@okellyassociates.com