Big Data

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 2 μήνες)

55 εμφανίσεις

Big Data

Kirk Borne

George Mason University


LSST All Hands Meeting

August 13
-

17, 2012

Characteristics of Big Data


1a


Big
quantities of data are acquired everywhere.


It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.

Characteristics of Big Data


1b


Big
quantities of data are acquired everywhere.


It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.

Characteristics of Big Data


1c


Big
quantities of data are acquired everywhere.


It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.

Characteristics of Big Data


2


Big
quantities of data are acquired everywhere.


It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.


But…


What do we mean by “
big
”?


Gigabytes? Terabytes? Petabytes?
Exabytes
?


The meaning of “big” is domain
-
specific and resource
-
dependent (data storage, I/O throughput, computation
cycles, communication costs)


I say … we all are dealing with our own “
tonnabytes


Characteristics of Big Data


3


There are 4 dimensions to the Big Data challenge:

1.
Volume

(“
tonnabytes
” data challenge
)

2.
Variety
(
complexity, curse of dimensionality
)

3.
Velocity
(rate of data and information flowing at us)

4.
Verification
(verifying inference
-
based models from
data)


Therefore, we need something better to cope
with the
tonnabytes



Data Science


Informatics
& Statistics

This graphic says it all …

Graphic provided by
S
. G. Djorgovski, Caltech


Clustering


examine
the data and find the
data clusters (clouds),
without considering
what the items are =
Characterization !


Classification


for
each new data item,
try to place it within a
known class (i.e., a
known category or
cluster) =

Classify !


Outlier Detection


identify those data
items that don’t fit
into the known classes
or clusters =
Surprise !

8

Data
-
Enabled Science:

Scientific KDD (Knowledge Discovery from Data)


Characterize the new (
clustering,
unsupervised learning
)


Assign the known (
classification,
supervised learning
)


Discover the unknown (
outlier
detection, semi
-
supervised learning
)


The two major benefits
of
BIG DATA:

1.
best statistical analysis of “typical” events

2.
automated search for “rare” events

Graphic from S. G. Djorgovski

Graphic from S. G. Djorgovski

Basic Astronomical Knowledge Problems


1


The clustering problem:


Finding clusters of objects within a data set


What is the significance of the clusters (statistically
and scientifically)?


What is the optimal algorithm for finding friends
-
of
-
friends or nearest neighbors?


N is >10
10
, so what is the most efficient way to sort?


Number of dimensions ~ 1000


therefore, we have an
enormous subspace search problem


Are there pair
-
wise (2
-
point) or higher
-
order (N
-
way)
correlations?


N is >10
10
, so what is the most efficient way to do an N
-
point
correlation?


algorithms that scale as N
2
logN won’t get us there

Basic Astronomical Knowledge Problems


2


Outlier detection: (unknown unknowns)


Finding the objects and events that are outside the
bounds of our expectations (outside known clusters)


These may be real scientific discoveries or garbage


Outlier detection is therefore useful for:


Novelty Discovery


is my Nobel prize waiting?


Anomaly Detection


is the detector system working?


Data Quality Assurance


is the data pipeline working?


How does one optimally find outliers in 10
3
-
D
parameter space? or in interesting subspaces (in
lower dimensions)?


How do we measure their “interestingness”?


The dimension reduction problem:


Finding correlations and “fundamental planes” of parameters


Number of attributes can be
hundreds or thousands


The Curse of High
Dimensionality !


Are there combinations
(linear or non
-
linear
functions) of observational
parameters that correlate
strongly with one another?


Are there eigenvectors or
condensed representations
(e.g., basis sets) that
represent the full set of
properties?

Basic Astronomical Knowledge Problems


3

Basic Astronomical Knowledge Problems


4


The superposition / decomposition problem:


Finding the defining features that separate different
classes objects that overlap in simple parameter spaces







What if there are 10
10

objects that overlap in a 10
3
-
D
parameter space?


What is the optimal way to separate and extract the
different unique classes of objects?

The LSST
Big Data Manifesto


More data is not just more data …
more is different!



Discover the unknown unknowns.



Massive Data
-
to
-
Knowledge challenge.

The LSST
Big Data Challenges

1.
Massive data stream: ~2
Terabytes of image data
per hour that must be
mined in real time (for 10
years).


2.
Massive 20
-
Petabyte
database: more than
20
billion objects need to be
classified, and most will be
monitored for important
variations in real time.


3.
Massive event stream:
knowledge extraction in
real time for
~2,000,000
events each night.


Challenge #1 includes
both the static data
mining aspects of #2
and the dynamic data
mining aspects of #3.



Look at
these in
more
detail ...

LSST big data challenges # 1, 2


Each night
for 10 years LSST will obtain
roughly

the
equivalent amount of data that was obtained by the entire
Sloan Digital Sky Survey


Our grad students will be asked to mine these data (~20
TB each night
≈ 40,000
CDs filled with data):


A truckload of CDs each and every day for 10 yrs


Cumulatively, a football stadium full of 100 million CDs
after 10 yrs



The challenge is to find the new, the novel,
the interesting, and
the surprises
(the
unknown unknowns)

within all of these data.


Yes,
more

is most definitely
different

!

LSST big data challenge # 3


Approximately 2,000,000 times each night for 10
years LSST will detect a new sky event, and the
astronomical community will be challenged with
classifying these events. What will we do with
all of these events?

time

flux

Characterize first !

(Unsupervised Learning)


Classify later.

Characterization includes …


Feature Detection and Extraction:


Identifying and describing features in the data


via machine algorithms or human inspection (including the
potentially huge contributions from Citizen Science)


Extracting feature descriptors from the data


Curating these features for search, re
-
use, & discovery


Finding other parameters and features from other archives,
other databases, other information sources


and using
those to help characterize (ultimately classify) each new
event.


… hence, coping with a highly multivariate parameter space

Data
-
driven Discovery (Unsupervised Learning
)

i.e., What can I do with characterizations?

1.
Class Discovery


Clustering

2.
Principal
Component Analysis


Dimension
Reduction

3.
Outlier
Detection


Surprise / Anomaly /
Deviation / Novelty Discovery

4.
Link
Analysis


Association Analysis


Network
Analysis

5.
and more.

20

Addressing the D2K (Data
-
to
-
Knowledge) Challenge

Complete end
-
to
-
end application of
Data Science:


Data management, metadata management, data search,
information extraction, data mining, knowledge discovery


Applies to any discipline (not just science disciplines)


Skilled workforce needed to take data to knowledge

21

Informatics in Education

and

An Education in Informatics

Data Science Education: Two Perspectives


Informatics in Education



working with data in all learning settings


Informatics (Data Science) enables transparent reuse and analysis of data in
inquiry
-
based classroom learning.


Learning is enhanced when students work with real data and information
(especially online data) that are related to the topic (any topic) being
studied.


http://serc.carleton.edu/usingdata/

(“Using Data in the Classroom”)


Example: CSI The Cosmos


An Education in Informatics



students are specifically trained:


… to access large distributed data repositories


… to conduct meaningful inquiries into the data


… to mine, visualize, and analyze the data


… to make objective data
-
driven inferences, discoveries, and decisions


Numerous Data Science programs now exist at several universities
(GMU, Caltech, RPI, Michigan, Cornell, U. Illinois,
Indiana U., … )


http://cds.gmu.edu/

(Computational & Data Sciences @ GMU)

23

Responses to Big Data


1

2.5 approaches to dealing with Big Data:


Data Science
= Informatics & Statistics (and data
-
intensive
computing)


Citizen Science = Human Computation


Or else … (where possible)
combine these two


use the very
effective human cognitive skills of pattern recognition and
anomaly detection to generate training sets of relevant
features (characterizations) to improve the machine
algorithms.

24

Responses to Big Data


2

LSST
Informatics & Statistics Science Collaboration:


breakout @ 11am in TB
-
A


New Journal: Astronomy & Computing


Poster and flyers available in hallway


http://www.journals.elsevier.com/astronomy
-
and
-
computing/



New AAS Working Group on Astroinformatics & Astrostatistics


Members: “Bill”
Zeljko

Ivezic

(chair),
Kirk Borne, George
Djorgovski, Eric
Feigelson
, Eric Ford, Alyssa Goodman,
Aneta

Siemiginowska
, Alex Szalay, Rick White.


Visit
https://www.facebook.com/AstroInformatics

25

LSST Informatics & Statistics Breakout Session

Brief “lightning” talks by 7 team members :


Jogesh

Babu
:
Statistical Resources


Kirk Borne:
Outlier Detection for Surprise Discovery in Big Data


Matthew Graham:
Characterizing and Classifying CRTS


Joseph Richards:
Time
-
Domain Discovery and Classificati
on


Sam Schmidt:
Upcoming Challenges for Photometric Redshifts


Lior

Shamir:
Automatic Analysis of Galaxy Morphology


John Wallin:
Citizen Science and Machine Learning

Open Discussion :


LSST Publication Reviews: informatics & statistics participation


LSST Science Book chapter


Research Roadmap document

11:00am
-
12:30pm today


Tortolita

Ballroom A