Big Data
Kirk Borne
George Mason University
LSST All Hands Meeting
August 13

17, 2012
Characteristics of Big Data
–
1a
•
Big
quantities of data are acquired everywhere.
•
It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.
Characteristics of Big Data
–
1b
•
Big
quantities of data are acquired everywhere.
•
It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.
Characteristics of Big Data
–
1c
•
Big
quantities of data are acquired everywhere.
•
It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.
Characteristics of Big Data
–
2
•
Big
quantities of data are acquired everywhere.
•
It is now a big issue in all aspects of life: science,
business, healthcare,
gov
, social networks, etc.
•
But…
•
What do we mean by “
big
”?
•
Gigabytes? Terabytes? Petabytes?
Exabytes
?
•
The meaning of “big” is domain

specific and resource

dependent (data storage, I/O throughput, computation
cycles, communication costs)
•
I say … we all are dealing with our own “
tonnabytes
”
Characteristics of Big Data
–
3
•
There are 4 dimensions to the Big Data challenge:
1.
Volume
(“
tonnabytes
” data challenge
)
2.
Variety
(
complexity, curse of dimensionality
)
3.
Velocity
(rate of data and information flowing at us)
4.
Verification
(verifying inference

based models from
data)
•
Therefore, we need something better to cope
with the
tonnabytes
…
Data Science
–
Informatics
& Statistics
This graphic says it all …
Graphic provided by
S
. G. Djorgovski, Caltech
•
Clustering
–
examine
the data and find the
data clusters (clouds),
without considering
what the items are =
Characterization !
•
Classification
–
for
each new data item,
try to place it within a
known class (i.e., a
known category or
cluster) =
Classify !
•
Outlier Detection
–
identify those data
items that don’t fit
into the known classes
or clusters =
Surprise !
8
Data

Enabled Science:
Scientific KDD (Knowledge Discovery from Data)
•
Characterize the new (
clustering,
unsupervised learning
)
•
Assign the known (
classification,
supervised learning
)
•
Discover the unknown (
outlier
detection, semi

supervised learning
)
•
The two major benefits
of
BIG DATA:
1.
best statistical analysis of “typical” events
2.
automated search for “rare” events
Graphic from S. G. Djorgovski
Graphic from S. G. Djorgovski
Basic Astronomical Knowledge Problems
–
1
•
The clustering problem:
–
Finding clusters of objects within a data set
–
What is the significance of the clusters (statistically
and scientifically)?
–
What is the optimal algorithm for finding friends

of

friends or nearest neighbors?
•
N is >10
10
, so what is the most efficient way to sort?
•
Number of dimensions ~ 1000
–
therefore, we have an
enormous subspace search problem
–
Are there pair

wise (2

point) or higher

order (N

way)
correlations?
•
N is >10
10
, so what is the most efficient way to do an N

point
correlation?
–
algorithms that scale as N
2
logN won’t get us there
Basic Astronomical Knowledge Problems
–
2
•
Outlier detection: (unknown unknowns)
–
Finding the objects and events that are outside the
bounds of our expectations (outside known clusters)
–
These may be real scientific discoveries or garbage
–
Outlier detection is therefore useful for:
•
Novelty Discovery
–
is my Nobel prize waiting?
•
Anomaly Detection
–
is the detector system working?
•
Data Quality Assurance
–
is the data pipeline working?
–
How does one optimally find outliers in 10
3

D
parameter space? or in interesting subspaces (in
lower dimensions)?
–
How do we measure their “interestingness”?
•
The dimension reduction problem:
–
Finding correlations and “fundamental planes” of parameters
–
Number of attributes can be
hundreds or thousands
•
The Curse of High
Dimensionality !
–
Are there combinations
(linear or non

linear
functions) of observational
parameters that correlate
strongly with one another?
–
Are there eigenvectors or
condensed representations
(e.g., basis sets) that
represent the full set of
properties?
Basic Astronomical Knowledge Problems
–
3
Basic Astronomical Knowledge Problems
–
4
•
The superposition / decomposition problem:
–
Finding the defining features that separate different
classes objects that overlap in simple parameter spaces
–
What if there are 10
10
objects that overlap in a 10
3

D
parameter space?
–
What is the optimal way to separate and extract the
different unique classes of objects?
The LSST
Big Data Manifesto
•
More data is not just more data …
more is different!
•
Discover the unknown unknowns.
•
Massive Data

to

Knowledge challenge.
The LSST
Big Data Challenges
1.
Massive data stream: ~2
Terabytes of image data
per hour that must be
mined in real time (for 10
years).
2.
Massive 20

Petabyte
database: more than
20
billion objects need to be
classified, and most will be
monitored for important
variations in real time.
3.
Massive event stream:
knowledge extraction in
real time for
~2,000,000
events each night.
•
Challenge #1 includes
both the static data
mining aspects of #2
and the dynamic data
mining aspects of #3.
•
Look at
these in
more
detail ...
LSST big data challenges # 1, 2
•
Each night
for 10 years LSST will obtain
roughly
the
equivalent amount of data that was obtained by the entire
Sloan Digital Sky Survey
•
Our grad students will be asked to mine these data (~20
TB each night
≈ 40,000
CDs filled with data):
–
A truckload of CDs each and every day for 10 yrs
–
Cumulatively, a football stadium full of 100 million CDs
after 10 yrs
•
The challenge is to find the new, the novel,
the interesting, and
the surprises
(the
unknown unknowns)
within all of these data.
•
Yes,
more
is most definitely
different
!
LSST big data challenge # 3
•
Approximately 2,000,000 times each night for 10
years LSST will detect a new sky event, and the
astronomical community will be challenged with
classifying these events. What will we do with
all of these events?
time
flux
Characterize first !
(Unsupervised Learning)
Classify later.
Characterization includes …
•
Feature Detection and Extraction:
•
Identifying and describing features in the data
–
via machine algorithms or human inspection (including the
potentially huge contributions from Citizen Science)
•
Extracting feature descriptors from the data
•
Curating these features for search, re

use, & discovery
•
Finding other parameters and features from other archives,
other databases, other information sources
–
and using
those to help characterize (ultimately classify) each new
event.
•
… hence, coping with a highly multivariate parameter space
Data

driven Discovery (Unsupervised Learning
)
i.e., What can I do with characterizations?
1.
Class Discovery
–
Clustering
2.
Principal
Component Analysis
–
Dimension
Reduction
3.
Outlier
Detection
–
Surprise / Anomaly /
Deviation / Novelty Discovery
4.
Link
Analysis
–
Association Analysis
–
Network
Analysis
5.
and more.
20
Addressing the D2K (Data

to

Knowledge) Challenge
Complete end

to

end application of
Data Science:
•
Data management, metadata management, data search,
information extraction, data mining, knowledge discovery
•
Applies to any discipline (not just science disciplines)
•
Skilled workforce needed to take data to knowledge
21
Informatics in Education
and
An Education in Informatics
Data Science Education: Two Perspectives
•
Informatics in Education
–
working with data in all learning settings
•
Informatics (Data Science) enables transparent reuse and analysis of data in
inquiry

based classroom learning.
•
Learning is enhanced when students work with real data and information
(especially online data) that are related to the topic (any topic) being
studied.
•
http://serc.carleton.edu/usingdata/
(“Using Data in the Classroom”)
•
Example: CSI The Cosmos
•
An Education in Informatics
–
students are specifically trained:
•
… to access large distributed data repositories
•
… to conduct meaningful inquiries into the data
•
… to mine, visualize, and analyze the data
•
… to make objective data

driven inferences, discoveries, and decisions
•
Numerous Data Science programs now exist at several universities
(GMU, Caltech, RPI, Michigan, Cornell, U. Illinois,
Indiana U., … )
•
http://cds.gmu.edu/
(Computational & Data Sciences @ GMU)
23
Responses to Big Data
–
1
2.5 approaches to dealing with Big Data:
–
Data Science
= Informatics & Statistics (and data

intensive
computing)
–
Citizen Science = Human Computation
–
Or else … (where possible)
combine these two
–
use the very
effective human cognitive skills of pattern recognition and
anomaly detection to generate training sets of relevant
features (characterizations) to improve the machine
algorithms.
24
Responses to Big Data
–
2
LSST
Informatics & Statistics Science Collaboration:
–
breakout @ 11am in TB

A
New Journal: Astronomy & Computing
–
Poster and flyers available in hallway
–
http://www.journals.elsevier.com/astronomy

and

computing/
New AAS Working Group on Astroinformatics & Astrostatistics
–
Members: “Bill”
Zeljko
Ivezic
(chair),
Kirk Borne, George
Djorgovski, Eric
Feigelson
, Eric Ford, Alyssa Goodman,
Aneta
Siemiginowska
, Alex Szalay, Rick White.
–
Visit
https://www.facebook.com/AstroInformatics
25
LSST Informatics & Statistics Breakout Session
Brief “lightning” talks by 7 team members :
–
Jogesh
Babu
:
Statistical Resources
–
Kirk Borne:
Outlier Detection for Surprise Discovery in Big Data
–
Matthew Graham:
Characterizing and Classifying CRTS
–
Joseph Richards:
Time

Domain Discovery and Classificati
on
–
Sam Schmidt:
Upcoming Challenges for Photometric Redshifts
–
Lior
Shamir:
Automatic Analysis of Galaxy Morphology
–
John Wallin:
Citizen Science and Machine Learning
Open Discussion :
–
LSST Publication Reviews: informatics & statistics participation
–
LSST Science Book chapter
–
Research Roadmap document
11:00am

12:30pm today
–
Tortolita
Ballroom A
Comments 0
Log in to post a comment