Distributed Data Mining Research in the NASA Intelligent Systems Program

levelsordData Management

Nov 20, 2013 (3 years and 10 months ago)

106 views

Distributed Data Mining Research in the
NASA Intelligent Systems Program


Kirk D. Borne

School of Computational Sciences, George Mason University

Fairfax, Virginia

kborne@gmu.edu

NASA AISRP Mtg NASA Ames, Moffett Field, CA April 4


6, 2005


4/5/05

2

Outline


What is ISP?


Short Descriptions


Space Science Project


Earth Science Project

4/5/05

3

What is (was) ISP?


NASA Code R / Ames ISP (Intelligent
Systems Program) had 2 components:


Open competition (low TRL = pure research)


NASA Center
-
to
-
Center
Mission Infusion Tasks

(mid
-
TRL = development and application)


Mission Infusion

Teams were tasked with
taking the low
-
TRL technology to mid
-
TRL,
by infusing the technology into a mission,
project, or existing program.


4/5/05

4

Funded ISP Mission Infusion Projects


Two funded projects (both are now completed):


Distributed Data Mining in the NVO*


PI: K. Borne, GMU


Co
-
I: Cynthia Cheung, NASA


Collaborator: Hillol Kargupta, UMBC


Space Science infusion task


Automated Wildfire Detection (and Prediction) through Artificial
Neural Networks


PI: Jerry Miller, NASA


Co
-
I: K. Borne, GMU


Collaborators: NOAA
-
NESDIS staff


Earth Science infusion task


Projects were funded by the IDU (Intelligent Data Understanding)
component of ISP


*
NVO = National Virtual Observatory

4/5/05

5

AISRP Projects


still more data mining


Machine Learning and Data Mining for
Automatic Detection and Interpretation of
Solar Events


Develop an automatic system for CME detection,
tracking, characterization, and source region
location


Discover specific associations between solar
events (CME) and Earth events (Space Weather)


PI: Art Poland, GMU


Co
-
I’s: Jie Zhang, K. Borne, Harry Wechsler


Novel Approaches to Semi
-
supervised Data
Exploration



Develop an efficient and effective automated
system for astronomical object classification, with
an emphasis on star
-
galaxy discrimination and
morphological galaxy classification


PI: David Bazell, Eureka


Co
-
I’s: K. Borne (GMU), David Miller (Penn St.)

4/5/05

6

Short Descriptions of ISP Projects


Distributed Data Mining
(DDM)

in the NVO


Search for examples of interacting/colliding/merging
galaxies across multiple distributed databases


Apply Distributed Learning


Apply Distributed Classification


Use DDM algorithms being developed by UMBC group


Apply algorithms within NVO data environment


Automated Wildfire Detection
(and Prediction)

through Artificial Neural Networks
(ANN)


Identify all wildfires in Earth
-
observing satellite images


Train ANN to mimic human analysts’ classifications


Apply ANN to new data (from 3 remote
-
sensing
satellites: GOES, AVHRR, MODIS)


Extend NOAA fire product from USA to the whole Earth

4/5/05

7

Searching, retrieving, mining, integrating, and
analyzing geographically distributed data
repositories is one of the major challenges in
data mining today


4/5/05

8

Why so many telescopes and databases? …

Many great astronomical

discoveries have come

from inter
-
comparisons

of various wavelengths:

-

Quasars

-

Gamma
-
ray bursts

-

Ultraluminous IR galaxies

-

X
-
ray black
-
hole binaries

-

Radio galaxies

-

. . .

Because …


4/5/05

9

Distributed Data Mining


2 perspectives:



robust statistical analysis of “typical” events



automated search for “rare” events

Figure
: The clustering of data
clouds (dc#) within a
multidimensional parameter
space (p#).


Such a mapping can be used to
search for and identify clusters,
voids, outliers, one
-
of
-
kinds,
relationships, and associations
among arbitrary parameters in a
database (or among various
parameters in geographically
distributed databases).

Credit: S. G. Djorgovski

4/5/05

10

Space Science Project:

Distributed Data Mining in the NVO


A case study to find colliding, interacting, and merging
galaxies among the IR
-
luminous galaxy population


Examine several distributed databases (HST, 2MASS, Sloan,
FIRST, IRAS)


Solve a particular science problem (a NVO science scenario)

NASA

Information

Power Grid

(IPG)

4/5/05

11

NVO Science Cases & Drivers

(from Aspen 2001 NVO Workshop)



Solar System

: NEOs, Long
-
Period Comets, TNOs,
Killer Asteroids!!!


The Digital Galaxy

: Find star streams and populations
--

relics of past/present
assembly phase. Identify components of disk, thick disk, bulge, halo, arms, ??


The Low
-
Surface Brightness Universe

: spatial filtering, multi
-
wavelength
searches, intersection of the image and catalog domains


Panchromatic Census

of AGN (Active Galactic Nuclei) : Complete sample of
the AGN zoo, their emission mechanisms, and their environments


Precision Cosmology

& Large
-
Scale Structure :
**Hierarchical Assembly
History of Galaxies and Structure**
,

Cosmological Parameters, Dark
Matter and Galaxy Biasing as
f(z)


Precision science of any kind

that depends on very large sample sizes


"Survey Science Deluxe"


Search for rare and exotic objects

(e.g., high
-
z

QSOs, high
-
z

Sne, L/T dwarfs)


Serendipity

: Explore new domains of parameter space (e.g., time domain, or
"color
-
color space" of all kinds)

**This is the scientific goal of the ISP
-
funded project described here.

4/5/05

12

Colliding and Merging Galaxies:
Building Blocks of the Universe

4/5/05

13

Ultra
-
Luminous Infrared Galaxies (ULIRGs)

and other IR
-
Luminous Galaxies (LIRGs):


Nearly 100% are involved in collisions and mergers

ULIRGs:

the most luminous

galaxies in the

Universe

4/5/05

14

Merger Tree
-

Galaxy Merger Family History

Past

Present


The goal of this study is to identify collision and merger remnant


candidates at increasing redshift, in order to measure the galaxy


hierarchical mass assembly rate as a function of cosmic epoch.

4/5/05

15

Distributed Data Mining in the NVO


1. Identify classes of galaxies among several large photometric catalogs
(e.g., 2MASS, Sloan DSS, FIRST, NVSS, etc.): the galaxy class is
either
normal

or
IR
-
luminous

(
the latter

being indicative of
collision/merger activity)


2. Identify all known examples of ULIRGs:


linked to Starburst Galaxies, Gamma
-
Ray Bursts, Quasars,
Hierarchical Galaxy Assembly, etc.


3. Learn new properties of ULIRGs (e.g., Association Rule



Mining) by examining multiple distributed databases.


4. Build a classifier from these rules.


5. Find new cases of ULIRGs in the distributed databases.


6. Results will contribute to understanding of many classes of
astronomical phenomena.


7. Techniques will be applicable to NVO, LWS, other VxOs, JWST
science program, ..., and E/PO projects (e.g., mining Kepler mission
catalog by students; or
VO@Home
)


4/5/05

16

An example of clustering in a 3
-
dimensional color
-
color
parameter space using data from two different (distributed)
astronomical databases. In this case, the 3 colors are pairings
of 2MASS near
-
IR and Sloan optical magnitudes.

Plot provided by H.Kargupta (UMBC)

4/5/05

17

Science Result


Successfully completed one true proof
-
of
-
concept science case within small subset
of the IRAS, HST, and FIRST databases.


We re
-
discovered exactly the type of object
that we are hoping to find automatically
with our data mining tools:


We found a very distant hyper
-
luminous
infrared galaxy, one of the brightest galaxies
in the known Universe.


This particular galaxy was previously
known (catalogued), but we re
-
discovered
it serendipitously.


References:


K.Borne, "Distributed Data Mining in the National Virtual
Observatory",
SPIE Data Mining & Knowledge Discovery
V
, vol. 5098, p. 211 (2003).


K.Borne, "A National Virtual Observatory (NVO) Science
Case: Properties of Very Luminous IR Galaxies
(VLIRGs)", in
"The Emergence of Cosmic Structure",

p.
307 (2003).

IRAS F12509+3122


Redshift = 0.780


4/5/05

18

Additional application areas of

ISP
-
funded NVO data mining project


Application of
XML

to distributed data mining:


ADQL

(Astronomical Data Query Language)


XMLA

(XML for Analysis)


PMML

(Predictive Modeling Markup Language)


Application of different data mining techniques:


Bayes classification


Neural nets


Decision trees


Association rule mining


Genetic Algorithms for rapid data modeling


Supervised
and Unsupervised Learning algorithms for robust classification


Application of Beowulfs to parallel high
-
performance data mining


Application to new mission data sets:
GALEX, Spitzer, WISE,
JWST, LWS, Sensor Webs, Constellations (distributed Sciencecraft)

4/5/05

19

NASA Intelligent Systems (IS) Project

Intelligent Data Understanding (IDU)

Earth Science Project



Automated Wildfire Detection Through
Artificial Neural Networks




Jerry Miller (P.I.), NASA, GSFC


Dr. Kirk Borne (Co
-
I), GMU

Dr. Brian Thomas, University of Maryland


Dr. Zhenping Huang, University of Maryland


Yuechen Chi, GMU


Donna McNamara, NOAA
-
NESDIS, Camp Springs, MD

George Serafino , NOAA
-
NESDIS, Camp Springs, MD

4/5/05

20


NOAA’S HAZARD MAPPING SYSTEM


NOAA’s

Hazard

Mapping

System

(HMS)

is

an

interactive

processing

system

that

allows

trained

satellite

analysts

to

manually

integrate

data

from

3

automated

fire

detection

algorithms

corresponding

to

the

GOES,

AVHRR

and

MODIS

sensors
.

The

result

is

a

quality

controlled

fire

product

in

graphic

(Fig

1
),

ASCII

(Table

1
)

and

GIS

formats

for

the

continental

US
.



Figure



Hazard

Mapping

System

(HMS)

Graphic

Fire

Product

for

day

5
/
19
/
2003







4/5/05

21

OVERALL TASK OBJECTIVES



To mimic the NOAA
-
NESDIS Fire Analysts’
subjective

decision
-
making and fire detection
algorithms with a Neural Network in order to:



remove

subjectivity

in

results



improve

automation

&

consistency



allow

NESDIS

to

expand

coverage

globally


4/5/05

22



OLD

FORMAT

NEW

FORMAT

(as

of

May

16
,

2003
)



Lon,

Lat

Lon,

Lat,

Time,

Satellite,

Method

of

Detection


-
80
.
531
,

25
.
351


-
80
.
597
,

22
.
932
,

1830
,

MODIS

AQUA,

MODIS


-
81
.
461
,

29
.
072


-
79
.
648
,

34
.
913
,

1829
,

MODIS,

ANALYSIS


-
83
.
388
,

30
.
360


-
81
.
048
,

33
.
195
,

1829
,

MODIS,

ANALYSIS


-
95
.
004
,

30
.
949


-
83
.
037
,

36
.
219
,

1829
,

MODIS,

ANALYSIS


-
93
.
579
,

30
.
459


-
83
.
037
,

36
.
219
,

1829
,

MODIS,

ANALYSIS

-
108
.
264
,

27
.
116


-
85
.
767
,

49
.
517
,

1805
,

AVHRR

NOAA
-
16
,

FIMMA

-
108
.
195
,

28
.
151



-
84
.
465
,

48
.
926
,

2130
,

GOES
-
WEST,

ABBA

-
108
.
551
,

28
.
413


-
84
.
481
,

48
.
888
,

2230
,

GOES
-
WEST,

ABBA

-
108
.
574
,

28
.
441



-
84
.
521
,

48
.
864
,

2030
,

GOES
-
WEST,

ABBA

-
105
.
987
,

26
.
549



-
84
.
557
,

48
.
891
,

1835
,

MODIS

AQUA,

MODIS

-
106
.
328
,

26
.
291


-
84
.
561
,

48
.
881
,

1655
,

MODIS

TERRA,

MODIS

-
106
.
762
,

26
.
152



-
84
.
561
,

48
.
881
,

1835
,

MODIS

AQUA,

MODIS

-
106
.
488
,

26
.
006



-
89
.
433
,

36
.
827
,

1700
,

MODIS

TERRA,

MODIS

-
106
.
516
,

25
.
828



-
89
.
750
,

36
.
198
,

1845
,

GOES,

ANALYSIS



Hazard Mapping System (HMS) ASCII Fire Product

4/5/05

23

GOES CH2 (
3.78
-

4.03
μm)


Northern Florida Fire



2003
:

Day

126

,


82
.
10

Deg

West

Longitude,

30
.
49

Deg

North

Latitude

File
:

florida_ch
2
.
png





4/5/05

24

NOAA
-
NESDIS FIRE DETECTION SYSTEM

GOES EAST
-
WEST

IMAGER


5 CHAN

10
-
BIT WDS

NOAA 14
-
17

AVHRR

5 CHAN

10
-
BIT WDS

TERRA
-
AQUA

MODIS

36 CHAN

12
-
BIT WDS

FIRE

ANALYSTS

MODIS MOD14 FIRE PRODUCT

CH’s 2, 22, 31 (0.86, 03.9, 11
μm)





HAZARD

MAPPING

SYSTEM


(HMS)


-------


ENVI





WF
-
ABBA FIRE DET
CH’s 1, 2, 4

(0.62, 3.9, 10.7
μm)


MCIDAS


(COTS)

TERASCAN


(COTS)

DAILY

NOAA

FIRE

PRODUCT

(automated

algorithms

and manual

additions)

NOAA S/C

NASA S/C

GVAR

FORMAT

HRPT

FORMAT


MCIDAS


(COTS)

HDF

FORMAT

10
-
bit

LCC = Lambert Conformal Conic Projection

CH’S 1, 2, 4 ( 0.62, 3.9, 10.7
μm

)

8
-
BIT WDS, LCC



FIMMA FIRE DET

CH’s 2, 3b, 4, 5

(0.91, 3.7, 10.8, 12
μm)

10
-
bit

CH’S 1, 2, 3b (0.63, 0.91, 3.7
μm
)


8
-
BIT WDS, LCC

CH’S 1, 2, 22 ( 0.66, 0.86, 3.96
μm

)

8
-
BIT WDS, LCC

FIMMA = Fire Identification Mapping and Monitoring Alg

WF
-
ABBA = Wildfire Automated Biomass Burning Alg

MCIDAS = Man Computer Interactive Data Access System

Geo
-
correction

Bow
-
Tie Effect Removal

NASA TAP
-
OFF POINT

FOR IMAGERY

4/5/05

25

SIMPLIFIED DATA EXTRACTION
PROCEDURE



Daily

HMS ASCII

Fire Product

Geographic
Coords (lat/lon)

ENVI Function Call

Conversion to Image

Coords (row/col)

Image Ref’s

DATA:

GOES (96 Files/day)

AVHRR (25 Files/day)

MODIS (14 Files/day)


Filter Out

Bad data points

Image

Coords

Spectral
Data

Neural
Network

Training
Set

4/5/05

26

DECISION REGIONS AND BOUNDARIES FOR HIGHLY
IDEAL

SCATTER PLOT CLUSTERING PATTERNS



Multiple Fire Signatures

Single Fire Signature

Fire

Background

X
1

X
2


X
2


X
1


Crown Fire

Surface Fire

Ground Fire

Background

4/5/05

27

Scatter Plot of Background
-
Subtracted GOES CH 1 vs. CH 2



Fire

(lower)

and

non
-
fire

(upper)

separation

of

clusters


2003
:

June

2

Northern

Florida

File
:

scatter_fires
12
.
png



(GOES

CH
1
,

CH
2
,

CH
4

are

input

to

neural

network)


4/5/05

28

Scatter Plot of Background

Subtracted GOES CH 2 vs. CH 4


Fire

(left)

and

non
-
fire

(right)

separation

of

clusters


2003
:

June

2

Northern

Florida

File
:
scatter_fires
22
.
png


(GOES

CH
1
,

CH
2
,

CH
4

are

input

to

neural

network)


4/5/05

29

Neural Network Configuration



Connections

(weights)

Connections

(weights)

Input

Layer 0

Hidden

Layer 1

Output

Layer 2

Output

Classification

Band A

Inputs:1
-

49

Band B

Inputs: 50
-

98

Band C

Inputs: 99
-

147

4/5/05

30

Typical Error Matrix

(for MODIS instrument)

Fire NonFire Totals

Fire



NonFire



Totals

TRAINING DATA

3007

318

(FN)

3421

3103

(TN)

3276

3152

6428

173

(FP)

2834

(TP)

True Positive

False Positive

False Negative

True Negative

4/5/05

31

Typical Measures of Accuracy


Overall Accuracy

= (TP+TN)/(TP+TN+FP+FN)


Producer’s Accuracy (fire)

= TP/(TP+FN)


Producer’s Accuracy (nonfire)

= TN/(FP+TN)


User’s Accuracy (fire)

= TP/(TP+FP)


User’s Acuracy (nonfire)

= TN/(TN+FN)

Accuracy of our NN Classification


Overall Accuracy

= 92.4%


Producer’s Accuracy (fire)

= 89.9%


Producer’s Accuracy (nonfire)

= 94.7%


User’s Accuracy (fire)

= 94.2%


User’s Acuracy (nonfire)

= 90.7%

4/5/05

32

Summary

4/5/05

33

Sample Data Mining Applications within the NVO:


Discover data stored in geographically distributed heterogeneous
systems.


Search huge databases for trends and correlations in high
-
dimensional parameter spaces: identify new properties or new
classes of objects.


Search for rare, one
-
of
-
a
-
kind, and exotic objects in huge databases.


Identify temporal variations in objects from millions or billions of
observations.


Identify moving objects in huge survey catalogs and image
databases.


Identify parameter glitches / anomalies / deviations either in static
databases (e.g., archives) or in dynamic data (e.g., science /
telemetry / engineering data streams from remote satellites).


Find clusters, nearest neighbors, outliers, and/or zones of avoidance
in the distribution of astrophysical objects or other observables in
arbitrary parameter spaces.


Serendipitously explore the huge databases that will be part of the
NVO, through access to distributed, autonomous, federated,
heterogeneous, multi
-
wavelength, multi
-
mission astrophysics data
archives.

Summary


NVO Data Mining Applications

Data Mining Resource Guide for Space Science:

http://nvo.gsfc.nasa.gov/nvo_datamining.html

http://www.us
-
vo.org/

4/5/05

34

Addressing NASA Exploration Challenges

through Intelligent Data Understanding


Autonomy
:
“making systems more intelligent”


Robotic Networks
: “
enabling networks of
cooperating robotic systems”


Data
-
Rich Virtual Presence
:
“local and
remote, both real
-
time and asynchronous virtual
presence to enable effective science and robust
operations (including tele
-
presense , tele
-
science,
tele
-
supervision)”

Source: Human & Robotic Technology,

Program Formulation Plan, 15 May 2004