Scientific Data Mining

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

66 εμφανίσεις

Scientific Data Mining


Principles and applications with astronomical data.


Amos Storkey





Institute for Adaptive and Neural Computation

Division of Informatics and Institute for Astronomy


University of Edinburgh

Collaborators and Thanks


Collaborative work with Nigel Hambly,
Chris Williams and Bob Mann.


Thanks also to many others at the Royal
Observatory, Edinburgh for their help in
clarifying many of the things that an
astronomical outsider might
misunderstand or falsely presume!

Astro
-
informatics


Problems in Astronomy increasingly require use
of machine learning, data mining and
informatics techniques.


Detection of spurious objects


Record linkage


Object classification

and clustering


Source seperation


Compression


Information about techniques

Galaxy spectra


James Riden, with Alan Heavens and Ben
Panter.Chris Williams.


Given spectra, what can be said about the
generation history and metallicity of galaxy.


Data exploration techniques: ISOMAP and LLE


find data manifold and project to low
dimension.


Develop probabilistic model for galaxy
generation, infer history and metallicity
parameters from spectra.

Exploratory Data Analysis


Exploratory Data Analysis

Record Linkage


Problem of linking records from different
datasets.


There is an ambiguity in matches.


Room for new techniques.

Super
-
resolution


Improving resolution of a single image, or
combining images from different sources
to provide an increased resolution.


Image cleaning and characterisation.


H alpha survey. Matches in short red.


Examples.

Part II


Main Problem


Locating junk objects in astronomical
databases.

Makes finding non
-
matches across
epochs or colours
hard.

Supercosmos Sky Survey Data


UK, ESO and Palomar Schmidt sky survey plates.


Optical: 3 colours and 2 epochs, 894 fields for
each covering the Southern sky.


Digitised using SuperCOSMOS to 10 micron
(0.7arcsec). 5x10
5

to 10
7

objects on the plate.


Objects and features extracted from plates to form
a catalogue of stars and galaxies and
characteristics (eg ellipses), but also spurious
objects, eg. from satellite tracks


Average of 2 satellite tracks per plate, a few
hundred to a few thousand objects per track.


Aeroplanes, diffraction spikes, halos, scratches...

Satellite track problem


Some satellite tracks tend to be
recognised as a line of objects:


Optical Artefacts


Can be halos about
bright stars. High
density of spurious
points local to the star.


(Almost) horizontal
and (almost) vertical
diffraction spikes are
possible.

Spurious object characteristics


Spurious objects cover all the ranges of
magnitude measurements, they often (but
not always) have characteristics
resembling those of galaxies.


In fact their characteristics are wide and
various. They are not easy to detect from
their characteristics alone.

Machine Learning Methods


Hough Transform and Circular Hough
Transform


See


http://www.anc.ed.ac.uk/~amos/hough.html


Circular Hough Transform


Hough Example: UKJ005

angle

Distance from origin

0

2


d
max

Data space corresponding to bin


However:


Can’t find short lines


Curves are problematic


Background star/galaxy density changes can
cause errors.

Renewal Strings


Hidden
-
Markov renewal processes.


Look at all possible line segments in terms
of renewal processes.


If local density is closer in signature to a
satellite track than the background stars and
galaxies, then flag as a satellite track.


Benefits


Can use line widths thirty times narrower than
with Hough.


Copes with curves by using local linearity
rather than restricted to global linearity.


Deals with local star/galaxy density differences.


Copes with partial lines, dashed lines etc.
Flexible model.


Can use other data (eg ellipticity) to strengthen
classification.


Bayesian.

Generative renewal string


Can
generate
from
model.

To use


Don’t use generative model! Too hard.


Look at all line segments. Transform
star/galaxy model to Poisson process on line.
Run Markov chain along each line.


Simplest case: class 0 is background process.
Class 1defines a renewal processes
corresponding to a scratch, satellite track etc.
Processing is fully Markovian.

Results


Get probabilistic results. Two
possibilities:


Probability of a given point being a spurious
point.


Most probable classification of points.

Results


Two examples. The left example is a
small scratch or track in the corner of
ukj005. Right is a track on a dense plate.

Further examples


Further examples can be found at


http://www.anc.ed.ac.uk/~amos/sattrackres.html


A flythrough movie of one plate can be
found at


http://www.anc.ed.ac.uk/~amos/demos/flythroug
hnew3c0.avi

(36MB)

Conclusions


Machine Learning and Data Mining methods
are, and will continue, to prove useful with
astronomical databases.


Methods do not always work automatically.
Some thought is needed.


Circular Hough transforms, and renewal strings
have proven effective in locating a variety of
spurious objects in astronomical databases.


So far have run on a quarter of one colour of
SuperCOSMOS data.

Contact and URLs

http://www.anc.ed.ac.uk/~amos/


a.storkey@ed.ac.uk


http://www.roe.ac.uk/cosmos/scosmos.html