Implementing VO Services, some technical experiences. - ANU ...

numbergrandioseInternet and Web Development

Feb 5, 2013 (4 years and 6 months ago)

277 views

IMPLEMENTING VO SERVICES:

SOME TECHNICAL EXPERIENCES

eScience

Challenges in Astronomy and Astrophysics

Brisbane 2010

Jon.Smillie@anu.edu.au

ANU Supercomputer Facility / NCI National Facility

Context


Sustainable Optical Astronomy Archive


Fronted by IVOA
-
compliant VO services


ANU Supercomputer Facility / NCI NF


Funded by LIEF, ANDS


Including:


MACHO


SkyMapper


WiggleZ


GAMA


HAT
-
South


WiFeS

2

VO Pathfinder: MACHO


Stable legacy project, last data taken in 2003


LMC, SMC, Bulge observed over 10 years


100,000+ 74MB FITS Images


Catalogue of 75 million stars


Light curves for most of these stars (~300
data
-
points per star)


http://wwwmacho.anu.edu.au


3

What is a VO Service?

1.
A web
-
service.

2.
Fronting a database and/or data collection.

3.
Processing a URL conforming to an IVOA
standard (Encoding a search on the data)

4.
Returning a document conforming to an
IVOA standard (Encoding a subset of the
data representing the results of a search).

IVOA Standards:
http://www.ivoa.net/Documents

4

What VO Services were
implemented for MACHO?

1.
ConeSearch

1.
Image metadata

2.
Star catalogue

3.
Star lightcurves

2.
SIAP (incorporates image download)

1.
Image metadata and associated FITS images


http://macho.anu.edu.au


5

How did we implement the
MACHO VO/web
-
services?


With a web
-
service framework!


URL parsing


Object
-
Relation
Mapper

(ORM)


Forms


Web server


All the “boiler
-
plate” code you need to rapidly
build database
-
backed web services.


Rails (Ruby) (v 1.2.6) used for MACHO.


Backed by a
PostgreSQL

DB (v 8.1.4)


6

Biggest issue with Rails?


Database ideology:


“all tables must have in integer column called ‘id’
which acts as the table’s primary key”


“thou shalt not use composite primary keys”


MACHO tables had composite primary keys


Eg
: photometry: star id + observation id


Third
-
party
plugin

“composite
-
primary
-
keys”


v 0.9.0 worked with Rails 1.2.6 for MACHO


v 2.3.5.1 would not work with Rails 2.3.8


Latest v 3.0.3 claims to work with Rails 3

7

Rails Issue 2:
PostgreSQL



Postgres

provides simple geometric types:


Point, line, box, circle, etc.


Can formulate useful queries based on
overlapping regions, intersecting lines, etc.


Rails (v 1.2.6) won’t deal with these types


Casts them to ‘string’


We manually modified our Rails installation


Forced it to pass these types through to DB

http://www.anusf.anu.edu.au/~jgs900/rails
-
postgres.html

8

...that was the easy part.


With a good web
-
service framework,
developing the web
-
service was simple


Lots of detail to attend to with constructing the
output VOTable/XML documents.


UCDs/ontology etc


Needed to formulate the right queries to the
database.


Eg
: Distance on a sphere in SQL (
Haversine
).


Very time consuming due to detail, but recipe was
clear given IVOA service standards.

9

Data Management was harder


Data sets in this area are considered “legacy”
for the purposes of building VO.


Data structures, formats and schemas
preexist
, defined/developed by project team.
We overlay VO services.


Choice is to build VO services on top of
existing DB tables etc, or dump and
reingest
.


With MACHO we
reingested

to
PostgreSQL
.


Proprietary format was unsupportable


Off
-
the
-
shelf, no extensions,
Postgres

v 8.1.4

10

MACHO Data Management


Load data into a
PostgreSQL

8.1 DB


Observation metadata


1 table, 130,000 rows, 66MB


Star catalogue


3 tables, 76 million rows, 61GB


Star time
-
series photometry


1
partioned

table, ~20 billion rows


By
Postgres

standards, 10s of millions of rows,
or hundreds of GB are “large”.

11

Data problem 0: pre
-
ingest


A dynamic data
-
set, evolved over 10 years


Metadata overhaul


Correct omissions and errors, standardise format


FITS header overhaul


Align with metadata, correct errors


FITS WCS augmentation


Image coordinates aligned to UCAC catalogue


Repackage images as ME
-
FITS files


Many individual problem cases identified and
rectified before release to the VO ....

12

Data Problem 1: ingest


Initial estimate of time to ingest: ~1 year


Eventually got that down to ~ 2 months


Key lesson: don’t modify the DB data!


Attempted to add integer primary keys to tables


Changing the data provokes a “vacuum”


Vacuum on 10+ TB can take months!


Orders of magnitude faster to copy a very
large table and delete original, than to modify
and vacuum.


Found these out the hard way ...

13

Data Problem 2: querying


Indexes, indexes, indexes...


Time to do a select (with index in place):


Observation catalogue: 1/3 second


Star catalogue: 19 seconds


Time to do a select (with no index):


Star catalogue: > 10 minutes


Time to build indexes:


Observation catalogue: 1/3 second


Star catalogue: 233 seconds


Photometry: a significant fraction of 2 month ingest

14

Querying, continued:


Data types, indexes and queries related


Eg
: How to search for objects in a region?


Option 1: sequential comparison of all
ra
/
dec


Option 2:
Postgres

geometric functions:


Distance between points, overlap of regions, etc


Specific indexes which support these queries on
these types

So:

1.
Convert coordinates to geometric type

2.
Create appropriate index, and search on this

15

Summary


Using a web
-
service framework (Rails)


Was invaluable


huge
kickstart


Introduced various idiosyncrasies


Data management was harder


Data pre
-
processing took more effort than data
ingest and web
-
service development


Database design (indexes) critical to achieving
acceptable (or any ..) query performance


Query
-
> index
-
> table structure


Data ingest, index construction etc a far from
trivial task


16

If we were to do it again?


Store floats as scaled integers


Halve required storage space?


4 bytes
int

vs. 8 byte float


Third
-
party
Postgres

extension packages


Better query options/performance?


Investigate alternative DB architectures?


Column
-
oriented,
eg
:
InfiniDB

(SQL front
-
end)

17

Next: SkyMapper


Expecting ~1TB image data / night for 5 years.


1 million+ 512MB (FITS) observations.


Catalogue of 1 billion objects.


Photometry DB will be 5
-
> 10 times larger
than MACHO: ~100TB


IVOA Compliant web services


Using
CherryPy

(Python) framework this time


Much more lightweight than Rails


SQLAlchemy

(Python) SQL ORM


Much less
ideological than Rails

18