Report [Report] - Apache Tomcat

assistantashamedΔιαχείριση Δεδομένων

29 Νοε 2012 (πριν από 4 χρόνια και 8 μήνες)

308 εμφανίσεις







SWIRED

S
pitzer

Wide
-
area InfraRed Extra
galactic Database


Project Report

Brian Smith

June 6, 2006















1

1.


Abstract

Astronomers can sometimes find very interesting objects for follow
-
up observation by
systematically examining a broad, unbiased
survey of the sky. Spitzer's SWIRE Survey

is
an excellent opportunity to carry out this kind of search for objects with unusual
characteristics. By combining Spitzer data with catalogs from instruments that observed
other wavelengths astronomers can constr
uct a more complete graph of an object's flux.


This project is a database
-
driven web application that provides access to these
astronomical catalogs to the scientific community. The catalogs have been pre
-
matched
using several features of the PostgreSQL d
atabase management software. This increases
efficiency and
reduces server load wh
en users query the database. To do so they follow a
succession of three steps on the website to choose the field, catalogs and properties they
wish to return.

They can then do
wnload the comma
-
delimited results file from the site.


2.


Introduction

2.1 SWIRE Survey

The Spitzer Wide
-
area InfraRed Extragalactic Survey

is a high galactic latitude
survey
being carried out for the benefit of the astronomical community by teams of sc
ientists
from the Spitzer Science Center at Caltech in Pasadena, California.

It covers nearly fifty
square degrees of the sky, approximately equivalent to the area covered by 250 full
moons. The fields have been carefully selected for the best possible inf
rared observation
conditions for detecting faint infrared galaxies and quasars.
They are away from the
direction of the Milky Way's disk of stars, and with the least amount possible of
interplanetary or interstellar gas and dust. Three of the fields were s
urveyed by the
European ISO satellite's Extragalactic Large Area Infrared Survey (ELAIS), two in the
northern and one in the southern galactic hemisphere. The two deep surveys are from
the Chandra X
-
Ray satellite from the United States and the European XMM

satellite.


The primary data for the survey are infrared images taken over about a total of a
month of observing time with the Spitzer Space Telescope. The images are obtained in
all seven infrared colors by
instruments on
the telescope
, namely the IRAC m
id
-
infrared
camera and the MIPS far
-
infrared camera. Object catalogs are created from these



2

images, recording the position, flux and many other features of the objects in large
tables of data.

The Spitzer Science Center provides four primary SWIRE catalogs
. The
first and largest is a band
-
merged catalog consisting of fluxes and other information
from optical, IRAC and MIP
-
24 bands. It requires detection at the first two IRAC bands,
3.6 and 4.5 microns, to be above specific signal
-
to
-
noise ratios. The other
three catalogs
contain the three MIPS bands: 24, 70 and 160 microns respectively. These catalogs are
single band only and are cut based only on each MIPS waveband.



Equal area projection in galactic coordinates with the SWIRE fields shown in red.


2.2
Other Catalogs

The SWIRE

survey is an excellent foundation for finding follow
-
up objects. In addition
to the Spitzer Space Telescope, many other telescopes have made observations of one or
many of the six SWIRE fields. Some of these telescopes like the Inf
raRed Astronomical
Satellite
observed in infrared wavelengths, while others
viewed the sky in optical
wavelengths.


These other catalogs are incredibly useful when searching a field for objects with
unusual properties. By matching objects across catalogs
astronomers can construct a
graph called a spectral energy distribution (SED). This graph shows the amount of flux



3

detected for the object at many different wavelengths

and usually looks similar to a bell
curve with a shortened left slope. The peak of the
SED indicates the wavelength at which
the star is brightest. There is a wealth of information in the SED of an object but
constructing it can be a tedious process. This project is an attempt to simplify that
process and make in much more efficient by pre
-
m
atching the catalogs for the user and
allowing them to construct complex queries and retrieve results using the web interface.

The results produced by the application can then be used to construct SEDs and look for
anomalous features that can lead to major

astronomical breakthroughs.


3.

Technical Background

3.1 Hardware

The platform is a Dell PC running Microsoft Windows 2000 Server SP4 with a 2.2GHz
Pentium 4 and 1GB of RAM. The machine was provided by the Spitzer project scientist
as the basis for deve
loping astronomical application
s. The extra speed and RAM have
been very helpful in managing the large catalog tables in the database. The machine is
currently at the Jet Propulsion Laboratory in Pasadena, California behind the firewall.


3.2 Software

3.
2.1 DBMS

The database management software that we chose for the project is PostgreSQL 8.1.
PostgreSQL is an object
-
relational database server
released to the public under a license
similar to the Berkeley Software Distribution. It is an open source projec
t maintained by
a global community of developers.


PostgreSQL 8.1 suits

our needs well for several reasons. It is a very mature
DBMS, supporting many advanced features natively and making it comparable to
commercial software packages. It preserves

ACID rul
es and

supports referential
integrity
,
transactions

and full Unicode. PostgreSQL provides the widest range of index
types among all leading DBMS
e
s commercial and free, including R
-
tree, hash,
expression, partial, reverse and even GiST indexes. It also prov
ides triggers, functions
and procedures in all of the popular procedural languages.

It also has excellent



4

portability if we acquire new hardware in the future, as it runs on Windows, Mac OS X,
Linux, BSD and Unix.


Of particular interest for this project
are the native geometric data types. In
addition to the usual data types used in databases like integers, floats and character
data, PostgreSQL provides points, boxes, circles and m
ore complex paths and polygons.
There are also numerous geometric functions

available to act on these data types. For
example one can determine if two shapes are touching using
the overlap operator (&&).
Columns that contain geometric data types can be spatially indexed using R
-
tree instead
of B
-
tree indexes. Finding overlapping
shapes is very efficient once this is accomplished.


One other feature of
PostgreSQL well
-
suited to this application is the partitioning
of tables. Some of the tables that hold astronomical catalog data are very large
, one with
several

million records.
Pos
tgreSQL allows tables to be partitioned by treating the tables
as objects in a parent
-
child relationship using inheritance. The columns in the parent
table are defined with the usual data definition language, but none of the records are
stored in this tabl
e. Instead any number of child tables are created that inherit the
structure definition from the parent table. A record is stored in particular child based on
check constraints that group records into partitions. In our case there is a child table for
each

of the six SWIRE fields

for the largest catalogs, and check constraints on the
field

column to partition the data as required.













2MASS Catalog


Parent table,

All columns defined
here,

No records stored in
this table

2MASS
-

Chandra South field only

create table
catalogs.
twomass
_chs (

CHECK ( field = 'chs' ) )
INHERITS (catalogs.
twomass
);

2MASS
-

ELAIS N1 field only

2MASS
-

ELAIS N2

field only

2MASS
-

ELAIS S1

field only

2MASS
-

Lockman Hole

field only

2MASS
-

XMM
-
LSS

field only




5


Partitioning the large catalog tables keeps the indexes to a manageable size
instead of a single giga
ntic index. Also, the database admin can turn on a property called
constraint exclusion
. If a query selects records from the parent and specifies a constant
in the column with the child's check constraint, the
DBMS will not scan child tables who
can't fill

the constraint. For example, if the query requests records from the Chandra
South field the DBMS will only scan records in the child table that holds the Chandra
South field and not the records in any other children.


3.2.2 Web Server

We are hosting the
web application with Apache Tomcat 5.5. This web server supports
Sun Microsystem's specifications for Java servlets

(2.4) and JavaServer Pages (2.0)
. It
also comes with the Jasper compiler that compiles the JSPs into servlets
. The software is
maintained by

the Apache Software Foundation and released under the Apache License

which is less strict about open source

code than others like the GPL.


Release 5 of Apache Tomcat reduced the need for object garbage collection
through the implementation of a new reque
st URI mapper. They increased the efficiency
of the JSP tag library using tag pooling, and also added native Windows and Unix
wrappers for platform integration.


4. Database Design

and Implementation

4.1 Astronomical Catalog Tables

T
he majority of the da
ta in the database for our application is composed of astronomical
catalogs. The images returned by telescopes in space and on the ground are analyzed
using a process called source extraction. Large tables are created from the images and
each record contai
ns the position, the energy from the
object called the flux, and many
other attributes.


These catalogs are available to download from several public websites hosting
astronomical data. Before importing we created a data definition script to create the
t
able structures for the appropriate catalogs. We added a column named
field

to specify
which of the six SWIRE fields contains each object
. The three largest catalogs were
created with the parent
-
child partitioning structure previously discussed.




6



CHS

EN1

EN2

ES1

LCK

XMM


total objects

spitzer








3,144,184

spitzer_chs

495292












495,292

spitzer_en1



631355










631,355

spitzer_en2





307855








307,855

spitzer_es1







430526






430,526

spitzer_lck









720603




720,603

sp
itzer_xmm











558553


558,553










twomass








124,962

twomass_chs

14681












14,681

twomass_en1



24300










24,300

twomass_en2





15037








15,037

twomass_es1







26994






26,994

twomass_lck









26886




26,886

twomass_xmm











17064


17,064










gsc2








228,305

gsc2_chs

40037












40,037

gsc2_en1



37632










37,632

gsc2_en2





44117








44,117

gsc2_es1







35963






35,963

gsc2_lck









31102




31,102

gsc2_xmm











39454


39,454










spitzer70

1723

2397

1130

801

2485

1499


10,035

spitzer160

696

1074

439

256

1106

627


4,198

tycho

260

446

280

601

443

437


2,467

hip

48

56

40

116

78

94


432

irasp

18

20

15

25

41

14


133

irasf

62

76

45

82

126

39


430

simbad

10
42

1117

1461

2856

1546

445


8,467


















3,523,613



4.2 Matching Tables

The main objective of this project is to make analysis of objects across several catalogs
efficient and simple. There is no way to do this matching between c
atalogs simp
ly by
identifiers and

foreign keys. The objects must be matched based on their position in the



7

sky, using two
-
dimensional coordinates that in astronomy are called
r
ight
a
scension

and
d
eclination

or RA and DEC for short.
Depending on the accuracy of the tel
escope
that made the observations and how long ago the observations were made there is
always some inaccuracy in the position of an object that was recorded by different
catalogs. However, there is usually a threshold of distance between the position of an

object from one catalog to the next under which we can safely assume they are in fact
the same object.


To do positional matching like this on the fly when the user makes a request
would be wildly inefficient if it succeeded at all. The user may request
many catalogs
each of which may have tens or hundreds of thousands of objects. Even if the objects are
matched beforehand, how does one go about finding objects that match? It again
introduces incredible inefficiency to test every object from one catalog a
gainst another.


PostgreSQL offers several ways to avoid these pitfalls. When each catalog is
imported we also create two geometric data type columns from the RA and DEC of the
object: a
point

and a
box
. The point simply records the RA and DEC in a conveni
ent data
type. The box is centered on this point, and the half
-
size of the box is the distance
approximately twice the threshold considered when matching to other catalogs. The
exact half
-
size is not necessary because the box is only used to filter the obj
ects that are
tested.


The other way that PostgreSQL offers to avoid inefficiency is in spatial indexes.
Once the boxes are generated for a catalog table the box column is spatially indexed
using an R
-
tree index. This orders and relates elements of the bo
x column by proximity
instead of the usual comparisons of a B
-
tree index. From then on when a query wishes
to find objects from one catalog that are in close proximity to objects in another catalog
the DBMS can use the R
-
tree indexes on the box columns to
very quickly compare the
positions of the objects. All the user has to provide are the box columns related by the
overlap operator (&&). Thus instead of doing a very expensive cross join from one table
to another
and effectively multiplying the row counts,

we can use geometric data types
and spatial indexes to filter our results.


As mentioned the box/index method of filtering is only a rough cut of the objects
from different catalogs
that should be
tested to see if they are in fact the same object.
The hea
vy work on the database is already avoided by using the box/index filter, though.



8

All that remains is to remov
e records where the point data of the objects is greater than
the allowable threshold. This step is the only one that is still done on the fly whe
n
responding to a user request. All the matches that remain after the box/index process
are stored in a table called
matches
. All the records in this table are duplicated with the
order of the two catalogs reversed. This is to accommodate the fact that eit
her catalog
may be the primary catalog

when matching to others.


4.3 Web Application Tables

In addition to the tables that hold real astronomical data the database also contains
several tables that hold what may be called catalog metadata. These tables ar
e used by
the web interface to store information about the fields, catalogs and properties that are
not included in the catalogs themselves. For instance, the full name and description of
each field is stored in the table
fields
, and likewise for the catal
ogs and properties. This
data provides some context and extra information to the user when they are making
choices about their database query.



5. Interface Design and Implementation

The web interface is built around a succession of steps that allow the
user build a custom
query of the catalogs in the database.






9

5.1 Model
-
View
-
Controller Architecture

In an effort to keep the code clean and separate our business logic, data and interface we
chose a Model
-
View
-
Controller architecture for the project. Eac
h step in the query
-
building process is handled first by the appropriate controller. The controller handles
the ongoing query being built by the user, as well as any data that must be displayed at
the current step. It then dispatches to a JSP of the same n
ame to present the
information and a form to the user. When the user is done making choices the form is
submitted to the controller that handles the next step. This continues until the final
results JSP is reached. The user can also backtrack through the s
teps as desired, so the
controllers must handle users coming from the previous step and those coming from the
following step.





















Index.jsp

Simple intro, button to go
to first step

Fields Controller


fields
.jsp

Step 1: Choose a field and
spatial constraints

catalo
gs
.jsp

Step 2: Choose catalogs

Catalogs

Controller


Properties

Controller


properties
.jsp

Step 3: Choose properties

Results

Controller


results
.jsp

Final page, save results file




10

5.2 Persistence of Query Details

The interface as a whole is a single large form that can b
e used to customize queries on
the catalog database. A choice at the beginning of the form determines the content of
subsequent parts of the form. To make this possible we broke the form up into several
steps. Submitting one step to the appropriate contro
ller allows the controller to handle
the choice made by the user and to direct the JSP view to display the adapted
information.


There is an unfortunate side effect of splitting forms across multiple pages.
Whereas with a single form the user's browser wil
l handle revisions and backtracking
sufficiently, in a multi
-
page form it is very common to lose information by attempting to
go backward and forward through the steps. It would be best to track the choices the
user has made from the server side to ensure
the integrity of the data and to make the
form as easy as possible for the user.


When the user reaches the first step of the form the first controller creates a Java
bean that will contain all the choices and data that are entered by the user. This query

bean has methods and variables to track the fields, spatial limits, catalogs and properties
that the user may choose. The bean is always read and updated by each controller along
the path of steps so it is always up to date. It is also immune to the probl
ems that arise
when a user clicks "Back" or "Forward" when completing a multi
-
page form. If a user
goes back to any previous step the page displays a form based on the contents of the
query bean and does not rely on the user's browser to persist informatio
n.


The contents of this query bean can be seen at any time in the left sidebar titled
"My Query". This sidebar has areas that display the chosen field, spatial limits, catalogs
and the number of properties chosen for each catalog. When a user has passed
a
particular step

the sidebar displays the information they have chosen as well as a green
check mark next to the title of that information. For example, after finishing step 1 and
submitting the form the sidebar will display checkmarks next to "Fields" an
d "Spatial
Limits" because these have now been chosen. This does not mean that they cannot be
changed, however; the user can simply return to step 1, make a new choice and submit
again to make changes.





11

5.3

General Use Case

The steps progress from the mos
t general criteria to the most specific. In the first step
the user must choose one of the six SWIRE fields to query. They may also choose
whether to query the entire field or to perform a cone search, meaning to return only
those objects within the given
radius of a given point.


In step 2 the user is presented with a list of the catalogs that are available for the
field they have chosen. As mentioned previously that field and the spatial limits they
have set are displayed on the left hand side.
Each catal
og may be selected by a checkbox
at the beginning of the line. The user may select one catalog if they only want to return
objects from that catalog. The most interesting use is in selecting multiple catalogs, and
there is no limit; the user may select all

available catalogs on the page. If they choose
multiple catalogs they must pick one of them to be the primary catalog when doing
matching. In other words, objects from all the secondary catalogs will be considered
matches if they are within the threshold
distance of objects in the primary catalog. A
primary must be chosen if they have requested multiple catalogs.


The next step presents the user with the properties available for each of the
catalogs they selected. In essence this is a way of choosing which

columns from the
catalog tables they would like to include in the results.
The ID, RA and DEC columns
from each catalog are mandatory but all others are optional. Some of the most used
properties of the catalog are checked by default as a convenience for
the user. They can
also enter a lower or upper bound or both for each property. Th
is can be used to filter
results by putting restrictions on the data in a particular column.


Once the user has selected the properties they want and submitted the form, the
final controller constructs the complicated SQL query, executes it and writes the results
to a comma
-
delimited text file in the results directory on the web server. The user is
presented with a link to their results file along with some metadata about the
file. If
there were any errors in processing the request they are displayed here as well. On most
systems with modern browsers and the Excel spreadsheet application clicking the link to
the CSV file will automatically open it in Excel, ready to use.






12

6.
Evaluation

The main set of catalogs has been imported and all the existing catalogs have been
matched pair
-
wise, creating records in the matches table. The web interface has been
created and is functional, though there are several features that we may add
i
n the near
future. Some would

allow the user more fine
-
grained control when building the query
.
Others would provide automation of some of the analytical processes that users will
carry out on the result data.
For example it may be possible to automatical
ly graph the
fluxes from the catalogs chosen by the user saving them valuable time.


The performance of the final controller in generating the results data is very good
in most cases. Using the matches table in which objects between two catalogs have been
pre
-
matched makes querying the database very efficient
--
the query only has to join
catalogs to the matches table based on their primary keys. Queries returning thousands
of records take less than a second.
Those

returning tens of thousands
took several
sec
onds and returned a results file of several megabytes.
Despite the pre
-
matched
objects and partitioned tables queries involving the Spitzer catalog take upwards of
several minutes to complete. This may be inherent when using this catalog as the row
counts
are an order of magnitude beyond all other catalogs. Also, retrieving this catalog
from the GATOR system at the Infrared Processing and Analysis Center at Caltech
exhibited the same extended duration. Nevertheless we hope to improve this time in the
future

to
bring it closer to the responsiveness of smaller queries.


7. Conclusion

This project was designed to help scientists in doing broad analysis of the fields in the
SWIRE Survey. Along with the catalogs from the Spitzer Space Telescope we included
the m
any other infrared and optical catalogs. Through the web interface this data is now
available for use, but more than that the cross
-
matching between catalogs has already
been completed for the user. By completing three steps to choose the field, catalogs a
nd
properties they wish to return the user can receive their matched catalog results in
seconds. Along the way we have learned the value of pre
-
matching objects between
catalogs and successfully used the geometric data types and spatial indexes of



13

PostgreS
QL to accomplish it. We were also able to persist the user's choices across
multiple forms by utilizing a session
-
scope Java bean to store the information. Our
project has created a solid foundation that can be easily expanded with other catalogs
and even
other fields in the future
.


8. Appendices

8.1 Astronomical Data

The SWIRE Survey
-
>
http://www.ipac.caltech.edu/SWIRE/

Spitzer
-
>
http://irsa.ipac.caltech.edu/Missions/spitzer.html

2MASS
-
>
http://irsa.ipac.caltech.edu/Missions/2mass.html

IRAS
-
>

http://irsa.ipac.caltech.edu/Missions/iras.html

Tycho/Hipparcos
-
>
http://www.rssd.esa.int/Hipparcos/catalog.html

SIMBAD
-
>
http://simbad.u
-
strasbg.fr/Simbad

GSC II
-
>
http://archive.stsci.edu/gsc/


8.2

Software

PostgreSQL 8.1.3 is available from
http://www.postgresql.org

Apache Tomcat 5.5 is available from
http://tomcat.apache.org

The Formatted Dataset classes that we used to easily create the comma
-
delimited results
file are available from
http://fdsapi.sourceforge.net