Automation components in UsefulChem

grapedraughtSoftware and s/w Development

Dec 2, 2013 (3 years and 7 months ago)

64 views

grapedraught_1648b13f
-
92aa
-
45e7
-
8a50
-
05b7c264ba43.doc



Page
1

of
3


Automation components in UsefulChem



This page describes
the evolution

of
software
tools
which process the usefulchem
-
molecules blog into a variety
of useful formats, e.g.,
spreadsheets,
RSS feeds
,

and

CML for
molecular visualization
/
manipulation
tools
su
ch
as
Jmol, as well as adding additional chemical information (InChIs, MWs, supplier info) for the molecules in
the UsefulChem project.

I will also discuss the on
-
going development of an
automated
RSS
feed
reader for
extracting and
performing further
proc
essing
this chemical information, and potential future
work

in th
e
s
e

area
s
.

F
or more information on this work
, and to follow new developments
, please re
fer to
my
blog
entries
at

http://usefulchem.blogspot.com.



Initial work

with
Excel

/ Excel VBA:


Molec
ule e
ntries in
http://usefulchem
-
molecules.blogspot.com

are characterized primarily by a UC number
(e.g., UC
0
1
88), a SMILES notation, and an image, although other information
, such as CAS number,

is often
added.
To summarize
and expand on
this data in
a c
onvenient

form
at
,
a program in Microsoft Excel Visual
Basic for Applications (VBA)
(
http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/MoleculeBlogInfo.zip
)
was
developed which downloads this page, parses
out

the desired information, and

generates a spreadsheet
(
http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem
-
molecules/usefulchem
-
molecules.xls
)
in which each row represents one blog entry.

Given that the blog format itself is rather loose


for example, th
e SMILES entry might be prefixed by “SMILES” or “SMILES:”


and can change over time, the
search criteria
for fields
were made fully configurable by placing
them

in an initialization (.ini) file.


Additional information beyond that provided by the blog, su
ch as links to suppliers, were desired, and for this
purpose several different free
ly available

software packages and libraries were used. Molecular weight
information and molecular format files (CML, MOL) were generated from the SMILES using the CDK Java

librar
ies
, while InChI descriptors were produced by
Open
Babel.

Image files were
at first
generated using
ChemSketch
,
although these are now simply downloaded directly from
the
blog

itself
.

Supplier information
was acquired by sending
HTTP

GET requests t
o chmoogle.com (now eMolecules.com), and
processing

the
responses gleaned from this service.


In addition to the spreadsheet, this software also creates HTML
and CML files
(e. g.,
http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/useful
chem
-
molecules/UC00
88
.htm
)
for each blog entry
,

which
in combination
allow the
molecule
s in the blog

to be viewed with the Jmol applet.



RSS feeds

and Automation

Software in Java
:


The spreadsheet format for the usefulchem
-
molecules blog was
a
useful

begi
nning. It

was
, however,

not very
amenable to
automated
data processing

or other kinds of display desired
, particularly
for

the
internet/
web
.
An
initial attempt to

address th
ese

deficienc
ies

involved modifying
the Excel VBA software to generate an RSS

1.0

feed
(
http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem
-
molecules/usefulchem
-
molecules.rss
)
of the blog data in addition to its
other output
.
The advantage to having
the data in a

feed

is that

can
then
be viewed using any n
umber of available desktop or
web
-
based

readers, such
as RSS Bandit (
http://www.rssbandit.org
) or Bloglines (
http://www.bloglines.com
). Furthermore, as RSS is
simply XML, feeds can contain other XML formatted data, such as Chemical Markup Language (CML).

Thus,
grapedraught_1648b13f
-
92aa
-
45e7
-
8a50
-
05b7c264ba43.doc



Page
2

of
3


a feed can be downloaded and parsed for its CML by software such as Bioclipse (
http://www.bioclipse.net
) or
Jmol (
http://jmol.sourceforge.net
).


A shortcoming of
using
Excel VBA is that it does not easily lend itself to automation.
Also, it is neit
her truly an
open source development platform nor portable to other operating systems such as Unix or
Macintosh
.
Therefore
, to

address th
ese shortcomings
,
I rewrote
the
VBA code

in the Java programming language, which is
both free (
see
http://java.sun.com/jav
ase/downloads/index.jsp

to download the Java
D
evelopment Kit
) and is
implemented

on all major operating systems
. Once in Java, it was straightforward to set the software up as an
service to be run periodically.

As a result, the RSS feed and associated files are now regenerated
automatic
ally

whe
never additions or changes are made the usefulchem
-
molecules blog.


A zip file containing both the source and compiled code for the Java software
to convert the usefulchem
-
molecules blog to an RSS feed
can be found at
http://showme.physics.drexel.edu/usefulchem/Software/Java/MoleculeBlogInfo/MoleculeBlogInfo.zi
p
.



CMLRSSReader
:


Having an RSS feed with special fields provides a launching platform
of

essentially unlimited opportunities for
further treatment of
chemical information
.
Standard RSS readers
, however,

rarely
display
l
ittle more

the
<description>

and several othe
r standard fields
i
n

a feed. Furthermore, they are not extendable or
configurable to
include additional

processing
via plug
-
ins or

hook


programs
on a feed, its entries, or
the
various special
ized

fields

it can contain
.

Thus, a specialized reader seemed necessary.


Writing a
s
imple

feed reader is
actually
not
a

p
articularly

difficult

software projec
t
, and ther
e is a lot of help
available in books and web sites (I used
“RSS and Atom Programming” from Wrox books
(Wrox.com)
as a
guide for all my RSS programming
).

I have developed
s
uch a reader
,
again
u
sing

Java
,

which begins to address
some of
o
ur specialized

r
equirements

for feeds containing CML and other chemical information.

This
r
eader
and associated software
, which can be downloaded from
http://showme.physics.drexel.edu/usefulchem/Software/Java/
CMLRSSReader/CMLRSSReader.zip,

is
still
a
t an

early stage in development

and can currently han
dle only RSS 1.0 feeds

(and so far
has only been tested on the
usefulchem
-
molecules and two other closely related feeds)
, but demonstrates
some of
what can be done along
lines

described above
.

In addition to the standard reader features of
automatically
downloading and
managing
multiple feeds
,

displaying information contained their item entries,
a
nd

as tracking new or changed items,
the
software

also allows specialized programs to be executed

on the feeds themselves and their contents
. In
it
s

current form, pr
ogram
s

can be configured to run after feed
file
download

and/or

p
rocessing
.

T
hese p
rogram
s

can be written in any language
, even DOS BAT files

(although Java must be used on processed feeds, as they
are s
tored

via

Java serialization)
, and
can
perform any processing/reporting desired, s
uch as calculations using
the CML in the feed, internet searches,
database entry, and/or e
-
mailing results to the interested parties.


Two examples of this capability are already being used to automatically generate and upload information
for
d
isplay

on the
web.
One,
ExtractHTMLPages
,

is a Java program that

parses
the usefulchem
-
molecules

feed
file
for
i
ts

item

<description>

fields

and generates

an HTML file for each item
.
ExtractHTMLPages

also
g
enerates

an index file

(
http://showme.physics.drexel.edu/usefulchem/
Software/MoleculeBlogInfo/usefulchem
-
molecules/Items/UsefulChemistryMolecules.html
)

of the item HTML files
which,
us
ing
a combination of
JavaScript

and HTML
i
frames
,

allow
s

any
of the
m

to be
s
elected for viewing from a
drop
-
down list
.

When
CMLRSSReader downloads a feed, which it does
whenever the feed has been updated

(which in the case of
usefulchem
-
molecules, occurs whenever the
blog

is updated)
, it automatically
runs ExtractHTMLPages,
generating and uploading

all of these files

to the web server
.


grapedraught_1648b13f
-
92aa
-
45e7
-
8a50
-
05b7c264ba43.doc



Page
3

of
3


The other example,
ExtractNewItems
,

is a Java program w
hich works with processed feeds

to
record

and detail
changes

to the feed
. When new items are added to
the usefulchem
-
molecules

feed
, or new information about an
item is added

or modified
, ExtractNewItems

generates and uploads two files:
newItems.html
(
http://showme.physics.drex
el.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem
-
molecules/newItems.html
) and newItems.xls. True to their names, these files list items that have been added or
updated since the last time the program was run. Ultimately, the reason for a new listin
g will also be given,
such as new supplier information, but this is not currently impl
e
mented.



Future Directions:


Quite a bit of ground has been covered, and a lot of evolution occurred, since
the initial work with Excel VBA.

A certain amount of consol
idation and strategic consideration would seem to be worthwhile at this point.

To
begin, the numerous web sites and pages generated would benefit from some organization.
This can be done
with a single page, or small set of pages, providing links to and d
escriptions of the various software tools and
the pages they generate.


Second, although I have tried to make the CML RSS reader software highly flexible, it needs to be tested for
compatibility with other RSS 1.0 feeds containing CML if it is to become of

general use to the scientific
community. Additional development is almost certainly going to be needed here (
no one should

expect to be
that

lucky!).
I am also eager to see how the reader might interact with other software, such as Bioclipse, for
exampl
e in providing CML and other data in automated fashion.

This should prove fruitful, as Bioclipse
obviously
provides
so
much more in the way of processing and visualization tools than the reader itself.

Other
enhancements include a replacement for Java’s
JEditorPane for displaying item data

(
JEditorPane’s handl
ing of
HTML is fairly primitive),
other
improvements to the user interface,
and more configurable program extensions
and
/or

plug
-
ins.


Finally, a

lot of techn
ologies have yet to be explored in this a
rea.
One excellent candidate is the combination of
Ajax in HTML pages with chemical information web services. Ajax provides the ability to dynamically query
web sites and services without the overhead in time and resources of
retransmitting
/reloading ent
ire pages. In
conjunction with
JavaScript

events and dynamic HTML, this can essentially turn an ordinary browser into a
full
-
featured software user interface. Ajax also appears quite easy to use. For some simple examples of what
can be done with Ajax, s
ee
http://showme.physics.drexel.edu/usefulchem/Software/Ajax/UsefulChemistryMolecules/UsefulChemistryMole
cules.htm

and
http://showme.physics.drexel.edu/usefulchem/Software/Ajax/UsefulChemistryMolecules/UsefulChemistryMole
cules2.htm

(simply hover over any o
f the UC numbers).