Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents

greenbeansneedlesΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

87 εμφανίσεις

Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents



Presented for:

APIII
-

Advancing Practice Instruction and Innovation through Informatics

Marriott City Center, Pittsburgh, PA

Friday, October 10, 2003

Session E2


Perl and Python Programming Workshop

Session Organizers: Jules Berman and Jim Harrison



Jules J. Berman, Ph.D., M.D.

Program Director for Pathology Informatics

Cancer Diagnosis Program

National Cancer Institute

National Institutes of Health

Rockville, MD


Virtually everything presented can be
reviewed at you leisure at:


http://65.222.228.150/jjb/tutor.htm


This site contains literally hundreds of Perl
programming tips and scripts

What is the purpose of XML?

XML allows heterogeneous systems to
communicate and exchange their data

It achieves this through metadata (data about
data).

Can produce an ideal document that
completely describes itself, including all data
and all metadata.

COMMON XML TASKS


1. Converting an HTML file to an XML file.


2. Converting an XML file to an HTML file (e.g. making an
XML file presentable while preserving its information content)


3. Converting an Excel file to an XML file Converting an XML
file to a different data structure (e.g. moving XML into a
standard database)


4. Querying an XML file


5. Querying multiple XML files for related information

Lets do a simple conversion of an html file to an XML
file.

Here’s the html file (notice that the top header
information has been removed)

<body>

<h1>Simple HTML document</h1>

<br>List to follow:

<ul>

<li>First

<li>Second

<li>Third

</ul>

</body>

</html>

open (TEXT, "html.htm")||die"Cannot"; #substitute your html page

open (STDOUT, ">html.xml")||die"Cannot"; #substitute your html page

print "
\
<
\
?xml version
\
=
\
"1
\
.0
\
" encoding
\
=
\
"ISO
\
-
8859
\
-
1
\
"
\
?
\
>
\
n";

$line = " ";

%dictionary = (

"body" => "document",

"h1" => "title",

"ul" => "list",

"ol" => "list"

);

@keysarray = keys(%dictionary);

while ($line ne "")


{


$line = <TEXT>;


$line =~ s/
\
<
\
/html
\
>//;


$line =~ s/
\
n//;


if ($line =~ /^
\
<br
\
>/)


{


$line = "
\
<line
\
>$'
\
<
\
/line
\
>";


print $line;


next;


}


if ($line =~ /^
\
<li
\
>/)


{


$line = "<item>$'
\
<
\
/item
\
>";


print $line;


next;


}


foreach $key (@keysarray)


{


$line =~ s/(
\
<[
\
/]?)$key/$1$dictionary{$key}/g;


}


print $line;


}

exit;

Most important parts of HTML
-
>XML script:

%dictionary = (

"body" => "document",

"h1" => "title",

"ul" => "list",

"ol" => "list"

);

@keysarray = keys(%dictionary);


foreach $key (@keysarray)


{


$line =~ s/(
\
<[
\
/]?)$key/$1$dictionary{$key}/g;


}




Converting an XML file to an HTML file
(many many
different ways to do this)






Converting an XML file to an HTML file:


use XML::Parser;
#calls an external module

open (STDOUT, “>output.htm");


my $parser = XML::Parser
-
>new( Handlers => {


Init =>
\
&handle_doc_start,


Final =>
\
&handle_doc_end,


Start =>
\
&handle_elem_start,


End =>
\
&handle_elem_end,


Char =>
\
&handle_char_data,


});

my $file = "presum.xml";

$parser
-
> parsefile($file);



sub handle_doc_start

{

my $header = <<HEADER;

<html>

<head>

<title>

Precancer Classification

</title>

</head>

<body>

<center><h1>Precancer Classification</h1></center>

<br>

<br>

HEADER

print $header;

}

sub handle_doc_end

{

my $header = <<HEADER;

<br>

</body>

</html>

HEADER

print $header;

}

sub handle_elem_start


{


my ($expat, $name, %atts) = @_;


if ($name eq "concept")


{


$count++;


print "
\
<br
\
><font color=
\
"0000ff
\
">$name
$count</font><ul>
\
n";


return;


}


}




Etc., etc., etc.,




Remember: Perl XML
-
related modules can be
downloaded/installed at no cost from

www.activestate.com


ppm service.


PPM> search xml

Packages available from http://www.ActiveState.com/PPMPackages/5.6:

CGI
-
Form2XML [1.3 ] Render CGI form input as XML

CGI
-
ToXML [0.02 ] Converts CGI to an XML structure

CGI
-
XML [0.1 ] Perl extension for converting


CGI.pm variables to/from XML

CGI
-
XMLForm [0.10 ] Extension of CGI.pm which


reads/generates formated XML.

CGI
-
XMLPost [1.3 ] receive XML file as an HTTP POST

DBIx
-
XML
-
DataLoader [1.1b ]

DBIx
-
XMLMessage [0.05 ] XML Message exchange between DBI


data sources

DBIx
-
XML_RDB [0.05 ] Perl extension for creating XML


from existing DBI datasources

Data
-
DumpXML [1.05 ] Dump arbitrary data structures as


XML

GoXML
-
XQI [1.1.4 ] Perl extension for the XML Query


Interface at xqi.goxml.com.

HTTP
-
WebTest
-
Plugin
-
XMLReport [1.01 ] Report plugin for HTTP::WebTest
generates output in XML format

Tk
-
XMLViewer [0.15 ] Tk widget to display XML

XML
-
AutoWriter [0.37 ] DOCTYPE based XML output

XML
-
Beautify [0.05 ] Beautifies XML output from


XML::Writer (soon to do any XML).

XML
-
DOM [1.25 ] A perl module for building DOM


Level 1 compliant document


structures

XML
-
DOMHandler [1 ] Implements a call
-
back interface to


DOM.

XML
-
DTDParser [1.7 ] quick and dirty DTD parser

XML
-
Excel [0.02 ] Perl extension converting Excel


files to XML

XML
-
Node [0.11 ] Node
-
based XML parsing: an


simplified interface to XML::Parser

XML
-
SAX [0.12 ] Simple API for XML

XML
-
SAX
-
Base [1.02 ] Base class SAX Drivers and Filters

XML
-
SAX
-
Builder [0.02 ] build XML documents using SAX

XML
-
SAX
-
Expat [0.37 ] SAX Driver for Expat

XML
-
SAX
-
Machines [0.4 ] manage collections of SAX


processors

XML
-
SAX
-
PurePerl [0.80 ] Pure Perl XML Parser with SAX2


interface

XML
-
SAX
-
RTF [0.1 ] SAX Driver for Microsoft's Rich


Text Format (RTF)

XML
-
SAX
-
Simple [0.02 ] SAX version of XML::Simple

XML
-
SAX
-
Writer [0.44 ] SAX2 XML Writer

XML
-
SAXDriver
-
CSV [0.07 ] SAXDriver for converting CSV files


to XML

XML
-
Writer [0.4 ] Perl extension for writing XML


documents.

XML
-
Writer
-
String [ 0.1 ] Capture output from XML::Writer.

XML
-
XPath [1.12 ] a set of modules for parsing and


evaluating XPath statements

XML
-
XPath
-
Simple [0.05 ] Very simple interface for XPaths

XML
-
XPathScript [0.03 ] Stand alone XPathScript

XML
-
XQL [0.68 ] A perl module for querying XML tree


structures with XQL

XML
-
XSLT [0.40 ] A perl module for processing XSLT

Creating an XML file from an Excel file


1.
Example is done in Windows, and because it’s
using an Windows
-
based application, and the
Windows API, it won’t work in Linux (not Perl’s
fault).

2.
There are plenty of other approaches that will work
in Linux

3.
Also, requires Excel to be installed.

4.
The complete Perl script is opener7.pl and found
in the perl tutorial:
http://65.222.228.150/jjb/tutor.htm


Creates a Windows OLE object for Excel
-

NON_PERLISH

my $app = CreateObject OLE "Excel.Application" || die "Can't open";

$app
-
>Workbooks
-
>Open($xlfile);


Creates the XML tags by collecting a list of the column headers

foreach my $column_place (@column_array)


{


$thing = $app
-
>Range("${column_place}1")
-
>{'Value'};


if ($thing ne "")


{


$thing =~ s/ /_/g;


$thing =~ s/[^
\
w0
-
9]//g;


$thing =~ s/2nd/Second/g;


$nextthing = "$column_place||$thing";


print "$nextthing
\
n";


push(@index, $nextthing);


}


else


{


last;


}


}

Creates a Windows OLE object for Excel
-

NON_PERLISH


foreach my $arrayvalue (@index)


{


$arrayvalue =~ /
\
|
\
|/;


my $key = $`;


my $value = $';


$thing = $app
-
>Range($key . $row)
-
>{'Value'};


#substitute &amp for &


$thing =~ s/
\
&/
\
&amp/;


#substitute &gt for >


$thing =~ s/
\
>/
\
&gt/;


#substitute &lt for <


$thing =~ s/
\
</
\
&lt/;


#substitute &apos for '


$thing =~ s/
\
'/
\
&apos/;


#substitute &quot for "


$thing =~ s/
\
"/
\
&quot/;


$thing =~ tr/a
-
zA
-
Z0
-
9 //cd;


print "
\
<$value
\
>$thing
\
<
\
/$value
\
>
\
n";


}

$row++;


BUILDING THE COOPERATIVE PROSTATE
CANCER TISSUE RESOURCE TISSUE
MICROARRAY FILE


1. Get xls file with core information

TMACPCTR.XLS 98,816 7
-
24
-
03 11:17am A


2. convert the xls file to an xml file using opener7.pl

OPENER7 .PL 3,663 7
-
24
-
03 11:39am A

This produces file block2.xml

BLOCK2.XML 328,263 7
-
24
-
03 11:39am A


3. Add header and trailer information to the xml file

Header information is basically:

<?xml version="1.0"?>

<histo>

<tma>

<header>

<title>CPCTR Microarray 1</title>

<creator>CPCTR</creator>

<subject>Tissue Microarrays</subject>

<description>CPCTR TMA XML</description>

<rights>public domain</rights>

<filename>tmacpctr.xml</filename>

</header>


Trailer information is basically:

</core>

</block>

</tma>

</histo>

This produces:

TMACPCTR .XML 331,636 7
-
28
-
03 10:58am A


4. Check validity of the tmacpcrt.xml file using validtma.pl

VALIDTMA .PL 9,132 5
-
21
-
03 3:06pm A

The TMA validating Perl script can be obtained by going to
the TMA specification paper:

The tissue microarray data exchange specification: A community
-
based, open source tool for
sharing tissue microarray data

Jules J Berman1

, Mary E Edgerton2


and Bruce A Friedman3


BMC Medical Informatics and Decision Making 2003 3:5

http://www.biomedcentral.com/1472
-
6947/3/5


The validating protocol produces a screen output that
includes:


c:
\
tmacpctr.xml

Begining to parse c:
\
tmacpctr.xml now.

Finished. c:
\
tmacpctr.xml is a valid Tissue Microarray
File.

The one
-
way hash of your file is
e2ad62a75974628b7499bd7d771b82f0

Querying an XML file


1.

Many many ways. Most people use XSLT
(Extensible Stylesheet Language Transformations)


2. When you haven’t converted your XML into
another data structure (like a database structure)
and you’re using straight XML as the document
that you’re querying, then a query is the same as a
transformation where you through everything away
except the stuff that matches your query
.

HETEROGENEOUS XML MERGES/QUERIES


1.
Can be thought of as a special form of XSLT


2.
Or as a data structure conversion


3.
Or as a straightforward Perl programming job


HETEROGENEOUS XML MERGES/QUERIES

HETEROGENEOUS XML MERGES/QUERIES


This is where namespaces becomes important