XML Processing with Python

taxidermistplateΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 4 χρόνια και 8 μήνες)

102 εμφανίσεις

XML Processing with Python


Sean McGrath

December 06, 1999

As part of our XML'99 cover
age, we are pleased to bring you this taster from the
"Working with XML in Python" tutorial led by Sean McGrath.


A century ago, when HTML and
CGI ruled the waves, Perl
dominated the Web programming
scene. As the transition to XML on the Web gathers pace, competi
tion for the
hearts and minds of Web developers is heating up. One language attracting a lot
of attention at the moment is Python.

In this article we will take a high level look at Python. We will use the time
honored "Hello world" example program to illu
strate the principle features of the
language. We will then examine the XML processing capabilities of Python.

Python is free

Python is
. You will find downloadable source code plus pre
executables on
. As you know, "free" is one of those words that is
often heavily loaded on the Internet. Fear not. Python is free with a capital "F".
You are free to do essentially anything you like with Python, including make
commercial use of it or derivatives c
reated from it.

Python is interpreted

Python is an

language. Programs can execute directly from the plain
text files that house them. Typically Python files have a

extension. There is
no compilation phase as far as the programmer is concer
ned. Just edit and run!

Python is portable

Python is
. It runs on basically every computing platform of note, from
mainframes to Palm Pilots and everything in between. Python uses a virtual
machine architecture, similar in concept to Java's virtua
l machine. The Python
interpreter "compiles" programs to virtual machine code on
fly. These
compiled files (typically having a

extension) are also portable. That is to
say, if you wish to keep your source files hidden from your end
users you can
imply ship the compiled


Python is easy to understand

Python is very easy to understand. Here is a Python program that prints the
string "Hello world":

print "Hello world"

I think you will agree that programming a "Hello world" application can
not get
much simpler than that! To execute this program, you put it in a text file, say
, and feed it to the Python interpreter like this:

python Hello.py

The output is, surprise, surprise:

Hello world

Note the complete lack of syntactic baggage
in the

program. There are
no mandatory keywords or semi
colons required to get this simple job done. This
spartan, no
nonsense approach to syntax is one of the hallmarks of Python and
applies equally well to large Python programs.

Python is inter

By invoking the Python interpreter (typically by typing

on a UNIX/Linux
system, or running the "IDLE" application on Windows), you will find yourself in
an environment where you can execute Python statements interactively. As an
example, here

is the "Hello world" application again:

>>> print "Hello world"

This will output:

Hello world

Note that the ">>>" above is Python's command prompt. The interactive mode is
an excellent environment for playing around with Python. It is also indispensable

as a fully programmable calculator!

Python is WYSIWYG

Python is sometimes referred to as a WYSIWYG programming language. This is
because the indentation of Python code controls how the code is executed.
Python does not have begin/end keywords or braces f
or grouping code
statements. It simply does not need them. Take a look at the following Python

if x > y:

print x

if y > z:

print y

print z


print z

The indentation of the code is used to control how statements are grouped for
tion purposes. There can be no ambiguity as to which

clause is
associated with the

clause in the above code because both statements
have same level of indentation.

Functions in Python

We can turn the "Hello world" program into a Python function li
ke this:

def Hello():

print "Hello world"

Note that statements within the body of a function are indented beneath the

line which introduces the function. The parenthesis are a place holder
for function parameters. Here is a function that prin
ts its parameters


well as the string "Hello world":

def Hello(x,y):

print "Hello world",x,y

Python modules

A Python program typically consists of a number of
. Any Python source
file can serve as a module and be imported into another Py
thon program. For
example, assuming the

function above is housed in the file

we can import the function into a Python program and call it as follows:

# Import the Hello function from the Greeting module

from Greeting import Hello

# Call
the Hello function


Programs as modules to larger programs

Python makes it easy to write programs that can be used both as stand
programs and as modules to other programs.

Here is a modified version of

which will print "Hello worl
d" but can
also still be imported into other programs:

def Hello():

print "Hello world"

if __name__ == "__main__":

# Test Hello Function if running as

# main program


Note the special

variable above. This variable is automatically set


when a program is being executed directly. If it is being imported
into another program,

is set to the name of the module, which in this
case would be "Greeting".

Python is object

Python is a very object
oriented language.

Here is an extended version of the
"Hello world" program, called
, that can print any message via


#Create a class called MessageHolder

class MessageHolder:

# Constructor

called automatically

# when an object of this c
lass is created

def __init__(self,msg):

self.msg = msg

# Function to return the stored message string

def getMsg(self):

return self.msg

Note how indentation is used to structure the source code. the

function is
associated with objects of th

class because it is indented
beneath the class
. Functions associated with objects are more
generally known as

Suppose now that I need a variation on the

class in which all
messages are returned in upper
case. I can do that by

, specifying the class I wish to inherit from in parentheses after
the class name:

# Import existing MessageHolder class from Message.py

from Message import MessageHolder

# Create a sub
class of MessageHold
er called MessageUpper

class MessageUpper(MessageHolder):

# Constructor

def __init__(self,msg):

# Call constructor of superclass


# Over
ride getMsg with new

# functionality

def getMsg(self):

return string.upper(self.msg)

ython is extensible

The Python language consists of a small core and a large collection of modules.
Some of these modules are written in Python and some are written in C. As a
user of Python modules, you cannot tell the difference. For example:

import xml

import pyexpat

The first statement imports Lars Marius Garshol's implementation of an XML
parser that is written purely in Python. The second statement imports the Python
wrapping of James Clark's

XML parser which is written in C.

Python progra
ms using these modules cannot tell what language they have been
implemented in. As you would expect, programs based on

are typically
faster owing to the speed advantages of a pure C implementation of an XML

It is remarkably easy to write a P
ython module in C. This facility is very useful for
critical parts of large Python systems. It is also easy to "wrap" existing C
libraries as Python modules, as has been done with
. Many technologies
exposing a C API have been wrapped as Python
modules, for example Oracle,
the Win32 API, and the wxWindows GUI toolkit, to name a few.

XML programming support

The core Python distribution (currently at version 1.5.2) has a simple non
validating XML parser module called
. The vast bulk of Pytho
n's XML
support is in the form of an add
on module under active development by the
for XML Processing in Python

(known as XML
SIG). To illustrate Python's XML
support, we will switch to an XML 1.0 ver
sion of the "Hello world" program
processing the following file:

<?xml version = "1.0"?>


Hello world




is a simple API for XML, spearheaded by
David Megginson

and developed
as a collaborative effort on the

mail list. The Python implementation
was developed by Lars Marius Garshol.

A Python SAX application to count the words


looks like this:

from xml.sax import saxexts, saxlib, saxutils

import string

# Create a class to handle document events

class docHandler(saxlib.DocumentHandler):

# Start of document handler

def startDocument(self):

# Initialize stora
ge for character data

self.Storage = ""

# end of document handler

def endDocument(self):

# Print approximate number of words

# by counting the number of elements in

# the list of words returned by the

# string.split function

print len(stri

def characters(self,str,start,end):

# Accumulate character data

self.Storage = self.Storage + str[start:end]

# Create a parser

parser = saxexts.make_parser()

# Provide the parser with a document handler


# Parse the Greeting.xml file



The DOM is a W3C initiative to standardize an API to XML (and HTML)
documents. Python has two DOM implementations. The one in the XML
modules is the work of

Andrew Kuchling and Stéfane Fermigier. The other is
called 4DOM and is the work of
, who have also created XSLT and
XPath implementations in Python.

Here is a sample DOM application to count the words


from xml.dom import utils,core

import string

# Read an XML document into a DOM object

reader = utils.FileReader('Greeting.xml')

# Retrieve top level DOM document object

doc = reader.document

Storage = ""

# Walk over the nodes

for n in


if n.nodeType == core.TEXT_NODE:

# Accumulate contents of text nodes

Storage = Storage + n.nodeValue

print len(string.split(Storage))

Native Python APIs

As well as industry standard APIs, there is a native Python XML p
library known as


is an open source XML processing library for Python which will be made
publicly available in January 2000. Pyxie tries to make the best of Python's
features to simplify XML proce

Here is the word counting application developed using Pyxie:

from pyxie import *

# Load XML into tree structure

t = File2xTree("Greeting.xml")

Storage = ""

# Iterate over list of data nodes

for n in Data(t):

Storage = Storage + t.Data

print len(

In conclusion

We have looked at some of the main features of Python in a high level way. Also,
we have glimpsed at some of the XML processing facilities available. For further
information on programming with Python, I suggest you sta
rt with