p3-ch15-hybrid.doc - Holoweb

seasoningalluringData Management

Nov 29, 2012 (4 years and 8 months ago)

279 views

Chapter 15

Hybrid Approaches

Previous chapters have discussed ways of using XML together with a single type of database. This chapter talks
about combining multiple techniques together. Some people seem to want to see everything as an XML
document, and oth
ers prefer to view the world as being composed of relational tables, or of objects. The real
world is not an elegant abstract concept, so it is no surprise that no one of these models solves all needs.

Taking a "toolkit" approach, this chapter gives a numb
er of suggested solutions to problems, and
mentions software that is also listed in the Resource Guide at the end of the book.

Files and Databases

The fastest commercial SGML repositories have been around for years longer than XML itself; of these,
the fas
test store data in flat files. That doesn't make much of a repository, though, so they add other software. One
interesting approach is to use a relational database to store
metadata

such as filename, permissions, author and
title, but to keep the actual do
cuments in external files. The files can be accessed with the Network File System
(NFS), or through a web server, or with some other (generally proprietary) mechanism.

There are a number of difficulties with this approach, although they can all be surmount
ed.

The biggest is a Referential Integrity problem, as the database people call it. You have two sets of
information that are interrelated, but that can be updated independently. What if I delete a file that's still listed in
the database? What if I change

the title in the database but don't edit the document? What if I rename a file? What
if a file changes while I'm backing up the database, or the database changes while I'm backing up files?

The obvious way to surmount this is to deny users all access to t
he files except through your software.
Write a server that controls all aspects of file management. Where there is derived information stored in the
database, make sure that it is only updated by actually reading and parsing a document file.

A second possi
ble problem is that a file system may be less "robust" than a database. Since databases are
stored on the same disks used for file systems, this is generally bogus. Back up all of your data daily and weekly,
and use a high
-
quality RAID if the data is impor
tant.

A real problem is that if you need to do a lot of structure
-
based searches on your XML data, you'll end up
representing the XML structure in the database and lose the biggest advantage of this strategy: speed. If you are
planning to ask for all store
s in a certain geographical region containing magazines which in turn contain articles
describing socks manufactured in your Aberdeen factory, this isn't the way to go. If you need XML
-
sensitive
searching at all, see the next section, Databases and Text Re
trieval, for a solution.

All of this is further complicated by the variation in support for external files in different databases.
Some databases (including Oracle 8i) have support for backing up a database and external files together so that
no data is lo
st, but this level of support is the exception rather than the rule.

Another possibility is to use a text management system such as RCS or CVS to store the documents, and
keep metadata (of which more in Part Four) in the relational database. If you are not

familiar with RCS and CVS,
the Revision Control System and Central Versioning System, check them out; they are pretty useful. The author
keeps all of his text and configuration files in RCS. The next section explores this in more detail.

Using RCS and a R
elational Database

The Revision Control System, RCS, stores text files. Each time you change the file, you "check it in" to
RCS; RCS then stores the changes. As a result, the size of an RCS log file grows slowly, and yet RCS can
quickly show you the differ
ences between any two versions. The following log shows checking a file out, making
changes, and then asking RCS what the changes were:

@

bash
-
2.03$
mkdir RCS

bash
-
2.03$
ci
-
u Morality.html

RCS/Morality.html,v <
--

Morality.html

enter description, termina
ted with single '.' or end of file:

NOTE: This is NOT the log message!

>>
a poem by Matthew Arnold

>>
.

initial revision: 1.1

done

bash
-
2.03$
co
-
l Morality.html

RCS/Morality.html,v
--
> Morality.html

revision 1.1 (locked)

done

bash
-
2.03$
(edit the file M
orality.html here)

bash
-
2.03$
rcsdiff
-
c Morality.html

===================================================================

RCS file: RCS/Morality.html,v

retrieving revision 1.1

diff
-
c
-
r1.1 Morality.html

*** Morality.html 2000/03/02 21:36:02 1.1

---

Morality.html 2000/03/02 21:45:57

***************

*** 10,15 ****

---

10,18
----


<p>


We cannot kindle when we will<br>


The fire that in the heart resides,<br>

+ <!
--
* http://www.library.utoronto.ca/utel/rp/poems/arnold2b.html

+ * has wh
ich instead of that

+ *
--
>


The spirit bloweth and is still,<br>


In mystery our soul abides:<br>


&nbsp; But tasks in hours of insight will'd<br>

***************

*** 21,28 ****


We dig and heap, lay stone on stone;<br>


We bear the burden and the

heat<br>


Of the long day, and wish 'twere done.<br>

! &nbsp Not till the hours of light return<br>


All we have built do we discern.<br>


</p>




---

24,66
----


We dig and heap, lay stone on stone;<br>


We bear the burden and the heat<br>


Of the

long day, and wish 'twere done.<br>

! &nbsp Not till the hours of light return,<br>


All we have built do we discern.<br>


</p>



bash
-
2.03$
ci
-
u Morality.html

RCS/Morality.html,v <
--

Morality.html

new revision: 1.2; previous revision: 1.1

enter log

message, terminated with single '.' or end of file:

>>
added the rest of the verses, fixed a typo,

>>
and commented a difference.

>>
.

done

The bold lines after prompts are what I typed; I've also shown the changed lines in the
diff

output in bold.
A "
+
"
at the start of a line shows an added line, and a "
!
" shows a changed line. Had there been any deleted lines,
they would have been marked with a "
-
".

The
rcsdiff

command can compare any two versions of the file, with the
-
r

option:

rcsdiff
-
r1.2
-
r1.9 Mora
lity.html

It would be fairly easy to write a CGI script, perhaps using PHP, which checked a file into RCS and
updated the database. You might keep a simple table like this:

+
-----
+
-----
+
----
+
-----------
+
---------
+
---------------------------
+

|
DocID
|
Owner
|
R
ev

|
ModDate

|
File

|
Title

|

+
-----
+
-----
+
----
+
-----------
+
---------
+
---------------------------
+

|104 |james|1.14|2000
-
03
-
14 |r419.xml +Book Review: MySQL & mSQL |

+
-----
+
-----
+
----
+
-----------
+
---------
+
-----------------------
----
+

Your script could extract title and other information from the file each time it's changed, and update the
database. You would be able to do title, owner and date searches quickly, and could retrieve any version of any
document.

The main reason to pu
t files into the database is to be able to back up your data safely. If the data is
external, you may need to shut down the database in order to ensure that no files change during the operation.

The main advantages to having the files external are:



Databas
e size: most databases have a larger storage overhead than the file system



Performance: the database may go slowly when there's a lot of stored data.



External access: you could make a text retrieval index of the files, for example, or do a read
-
only CVS
fe
tch of a project without having to use the database

A hybrid system that stores files externally and indexes them with a text retrieval package is described in
the next Section, but even the ability to use the Unix
grep

command should not be undervalued.

D
atabases and Text Retrieval

Most text retrieval systems work with files, not database fields. If you want to provide the most powerful
searching possible, you may end up with one of two hybrid solutions: relational databases and text retrieval, or
object
-
o
riented database and text retrieval.

Whichever sort of database you are using, the simplest technique is to generate an external file
corresponding to each document, and hand the files to the text retrieval package to index. Give the files
significant name
s, perhaps composed of a document number and a revision number. When you use the text
retrieval package to search, it will give you a list of filenames that match your query. You can then map these
filenames back to internal objects or database rows, caref
ully checking document access permissions before
returning the number of matches and document list to the user.

The author's own
lq
-
text

package has been used in this way, as have many commercial packages. At the
time of writing there are no open source te
xt retrieval packages that handle XML; if you are interested in working
on
lq
-
text
, contact the author. Commercial vendors such as Open Text and Verity have support for SGML and/or
XML searching. See the Resource Guide for a number of freely available text

retrieval packages for plain text.

Here is an example of a result of a query using
lq
-
text
; the format is entirely configurable, and is shown
here in an SQL
-
like format:

+
---
+
-----------------------
+
--------------
+
---------------------------
+

|
Doc
|
Prefix


|
Match

|
Postfix

|

+
---
+
-----------------------
+
--------------
+
---------------------------
+

|12 |nal) the ``XForms'' X11|user interface|toolkit. You need to have a|

|12 |on a freely distributed|user interface|toolk
it called the XForms L|

|13 |w programming language.|User interface|design. MPI was designed af|

|96 |r PostgreSQL (Graphical|User Interface|) 9. Integrated Development|

|96 |r PostgreSQL (Graphical|User Interface|) PostgreSQL has Tcl/Tk int|

+
---
+
--------
---------------
+
--------------
+
---------------------------
+


+
---
+
--------------------------
+
---------------
+

|
Doc
|
Filename

|
Location

|

+
---
+
--------------------------
+
---------------
+

|12 |PCMCIA
-
HOWTO |/usr/doc/HOWTO |

|13 |Parallel
-
Processing
-
HOWTO |/usr/doc/HOWTO |

|96 |PostgreSQL
-
HOWTO |/usr/doc/HOWTO |

+
---
+
--------------------------
+
---------------
+

The results are returned as lines of text using the command
-
line API; alternatives include using the C API
t
o extract individual fields, or using a unique separator string between the fields and then extracting them from
text. One way to do this would be to generate XML, and
lq
-
text
can already escape tags in several ways. The
following example shows a possible
XML rendition of the same matches:

<Results>


<File Name="
/usr/doc/HOWTO/PCMCIA
-
HOWTO
">


<Match>


<Before>tem before you begin: ∙ A 2.0.*, 2.1.*, or 2.2.* series

kernel source tree. ∙ An appropriate set of module utilities.

(Optional) the ``XForms
'' X11 </Before>


<Text>
user interface
</Text>


<After> toolkit. You need to have a complete linux source tree for

your kernel, not just an up
-
to
-
date kernel image. The driver

modules</After>


</Match>


<Match>


<Before>module "misc/ser
ial", "serial_cs" This package includes an

X
-
based card status utility called cardinfo. This utility is based on a

freely distributed</Before>


<Text>
user interface
</Text>


<After> toolkit called the XForms Library. This library is

available as a

separate package with most Linux distributions. If you

would li</After>


</Match>


</File>


<File Name="
/usr/doc/HOWTO/Parallel
-
Processing
-
HOWTO
">


<Match>


<Before>nd parallel file I/O. Are these things useful? Of course

they are... but lear
ning MPI 2.0 is a lot like learning a complete new

programming language. </Before>


<match>
User interface
</match>


<After> design. MPI was designed after PVM, and clearly learned

from it. MPI offers simpler, more efficient, buffer handling a</Aft
er>


</Match>


</File>


<File Name="
/usr/doc/HOWTO/PostgreSQL
-
HOWTO
">


<Match>


<Before>reater than 200 Gig 7. How can I trust PostgreSQL ?

Regression Test Package builds customer confidence 8. GUI FrontEnd Tool

for PostgreSQL (Graphical </Bef
ore>


<Text>
User Interface
</Text>


<After>) 9. Integrated Development Environment Tools for

PostgreSQL (GUI IDE) 10. Interface Drivers for PostgreSQL 10.1 ODBC

Driver</After>


</Match>


<Match>


<Before>ge MAY not be supported by Postg
reSQL!! You may need to

verify those and add it to regression package. 8. GUI FrontEnd Tool for

PostgreSQL (Graphical </Before>


<Text>
User Interface
</Text>


<After>) PostgreSQL has Tcl/Tk interface library in the

distribution called 'pgTcl'. Tcl
/Tk is a Rapid Application Development

tool and is</After>


</Match>


</File>

</Results>

@

To give an idea of how a text retrieval package might be coerced into cooperating in that way, the
command used to generate the above list is shown below.

#! /bi
n/sh


echo "<Results>"


lqphrase "$@"

lqkwic
\


-
S '<File Name="${FileName}">
\
n'
\


-
A '</File>
\
n'
\


-
s '<Match>


<Before>${TextBefore}</Before>


<Text>${MatchedText}</Text>


<After>${TextAfter}</After>


</
Match>
\
n'
\


-
f
-


echo "</Results>"

The
lqkwic

program substitutes the
${TextBefore}

and other variables given in the
-
s

option for
each match, and uses the
-
S

and
-
A

values before and after each group of matches for a given file. Other options
(not sho
wn here) support the replacement of XML entities and encoding of
<

and
&

in the output. This is a pretty
complex example, and the purpose isn't to teach
lq
-
text

commands, but to give an idea of how it can be fairly easy
to integrate a text retrieval packag
e into other software at the Unix (or NT) shell level.

A fairly current source
-
code release of
lq
-
text

is included on the CD
-
ROM for you to play with.

Things to Watch For

Not all text retrieval packages can cope with documents that change, unless you build

a new index of all
document, often a very lengthy process. If this is an issue for you, check first. The
lq
-
text

package can "unindex"
a file, if you still have the original. Space in the index is made available for reuse, but the index never actually
shr
inks, so it's a good idea to rebuild it completely once a month or so.

Since most text retrieval databases are unavailable, or give incomplete answers, during rebuilding, you
may want to build a new index alongside the old one and then switch. This avoids
having a downtime of several
hours while a large index is rebuilt. A short shell script can usually do this.

Some packages can handle compressed or gzip'd files, and some can't. Some can handle them at retrieval
time but not at indexing time, oddly.

Most t
ext retrieval packages can write a database to a networked drive, but do so very slowly. A factor of
ten or more performance loss is not at all unheard of. It's therefore common to run indexing software on the
server even if the retrieval is done on a clie
nt.

You should always consider a text retrieval package to be a sort of cache: they are not the primary
repository for your data, and often don't store the data itself at all. As with all caches, there are consistency
problems. One is that the text retriev
al package may grant access to information about documents even if the user
doing the query should not be able to see the data. This is how the Dead Sea Scrolls were leaked: someone had
access to a text retrieval index, and was able to piece the documents
together word by word! Another issue is that
the text retrieval system might not notice that a document has been changed, and give incorrect results.

When you are choosing (or implementing) a text retrieval system, note that the more documents you
have, th
e more you will value precise and correct answers. Higher precision usually means storing more
information, so that more accurate retrieval packages tend to have larger indexes
--

often larger than the actual
data. If you also need temporary storage to bui
ld the index, and enough room for a spare copy of the index while
it's being rebuilt, remember to budget for the extra disk space. Usually, the more memory you have, the faster
indexing goes, so having extra memory on the server may be useful too.

Using nd
bm as a Cache Manager

Relational databases are very slow beasts compared to most other storage technologies: few databases
can handle a paltry 100,000 transactions per second even on a Pentium or SPARC system. If you are serving up
data from a database ont
o the web, you may find things are too slow, taking several seconds to send a complete
web page. Users will generally wait three seconds or so for a response, but not ten or twenty unless they really
need the information.

One way to speed performance is to

design your database so that as few SQL JOIN statements are needed
as possible. Sometimes performance constraints mean you end up abandoning the Third Normal Form that we
mentioned in Chapter 3, Just Enough SQL.

If the database isn't your own, or if you d
ecide that you need to keep the current schema, you could speed
things up by keeping a copy of recently accessed documents. This copy is called a
cache
. Caches are very
difficult to implement:



If the database changes, documents in the cache become incorre
ct.



If multiple processes write to the cache at the same time, it may be come corrupt.



Different users might get different views of the database; the cache must take this into account.



In a multi
-
user server environment, the documents in the cache itself m
ust be protected so people can't
look at what other users are reading.



Problems in the database code become hard to debug, because you have to work out whether the
document you see was from the cache or was freshly generated. A separate interface that bypa
sses the
cache can help, as can a program to give a detailed table of contents of the cache itself.

The simplest way to implement a cache in the HTTP environment is to cache the generated HTML pages,
making sure that the URL of a page is sufficient to gene
rate the page. Some sites use cookies or the HTTP
Realm
-
based authentication described in Chapter 2, and generate user
-
specific pages for a given URL based on
the realm or cookie. Don't do this. Put a session identifier into the URL if you have to, because

otherwise the
browser's own cache will make debugging early impossible.

Once you've made your URLs unique, you can use Apache's cache, or investigate Squid. There are links
to both of these packages in the Resource Guide in the Client/Server section. Figu
re 15.1 shows this architecture.

Figure 15.1

An External HTTP
-
Based Cache

If the URL is not sufficient to produce the correct page, you will have to use a different technique. You
may want to do this anyway if you build up pages by assembling fragments, be
cause caching the fragments may
be more efficient or easier to implement. For example, if a fragment changes, you will know which parts of the
cache to delete. Figure 15.2 shows where this cache goes in the architecture. TODO

Figure 15.2

An Internal Cache

If you have lots of very small files, you'll end up with a lot of wasted disk space: each file has to be
stored in an integral number of blocks on disk. The size of a disk block varies depending on the type of disk and
the operating system. most Unix syste
ms can use disk blocks as small as 512 bytes even on a large disk, whereas
Microsoft Windows systems tend to use 16Kbyte or larger blocks. For a million files of 500 bytes each, you've
just lost fifteen gigabytes of space.

In a client/server environment, a

server can keep track of the latest version of an object, and can often
avoid going to the database at all if it's asked for the same object twice. You have to remember to check
permissions, of course: even if it's the same user, the database permission t
able might have been updated.

One way to implement a cache is to use the Berkeley
db

package (see Chapter 12 for more on
ndbm

and
related packages); this is what at least some versions of Netscape use. You use the URL or document path as a
key. The corresp
onding value might be the data itself, or it might be a structure that describes the data. If you
store the data in the
db

file, you get the benefit of losing less file system space and having faster access, but it's
harder to control the size of the cache
.

Other Hybrid applications with
db

Chapter 15 showed an example called the
AutoLinker
, which adds hypertext links and annotations to
documents on the fly. The example shown used an in
-
memory database built by reading an XML file, but an
alternative is to
split the process into two parts. The first part reads the XML file or uses DBI to perform a
database query, and saves the results in an
ndbm

(or
db
) database. The second part can then be run at any time on
any document, without the overhead of having to s
can for possible targets. A variation of the AutoLinker
described in Chapter 18 uses a text retrieval database to add links to documents in a similar two
-
stage strategy.

Documents as Objects

A number of XML
-
based systems use an object
-
oriented database to
store XML, usually representing
each element as an individual object. An alternative is to consider an entire document to be an object. The object
could be stored externally, on the file system or in a different database, with the object
-
oriented database
holding
a "surrogate object" that represents the actual document. This is similar to the approaches described above with
relational databases, of course.

Document Management and Work Flow

If you have a team of people working on the same document, it might
sound useful to let anyone edit any
element. In practice, people usually want to edit a chapter or a section, not just a single emphasis element. This
lets you make an optimization and store the "editable objects" as single objects, either internal to the
database or
externally.

Once someone has edited an object, they may need to get approval from management before it can be
published. Clearly this can be automated, perhaps by sending email to the manager concerned as well as storing a
"Needs Approval" stat
e for the object.

Document Management and Work Flow are two related subject areas that could fill whole books by
themselves. All that can be said here is that although there are few open source content management systems, that
is changing. The resource gui
de has pointers to more, and the web site for this book may list more as and when
they appear.

Error handling

This may seem like an odd subject! Error handling is one of the hardest things to do well when creating a
hybrid system. Each component will proba
bly report errors in its own way, and you will need to trap the failures
and report the detailed system
-
level information, but also do that in a way that helps the user to diagnose the
problem.

One system the author has used issues the error "you may have
too many process running" whenever one
of the components dumps core or crashes. A better message might be "The program start
-
revision crashed; it was
called from internal task: Copy Configuration To Make New Revision, as part of revising document DZ40196
f
rom revision 4 to revision 5." because at least then the users and staff would have a clue about where to start
debugging.

It is worth spending a
lot

of time on a user interface, because that's
all

the user sees. The user rarely looks
at the code itself. E
rror handling is an important part of that interface, and a difficult one.

Summary

This has been a short chapter, and yet the ideas here are among the most important in the book.
Combining the strengths of two or more systems leads to a stronger one.

It is
n't always easy to combine packages: there are issues relating to backup, to integrity, to security and
to error reporting. But these issues are worth overconing.