Using Monarch Pro
A helpful method for pulling data from text documents,
PDFs, and HTML files into tables.
IRE 2006 | Fort Worth, Texas
Paula Lavigne | The Dallas Morning News |
What is Monarch?
Monarch is a “data
mining” software that can turn text documents, PDFs, HTML files, formatted
ASCII files and other report formats (.prn, .prt., .txt, .rpt) into usable data tables. It does this by
recognizing patterns in the
document, so it works best with reports that have a repetitive style and
It also has spreadsheet, database, analysis, and chart
making tools. But our main use of it is to
get that icky, jumbled data from a text file into a database we
Where do I get Monarch?
Monarch is sold by Datawatch at www.datawatch.com . Version 8 is the most current. Here’s the
downside. The standard edition is about $500; the professional edition is $600. (The professional
edition reads a greater ra
nge of report types, such as PDF and HTML.)
Here’s an example of what Monarch can do. It can take a document that looks like this:
And turn them into:
How it works
Monarch does this by creating several templates and setting “t
raps” in each template to capture
each level of information. You can create several fields from one level, but the number of levels
you have is determined by how your data are organized.
Monarch assumes that all information is hierarchical from the bottom
of a pattern set of
There are four types of templates based on levels of information.
Detail Templates: This the lowest level of data on your document. In the crime example, it
would include the number of different types of crimes and th
e categories on the left. You
can have only one detail template. Monarch creates one record for each detail line
extracted from the report.
Append Templates: These include information you want to add to each line of detail. In
the crime file, it would incl
ude the county. You can have up to nine append templates.
Page Header Templates: This attaches information at the page header to each line of
detail and recognizes page headers as they appear throughout the document. In the
crime example, the only piece of
data we needed from the page header was the year.
Footer Templates: You can create a footer template to extract data that follow the detail
lines, such as page numbers, totals, or certain comment lines.
Setting a trap
Once you pick which template you
need, you extract the data by setting a trap. The trap
recognizes a certain pattern in the data. In the two examples above, you can see how certain
types of data are always in the same, or at least similar, positions.
There are seven types of traps:
ical: The trap can match exactly a character or set of characters. On the music
example, this could be Account Number or Contact.
Numeric: For any numeric character.
Alpha: For any letter.
Blank: For any space or blank.
blank: For any type of characte
r (alpha, numeric, punctuation, etc.)
Postal Trap: A special trap that recognizes the format of address blocks.
Floating trap: Sometimes data won’t be in the exact same horizontal position. In this
case, you can specify a pattern using a combination of the
other traps that gets picked up
at any position along the line. For example, say your data look like this:
Owner: Jane Smith
Owner: John Doe
Owner: Jim Johnson
In this case, you coul
d use the floating trap to look for the date pattern NN/NN/NNNN, or
the Owner: pattern.
A sample trap
The first step in setting a trap is to pick one or more lines of sample text. Then you determine the
best pattern to set to trap your data. You can scr
oll through your document to see whether you’ve
missed lines that need to be trapped or included those that shouldn’t be.
In this case, we’ve set a numeric trap followed by two blank traps to match the unique pattern in
Each trap template b
uilds on the other. You can assign fields as you go, by highlighting the field
on the sample text line, and double clicking to bring up the field properties box.
Or you can finish creating your templates and then go to Edit | Field List, where you can nam
your fields and change the order.
This is just a sample of one type of trap. Monarch can also work with multiple line fields and data
in columns (Template | Multi Column Region).
When you’re finished, click on the Table Window icon to view your tab
You can export your table into a variety of formats, including, but not limited to, Microsoft Excel,
Dbase, and text.
The basic instruction manual with Monarch offers some standard data extraction samples, and it’s
quite easy to
follow. Datawatch also offers training throughout the nation, but it’s actually more
expensive than the product itself.
Good news is that you get some tech support from the company with the software purchase, and
there’s an active users forum at
or just go to
and click on “Support” and “Product Forums.”