Brief project overview: ----------------------- To take HTML files on a local drive/path, including subfolders, and compile a master SQLite file containing a summary of the HTML files. Two Delphi programs are needed. One program will index the HTML files. The second program will use the SQLite database to locate the HTML files based upon certain user input parameters.

tenderlaSoftware and s/w Development

Dec 13, 2013 (3 years and 5 days ago)

60 views

Brief project overview:

-----------------------


To take HTML files on a local drive/path, including subfolders, and compile a master
SQLite file containing a summary of the HTML files. Two Delphi programs are needed.
One program will index the HTML files
. The second program will use the SQLite
database to locate the HTML files based upon certain user input parameters.



Details:

--------


Source code will be provided and liberally commented. Source code m
u
st run in Dephi

7
or above
. EXE files will NOT be

submitted to the contractor at any time.


Tens of t
housands

of HTML documents, residing in multiple folders, will need to be
indexed.


The HTML files that will be indexed will either be a single file, or most often than not, a
series of files
that return

from

a wildcard search
filter
, a
nd/or a

folder

path
, or a series of
nested folders.



Program 1 of 2 (the Indexer tool):

==================================


Execution:

----------


The program will allow the user to specify the location of the HTML files t
hat need to be
indexed. Below is
roughly
how the interface that allows this to happen should look like:




For instance, if the file
name

filter is:


MyFile.html


...and the TCheckBox "Include
sub
folders" is checked off, the app will look in the current
ap
p's default path for the MyFile.html file. If the file does not exist, the program will
display an appropriate error message.


If the file filter is "MyFile*.html"
,

then the program will look in the current app's default
path for any files that match the "
MyFile*.html" filter.


If the file filter is "*.htm*" then the program will look in the current app's default path for
any files that match the "*.htm*" filter.


If the file filter is "*.htm*" and the user clicks the "..." TButton, then the TOpenDialog
wil
l appear and allow the user to specify a beginning search folder path (e.g., 'c:
\
my
files
\
start here'). The program will look in the selected TOpenDialog's path for any files
that match the "*.htm*" filter. If the TCheckBox "Include
sub
folders" is checked
ON,
then all subfolders containing files matching "*.htm*" will be included.


Once the file or files have been determined, each file will be opened and a search will be
performed on certain HTML tags found within the <head></head> section of the files.


Th
e
5
tags that th
is

program will utilize are the following:


<meta http
-
equiv="FileID" content="" />

<meta http
-
equiv="Reference" content="" />

<meta http
-
equiv="Keywords" content="" />

<meta http
-
equiv="Author" content="" />

<title></title>


The data found

in the content="" parameters of these tags will be used to create the
subsequent SQLite table indexes.


As each HTML file is analyzed, any "irregularities" should be shown in the TMemo,
which will serve as a log. The log data should be automatically save
d to disk when the
program closes. When the program launches, the TMemo should be blank.



Irregularities include:

-----------------------


+ HTML files that have one or more of the

5
tags
(above)
missing.


+ HTML files where the <title></title> tag exists
, but there is no data found between
these tags.


+ "Reference" content that is out of range or empty. ("Out of range" will be explained
later.)


+
Whenever BOTH
these situations exist: 1)
the
"Keywords"
tag
content contains no
keywords

AND
2)
where the “R
eferences” tag content is empty.


+ More than one HTML file with the same FileID
. (FileID will be explained later.)



If the chkBoxStopOnError TCheckBox 'Checked' property is True, then processing will
come to a stop and a message dialog will report an err
or has been found. No additional
HTML files will be processed.



Content parameters:

-------------------


In the case of the following <head> tags...


<meta http
-
equiv="Reference" content="" />

<meta http
-
equiv="Keywords" content="" />


...the content para
meter
for one or the other will

always be associated with a value or
collection of values. Examples:


<meta http
-
equiv="Reference" content="Isaiah 110:1|Matthew 2:22
0
-
23" />

<meta http
-
equiv="Keywords" content="Nazarene|Nazareth" />


If both content fields

are empty, then this is an error and needs to be flagged/logged as
such.


It is permissible to have the “Keywords” content empty IF there is content for the
“Reference” tag. For example, the following is acceptable:


<meta http
-
equiv="Reference" con
tent="
Isaiah 110:1|Matthew 2:22
0
-
23" />

<meta http
-
equiv="Keywords" content="" />


You will find that often more

than one value exists with
in the "content" parameter. In
these cases
the values will be separated by pipe/filter ( | ) characters. The program will
need to parse the values between the pipes. These
parsed
values will be used later to
create the SQLite tables.


But before we create these files, it will be necessary to verify that the values found
associated with the "Reference" content are not "Out of
range".


An "out of range" value, in the case of the "Reference" content, would be where the Bible
reference does not exist in the Bible. For instance, in the above example, "Isaiah 110:1"
is an out of range reference. This is because the book of Isaiah
only contains 66 chapters,
not 110. With respect to the other "Reference" content value (Matthew 2:220
-
23), this
value is also "out of range". This is because 1) Matthew chapter 2 doesn't have 220 verses
(it contains 23 verses), and 2) a Bible reference r
ange cannot be backwards (220 is
greater than 23). It turns out that the "220" is a typo, and should be 22, not 220. IOW,
"Matthew 2:220
-
23" should be "Matthew 2:22
-
23".

Such errors need to be
flagged/logged so that the user can make the necessary correc
tions.


The program will need to parse the values in the "Reference" content tag so that a
comparison can be made against a table of valid book names, chapters and verses. The
buyer/client will provide the consultant programmer with CSV files that contain

all the
valid book names, maximum number of chapters per book, and the maximum number of
verses in each chapter. Example:





CSV file #1:

------------


Gen,Genesis,1

Ge,Genesis,1

Gens,Genesis,1

Gn,Genesis,1

Genesis,Genesis,1

Genes,Genesis,1

Exo,Exodus,2

Ex,Exodus,2


The above will allow the program to determine whether a "Refere
n
ce" content book name
value is valid or not.


CSV file #2:

------------


1,1,31

1,2,25

1,3,24

1,4,26

1,5,32

1,6,22

1,7,24

1,8,22

1,9,29

..

..

2,1,22

2,2,25


This second CSV file i
dentifies the book by number (1 = Genesis, 2 = Exodus, etc.) The
second value identifies the chapter number, and the third value identifies the maximum
number of verses found in this chapter. Example "1,1,31" means that Genesis chapter 1
has a maximum of
31 verses.


Using this data, the app will first confirm whether any of the "Reference" content values
are "out of range". Any out of range values that are found will be listed in the TMemo
log. If the chkBoxStopOnError TCheckBox 'Checked' property is Tru
e, then processing
will come to a stop and a message dialog will report an error has been found. No
additional HTML files will be processed.


Note: There are two special instances that the “Out of range” function must take into
consideration. These would b
e when the verse range token in the “Reference” content is a
* or # symbol. Examples:


<meta http
-
equiv="Reference" content="Isaiah 53:*" />


<meta http
-
equiv="Reference" content="Matthew 2:#" />



In these cases where a * and # are found in the “verse” se
ction of the Bible reference,
only the book chapter needs to be verified as to whether it is out of range or not.


Also, the program needs to verify whether any duplicate references exist in the
“content=” field. Example,


<meta http
-
equiv="Reference" cont
ent="
Gen. 3:5|Exo. 5:7|Num. 5:1|Gen. 3:5”

/>


In the above example, “Gen. 3:5” exists more than once. In such cases, the duplicate
value will be discarded.

Such a duplicate value should be logged, however, but the
program execution will continue.


Also, t
here will be instances where the book number rather than the book name will have
been used in the “content=” parameter. E.g.,


<meta http
-
equiv="Reference" content="
1:3:5|Exodus 5:7|4:5:1|Gen. 3:5”

/>


In the case of “1:3:5” and “4:5:1”, the program will
still need to verify whether the
chapter and verse is “out of range”.



Each HTML file that has been indexed will be saved with its new file name in a folder
specified by the RenamedHTMLFilePath TEdit.


The HTML files will be given a random file name if a
value has not been assigned to the
"content" parameter in the tag "<meta http
-
equiv="FileID" content="" />". The random
file name will be alphanumeric (meaning having only numbers or lowercase letters), with
a ".htm" extension (although the "content" para
meter itself will not include the ".htm" file
extension). It is vital that no two files shall have the same random file name/FileID.


The random file name will be 10 characters in length (excluding the .htm extension).
Examples: h23ip81bn3.htm 3uy98z12
7n.htm (The FileIDs would be h23ip81bn3 and
3uy98z127n respectively.) The files will be saved in the folder specified by the
RenamedHTMLFilePath TEdit, and placed in a subfolder below the
RenamedHTMLFilePath folder identified by the first 2 characters
of the file name.
Assuming the RenamedHTMLFilePath = 'c:
\
users
\
Joe
\
Export' then the files...




h23ip81bn3.htm will be saved in folder c:
\
users
\
Joe
\
Export
\
h2


3uy98z127n.htm will be saved in folder c:
\
users
\
Joe
\
Export
\
3u


If a value is found associated wit
h the "content=" parameter in the following tag:


"<meta http
-
equiv="FileID" content="" />"


...then that value will be used as the file name instead (with a ".htm" extension). For
example...


"<meta http
-
equiv="FileID" content="w87sj20zj2" />"


The HTML
file that is found with the above tag will be copied and saved with the new
file name "w87sj20zj2.htm" in folder c:
\
users
\
Joe
\
Export
\
w8


If the FileID tag does not exist in the HTML file, or if the content value is empty, then
the newly chosen random 10 ch
aracter file name will be inserted, along with the tag, into
the original HTML file, within the <head> section, just below the <head> tag.



Important:



When posting your bid, make sure that the
dollar
or

cent amount contains the number 2 in
your bid. Ex
amples: $2, or $11.02
, or $22
.92
, or $
1
42, etc. This will tell me that you
have indeed read this document before posting your bid.

If you are uncertain whether
your bid will end up displaying the number 2, then simply place the number 2 somewhere
in your
message that accompanies your bid.


Now, back to our regularly scheduled program…




SQLite tables:

--------------


There will be several SQLite tables created. They will contain the data gleaned/parsed
from the HTML files (Keywords, Authors, Titles, Bible

references).



One SQLite table will contain:

------------------------------


1. The 10 character FileID


2. The author ID number (this will be created in another table
-

see next section
describing the second SQLite table)


3. The title data found betwee
n the <title></title> tags.


Example of the content of this table:


FileID,AuthorID,Title


h23ip81bn3,109,Once Upon A Time

3uy98z127n,22,Analysis of the Word "Radar"

8shhs8soa2,0,The Book of Tobit


This table will be indexed by the 10 character FileID.


If

more than one FileID is found during the processing of the HTML files, the program
should log this to the TMemo and display a message dialog. If the chkBoxStopOnError
TCheckBox is True, then the program execution should come to an end. In other words,
we

don't want to have two HTML files containing the same FileID.



A second SQLite table will contain:

-----------------------------------


1. AuthorID number


2. Author name


This table will be indexed by AuthorID.


The AuthorID number will need to be creat
ed after analyzing all the HTML files. The
only way to create this table will be to open and parse all the HTML files in advance,
temporarily storing all the content found in the "content" parameter in the tag: <meta
http
-
equiv="Author" content="" />), th
en removing all redundant/duplicate values, then
assigning numbers to the remaining values.


Example:


Given this <head> data...


<head>

<meta http
-
equiv="Reference" content="Isaiah 11:1|Matthew 2:22
-
23"

<meta http
-
equiv="Keywords" content="king|kingdom|d
omain|government" />

<meta http
-
equiv="Author" content="Ernest Hemmingway" />

<title>The Man and the River</title>

</head>


..."Ernest Hemmingway" would be the author name that would eventually be added to the
SQLite table. Note: It is possible, and accept
able, for the author content value to be
blank/nil.


This SQLite table's contents would look something like this:


0,

1,Abel Johnson

2,Harry Jones

3,John Smith

4,Wikipedia


The authors' names should be sorted A
-
Z prior to assigning AuthorID numbers.



A th
ird SQLite table will contain:

-----------------------------------


1. Each keyword found in the values assigned to "content" in the tag "<meta http
-
equiv="Keywords" content="" />" Example, if content="King|Kingdom|Eminent
domain|Government" then 4 record
s will be created in the SQLite table. One for the
keyword "King", another record for "Kingdom", etc.


2. The 10 character FileID.


This table will be indexed alphabetically, by the keywords



Example:


Given this <head> data...


<head>

<meta http
-
equiv="
FileID" content="3uy98z127n" />

<meta http
-
equiv="Reference" content="Isaiah 11:1|Matthew 2:22
-
23"

<meta http
-
equiv="Keywords" content="King|Kingdom|Eminent domain|Government"
/>

<meta http
-
equiv="Author" content="Ernest Hemmingway" />

<title>The Man and
the River</title>

</head>


...the second SQLite table would contain 4 records containing:


Eminent domain,3uy98z127n

Government,3uy98z127n

King,3uy98z127n

Kingdom,3uy98z127n



A fourth SQLite table will contain:

-----------------------------------


The fou
rth SQLite table, using the following <head> data:


<head>

<meta http
-
equiv="Reference" content=
"
Amos 1:*|
Isaiah 11:1|Ma
rk 2:
2
-
6|Luke 3:#
"

<meta http
-
equiv="Keywords" content="king|kingdom|domain|government" />

<meta http
-
equiv="Author" content="Ernest He
mmingway" />

<title>The Man and the River</title>

</head>


...would contain the following information:


23,11,

1,

3uy98z127n

30, 1, 0, 3uy98z127n

4
1
,

2,

2,

3uy98z127n

4
1
,

2,

3,

3uy98z127n

41, 2, 4, 3uy98z127n

41, 2, 5, 3uy98z127n

41, 2, 6, 3uy98z127n

42, 3
,
-
1, 3uy98z127n



You should notice a few things about how the verse range in the “Reference content=”
field is rendered in the table:


1.

If there is a specified range of verses, such as Mark 2:2
-
6, then a record for each
verse needs to be entered into the
table. In the case of Mark 2:2
-
6, a total of 5
records should be entered into the table, one record for each verse. For each of
these 5 records, the same FileID should be displayed.

2.

If there is a * in the verse portion of the Reference, then only one recor
d will be
entered into the table, and a zero (0) will become the verse number. In the
example above, with Amos 1:*, the record that should be added to the table would
be: 30,1,0,3uy98z127n

3.

If there is a # in the verse portion of the Reference, then only on
e record will be
entered into the table, and a negative one (
-
1) will become the verse number. In
the example above, with Luke 3:#, the record that should be added to the table
would be: 42,3,
-
1,3uy98z127n


What do the * and # represent? These markers repr
esent that these records
,

during a
search/query of the table, should be retrieved no matter what the user inputs for the verse.
IOW, if the user later wishes to perform a query on book 42 (Luke), chapter 3, verse 11,
then the record
42,3,
-
1,3uy98z127n

woul
d be treated as a match. This record
should be retrieved. In the case of a search for book 30 (Amos), if the user searches for
chapter 1 verse 5, the record represented by
30,1,0,3uy98z127n

would be
considered a match. So why are there two symbols (* and
#) if queries will treat them the
same? Because the file in which a # is found means that the HTML file has internal
bookmarks, which can be accessed, based upon the verse number that is being queried.
E.g., the HTML file with the FileID of 3uy98z127n has

<a href= “#”></a> tags within
them (where # = a number representing a verse), which can be accessed from the URL
string (e.g.,
http://mySite.com/3u/3uy98z127n.htm#5
). This URL would then cause the
“5
” bookmark to be displayed within the user’s web browser.


This table would be sorted first by Bible book number (first field), then by chapter
number (second field), then by verse number (third field), then by FileID (fourth field).


Example:


15,5,18,si8
9z12jas

15,5,22,ji12zks023

15,6,3,ji12zks023

23,11,1,3uy98z127n

23,11,1,78hj2ik9a8

40,2,22,3uy98z127n

40,2,23,3uy98z127n

40,2,23,ikj28zu27s

40,2,23,ji12zks023

40,2,23,vsk28slas23



SQLite file names:

--------------------------


The SQLite files will be nam
ed based upon the date and time of creation. Example:


2013
-
05
-
23_23
-
10.db


This will follow the format: yyyy
-
mm
-
dd_hh
-
mm*.extension



SQLite file and table protection

--------------------------------------


The data within the SQLite tables must not be ab
le to be accessed by a third party,
outside of the second program that will be created (see next page). Using a password to
open the SQLite table will be necessary. Also, I would want to see the FileID value either
scrambled or encrypted. I do not want som
eone to be able to easily obtain a list of the
available HTML files on the server.




The Goal:

---------


The goal of this first application is to index all of the HTML files on a local disk so that
the .htm files can be located later based upon either: 1
) A keyword, or 2) A Bible
reference (e.g., Genesis 3:10). The .htm files are being copied from a local drive and
renamed so that they can be placed on a web server. The SQLite tables will be used by a
second application to find a HTML file based upon a
keyword or Bible reference.


If the consultant programmer believes there is a better method of laying out the SQLite
tables, then I am open to such a discussion.




Delphi components to be used:

-------------------------------


On the TForm will exist:


TC
heckBox (Caption := 'Include sub folders'; Name := 'chkBoxIncludeFolders')


TMemo (Name := 'memoLog'; this will contain a running log of events that occur during
the execution of the program). The log data will be saved to disk when the program
closes.


TEdit with TLabel (Caption := 'File name filter and path:')


TEdit with TLabel (Caption := 'HTML files will be renamed and saved to path:';
TEdit.Name = 'RenamedHTMLFilePath')


TEdit with TLabel (Caption := 'SQLite table path:')


TButtons: 1. (Caption := '
Begin insertion of reference tags') 2. (Caption := '...') 3.
(Caption := 'Begin indexing of HTML files')


TOpenDialog


TCheckBox (Caption := 'Stop processing if an irregularity is encountered'; Name :=
'chkBoxStopOnError')


Font size should be 11 and the f
ace should be Calibri for all text that appears in this app.


TButtons with "..." will trigger a dialog window that allows the user to choose the path
for the RenamedHTMLFilePath text field.


All of the properties associated with any of the TEdit.Text comp
onents will be saved to
an INI file at the program's close, and restored when the program is launched. The INI
file will be placed in the folder where the EXE file resides.




Program 2 of 2 (the Retrieval tool):

====================================


Once
the SQLite tables have been created, a second separate program will be needed to
query the tables. This will be the second app that needs to be written.



What this second app should do is:


1) Allow the user to choose a SQLite .db file.


2) Allow the use
r to input a Bible reference string using numbers (e.g., 3:4:23 = book
number:chapter number:verse number(s)) and have all the FileIDs,
Verse number
,
Author
names and Titles appear in a TMemo. Example:


Input: 40:1:3


Output:


8ak2j129ak
|
3
|
John Smith|Once
Upon A Time

ia82oa91js
|
0
|
Mike Wilson|Today Is Now

ak12ja0992
|
-
1
|
Wikipedia|Uranium


Note that the verse input
will

be a single verse,
not

a range of verses (e.g., 40:1:3). In this
case,
any records that contain 40,1,3,… or 40,1,
-
1,… or 40,1,0…. should
be f
etched.



3) Permit the user to input a keyword
or search phrase
and have all the FileIDs, Author
name and Titles associated with
the user's input

appear in a TMemo. The search should
be case insensitive, IOW, searching for "italy" or "iTALy" should retur
n find "Italy" in
the SQLite table.

There will need to be some "fuzzy search" performed also, so that
entering "kings" will return any records associated with "king". Other fuzzy searches
would include:


Search input: the dog

Acceptable results: "the dog",

"dog", "the dogs", "the dog's", "dog's"


Search input: Beth Abel

Acceptable results: "Beth Abel", "Beth
-
Abel","Beths Abel","Beth Abels","Beth
-
Abels","Beth Abel's","Beth
-
Abel's","Beths
-
Abel"


The search results must consider a match to exist if adding a le
tter "s", or apostrophe "s"
exists, or if stripping a trailing "s" from the search input finds a match:


Search input: swallows

Acceptable results: "swallow","swallows","swallows'","swallow's"


It may be necessary to have a list of characters which will be

ignored during the indexing
process, so as to make the search process work in the way I've described above. In other
words, perhaps during the indexing/creation of the tables the following characters should
be removed:


' "
-

( )


! `
[

] = _ ? , . /
\

# $ & +



4) Display the amount of time it takes to retrieve the data.