Protocols - BerliOS

buninnateSoftware and s/w Development

Nov 18, 2013 (3 years and 8 months ago)

76 views

HarvestMan for
Accessibility Assessment

Anand B Pillai

abpillai@gmail.com

(SETLABS, Infosys Technologies, Bangalore)

Workshop on web accessibility and

meta
-
modeling



Agder University College, Grimstad, Norway

April 15 2005

Outline


Introduction


Applications (Uses)


Architecture


Protocols & Tags


Features


Threading Architecture


Thread co
-
operation


Flow of information


Modular design


Modules


HarvestMan in EIAO


EIAO extensions


Plans for a distributed version


Distributed operation


Distributed architecture


Plans for EIAO


Framework for developing web accessibility applications

HarvestMan


Introduction


HarvestMan is a web crawler program


HarvestMan is a console application


HarvestMan is written completely in the Python
programming language


HarvestMan is an open source project, released under
the GNU General Public License (GPL)


Version 1.4 is the current version available for the
public


Version 1.4.1 is the development version


Project page is
http://harvestman.freezope.org


Development is hosted in
http://developer.berlios.de


HarvestMan


Applications
(Uses)


HarvestMan can be used to,


Download files from a website or
many websites


Download files from websites
matching certain patterns (regular
expressions)


Search a web site for keywords and
download web pages containing them

HarvestMan
-

Architecture


Fully Multithreaded


Uses the ’Producer
-
Consumer’ design pattern with co
-
operating thread classes and multiple queues


Highly Configurable


Reads options from a text or xml configuration file.


Supports upto 60 different kinds of configuration option


Command
-
line options also supported


Preferred way is to use the configuration files, however.


Downloads organized into ’projects’


Each HarvestMan project has a unique name. It also has a
starting url and a download directory


HarvestMan writes a project file before the start of a
download using Python pickle protocol


The project file can be read back later to continue or re
-
start
an abandoned/finished project

HarvestMan


Protocols &
Tags

Protocols


Supports HTTP,FTP,File and Gopher protocols


HTTPS support depends on the Python version used
(Supported for versions >= 2.3)

HTML Tags


Parses and downloads links pointed by the following tags


Hyperlinks of the form <a href=
¨
...
¨
>


Image links of the form <img href=...>


Image links of the form <img src=...>


Links of the form <link href=...>


Stylesheet links of the form <link rel=
¨
stylesheet
¨
...>


Server
-
side javascript links of the form <script src=
¨
...
¨
>


Server
-
side java applets (.class files) of the form <applet
...>

HarvestMan
-

Features

Filters


Filter urls based on regular expression patterns


Supports patterns based on url file extensions, url
path components and server names


Sample url filter


-
*.jpg
-
*.doc
-
*/exclude
-
this
-
path/*


Filter urls based on url ’scopes’. Scopes supported are,


Url depth scopes (Length of a url w.r.t the root server
or the parent url)


Url boundaries (Based on server names/i.p)


Url extents (Based on url ’directories’)


Advertisement (Junk url) filter


Version 1.4.1 has a full
-
fledged junk filter which can
filter out junk (advertisement/banner/flash) urls

HarvestMan


Features
(Contd.)

Limits


Maximum number of web servers can be specified (in
a multiserver project)


Maximum number of directories in a webserver can be
specified


Maximum number of files can be specified in a given
project


Limit can be set to the maximum size of a file
downloaded in a project


Time
-
limits can be set for a project


Maximum number of simultaneous alive connections
(downloads) can be set


HarvestMan


Features
(Contd.)

Controls


Obeys the Robot Exclusion Protocol (robots.txt) used
by certain servers


Can be turned on/off


Priorities for urls can be specified


Based on file extensions


Based on server names/i.p


Priorities can be specified in a range of (
-
5,5)


HarvestMan will schedule download of urls with a
higher priority before those with a lower priority


Sample priority specification setting

jpg+3,png
-
2,doc
-
5,html+5

HarvestMan


Features
(Contd.)

Storage (Persistence)


Files are saved to the disk and the website recreated,
preserving the original hyper link structure of the website(s)


A cache file is created for every project


The cache file is a binary file containing the data &
metadata of all files downloaded in a project


The cache file is written using Python pickle protocol


The cache file consists of the url, timestamp(at the web
server), location on the disk and the actual content, of all
files downloaded during a project.


Caching allows the program to download only the urls that
have been modified (at the web server) when a project with
a cache is re
-
run.


Caching can be turned on/off.

HarvestMan


Threading
Architecture


Consists of co
-
operating ’fetcher’ and
’crawler’ threads


Fetcher threads do the job of actually
downloading files and saving them to the
disk


Crawler threads do the job of parsing web
page data and extracting the list of urls to
be crawled according to the HarvestMan
rules & limits specified in the configuration
file


HarvestMan


Thread Co
-
operation


Fetcher and crawler threads co
-
operate by following the ’producer
-
consumer’ paradigm


HarvestMan uses a symmetric, synergic producer
-
consumer design pattern


There are two queues for data flow


A data queue which stores raw web
page data (html), and a url queue which stores urls


Fetcher threads obtain their urls from the url queue. They download the urls,
save them to the disk. If the url is a web page(html file) its contents are
posted to the data queue


Crawlers get their data from the data queue. They parse the html data, get
the new urls and post them to the url queue


Thus fetchers are the consumers of the url queue and producers for the data
queue. Crawlers are consumers of the data queue and producers for the url
queue.


This mutual producer
-
consumer dependency creates a symmetric and
synergic data flow


Apart from these thread types, there are additional ’worker’ or ’slave’ threads
to which fetcher threads can delegate the actual job of downloading files
from urls.

HarvestMan


Flow of
information



Crawler threads

Fetcher Threads

Data

Queue

Url

Queue

Get web
-
page

data

Post urls

Post web
-
page

data

Get urls

Parse
web
-
page
data

Download
files &
save them
to disk

Symmetric/Synergic Producer
-
Consumer threading paradigm

HarvestMan


Modular
Design


HarvestMan is designed in a modular
fashion, each modules doing a specific task.


This facilitates greater re
-
use of the
program’s code for other projects, such as
EIAO for example.


HarvestMan can be used as a general
framework/library for web crawling and
specific application functionality can be
plugged in by writing your application code
in Python and plugging it in at the right
module of HarvestMan

HarvestMan


Modules (As
of version 1.4.1)


Crawler.py


Code for fetcher/crawler threads


Urlqueue.py


Module containing the data/url queues


Urlthread.py


Code for worker/slave threads


Pageparser.py


High level web
-
page parser


Urlparser.py


Module to parse urls, get information about them and construct the local filename
for the urls; also handles relative urls


Config.py


Holds all configuration options and maintains state of the program


Connector.py


Network connection configuration, management and url downloads


Rules.py


Applies HarvestMan scoping, limit, control and filter rules to urls to decide whether to
download them or not


Datamgr.py


Manages download requests from different fetcher threads, downloads the urls
and maintains state information for downloaded urls such as cache, url download status,
statistical information etc.


Utils.py


A collection of utility functions and classes


Xmlparser.py


To parse HarvestMan xml config file


Htmlparser.py


Htmlparser module, borrowed from Python library and customized for
HarvestMan


Robotparser.py


To manage robots.txt file rules; borrowed from Python library and customized
for HarvestMan


Strptime.py


Pure Python strptime module to write timestamps of downloads into the cache
files


Common.py


Functions that don’t fit anywhere else are put here in the global namespace.


Harvestman.py


Main application module

HarvestMan in EIAO


Crawler component of the EIAO ROBACC


Crawl the web, obtaining URLs from the URL
repository


Applies scoping rules according to a scoping scheme
in the repository databases to limit the number of links
crawled


Store files downloaded and HTTP headers in the local
repository


Version 1.4.1 is being used for EIAO. It adds a few
new features


XML configuration option


Advertisement (Junk url) filter


Few performance enhancements

HarvestMan


EIAO
extensions


Persistency extensions


Ability to load urls from a database repository (Currently reads it
from the config file)


Ability to save files and metadata such as HTTP headers to a
database repository (Currently saves it to the file system)


Ability to load a scoping schema from a database repository
(Currently specified as rules in the config file)


Most of the changes in datamgr.py module


Url scoping extensions


Temporal scoping


Content
-
aware scoping


Scoping rules should be dynamically modifiable


Most of the changes to rules.py module


Scheduling extensions


A url scheduling extension/modification to the current best
-
effort
priority queue of urls. This is to support temporal scoping






HarvestMan


Plans for a
Distributed version


Use multiple instances of the crawler running in different
machines


Scale out using multiple co
-
operating crawler and fetcher
instances on multiple machines instead of the current scale
-
in architecture with multiple threads in the same process


Use a master
-
slave kind of distributed architecture with a
master crawler running on a central server and slave
fetchers running on slave machines


The crawler instance is a process which performs the job of
the existing crawler threads


Fetcher instances are slave processes which performs the
job of the existing fetcher threads


Communication is via distributed message queues


Rules loaded from a central repository which is modifiable in
time.

HarvestMan


Distributed
Operation


Master (crawler instance) downloads the starting url, parses it and
gets the new urls


The new urls are send to the distributed url queue


Fetcher instances are started up on slave machines configured for
it


Fetchers wait at the url queue and get the new urls. They download
the urls, save data/metadata to a central repository


Web
-
page data is posted to a distributed data queue by the
fetchers


Crawler instance, gets web page data from the data queue, parses
it, gets the new urls


It then loads the scoping rules from a repository, applies them to
the urls and filters out urls that don’t satisfy the scoping scheme


Urls which pass the test are posted to the distributed url queue


The process continues...

HarvestMan


Distributed
version architecture


Currently no code only a plan!


A basic proof of concept implementation can be done
using Python Remote Objects (Pyro) as the distributed
computing middleware


Pyro provides a very simple RPC framework for
distributed Python programs


It supports a master/slave architecture


Allows fast and easy porting of non
-
distributed code to
a basic distributed prototype


Written in pure Python


no external dependencies


Can also take a look at using tuple spaces


PyLinda
provides a framework for this.

HarvestMan


Plans for
EIAO


EIAO version 1.0 (Proof of concept)


Write the persistency, scoping and
scheduling extensions


Add any more plugins as needed


EIAO version 2.0


Use distributed HarvestMan architecture for
more efficiently distributed crawling tasks
across multiple machines in a cluster


Allow distributed fetching of scoping rules
from repositories


Should be fitting in with EIAO performance
requirements at this time.


HarvestMan


Framework for
creating web accessibility
applications


Pluggable design, hence very customizable


Can be used as a framework for developing
applications that can have very specific
processing capabilities on top of the basic
web accessibility provided by HarvestMan


Suitable for teaching courses for doing web
mining or web accessibility application
development in universities

Questions ?





THANK YOU!



Anand B Pillai

abpillai@gmail.com

References


HarvesMan web crawler


http://harvestman.freezope.org


Pyro (Python Remote Objects)


http://pyro.sourceforge.net