the Open Source Tools that Power Archive-It - IA Webteam JIRA

judgedrunkshipServers

Nov 17, 2013 (3 years and 8 months ago)

86 views


Open Inside
:

The Open Source Tools

that Power Archive
-
It

Archive
-
It Partners 2009

Gordon Mohr, Internet Archive

November 4, 2009

Archive
-
It Unifies Many Tools

Archive
-
It
: managing, designing,
monitoring, scheduling, reporting


Integrated Tools
: collecting,
storing, displaying, searching

Open Source & Standards from IA


3 open source software projects


Heritrix


collecting


Wayback


displaying


NutchWAX


searching


1 co
-
developed ISO standard


WARC File Format


storing

Open Source from Elsewhere


Linux


Apache/Tomcat


MySQL


Lucene
-
Nutch
-
Hadoop

Why Open Source?


Open Source Initiative says:


Open source is a development method for software that harnesses the
power of distributed peer review and transparency of process. The
promise of open source is better quality, higher reliability, more
flexibility, lower cost, and an end to predatory vendor lock
-
in.




More than access to source code:


Right to change, reuse, extend


Wins:


Harmonize formats, practices


Avoid duplication of effort


Reduce costs

Projects Genesis: 2003


Internet Archive wanted more control over its
own software & collections


Discussions with national libraries

USA, Canada, UK, France, Iceland, Sweden, Norway, Finland,
Denmark, Italy, Australia


Desire to share tools, formats, experiences

avoid duplicated effort, closed & inflexible tools


Formed:

International Internet Preservation Consortium (IIPC)



http://www.netpreserve.org

Heritrix

What is Heritrix?


Open
-
source


Extensible


Web
-
scale


Archival
-
quality


Web crawling software


http://crawler.archive.org

Heritrix Motivations


Deeper, specialized, in
-
house crawling


Open source


Encourage collaboration on features and best
practices


Avoid duplication of work, incompatibilities


Archival
-
quality


Perfect copies


Keep up with changing web


Meet evolving needs of Internet Archive and
International Internet Preservation Consortium

Heritrix Overview


Heritrix

means heiress


Java, modular


Project website:

http://crawler.archive.org


News, downloads, documentation, issue
-
tracking


Sourceforge: open source hosting site


Source
-
code control (SVN)


Official downloads



Lesser


GPL or Apache license


easy reuse


Outside contributions welcome

Milestones


1.0 release in March 2004


Major releases since:


1.2 new scope options (2004)


1.4 improved memory use (2005)


1.6 remote control (2005)


1.8 scaling (2006)


1.10 protocols, formats, fixes (2006)


1.12

smart


duplicate reduction (2007)


2.0

smart


prioritization (2008)


1.14 WARC, performance (2008
-
2009)

Archive
-
It Uses Heritrix 1.14.3+


AKA

1.15.4



WARC/1.0


Many minor fixes


Same as all contract/national crawls


Available as developer build


Will become 1.14.4

Heritrix


future


Next major release: Heritrix 3.0


Crawl configuration by

Spring



Scriptable configuration


Web
-
service remote control


Other upcoming priorities



Smart


continuous/automatic revisits (3.2)

(from change detection to prediction)


Rich media improvements


Spam/trap/mirror suppression


Automate ever
-
larger crawls

Heritrix


more info


Project website


http://crawler.archive.org


Source code


Sourceforge

SVN



Discussion


http://tech.groups.yahoo.com/group/archive
-
crawler/


Issues/Bugs


http://webarchive.jira.com/browse/HER


Key IA staff:


Steve Sisney, Gordon Mohr


Wayback

What is Wayback?

Open Source

Java

Modular

Scalable

Customizable

Web Archive Access Tool

http://archive
-
access.sourceforge.net/projects/wayback

Wayback


the beginning


Inception in 2005


Aim: URL
-
based browsing

as if


at previous dates


Contrasts with classic:


Open source, diverse installs


Java vs. Perl/C


Refactored:


Many extension points


Basis for new features & experiments


First release:

0.2.0


December 2005

Now at 1.4.2 (July 2009)

Wayback Features


Starting with an URL:


See list of captures by date


See extension URLs (same site)


View a capture


Once browsing (

replay

):


Browse web

as it was



Best
-
match clickthroughs

Wayback: Modular Components


Query User Interface


Calendar
, Search Engine, XML


Replay User Interface


Archival URL
, Timeline,
Proxy


Resource Index


CDX
, BDB,
Remote
, Nutch,
Aggregated


Resource Store


Local ARC,
HTTP 1.1 Remote ARC

Archive
-
It Uses Wayback 1.4.2+


UI customized


Adds server
-
side rewriting
-
mode


Available from project source
-
control


Next major release: 1.6.0

Wayback


more info


Website



http://archive
-
access.sourceforge.net/projects/wayback/



Source code



Sourceforge

SVN



Discussion


https://lists.sourceforge.net/lists/listinfo/archive
-
access
-
discuss


Issues/Bugs



https://webarchive.jira.com/browse/ACC


Key IA staff:



Brad Tofel


NutchWAX

What is NutchWAX?

Open Source

Java

Full
-
Text Indexing

End
-
User Querying


for Web Archives

Built on Lucene/Nutch/Hadoop

http://archive
-
access.sourceforge.net/projects/nutch

NutchWAX Background


Lucene


Open
-
source Java full
-
text indexing


Popular, mature


Nutch


Extensions to Lucene


For web content, access, scale


Hadoop


Spun off from Nutch


Inspired by Google

s Map
-
Reduce

NutchWAX


Inception in 2005


Nutch Web Archive eXtensions


Utilities for using (W)ARCs as Nutch input


Configuration for date dimension


Handle repeated URLs


First release



0.2.1




July 2005


Now at 0.12.8 (September 2009)

Archive
-
It Uses NutchWAX 0.12.8


Latest official release


Recent changes driven by Archive
-
It


Caching support


Index maintenance processes (merging)



Reboost


for reranking


NutchWAX


more info


Website



http://archive
-
access.sourceforge.net/projects/nutchwax/



Source code



Sourceforge

SVN



Discussion


https://lists.sourceforge.net/lists/listinfo/archive
-
access
-
discuss


Issues/Bugs



https://webarchive.jira.com/browse/WAX


Key IA staff:



Aaron Binns

WARC

What is WARC?

IIPC

ISO

Standard

Flexible

Simple

Format for Web Archive Files

http://tinyurl.com/2eusle

(drafts)

WARC Overview


WARC = Web ARChive file format


Next generation of ARC, called for by
IIPC


ARC format created by the Internet Archive


Over 1PB of ARCs gathered since 1996

WARC Goals


Store arbitrary metadata (e.g., subject
classifier, discovered language, encoding)


Data compression and record integrity


Store all control information from the
harvesting protocol (e.g., request headers)


Store the results of data migrations


Store a duplicate detection event


Distinguishable from the legacy ARC


Globally unique record identifiers


Deterministic handling of long records (e.g.,
truncation, segmentation).

ARC vs. WARC


Both are a simple sequence of content
blocks, each introduced by a small text
header


ARCs only 1
-
line header + protocol response


WARCs add:


multi
-
line header with extensible fields


New record types:


Request, Response, Resource


Metadata, Revisit, Conversion, Warcinfo, Continuation

What does the future hold?

What does the future hold?

Expand and improve toolset



Driven by user requests, contributions,
sponsors



Unify access tools



Verify and improve internationalization

What does the future hold?

Keep up with the web



New formats, protocols, design techniques



Content challenges:


Deep content


Spam


Interactive applications / AJAX / Javascript

Thank You

Gordon Mohr

Internet Archive Web Group

gojomo@archive.org


Thank You

Gordon Mohr

Internet Archive Web Group

gojomo@archive.org