Hui Zhang Jon Dunn - Indiana University Digital Library Program

cabbagepatchtapeInternet and Web Development

Feb 5, 2013 (4 years and 4 months ago)

257 views

Overview of

IU Digital Collections Search

Hui Zhang

Jon Dunn

Indiana University Digital Library Program


IU Digital Library Brown Bag

October 19, 2011


Outline


Introduction and motivation


Jon


Demo


Jon


Technical implementation


Hui


Next steps and future work


Jon


Why cross
-
collection search?


Support discovery across multiple
content formats, collections, and
repositories at IU


Use cases:


Multiple formats/collections within a single
thematic grouping (e.g.
Hoagy Carmichael
)


Show off the richness and diversity of IU

s
digital collections (PR


see
open.iu.edu
)


Find digital content at IU for teaching or
research use

Why cross
-
collection search?


Support discovery across multiple
content formats, collections, and
repositories at IU


Use cases:


Multiple formats/collections within a single
thematic grouping (e.g.
Hoagy Carmichael
)


Show off the richness and diversity of IU’s
digital collections (PR


see
open.iu.edu
)


Find digital content at IU for teaching or
research use

Digital collections evolution:

Discrete collection web sites

Digital collections evolution:

Services

METS Navigator

Archives Online

PhotoCat

Video Streaming Service

Variations

Digital collections evolution:

Services


Advantages


Can develop workflows for content ingestion and
description that are both optimized and scalable


Content stored in a common repository (Fedora)


Can develop discovery interfaces optimized for
particular content (e.g. images vs. music)


Common services to expose content into other
platforms (e.g. Google)


Disadvantages



Siloing


discovery by content type can be an
issue


Cross
-
collection search:

First iteration


Only selected collections with metadata in
Fedora


Includes Archives Online and most image collections


Not video streaming, Variations, encoded text,
IUScholarWorks, various

legacy


collections



Metadata only (MODS)


Stored natively as MODS in Fedora


Disseminated on the fly from other formats
(PhotoCat2)


Transformed via XSLT from EAD

(Archives Online)




Cross
-
collection search:

First iteration


Demonstration

Challenge:

Item
-
level records from EAD

Apache Solr Overview


A Java
-
based web application, open source
search server, Apache
Lucene

at its core


Demonstration


Solr

vs. relational database


Pros: full
-
text search, text analysis, flexible fields


Cons: no relational operation on fields


Solr

vs.
Lucene


Pros: web application, centralized configuration,
facet


Cons: security, slower








Solr Schema and Configuration


Schema: specify how the index is built


field, field type


dynamicField, copyField, uniqueKey


Text analysis: stop, stem, synonym,
tokenization


Configuration: specify Solr itself, query,
data import

Converting MODS to Solr XML


Solr XML


<add><doc><field>…</filed>…</doc></add>


Can simply be

POST


into the Solr index


Translation of MODS to Solr XML


Use XSLT


Called by the indexing program


Extract facet values


Format: MODS:typeofResource


Collection: customized based on item

s Fedora
PID

<add>

<doc>

<field name="id">iudl:10000</field>

<field name="title_t">Women Medical Students</field>

<field name="name_t">Photographic Services, Photographer</field>

<field name="name_facet">Photographic Services</field>

<field name="subject_topic_t">Medical students</field>

<field name="subject_topic_facet">Medical students</field>

<field name="subject_city_t">Bloomington</field>

<field name="subject_city_facet">Bloomington</field>

<field name="subject_state_t">Indiana</field>

<field name="subject_state_facet">Indiana</field>

<field name="type_of_resource_t">still image</field>

<field name="type_of_resource_facet">still image</field>

<field name="genre_t">Photographs</field>

<field name="genre_facet">Photographs</field>

<field name="w3c_taken_date">04
-
13
-
1956</field>

<field name="year">1956</field>

<field name="item_id">P0028020</field>

<field name="coll_id_mods">/archives/photos/</field>



</doc>

</add>

Solr Indexing


Carried by two Java programs running under
DLP

s Fedora Index Service framework


The service can be invoked by a RESTful
HTTP request, the Solr indexing is triggered
based on conditions specified in the
properties file


The MODS records are extracted from the
Fedora repository (natively stored) or
generated by the getMODS disseminator
(Photocat2 collections)

Overview of Blacklight


An open source project developed for
libraries with many potentials:


As a library catalog


As the discovery interface to a digital repository


Optimized to handle diversified content
(facet browsing)


Originally developed by University of

Virginia, has a growing community of active
contributors and users


Now part of Hydra Project


Written in Ruby, runs on Rails, requires
Solr

Customize Blacklight for DLP
Collections


Integrate blacklight with MODS
-
based
index


Blacklight by default expects MARC fields


New functions and features


Render thumbnail in result view


Use collection website as the landing page


Style and layout


Standard IU banner and footer


Color, font, and window size


Future Improvements


Automatic update of
Solr

index


Fedora repository communicates with the
Solr

indexing program via JMS about item
update


Include full
-
text content


It is challenging to have full
-
text content and
metadata in one index


Optimize the indexing and search algorithms


Search against full
-
text and use metadata as
facets

Future Improvement (cont

d)


Add more collections


Other collections from Fedora


Non
-
Fedora DLP collections


Archives of Institutional Memory


IUScholarWorks Repository?


IUPUI Digital Collections (
ContentDM
)?


Conduct usability evaluation


Explore integration w/ new
Blacklight
-
based
discovery layer for IUCAT


Variations on Video IMLS grant


Hydra/
Blacklight
-
based discovery on
PBcore

Questions?


Beta:

http://webapp1.dlib.indiana.edu/dcs/



Send comments to:

diglib@indiana.edu