In this paper, we introduce Peek@u, a webcam search engine
of webcam p
to display query results on a map
based interface. The prototype is available at amlia.cs.washington.edu:4242. We use
source search engine Nutch as our backbone.
Finding and classifying webcams is a diffic
is a plethora
images on the web, in
numerous file types, of var
ying sizes and behaviors. W
ebcams themselves have
variety. There is no uniform method for implementation, leading to webcams that are similar in
t are implemented with entirely different technologies.
Locating a webcam
is equally difficult.
Some webcam webpages reference 20
different locations in their content. Others do not have a single word that
can be linked to a
Many pages do contain locational information, but what they lack is precision, often
providing only a state or city.
created a search engine and
can return meaningful results from
In tackling this project, o
ur first challenge was to understand the webcam space, specifically,
which types of webcams
and which would most commonly be searched.
that adult webcams were
the most common type of webc
am on the web, but be
cause the vast
majority of them
sword protected, they were not our
focus. What w
e chose to focus on
of webcams that have every
utility. These are traffic, weather, and travel
webcams. These webcams are
not password pro
tected for the most part, and
are also quite
numerous. All major cities have many such webcams. We also recognize that there are a large
number of webcams that display indoor locations (personal or business webcams) and these are
ed for in our search results as well.
As a result, we
a webcam search eng
the internet and create
a database of
be searched using two different forms of queries. The first is a
locational focus. If the
query is locational (it contains locational keywords) we fire off a spatial
search, looking for results within a geographic radius, and then display them in a location
centered interface. The second is a non
locational focus. If the query does not contai
r results are generated using the standard
keyword matching method, and the
interface is similar to the common interfaces seen in other search engines.
The primary challenges that we faced were
cation of a page as a
association of a webpage with
a location, and displaying the yield of the previo
us two with
interface that was
and information rich.
A Parallel pipelined architecture is the overall design
theme of peek@u. Our engine is broken up
into five distinct parallel phases
as shown in Figure 1.
In the fetching stage the Nutch fetcher crawls the web periodically and link analysis is done.
When a crawl is complete
a segment is written to the database. We refer to this as a raw segment.
When fetched pages are parsed for outlinks, we also parse out the <img> tags (currently only for
jpegs) and store in the database as part of the page.
During image fetching, the
<img> tags are read and fetching begins. The images are fetched
multiple times to determine if they are changing and any missing height and width data is filled
in. This is logically the first step in classification. A new segment with updated image st
written to the database and the old one deleted.
In classification, the image checked segments are read and passed through a Naïve Bayes
classifier. The criteria that classifier is checking for can very as described in the ‘Classification’
that have been identified as webcams are written to the
database as a new
After image links have been
identified as webcams, we pass them through location parsing
searching for city and state names in the US as describe
d in the ‘GeoFinder’ section. A new
segment is written to the database with the labeled latitude and longitude values.
In addition to the standard indexing process, we run our GeoFinder plugin to index latitude and
longitude values for later querying.
Each section of this pipeline is tied together via scripts executed by the cron daemon.
GeoFinder is our module that implements geographical search. The functionality of GeoFinder
can be broken down into: the parser, the indexer, and the sea
The primary function of the parser is to find locations from web pages and translate
longitudes and latitudes. Our pa
rser relies on Geocoder for this translation.
oder is a project by Dan Egnor, for the Google p
rogramming contest in 2002. It contains an
index which it
with United States
census data. Geoc
oder can query this index for
specific street addresses and receive corresponding longitudes and latitudes in return. We
realized that web cam pages
do not typically have exact street addresses that describe their
cations. Thus, we
oder to accept queries in the form of city and state
combinations (e.g. Seattle Washington).
To extract suitable candidates
for querying with Geoc
we implemented the following
Scan through the document and compile a list of distinct states (we do this by checking
against a file of all US states). In the same scan, also build a list of sequences of
For each distinct sta
te, execute our 1
3 algorithm (described below) with the sequences
of words, and the state.
Send all candidate
queries to Geoc
oder, and store successfully
returned longitude and latitudes in a hash map.
Choose the 10 highest occurring longitudes and lat
itudes from the hash map and associate
them with the document.
3 Algorithm operates on a list of sequences of words and a state
to generate queries for
From each sequence words, it will generate 1, 2, and 3 tuples of consecutive words.
We chose 3 because we assumed that city names will be at most 3 words.
then append the state to the end of these tuples to form query strings.
a simulation of the algorithm for this document:
“welcome to our homepage! our Wa
shington locations are: Seattle, Bellevue, Tacoma, and
Note: We would have identified
Washington as a state from the earlier step.
Each of th
ese strings will be sent to Geoc
oder for querying
Seattle Bellevue Tacoma
n, Bellevue Washington, Tacoma Washington, and Walla Walla Washington
will all return longitudes and latitudes from Geocoder, and thus be associated with the page.
indexer is implemented via a Nut
ch indexing filter plugin. The
ter allows us to
add fields to our document which will be indexed by the Nutch indexer. For any given document,
we first check to see if there are any longitudes and latitudes associated with it (given by the
parser from above). If longitudes and latitud
, we add them all to the document.
Like the indexer, o
ur searcher is also implemented via a Nutch indexing filter plugin. The
searching filter allows us to add required search criteria to the query. We start by looking for US
in the query. If one is found, we look up the longitude and latitude coordinates for the
rectangle that encloses the state.
At this point, the searcher also sends a message to the UI,
indicating which state map should be displayed.
Next, we add range que
ries to the search
require all resulting pages to have longitude and latitude values that
within the range
of our state rectangle.
Peek@u used a Naïve Bayes classifier for labeling web cams. When training it, several c
are available. A generic function is made that will parse a series of words and return the best set
of words to use during classification. This can be coupled
with image property criteria such as
width, height, and ratio. This allows separating
body text, and alt text for testing different classifiers. Criteria split points are found by finding
the split with the best information gain.
might have done differently
We used the default Nutch c
rawler and link analyzer for our crawls. We seeded the crawler with
our positive webcam examples and let it go from there. The crawler found fewer and fewer
pages that ended up having webcams on them as time progressed. It also turned out that
if an image is changing was a great way for identifying web cams. Our time would
probably have been better spent coding a focused crawler than
depth Naïve Bayes filter
with many different criteria.
everal different classifiers are ready to b
e tested. Unfortunately, we left doing the testing
until late and our server crashed preventing us from gathering comparison data.
What we did right
Efficient Geographical Searcher
We have successfully implemented an efficient way to conduct geogra
phical searches. Geocoder
coverage of the United States and can be run locally on our machine. Additionally,
sing algorithm runs in at
3nm) time, where n = the number of capitalized words
and m = the number of distinct states.
Actual run times a
re significantly lower, as the capitalized
words are rarely connected in one large chain.
Creation of an Innovative Interface
One of the goals from the start was to break the mold that most search engines adhere to when
y results. Specifically, we wanted to create something different than simply
listing each result, with some accompanying text, in a list down the page. This might be
acceptable for basic internet searches, but is lacking for webcam searches.
orming a webcam search,
we decided that
each result must have a thumbnail of the
actual webcam image.
The information provided by the summary text pales in comparison to the
information a user can gather from the actual webcam image. “A picture is worth
words”. In both our interfaces (locational and non
locational) we make an effort to show not
only a webcam from the resulting page, but all the webcams on that page (should there be
To help add a greater scope of information to th
e locational results page, we include a map of the
state that the query was based on, and then place markers on the locations for the resulting
webcams (our precision is at the city level). Moving your mouse over a city, displays all of the
results for th
at city in a tool tip window (for each hit, we display the webcam image and the title
of the webpage, both are clickable and will take the user to the webcam page).
Our parallel pipeline architecture worked quite well for us. It al
lowed us to work within the
Nutch framework for the most part and take advantage of the existing code. It let us keep
cranking out raw crawls and image checking while we modified the code for other stages as well
as stop crawling, but keep the latter stag
In a ‘real’ environment we could have
used the distributed NutchFileSystem and had dedicated machines working on each piece.
What we learned
ithout a doubt,
we have gained
a much clearer picture of how search engines operate (core
es) and an understanding of the difficulties faced when trying to improve a search engine
e learned that the web is conceptually a big, dirty database that can be used in many interesting
The webs ‘dirtiness’ is what makes it so interesting of a
problem to try and solve.
We also learned not to wait until the end to run key tests and that each group needs their own
Proximity Search with More Options
Our current geographical search only filters results by state. It c
an easily be extended to search
at a city or county level.
One good improvement would be to allow the user set the radius of the
Another further improvement would be to order results base
d on proximity to and
Options with Google Maps
When first working on the interface for displaying the query results, we envisioned something
that was built off of Google Maps similar to Paul Rademacher’s project that displays housing
information from Craig’s List with a modifi
ed Google Maps interface. This proved problematic,
because Google Maps is still a Beta product
the stability of the underlying code could not be
counted on. Once Google Maps moves to a stable version, there are many interesting
possibilities for di
laying locational information.
An advantage to using Google Maps is that you can display locational information with a high
level of precision. To fully leverage that, more specific locational information in the
the webcam pages would be nee
ded, or some better way of resolving the location of the webcam
sing image recognition to analyz
e the position of the sun).
Appendix A: Attribution
Created the interface
Gathered maps, charted latitudes an
Created marker code
Created code for the aggregation of results located in the same city
(into the same tooltip as well as the matching of colored markers to
the locational results)
Created code for changing the thumbnail window
to display multiple webcams
from the same page
working of Search.jsp
Created State.java class
Designed and implemented interface prototypes, with limited user testing
Handled any of the .jsp work that needed to be done
Created all custom ima
Created basic content for the Report and Presentation
the parallel segmented architecture for the project.
Wrote the utilities and made significant modifications to the database classes to
rt our needs.
Implemented the image fetching stage in a scalable way
Wrote a flexible Naïve Bayes classifier
Added optional stemming options (through a library)
Wrote a generic criteria chooser
accept city state queries
Communicated with the Author of Geocoder to acquire advice
Implemented document parsing algorithm that
Parses documents for location information
Extracts longitude and latitude
about the information via Geocoder
xing filter that allows
Longitude and latitude to be added as fields to the document
Become index by
Implemented query filter to
Extract location information from search queries
Construct range queries to check if pages lie within the search box
ommunicate with the UI to display the correct maps
Appendix B: Other Code
is powered by
Used to display the tooltips (the windows that appear when you hover your mouse
over a marker in the locational interface)
No modification to wz_tooltip.js
State images taken from Google Maps (
ritten by Marco Schmidt
Used to get height and width information from jpegs
n by Martin Porter
Used for the porter stemming algorithm during classification
Part of Naïve Bayes code from CSE 473 project written
by Lee Faris and Don Kim
Included in distribution
sed to query city state combinations and receive longitudes and latitudes.
Information about Geocoder at:
Is a Nutch plugin that serves a similar function as GeoFinder
GeoFinder is modeled after GeoPosition in many aspects
Information regarding GeoPosition at: