Search Engine spamming

keckdonkeyInternet and Web Development

Nov 18, 2013 (3 years and 6 months ago)

56 views

















Information Retrieval

Assignment 1







Commercial Systems



Sylvia King 9901516


Page
2

of
14

Table of Content




Introduction

................................
................................
................................
.............

3

Crawler
-
Based Search Engin
e


Google

................................
...............................

4

Location & Frequency

................................
................................
............................

4

Advantages of crawler based search engines

................................
....................

4

Disadvantage of crawler based search engines

................................
.................

5

Search Engine spamming

................................
................................
.......................

6

Types of Spamming Techniques

................................
................................
............

6

Penalties for search engine Spamming

................................
................................
..

6

Google

................................
................................
................................
......................

7

Forecasted growth of Google

................................
................................
.................

7

Technology used for Google

................................
................................
..................

7

PigeonRank

................................
................................
................................
.......

8

Human
-
Powered Directories


AskJeeves
/ Yahoo

................................
...............

9

Askjeaves & Yahoo

................................
................................
................................

9

Advantages of Directories

................................
................................
..................

9

Disadvantage
s of Directories

................................
................................
..............

9

Askjeeves

................................
................................
................................
..............

10

Technology used for AskJeeves

................................
................................
..........

10

AskJe
eves & natural language Processing

................................
..........................

10

Yahoo!

................................
................................
................................
....................

11

Technology used in Yahoo

................................
................................
..................

11

Hybrid Search Engines Or Mixed Results Engines

................................
.............

11

Meta Search Engines

................................
................................
............................

11

Illustration of Metacrawler

................................
................................
....................

12

Conclusion
................................
................................
................................
.............

13

Reference

................................
................................
................................
...............

14












Sylvia King 9901516


Page
3

of
14



Introduction


In general there are only a number of search engines that users are
familiar with,
such as Google, Yahoo, Alta Vista, Lycos, AskJeeves, Excite, and HotBot, however
there are much more out there. The aim of the project is to study the popularly
used forms of search engines, namely Google, Yahoo and AskJeeves, the project

will examine the technologies used in these information retrieval systems and
explores why Google has been recently named no 1 on the list. Initially, I will
discuss the origins of these and review each one independently, discussing the
difference in t
he basic retrieval techniques of each.


Search engines can be typed into three categorise: crawler
-
based, human
-
powered,
and hybrid which is a mixture of the two categories. These three will be studied for
differences and common similarities. Google d
oes what is know in the industry as
crawls or spider the web, a user then searches through what it finds, this report will
study the technologies the company uses namely, the PigonRank system.


My project will also discuss human
-
powered directories, such a
s the AskJeeves and
Yahoo, directories that depends on humans for its listings, where in this case you
would submit a vivid report to the directory for your entire site. A particular search
only looks for a match using the descriptions submitted, this and
the ones mentioned
previously will be analysed for effectiveness, thus looking at the advantages and
disadvantages of each type of search engine.





Sylvia King 9901516


Page
4

of
14

Crawler
-
Based Search Engine


Google


Indexers or crawler based search engines use an automated way for in
formation
collection, so it is said that crawler based search engines crawls through the Internet
cataloguing and indexing websites.


These types of search engines have 3 main components, firstly the spider, which is
also called the crawler, what it doe
s is to visit a web page reading it as it goes along,
then it follows links to other pages within that site. The spider returns to the site every
month or two and checks for changes. All amendments are detected and are
transferred into the index.


The
index is sometimes known as the catalogue, it contains a copy of every single
web page that the spider finds, so when a web page is amended, this index or
catalogue is updated with all the new changes. Frequently it may take a while for
new pages or chang
es found by spider to be added to the index, so a web page may
have been spidered but not yet indexed.
[1],

until it is added to the index it will not be
available to those who are searching with the crawl based search engine.


Location & Frequency



Cra
wler
-
based search engines are designed for determining the relevancy of the
document requested, these search engines follow set rules called algorithms. They
concentrate on the location and frequency of keywords on a web page. An example
of this could be
a librarian who wishes to find a costumer books relating to "travel,"
first she would probably look at books with travel in the title, in this way, search
engines operate in the same manner.


The search engine also checks to see if the search keywords sho
ws near the top of a
web page, an action known as location, it looks in the headline or in the first few
paragraphs of text, the assumption is that any pages that are relevant to the topic will
mention the words at the beginning of the document.


Frequenc
y is another method of how search engines determine how relevant the
document is. Search keywords appear in relation to other words in a web page.
Words with a higher frequency are viewed as more relevant than those without. It
seems that all the main sear
ch engines follow the location & frequency process to
some degree, but what makes these different is that each search engine add a little
extra technology. So no search engine does it the same as another, which explains
why when a user enters the same sea
rch topic different results are produces when
using different search engines.


Some search engines index more web pages than others; an article on search
engine popularity shows the difference in the volume of index pages, the frequency in
which pages ar
e index also makes a difference. My findings show that no search
engine has the exact same collection of web pages to search through.

Ref [1]


Advantages of crawler based search engines



Offers larger searchable databases of web sites.





The full text
of individual web pages is often searchable.





Good for searching obscure terms or phrases.




Sylvia King 9901516


Page
5

of
14

Disadvantage of crawler based search engines




No human quality control to weed out duplicates and junk.





The size of the database can produce unmanage
ably high numbers of search
results.





The search command languages can be complicating and confusing



Other examples of crawler based search engines would include Alta Vista,
Excite, HotBot, and Magellan.



Sylvia King 9901516


Page
6

of
14

Search Engine spamming


Search engines a
re also able to penalize or exclude pages from the index, if it
detects a technique known as Spamming. Spamming is when a word is repeated
lots of times on one page, thus increasing the frequency level of that word, in order to
push the page higher in th
e lists. Search engines observe the spamming methods
with various tools and also most search engines follow
-
up customer complaints.


Types of Spamming Techniques


Keyword stuffing.

This is the repeated use of words to increase the frequency
Search engi
nes analysis the pages to determine whether the
frequency is above the normal level.

Invisible text.

It is known for webmasters to insert key words toward the
bottom of the page, scrupulous persons then change their text
colour the same as that of the
page background. This type of
spamming is also detectable by the engines.

Tiny text.

This process is the same as invisible text but with tiny,
unreadable text.

Page redirects.

Some engines, especially Infoseek, do not like pages that take
the user to a
nother page without his or her intervention, e.g.
using META refresh tags, cgi scripts, Java, JavaScript, or
server side techniques.

Meta tags stuffing.

Do not repeat your keywords in the Meta tags more than once,
and do not use keywords that are unrelat
ed to your site's
content.
Ref [1]



Penalties for search engine Spamming


Search engines penalties pages in different degrees, some engines will refuse to
index pages that contain spam whilst other would still index, but the pages would be
ranked lower.

In extreme cases the search engine can choose to ban the whole
website. The common aim of all search engines is to provide the most accurate and
up
-
to
-
date pages for their user; the activity of spamming clutters their indexes with
irrelevant or even misl
eading information.


Spamming is observed by directories such as Yahoo, and AskJeeves and also
search engines such as Google, which indecently has incorporated a paged that is
solely committed to reporting sites which use spam, this along with engine
-
spam
.com, a neighbourhood watch program dedicated to capturing search engine
spammers.






Sylvia King 9901516


Page
7

of
14

Google


Google is currently number one on the list of search engines, it indexes over four
billion
-
web pages. Google has the ability to search simple and advanced qu
eries, its
user interface is such that its easy to navigate, when searching in Google the user
enters a few words about there topic into the text box, when the user hits the search
button it finds a list of the relevant web pages. Google returns the web pa
ges that
contain all the words the user enters into the text box. Refining a search is simple
accomplished by adding more words to the search topic. The new query returns a
smaller part of the pages that Google found for the original search.


Google has

an advanced search page that allows the user to search using result
filters. The user chooses terms such as: "without the words”, "with all the words;"
"with at least one of the words;" and "with the exact phrase;" the search is then
narrowed according to

there choice. The page is designed is such a manner that it
includes filters for date, file format, numeric range, and language. The page also lets
the user expand the list of results from 10 to 100 hits.
Ref [2]



Forecasted growth of Google

Ref[3]


Technology used for Google


Google's hardware consists of more than 10,000 servers which index more than 4
billion web documents whilst its able to handling thousands of queries per second
with sub
-
second response times. This section of the report will
look at how Google
finds the results for queries so quickly.


Google's uses a search technology known as PigeonRank, this is a system for
ranking web pages, and was designed by the founders of Google, Larry Page and
Sergey Brin of Stanford University.
R
ef [4]


Google processes its search queries at a speed much greater than the traditional
search engines; it accomplishes this by collecting pigeons in thick clusters.


The internal Google documents include advertising
forecasts that have not been publicly disclosed.


Google predicted that the number of advertiser

accounts will rise from 280,000 this year to
378,000 in 2005, according to the documents.
From 2004 to 2008, the number of accounts is
expected to more than double to a number of
652,050.


Google expects its advertiser accounts to grow 35
percent between
2004 and 2005, however, Google
estimates that the growth rate will decline to 15
percent between 2007 and 2008.


Ref [3]



Sylvia King 9901516


Page
8

of
14


PigeonRank


The PigonRank system works as follows:



Firstly a user submitted a query t
o Google,



The query is then routed to what is known as data coop,



The data coop monitors flash result pages at incredible speed.



When a relevant result is located by one of the pigeons in the cluster, it strikes
a rubber
-
coated steel bar, this motion as
signs the page a Pigeon Rank value
of number one.



For each peck, the Pigeon Rank value is increased.



The pages that get the most pecks are prioritised and are shown at the top of
the user's results page.



The remaining results are displayed in order of t
his pecking system.


The pigeon rank methods used by Google makes it difficult to amend results, it is
know in the industry that some websites have attempted to boost rankings by
including images on their pages, Google's Pigeon Rank technology is not foole
d be
such techniques. The graphs below show the efficiencies of the Pigeon cluster:


Ref [5]




Sylvia King 9901516


Page
9

of
14


Human
-
Powered Directories



AskJeeves / Yahoo


Askjeaves & Yahoo



Human
-
powered directory, by name depends on humans for its listings; an example
would be Y
ahoo and AskJeeves, which are actually directories, directories that
depend on humans to collect the data.


Normally an editor will compile all the listings that directories have. Getting listed with
the web's major directories is extremely important sin
ce so many people see all the
listings. You submit a short description to the directory for your site, and then a
search looks for matches only with the description submitted.


It is common knowledge in the industy, that most services such as Google, M
SN
Search, AOL Search and Teoma, offer search engine and directory information,
despite the fact that they will generally feature the directories as opposed to the
websites.


Advantages of Directories




For browsing
--

when user is not entirely sure what

they are looking for.



If the user is unsure of which keywords to use in order to find information



Because these directories use human editors, the general standards are
higher than what’s found in search engines



Good for finding commercial sites
-

thi
s can also be viewed as being a
disadvantage as it indicates that non
-
commercial sites are not as common in
directories as they are in engines.



Keyword searches can be used within any category thus improving efficiency.


Disadvantages of Directories




It c
ould take the user a longer time in locating a suitable website.



Directories tend to be smaller than search engine databases, and tend only to
index top
-
level pages of a site.



Because directories are maintained by people, rather than by spiders, and
becaus
e they point to sites, rather than compiling databases containing
pages, the content of a site or page can change without the directory being
updated.



Dead links
-

these are links that do not go to the pages they are intended to,
but instead produce an erro
r message, are a problem as it is up to the human
editors to maintain the content of the directory.

Ref [6]










Sylvia King 9901516


Page
10

of
14




Askjeeves


AskJeeves search engine was founded in 1996 by David Warthen, a known software
developer, and Mr. Garrett Gruener, the founder

of Virtual Microsystems.


AskJeeves

has a sister company operating in the UK and Ireland, namely
Ask.co.uk.,
which is now in the top ten most popular search engines in the UK.

Ref [7]



Technology used for
AskJeeves


AskJeeves is a human
-
powered directo
ry search engine that’s known for its ability to
interpret natural language queries and has now obtained the privately held Teoma
Technologies, AskJeeves assists the user through questions which helps narrow the
search, it is know to also simultaneously se
arches of up to six other search sites for
the relevant web pages. Teoma is the backbone technology started by scientists at
Rutgers University, it is said that this technology is "the next big thing in search
engines".
[2]


Teoma technology places strong
emphasis to site popularity in their ranking
algorithms, and the search engine decides results by ranking a site based on what is
know as




Subject
-
Specific Popularity
: which is the number of web pages about the
subject that reference this page



General Pop
ularity
: the number of all the web pages that reference this page


Teoma also presents what are called "communities" of expert sites; these are
relevant knowledge hubs that may guide the user through their search. AskJeeves
via Teoma are said to indexes
over 1.5 billion web pages. Searching AskJeeves is
accomplished by using the simple or advanced search page.


AskJeeves & natural language Processing


As illustrated with the diagram above, AskJeeves is noted for their ability of using
Natural language

processing. This technique avoids forcing searchers to Boolean or
other query languages, AskJeeves allows the user to type in a question, and uses
this question for the search.


Askjeeves uses the Ask natural language processing search algorithm, which

goes
through your questions and finds the most relevant words. Other search engines
have taken the natural language processing field as well.


Sylvia King 9901516


Page
11

of
14


Yahoo!


Yahoo is the oldest search engine website in operation since 1994. Yahoo has been
concentrating on deve
loping a new search engine technology for a few years, and
now has it's own search engine database. The inventors of this search engine are
David Filo and Jerry Yang, both studied at the Stanford University. During the year
1994, they customized the Yahoo

database in order to serve the needs of growing
users.


Technology used in Yahoo


Yahoo has recently acquired its own brand of search engine, with its own indexing
and ranking methods, this move is said to create competition in the industry starting a
ne
w race for first place.


Yahoo was surrounded by speculation as regards to the Inktomi index, would Yahoo
be replacing this with the Google powered search technology it was originally using?
Journal reports indicated Yahoo has built a newly developed se
arch technology of its
own, an article published in February 2004 stated that Yahoo had dropped Google
and introduces its new algorithmic search technology called Yahoo Slurp which is
used for indexing its web pages.

Ref [6]


Yahoo searches are accessed o
n the main page, by typing your description a simple
search will commence, clicking the search shortcut links can quickly narrow your
search, and there are links to yellow pages, weather, news, and various products.
The advanced searching facility all
ows the user the use of dropdown filters to guide
their search.



Hybrid Search Engines Or Mixed Results Engines


In the early days of the web, search engine would show crawler based results or
human
-
powered listings, however in today environment it is ex
tremely common to
find a merger of both. Usually, a hybrid search engine will favour one type of listings
over another. For example, Yahoo is more likely to present human
-
powered listings
and Google its crawled listings.




Meta Search Engines


A Meta sear
ch engine works as an agent between the user and the search engines.
These search engines use what is known as metacrawler, which search the
database of various search engines.
Meta search engines do not build or maintain
their own web indexes, they use t
he indexes built by others
. Sometimes it is very
difficult to retrieve results from search engines, and in a quest to find that vital piece
of information, people often search several engines. This exercise would be time
-
consuming and the problem would li
e in sifting thought all the duplicated documents.









Sylvia King 9901516


Page
12

of
14

Illustration of Metacrawler

Ref [7]



An illustration of how Meta
-
Crawler, queries multiple search engines in parallel on the World
Wide Web.



The main feature of Meta search engine is the abili
ty to save time; it searches
various engines simultaneously and also removes duplicated documents.
The users
query is sent to multiple search engines, Meta search engines generally present the
first 10
-

30 results from each of the results page.

The adva
ntages here is that Meta
has the ability to single
-
handedly search several databases for the required topic.

The disadvantages associated with this type of search engine are that it may return a
limited number of hits.


Examples: Webcrawler and Query Ser
ver

Sylvia King 9901516


Page
13

of
14



Conclusion


Search engines are a term used to describe both true search engines, namely
Google and AltaVista, and Web directories, Yahoo and AskJeeves. This assignment
highlighted the fact that there is a distinct difference between the two.


Go
ogle and AltaVista are Search engines that spider or crawl through websites
compiling the data, so hits are found based on the information held in their
databases. The technology of PigonRank is an effective tool being used by Google,
I feel that for thi
s reason they have been granted number one status for search
engines.


Yahoo and AskJeeves are actually directories, that rely on humans for its listing. The
creators of websites can submit a short descriptive report in an attempt for inclusions
of their

site. The directory editors write descriptions for the site being reviewed, so a
search on directories, will find hits based on matches in the submitted descriptions.


The major difference of the search engines is the varied use of technology.
AskJee
ves was developed with a view of incorporating natural language processing.
Searches can be accomplished with questions submitted. The use of Boolean is
redundant. The Teoma technology used, places a strong emphasis to site popularity
in their ranking
algorithms; this technology seems to be the backbone of the
corporation and secrete tool for propelling AskJeeves into first place on the ranking
lists.










Sylvia King 9901516


Page
14

of
14


Reference


[1]

How Search Engines Rank Web Pages / By Danny Sullivan, Editor / July 31,
20
03 searchenginewatch.com/webmasters/article.php/2167961 ntro to
Search Engine Optimization



[2]

The search engines guide/ Kansas Public Library


Online 31 March 2004
-

Articles a Review of Search Engines



[3]

Searchenginewatch.com/blog/041020
-
111337
-

58k
-

Nov 20, 2004



Google Forecasts growth



[4]

SearchEngineWatch
-

The Technology Behind Google By
Chris Sherman
,
Associate Editor,
August 12, 2002



[5]

Seach Engin Watch

-

Journal
The Technology Behind Google

By Chris
Sherman, Associate Editor August 12, 2002


[6]

Search engine Watch journal
By
Chris Sherman
,
February 18, 2004


[7]

MetaCrawler/Husky Search group.
An illustration of how MetaCrawler,
washington.edu/research/projects/ai/metacrawler/



.