Spring Semester 2013-2014 (2014-1)

thumpinsplishInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

112 εμφανίσεις

SWE 363: WEB ENGINEERING & DEVELOPMENT

Dr. Nasir Al
-
Darwish

Computer Science Department

King Fahd University of Petroleum and Minerals

darwish@kfupm.edu.sa

Spring Semester 2013
-
2014 (2014
-
1)

Module 1
-
2: Web Basics

KFUPM: Dr. Al
-
Darwish © 2013

1

Outline


Finding Information on the Web


Search Engines


Other means


Web 2.0

KFUPM: Dr. Al
-
Darwish ©
2013

2

Search Engines


The Web is a wealthy repository of information on almost any
subject


To help find information quickly on a specific subject, use


Subject Index

(known as search directories); The Web lacks a central
directory


Search Engines
; Primary tools for finding information (on the web)
that is relevant to the user query


There are many search engines


Some of them are used for general search but others (called
vertical
search engines
) focus on specific topics (e.g. bioinformatics, medical,
jobs, business, real estate, travel, etc.)


Examples of major search engines


Google
,
Yahoo
,
Bing
,
Ask.com
,
About
, etc.


See a list of them at
http://www.thesearchenginelist.com/


Differ in their capabilities and the way they work

KFUPM: Dr. Al
-
Darwish ©
2013

3

Search Engines Market Share

Jan. 2013 Statistics,

Source
:

http://www.karmasnack.com/about/search
-
engine
-
market
-
share/

KFUPM: Dr. Al
-
Darwish ©
2013

4

How Web Search Engines Work


Search Engine


A search engine searches a database of previously stored information
about web pages


It supports a web
-
based GUI that allows users to enter their search
criteria (search query)


Finds the best matches for the search criteria and return results to the
user


Web Crawler (spider)


Navigates the web, retrieves web pages that satisfy certain criteria


Starts with a popular Web site containing lots of links, such as Yahoo,
then continues until it finds a logical stop, e.g. a dead end with no external
links or reaching a number of levels inside the Web site’s structure


Indexing


Pages are analyzed and a list of words (extracted from titles, headings and
other special meta tags) are stored in indexes to facilitate quick retrieval

KFUPM: Dr. Al
-
Darwish ©
2013

5

How Web Search Engines Work …


Large search engines, such as Google, index hundreds of millions of
web pages involving a large number of distinct terms (in different
spoken languages), and answer millions of queries every hour.


Some search engines pre
-
process the user query to improve the
retrieval performance (a process called query expansion), e.g.


Removing spelling errors


Searching for synonyms of the specified keyword


Stemming the given keywords to find all morphological forms


Some search engines rank (sort according to relevance) pages that
satisfy the criteria specified in the user query

KFUPM: Dr. Al
-
Darwish ©
2013

6

How Google Works


Google runs on a distributed network of thousands of low
-
cost
computers and can therefore carry out fast parallel processing.


Parallel processing allows computation to execute concurrently,
significantly speeding up data processing.


Google has three distinct parts:


Googlebot, a web crawler that finds and fetches web pages.


The indexer that sorts every word on every page and stores the
resulting index of words in a huge database.


The query processor, which compares your search query to the
index and recommends the documents that it considers most
relevant.

KFUPM: Dr. Al
-
Darwish ©
2013

7

Google’s PageRank


PageRank uses a mathematical algorithm based on a graph, the webgraph,
with the HTML pages as nodes and hyperlinks as edges


Google uses several measures, including PageRank, to order hits of a search
query. For example, Google offer personalized search results based on your
web history and location.

KFUPM: Dr. Al
-
Darwish ©
2013

8

Google’s PageRank …


A hyperlink pointing to a web page
counts as a vote of support. If there are
no links pointing to a page there is no
support for that page.


The PageRank of a page is defined
recursively and depends on the
number and PageRank metric of all
pages that link to it ("incoming links").


A page that is linked to by many pages
with high PageRank receives a high
rank itself.


In the figure, Page C has a higher
PageRank than Page E, even though
there are fewer links to C; this because
the one link to C comes from a page
with a high ranking.

KFUPM: Dr. Al
-
Darwish ©
2013

9

Challenges Faced by Search Engines


Size of the Web


The Indexed Web contains at least 9.45 billion HTML pages
(Wednesday, 05 September, 2012).



Currency (Freshness)


Many Web pages are updated frequently, which forces the search
engine to revisit them periodically.



Relevancy


The queries one can make are currently limited to searching for
exact words; this can result in many false positives


Better results might be achieved by using proximity
-
search
option or organic search engines

KFUPM: Dr. Al
-
Darwish ©
2013

10

Shortcomings of Search Engines


Problem with dynamically
-
generated Web sites


Because these sites may be slow or difficult to index, or may
result in excessive results


Search engines can be tricked


To return pages, in favor of the trick makers, which contain little
or no information about the matching phrases.


Making the more relevant Web pages pushed further down in the
results list


Indexing secured pages


Content hosted on HTTPS URLs pose a challenge for crawlers
which either can't browse the content for technical reasons or
won't index it for privacy reasons

KFUPM: Dr. Al
-
Darwish ©
2013

11

Invisible Web


Also called deep Web, dark Web, hidden Web


Refers to web content that is not seen (or indexed) by general search
engines


“It would be a site that's possibly reasonably designed, but they didn't
bother to register it with any of the search engines. So, no one can find
them! You're hidden. I call that the invisible Web.” Frank Garcia,
http://en.wikipedia.org/wiki/Deep_Web

KFUPM: Dr. Al
-
Darwish ©
2013

12

Invisible Web

Deep Web resources may be classified into one or more of the following
categories:


Dynamic content
:

dynamic pages

which are returned in response to

a submitted query or accessed only through a form, especially if open
-
domain input elements (such as text fields) are used; such fields are hard
to navigate without domain knowledge.


Unlinked content
:

pages which are not linked to by other pages,
which may prevent
Web crawling

programs from accessing the content.


Private Web
:

sites that require registration and login (password
-
protected resources).


Contextual Web
:

pages with content varying for different access
contexts (e.g., ranges of client IP addresses or previous navigation
sequence).


Scripted content
:

pages that are only accessible through links
produced by
JavaScript

as well as content dynamically downloaded from
Web servers via
Flash

or
Ajax

solutions.

KFUPM: Dr. Al
-
Darwish ©
2013

13

Enhancing Site Visibility


Search engine optimization (SEO)


Developing or tuning a website to improve its ranking in non
-
paid search
engines; to maximize traffic to this site


SEO can be done using
White hat SEO
or
Black hat SEO


White hat SEO
:
refers to methods approved by search engines (i.e. do not
attempt to deceive search engines), e.g.


Offering quality content


Using proper metadata and effective keywords


Having inbound links from relevant high
-
quality pages


Black hat SEO
(
spamdexing
):
methods that are used to deceive search
engines (e.g. repeat certain words, especially with links); these can result in
temporal improvement


Google bomb

is a black hat SEO that attempts to trick the search engine
to promote a certain page (by creating a large number of links the page
from various places) to have a high ranking for searches on unrelated or
off topic keyword phrases)

KFUPM: Dr. Al
-
Darwish ©
2013

14

Other Ways to Find Info. on the Web


Meta
-
search engines


have no databases or indexes but they use multiple other search engines
and aggregate their results, e.g.
WebCrawler
,
DogPile


Web directories


human
-

edited databases that store information about links in

a categorized manner (e.g.
Yahoo! Directory
,
DMOZ
,

Business.com
, etc);
they can also be automatically created by mining the output of search
engines


Public Web portals


multi
-
service web sites that provides a single point of access to

a variety of content and services. Example:
http://msn.com


often includes customizable pages, calendars, discussion groups,
announcements, reports, searches, email and address books, and access to
news, weather, maps, and shopping, as well as bookmarks.


often organizes information into channels, where specific information or
an application appears to facilitate locating information of interest by
content category.

KFUPM: Dr. Al
-
Darwish ©
2013

15

Web 2.0


Web 1.0 is mostly a “brochure web”; companies and advertisers produce
content for users to access


Web 2.0 provides collaborative community
-
based platforms that allows more
user participation, interaction and community contributions


Users create content, help organize it, critique it, update it, etc.


Users create open source software and make it available for anyone to use and
modify


Users direct how media is delivered and which news and information outlets to
trust


Examples: Wikis, YouTube, Flickr, MySpace, Facebook, LinkedIn, Google, etc.


The growth of Web 2.0 can be attributed for some key factors


Improvements in hardware: cheaper and faster


Increasing memory capacities and speeds at a rapid rate


Faster Internet access


Availability of open source software has resulted in cheaper (and often free)
customizable software options

KFUPM: Dr. Al
-
Darwish ©
2013

16

Web 2.0 …


User Generated Content


Allow users to edit existing content and add new information


Collaboration can result in smart ideas


But, users also might deliberately submit false or faulty information


Web 2.0 companies rely on collaborative filtering to help police their
sites


Let users promote valuable material and flag offensive or inappropriate
material


Wikis (What I Know Is) and social networks, e.g. Wikipedia, MySpace,
YouTube, Facebook, LinkedIn, Second Life, etc.

KFUPM: Dr. Al
-
Darwish ©
2013

17

Web 2.0 …


Blogs (“Web logs”)


Websites consisting of entries listed in reverse chronological order;
exponential growth; bloggers = blog authors


Blogs also incorporates media, such as music or videos, e.g.
Xanga
,
LiveJournal


Social Media


Allows users to decide which content (or news articles) are most
significant, e.g.
Digg
,
Reddit
,
StumbleUpon


Social Bookmarking


Allows users to recommend their favorite sites, e.g.
del.icio.us
(
http://delicious.com/

),
ma.gnolia
(
http://gnolia.com/
,
deceased)


Tagging


Labeling already existing web content by subject or keywords that allow
anyone to locate information more effectively


pushing the content right
to the user’s desktop


RSS feeds


RSS = Rich Site Summary


Allow users to receive new information as it is updated

KFUPM: Dr. Al
-
Darwish ©
2013

18

Web 2.0
-

Summary


The term “Web 2.0” defines an era; like “Dot Com”


Read/Write, two
-
way, anyone can be a publisher


Social Web, Social Networks (MySpace, Facebook, OpenSocial)


Enhanced Search (e.g. Google search by file type, file and specialized
search engines)


Online Media (YouTube, Hulu, Last.fm)


Content Aggregation / Syndication (Bloglines, Google Reader,
Techmeme, Topix)


Mashups (Google Maps, Flickr, Amazon)


Personalized Web pages, Widgets (e.g.,
iGoogle
,
My Yahoo
)

KFUPM: Dr. Al
-
Darwish ©
2013

19

Web 3.0


Semantic web (attach meaning to data), personalization (e.g. iGoogle),
intelligent search and behavioral advertising among other things.


KFUPM: Dr. Al
-
Darwish ©
2013

20

References


Data Communications and Networking, 4/e. B.A. Forouzan,
McGraw
-
Hill Higher Education 2007.
http://www.mhhe.com/forouzan


The World Wide Web Consortium (W3C
)


The Anatomy of a Large
-
Scale Hypertextual Web Search Engine
, by
Sergey Brin & Lawrence Page at Stanford University


Dive Into Web 2.0,
http://www.deitel.com/freeWeb20ebook/


“Crawling the Hidden Web
,” Proc. of the 27th International Conf. on
Very Large Data Bases (VLDB). pp.

129
-
138.


(PDF)