Web Mining & Entropy

farmpaintlickInternet and Web Development

Oct 21, 2013 (4 years and 7 months ago)


“The technique of navigating & mapping cyberspace to bring order to the
internet and determine relationships that were previously unfounded.”

Web Mining

By Peter Smith

The Structure of the Internet, Mining Architectures & Entropy

Web Mining

The Structure of the Web

Search Engines Vs. Entropy

Mining Architectures

The Degradation of Web Space

The 3 Types of Web Mining

Everything Tends to Disorder

Video Segment







The Structure of the Web

Structural Mining

Content Mining

Usage Mining

The 3 Types of Web Mining

Content mining is the
process of screening relevant
information to produce
superior web queries.

Spiders and robots known
as web crawlers are used to
traverse indexes and
hyperlinks of a site to gather

This data is linked with
keywords to determine the
accuracy and relevance of
web pages and search

See Content Mining Web
Crawler Architecture Diagram

Looks at the links between
pages and sites, the
organisation of data and
looks at the structure of a
web site as a whole.

It’s main objective is to try
and create a more connected
environment to provide a
more complete structure of
the web.

See Robots Mining Example

Usage mining looks at
operational data or user

Analyses data and turns it
into extremely useful
readable information.

Google Analytics is a
primary example of a tool
that uses usage mining.

Mining Architectures (1)

Content Mining

Web Crawler Architecture

See Content Mining Web
Crawler Architecture
Diagram and definition
by Marc Najork.

Mining Architectures (2)

Usage Mining Architecture

See Usage Mining Architecture
Diagram and Steps.

Everything Tends to Disorder (1)

What do sandcastles have to do with

the internet?

Everything Tends to Disorder (2)

Professor Brian Cox: “
How a sandcastle reveals the end of all things”
Wonders of the Universe


The Degradation of Web Space (1)


In this context, the sandcastle in fact represents the internet
controlled and structured by the mould being Search Engines.


The sand and wind was representative of the continuous
number of web pages being added to the web that is slowly
battering this attempted order of the internet.


Search engines employ a large amount of expertise to refine
their search algorithms to produce better results. In effect one
could say that this is the primary role of the search engine, to
reduce entropy across cyberspace and ensure the web is a
usable accumulator of information.

The Degradation of Web Space


Entropy however has a powerful ally, being time. As time
progresses more and more pages will be created and no
matter how organised search engines try to make the web,
entropy always increases and the internet will eventually
become far too vast an entropic medium to order.


The world wide web itself could in all reason become


Search Engines Vs. Entropy


Search Engines are trying to provide that form of structure
and the main tools they use against degradation of web space
are the mining techniques we have discussed here.


Along with the algorithms used to search the web, humans
are the greatest weapon we have against increasing entropy.

Tim Berners Lee understands that there

is a need to create some type of order and that’s

what the semantic web is trying to do, create

a form of control & order.


A master algorithm could also be the answer

and as we know Google are constantly refining theirs in order
to keep up with the ever changing web, but the question is

will it work?