White Hat Cloaking Six Practical

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

55 views

White Hat Cloaking


Six Practical
Applications

Presented by Hamlet Batista

Page


2

Why white hat cloaking?


“Good” vs “bad” cloaking is all about your
intention


Always weigh the risks versus the rewards of
cloaking


Ask permission


or just don’t call it cloaking!


Cloaking vs “IP delivery”



Page


3

Crash course in white hat cloaking

When to cloak?

How do we cloak?

How can cloaking be detected?

Risks and next steps

1

2

4

5

Practical scenarios where good cloaking makes sense

Practical scenarios and alternatives

3

Page


4

When is practical to cloak?


Content accessibility

-
Search unfriendly Content Management Systems

-
Rich media sites

-
Content behind forms


Membership sites

-
Free and paid content


Site structure improvements

-
Alternative to PR sculpting via “no
-
follow“


Geolocation/IP delivery


Multivariate testing



Page


5

Practical scenario #1

Regular users see


URLs with many dynamic
parameters


URLs with session IDs


URLs with canonicalization issues


Missing titles and meta descriptions

Search engine robot sees


Search engine friendly URLs


URLs without session IDs


URLs with a consistent naming
convention


Automatically generated titles and
meta descriptions


Proprietary website management systems that are
not search
-
engine friendly

Page


6

Practical scenario #2

Sites built completely in Flash, Silverlight or any other rich media
technology

Search engine robot sees


A text representation of all graphical
(images) elements


A text representation of all motion
(video) elements


A text transcription of all audio in the
rich media content


Page


7

Practical scenario #3

Membership sites

Search users see


Snippets of premium content on the
SERPs


When they land on the site they are
faced with a registration form



Members sees


The same content search engine
robots see



Page


8

Practical scenario #4

Step 1

Step 2

Step 3

Step 4

Step 5

Regular users follow a
link structure designed
for ease of navigation

Sites requiring massive site strucuture changes to improve index
penetration

Search engine robots
follow a link structure
designed for ease of
crawling and deeper
index penetration of
the most important
content

Step 1

Step 3

Step 2

Step 5

Step 4

Page


9

Practical scenario #5

Sites using geolocation technology

Regular users see


Content tailored to their geographical
location and/or user’s language



Search engine robot sees


The same content consistently



Page


10

Practical scenario #6

Split testing organic search landing pages

Each regular user sees


One of the content experiment
alternatives



Search engine robot sees


The same content consistently



Page


11

How do we cloak?

Search robot detection


By HTTP User agent


By IP address


By HTTP cookie test


By JavaScript/CSS test


By DNS double check


By visitor behavior


By combining all the techniques

Content delivery


Presenting the equivalent of the
inaccesible content to robots


Presenting the search
-
engine friendly
content to robots


Presenting the content behind forms
robots


Cloaking is performed with a web server script or module

Page


12

Robot detection by HTTP user agent

Search robot HTTP request

66.249.66.1


-


-


[04/Mar/2008:00:20:56
-
0500]

“GET /2007/11/13/game
-
plan
-
what
-
marketers
-
can
-
learn
-
from
-
strategy
-
games/

HTTP/1.1″

200

61477


-


“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”


-



A very simple robot detection technique

Page


13

Robot detection by HTTP cookie test

Search robot HTTP request

66.249.66.1


-


-


[04/Mar/2008:00:20:56
-
0500]

“GET /2007/11/13/game
-
plan
-
what
-
marketers
-
can
-
learn
-
from
-
strategy
-
games/

HTTP/1.1″

200

61477


-


“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”


Missing cookie info



Another simple robot detection technique, but weaker

HTML Code


<div id="header"><h1><a href="http://www.example.com" title="Example Site">Example site</a></h1></div>



and the CSS code is pretty straight forward, it swaps out anything in the h1 tag in the header with an image


CSS Code


/* CSS Image replacement */

#header h1 {margin:0; padding:0;}

#header h1 a {

display: block;

padding: 150px 0 0 0;

background: url(path to image) top right no
-
repeat;

overflow: hidden;

font
-
size: 1px;

line
-
height: 1px;

height: 0px !important;

height /**/:150px;

}

Page


14

Robot detection by JavaScript/CSS test

DHTML Content

Another option for robot detection

Page


15

Robot detection by IP address

Search robot HTTP request

66.249.66.1


-


-


[04/Mar/2008:00:20:56
-
0500]

“GET /2007/11/13/game
-
plan
-
what
-
marketers
-
can
-
learn
-
from
-
strategy
-
games/

HTTP/1.1″

200

61477


-


“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”


-



A more robust robot detection technique

Page


16

Robot detection by double DNS check

Search robot HTTP request


nslookup



66.249.66.1


Name:
crawl
-
66
-
249
-
66
-
1.googlebot.com

Address: 66.249.66.1



crawl
-
66
-
249
-
66
-
1.googlebot.com


Non
-
authoritative answer:

Name: crawl
-
66
-
249
-
66
-
1.googlebot.com

Address
: 66.249.66.1



A more robust robot detection technique

Page


17

Robot detection by visitor behavior

Robots differ substantially from regular users when visiting a website

Page


18

Combining the best of all techniques

Maintain a cache with a
list of known search
robots to reduce the
number of verification
attempts

Label as possible robot
any visitor with
suspicious behavior

Label a robot anything that
identifies as such

Confirm it is a robot by
doing a double DNS check.
Also confirm suspect robots

Page


19

Clever cloaking detection

A clever detection technique is to check the caches at the newest
datacenters


IP
-
based detection techniques rely
on an up
-
to
-
date list of robot IPs


Search engines change IPs on a
regular basis


It is possible to identify those new
IPs and check the cache


Page


20

Risks of cloaking

Search engines do not want to accept any type of cloaking

Survival tips


The safest way to cloak is to ask for
permission from each of the search
engines that you care about


Refer to it as
IP

delivery
.



Cloaking
: Serving different content to
users than to Googlebot. This is a
violation of our
webmaster guidelines
.
If the file that Googlebot sees is not
identical to the file that a typical user
sees, then you're in a high
-
risk
category.
A program such as
md5sum or diff can compute a
hash to verify that two different
files are identical.


http://googlewebmastercentral.blogs
pot.com/2008/06/how
-
google
-
defines
-
ip
-
delivery.html


Page


21

Next Steps


Make sure clients understand the risks/rewards of implementing
white hat cloaking


More information and how to get started

-
How Google defines IP delivery, geolocation and cloaking
http://googlewebmastercentral.blogspot.com/2008/06/how
-
google
-
defines
-
ip
-
delivery.html

-
First Click Free
http://googlenewsblog.blogspot.com/2007/09/first
-
click
-
free.html

-
Good Cloaking, Evil Cloaking and Detection
http://searchengineland.com/070301
-
065358.php

-
YADAC: Yet Another Debate About Cloaking Happens Again
http://searchengineland.com/070304
-
231603.php

-
Cloaking is OK Says Google
http://blog.venture
-
skills.co.uk/2007/07/06/cloaking
-
is
-
ok
-
says
-
google/

-
Advanced Cloaking Technique: How to feed password
-
protected content to
search engine spiders
http://hamletbatista.com/2007/09/03/advanced
-
cloaking
-
technique
-
how
-
to
-
feed
-
password
-
protected
-
content
-
to
-
search
-
engine
-
spiders/




Blog
http://hamletbatista.com


LinkedIn
http://www.linkedin.com/in/hamletbatista


Facebook
http://www.facebook.com/people/Hamlet_Batista/613808617


Twitter
http://twitter.com/hamletbatista


E
-
mail
hamlet@hamletbatista.com




Page


22

I would be happy to help.

Feel free to

contact me