Karolina Lewandowska Boguslawa Piekarska Ioannis Zografakis

longingwimpInternet και Εφαρμογές Web

26 Ιουν 2012 (πριν από 6 χρόνια και 4 μήνες)

932 εμφανίσεις

HoneyPot for Web Crawlers
Karolina Lewandowska
Boguslawa Piekarska
Ioannis Zografakis
Honey Pot:
•Is a trap set to detect, deflect, or in some manner
counteract attempts at unauthorized use of information
systems. •It consists of a computer, data, or a network site that
appears to be part of a network, but is actually isolated,
(un)protected, and monitored, and which seems to
contain information or a resource of value to attackers.
Web Crawler:
•Is a computer program that browses the World Wide
Web in a methodical, automated manner. •This process is called Web crawling or spidering. Many
sites, in particular search engines, use spidering as a
means of providing up-to-date data.
•The process is initiated with the addition of a list of
hypertext documents (the seeds) to the crawling frontier
•Documents in the frontier are ranked, selecting the next
document to crawl and removing it from the frontier
•An HTTP request is issued for the document and after it
has been retrieved, its contents are processed and its
outward links are extracted
•All links that are not already in the frontier and were not
yet crawled are added to the frontier
•The process continues recursively
Web Crawler –How they work
Web Crawler –Politeness policy
•Robots exclusion protocol: protocol that is a standard for
administrators to indicate which parts of their Web
servers should not be accessed by crawlers.
•Resource-level through META tag in HTML files:
Honey Pots –more details
•Honey Pots have several purposes including distract
attackers from more valuable machines on the network,
provide early warning about the attack and allow an in-depth
examination of adversaries during and after the exploitation
of the Honey Pot.
•Honey Pots are supposed to interact only with the intruders
all the transactions and interactions of the honey pot are
unauthorized and the information gathered should be
•Honey Pots can be distinguished into types based on their
purpose and level of interaction.
Honey Pots –distinguished by purpose
•Research Honey Pot: is designed to gain information
about the blackhat community •Production honey Pots: are mostly used within
organizations to protect them and help migrate the risk
•Honeytoken: is like a honeypot, but it is not a computer
but a digital entry
Honey Pots –distinguished by level of interaction
•Low interaction honeypots:are honeypots with no
operating system for the attacker to interact with.•Medium interaction honeypots: provide the attacker with
an illusion of an operating system and give the attacker
more to interact with.•High interaction honeypots: the most advanced
honeypots, they provide the attacker a real operating system
to interact with, where nothing is simulated or restricted.
Milestones of the project
•Create some Web page
•Make high position for the Web page (it will be easier for
Web Crawlers to find it)
•Create a script in PHP which shows website visits and
divide on normal guests and Web crawlers
•Observe and make statistics about attacks
•Compare attacks from two servers
Honey Pot –Web Page ”Healthy lifestyle”
Content of Web Page:
•Good advices about healthy lifestyle
•What to eat
•How prepare the food
•Counting BMI
•Exercises (available after log in)
•Plank on Elbows and Toes
•Long arm crunch
•Bicycle exercise
•Diets (available after log in)
•Blood type diet
•Grapefruit diet
•Low fat diet
•Calorie Table
•Language: XHTML,
elements of Flash
•Servers: orfi.uwm.edu.pl
•URL –addresses:orfi.uwm.edu.pl/~bagietka/int
Our steps to attract page for Web Crawlers
•Text links instead of buttons
oCrawlers can’t read text from image
•Not too many images–crawlers prefer text
Our steps to attract page for Web Crawlers -cont'd
•Links between page to page of same web site
oCrawlers move from one web page to other
through the navigational links.
•Minimize JavaScript effects
oCrawlers don't get the content in between
< script > ... < /script > tags
•Search engine optimization (SEO)
oMaking good position for web site. A lot of
links from pages with good position
Our steps to attract page for Web Crawlers -cont'd
Search engine optimization (SEO)
•SEO directories
•Blog about health –presell page
•Finding the most popular keywords with
Google AdWords •Keywords: health, healthy, exercises, diets,
exercise fitness, weight loss diet, diet plan, health
food, healthy lifestyle, calorie table, diet food
Examples of directories and blog used by us

Directories for the page
e.g.:dmoz.org ,click4choice.com, internet-web-directory.com,
thalesdirectory.com, canlinks.net, politicalforecast.net,
nashvillebbb.org and more •
http://health4us.yolasite.com •
Directories for blog
e.g.: shane-english.com, pegasusdirectory.com, skoobe.biz,
tsection.com and more
Hidden links –Home Page
Hidden links –left-bottom side of the footer
•"Robots.txt" is a regular text file that through its name,
has special meaning to the majority of "honorable"
robots on the web.
•By defining a few rules in this text file, we can instruct
robots to not crawl and index certain files, directories
within our site, or at all.
•File robots.txt is uploaded to the root accessible
directory of our site
Robots.txt –our file
User-agent: *
Disallow: http://orfi.uwm.edu.pl/~bagietka/int/
Disallow: /bagietka/int/calorie_table.php
Logfiles –filtering the users
function getIsCrawler($userAgent) {
$crawlers = 'Webduniabot|UnChaos|SitiDi|DIE-
OT|aipbot|Aladin|Aleksika Spider|AlkalineBOT|'.
'Allesklar|AltaVista Intranet|AmfibiBOT|Amfibibot|AnnoMille
Logfiles –filtering the users
$isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);
return $isCrawler;
$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);
if ($isCrawler) {
else {
Logfiles –writing into a file
$date=date("H:i:s d-m-Y");
$data="IP: $IP, Date: $date, UserAgent: $user\n
Where: $where\n“;
$fpa=fopen("$filea", "r+");
$data=$data.fread($fpa, filesize($filea));
flock($fpa, 2);
fwrite($fpa, $data);
flock($fpa, 3);
Logs -examples
•IP:, Date: 20:16:30 25-11-2009, UserAgent:
Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:
Gecko/2009101601 Firefox/3.0.15 (.NET CLR 3.5.30729)
Where: /~bagietka/int/index.php?content=diets •IP:, Date: 20:39:06 25-11-2009, UserAgent:
W3C_Validator/1.654 Where: /~bagietka/int/index.php •IP:, Date: 20:46:10 25-11-2009, UserAgent:
Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:
Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Where:
Logs –Web crawlers –orfi.uwm.edu.pl
•IP:, Date: 13:56:13 28-11-2009, UserAgent:
Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) Where: /~bagietka/int/
•Next visited pages by Googlebot:
•15:55:33 28-11-2009/~bagietka/int/index.php
•16:04:59 28-11-2009/~bagietka/int/index.php?content=flash
•16:10:08 28-11-2009/~bagietka/int/index.php?content=contact
•16:20:25 28-11-2009/~bagietka/int/index.php?content=calorie_table
Remarks -Web crawlers –orfi.uwm.edu.pl:
•visited page which was forbidden in robots.txt
•went to invisible for people page „flash”
•„jumping”between pages were slowly –bot read each
page 10 minutes on average
•didn’t go in order of appearing links –first was
calorie_table and then contact
•supposition:contact pagecontainsmuch more
Logs –Web crawlers –x10hosting.com
•IP:, Date: 14:30:0325-11-2009, UserAgent: -;
•IP:, Date: 12:07:2626-11-2009, UserAgent:
Custom Spiderwww.homepageseek.com /1.0;
•IP:, Date: 05:29:0528-11-2009, UserAgent:
Mozilla/5.0 (compatible; Googlebot/2.1;
Remarks–Web crawlers –
•Some crawlers don’t leave their name
•Crawlers use more than one IP
•So far no crawler went from one server to the
other but we still wait…