Search Engine Crawler Bot Traffic Detection Tutorial

Search Engine Crawler Bot Traffic Detection Tutorial

Search Engine Crawler Bot Traffic Detection Tutorial

We came across this good precis of What is the aim of a search engine crawling bot?

A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program.

As you might imagine, if your website is crawled by a search engine, such as Google, your website may need to cater for short periods of more intense interest. Perhaps similar for feeds. Perhaps similar for hacking “denial of service” attacks, alas.

Can we identify the first of those sources of traffic? Well, we did some research and development and got to this excellent PHP advice.

Why, for RJM Programming, do we want to identify the first of those sources of traffic? Well, we think the recent WordPress Blog TwentyTen theme 404.php background image work is asking a lot of our web server, and if there is a burst of traffic, we think it might be adversely affecting the website. Given the aims of a search engine crawling bot, above, we don’t think denying the bot those background images is a huge problem. And so, can we change 404.php to lessen that impost on our web server? We think so, as per …

<?php

function bot_detected() { // thanks to https://stackoverflow.com/questions/677419/how-to-detect-search-engine-bots-with-php
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}

?>

used below

<?php

$uparts=explode("/", $_SERVER['REQUEST_URI']);
if (sizeof($uparts) >= 2) {
if (trim(explode('#',explode('?',$uparts[-1 + sizeof($uparts)])[0])[0]) == '') {
$ioff=-1;
}
if (1 == 1 || ('' . $_SERVER['QUERY_STRING']) == '') {
$usz=sizeof($uparts);
if (str_replace('?' . $_SERVER['QUERY_STRING'],'',trim($uparts[-1 + sizeof($uparts)])) == '') { $usz--; }
if ($usz == 3 && strpos($uparts[-1 + $usz], "%20") !== false || strpos($uparts[-1 + $usz], "+") !== false) { // fix /ITblog/Linux%20mailx%20Primer%20Tutorial/ 18/1/2022 RM
$oky=true;
if (substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) >= '0' && substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) <= '9') {
if (substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) >= '0' && substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) <= '9') {
$oky=false;
}
}
if ($oky) {
if (('' . $_SERVER['QUERY_STRING']) == '') {
header('Location: ' . str_replace('~``','/ITblog/',str_replace('/','',str_replace('/ITBLOG/','~``',str_replace('/itblog/','~``',str_replace('/ITblog/','~``',str_replace('--','-',str_replace('---','-',str_replace('+','-',str_replace('%20','-',$_SERVER['REQUEST_URI']))))))))));
} else {
header('Location: ' . explode('?',str_replace('~``','/ITblog/',str_replace('/','',str_replace('/ITBLOG/','~``',str_replace('/itblog/','~``',str_replace('/ITblog/','~``',str_replace('--','-',str_replace('---','-',str_replace('+','-',str_replace('%20','-',$_SERVER['REQUEST_URI']))))))))))[0] . '?' . $_SERVER['QUERY_STRING']);
}
exit;
}
}
}
if (str_replace("category","cat",strtolower($uparts[-2 + sizeof($uparts)])) == "cat" || strtolower($uparts[-2 + sizeof($uparts)]) == "category") {
$catsare=["","Not Categorised","Ajax","Android","Animation","Anything You Like","Code::Blocks","Colour Matching","Data Integration","Database","Delphi","Eclipse","eLearning","ESL","Event-Driven Programming","Games","GIMP","GUI","Hradware","Installers","iOS","Land Surveying","Moodle","Music Poll","NetBeans","Networking","News","ontop","OOP","Operating System","Photography","Projects","Signage Poll","Software","SpectroPhotometer","Tiki Wiki","Trips","Tutorials","Uncategorized","Visual Studio","Xcode"];
for ($ibh=1; $ibh<sizeof($catsare); $ibh++) {
if (explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0] == strtolower($catsare[$ibh])) {
if (strtolower($catsare[$ibh]) == "ontop") {
header('Location: https://www.rjmprogramming.com.au/ITblog/category/' . str_replace(" ","-",explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;
} else {
header('Location: https://www.rjmprogramming.com.au/ITblog/category/' . str_replace(" ","-",explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;
}
} else if (explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0] == ('' . $ibh)) {
if (strtolower($catsare[$ibh]) == "ontop") {
header('Location: https://www.rjmprogramming.com.au/ITblog/?cat=' . str_replace(" ","-",explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;
} else {
header('Location: https://www.rjmprogramming.com.au/ITblog/?cat=' . str_replace(" ","-",explode("&",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;
}
}
}
} else if (!bot_detected() && substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) >= '0' && substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) <= '9') {
if (substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) >= '0' && substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) <= '9') {
$uwidth=trim($uparts[$ioff - 2 + sizeof($uparts)]);
$uheight=trim(explode('#',explode('?',$uparts[$ioff - 1 + sizeof($uparts)])[0])[0]);
$imfnameafterdomainsep="random_background_fadeinout.jpg";
$ptitle="Random Background Webpage Fade Tutorial";
selectNewBlogPostingTutorialPicture();
$postingiurl=explode('ITblog' . DIRECTORY_SEPARATOR, dirname(__FILE__) . DIRECTORY_SEPARATOR)[0] . $imfnameafterdomainsep;
list($iwidth, $iheight, $itype, $iattr) = getimagesize($postingiurl);
$amime = getimagesize($postingiurl);
if ($ioff == 0) {
//header('Content-Type: image/jpeg');
echo "<html>" . $tonl . "<body" . $bonl . " style=\"background:linear-gradient(rgba(255,255,255,0.7),rgba(255,255,255,0.7)),Url('data:image/jpeg;base64," . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . "#" . str_replace('+','%20',urlencode($ptitle)) . "') 0px 30px no-repeat;background-size:contain;background-repeat:no-repeat;background-position:0px 30px;\"><pre>data:image/jpeg;base64," . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . "#" . str_replace('+','%20',urlencode($ptitle)) . "</pre><br><iframe id=preif style='display:none;width:100%;height:1200px;' src=''></iframe><br><img onclick=\"document.getElementsByTagName('pre')[0].click();\" id=moimg style='display:none;border-width: 28px;border-style: solid; border-image: linear-gradient(to right, lightblue, lightgreen) 1;' src='data:image/jpeg;base64," . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . "#" . str_replace('+','%20',urlencode($ptitle)) . "'></img></body></html>";
} else if (1 == 2) {
//header('Content-Type: image/jpeg');
echo '<img src="' . "data:image/jpeg;base64," . base64_encode(file_get_contents($postingiurl)) . "#" . str_replace('+','%20',urlencode($ptitle)) . '"></img>';
} else {
createScaledImage($uwidth,$uheight,$postingiurl,false); //imagecreatefromjpeg($postingiurl);
}
exit;
}
}
}

?>

… as a means of differentiating users “surfing the net” from “search engine crawling bot” web traffic to our website’s WordPress blog (affecting search and tag and category and month list and day list query URLs) you are reading.

If this was interesting you may be interested in this too.

This entry was posted in eLearning, Software, Tutorials and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>