{"id":56097,"date":"2022-06-13T03:01:38","date_gmt":"2022-06-12T17:01:38","guid":{"rendered":"http:\/\/www.rjmprogramming.com.au\/ITblog\/?p=56097"},"modified":"2022-06-13T07:34:13","modified_gmt":"2022-06-12T21:34:13","slug":"search-engine-crawler-bot-traffic-detection-tutorial","status":"publish","type":"post","link":"https:\/\/www.rjmprogramming.com.au\/ITblog\/search-engine-crawler-bot-traffic-detection-tutorial\/","title":{"rendered":"Search Engine Crawler Bot Traffic Detection Tutorial"},"content":{"rendered":"<div style=\"width: 230px\" class=\"wp-caption alignnone\"><a target=_blank href=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/http_user_agent.jpg\"><img decoding=\"async\" style=\"border: 15px solid pink;\" alt=\"Search Engine Crawler Bot Traffic Detection Tutorial\" src=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/http_user_agent.jpg\" title=\"Search Engine Crawler Bot Traffic Detection Tutorial\"  style=\"float:left;\" \/><\/a><p class=\"wp-caption-text\">Search Engine Crawler Bot Traffic Detection Tutorial<\/p><\/div>\n<p>We came across this good precis of <a target=_blank title='what is the aim of a search engine crawling bot' href='https:\/\/www.google.com\/search?client=safari&#038;rls=en&#038;q=what+is+the+aim+of+a+search+engine+crawling+bot&#038;ie=UTF-8&#038;oe=UTF-8'>What is the aim of a search engine crawling bot<\/a>?<\/p>\n<blockquote cite='https:\/\/www.cloudflare.com\/en-au\/learning\/bots\/what-is-a-web-crawler\/'><p>\nA web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it&#8217;s needed. They&#8217;re called &#8220;web crawlers&#8221; because crawling is the technical term for automatically accessing a website and obtaining data via a software program.\n<\/p><\/blockquote>\n<p>As you might imagine, if your website is crawled by a search engine, such as <a target=_blank title='Google' href='https:\/\/google.com'>Google<\/a>, your website may need to cater for short periods of more intense interest.   Perhaps similar for feeds.  Perhaps similar for hacking &#8220;denial of service&#8221; attacks, alas.<\/p>\n<p>Can we identify the first of those sources of traffic?  Well, we did <a target=_blank title='How to detect search engine bots with php? (Google search)' href='https:\/\/www.google.com\/search?q=how+to+detect+search+engine+bots+with+php&#038;rlz=1C5CHFA_enAU973AU973&#038;oq=how+to+detect+search+engine+bots+with+php&#038;aqs=chrome..69i57j69i65.1000j0j7&#038;sourceid=chrome&#038;ie=UTF-8'>some research and development<\/a> and got to <a target=_blank title='Great advice' href='https:\/\/stackoverflow.com\/questions\/677419\/how-to-detect-search-engine-bots-with-php'>this excellent PHP advice<\/a>.<\/p>\n<p>Why, for RJM Programming, do we want to identify the first of those sources of traffic?  Well, we think the recent WordPress Blog TwentyTen theme <a target=_blank title='404.php work' href='https:\/\/www.rjmprogramming.com.au\/ITblog\/tag\/404.php'>404.php<\/a> background image work is asking a lot of our web server, and if there is a burst of traffic, we think it might be adversely affecting the website.  Given the aims of a search engine crawling bot, above, we don&#8217;t think denying the bot those background images is a huge problem.   And so, can we change 404.php to lessen that impost on our web server?   We think so, as per &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\nfunction bot_detected() {   \/\/ thanks to https:\/\/stackoverflow.com\/questions\/677419\/how-to-detect-search-engine-bots-with-php<br \/>\n  return (<br \/>\n    isset($_SERVER['HTTP_USER_AGENT'])<br \/>\n    && preg_match('\/bot|crawl|slurp|spider|mediapartners\/i', $_SERVER['HTTP_USER_AGENT'])<br \/>\n  );<br \/>\n}<br \/>\n<\/code><br \/>\n?&gt;<\/p>\n<p> &#8230; <font color=blue>used below<\/font> &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\n$uparts=explode(\"\/\", $_SERVER['REQUEST_URI']);<br \/>\nif (sizeof($uparts) &gt;= 2) {<br \/>\n  if (trim(explode('#',explode('?',$uparts[-1 + sizeof($uparts)])[0])[0]) == '') {<br \/>\n    $ioff=-1;<br \/>\n  }<br \/>\n  if (1 == 1 || ('' . $_SERVER['QUERY_STRING']) == '') {<br \/>\n    $usz=sizeof($uparts);<br \/>\n    if (str_replace('?' . $_SERVER['QUERY_STRING'],'',trim($uparts[-1 + sizeof($uparts)])) == '') { $usz--; }<br \/>\n    if ($usz == 3 && strpos($uparts[-1 + $usz], \"%20\") !== false || strpos($uparts[-1 + $usz], \"+\") !== false) { \/\/ fix \/ITblog\/Linux%20mailx%20Primer%20Tutorial\/ 18\/1\/2022 RM<br \/>\n     $oky=true;<br \/>\n     if (substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) &gt;= '0' && substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) &lt;= '9') {<br \/>\n     if (substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) &gt;= '0' && substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) &lt;= '9') {<br \/>\n     $oky=false;<br \/>\n     }<br \/>\n     }<br \/>\n     if ($oky) {<br \/>\n     if (('' . $_SERVER['QUERY_STRING']) == '') {<br \/>\n       header('Location: ' . str_replace('~``','\/ITblog\/',str_replace('\/','',str_replace('\/ITBLOG\/','~``',str_replace('\/itblog\/','~``',str_replace('\/ITblog\/','~``',str_replace('--','-',str_replace('---','-',str_replace('+','-',str_replace('%20','-',$_SERVER['REQUEST_URI']))))))))));<br \/>\n     } else {<br \/>\n       header('Location: ' . explode('?',str_replace('~``','\/ITblog\/',str_replace('\/','',str_replace('\/ITBLOG\/','~``',str_replace('\/itblog\/','~``',str_replace('\/ITblog\/','~``',str_replace('--','-',str_replace('---','-',str_replace('+','-',str_replace('%20','-',$_SERVER['REQUEST_URI']))))))))))[0] . '?' . $_SERVER['QUERY_STRING']);<br \/>\n     }<br \/>\n     exit;<br \/>\n     }<br \/>\n    }<br \/>\n  }<br \/>\n  if (str_replace(\"category\",\"cat\",strtolower($uparts[-2 + sizeof($uparts)])) == \"cat\" || strtolower($uparts[-2 + sizeof($uparts)]) == \"category\") {<br \/>\n    $catsare=[\"\",\"Not Categorised\",\"Ajax\",\"Android\",\"Animation\",\"Anything You Like\",\"Code::Blocks\",\"Colour Matching\",\"Data Integration\",\"Database\",\"Delphi\",\"Eclipse\",\"eLearning\",\"ESL\",\"Event-Driven Programming\",\"Games\",\"GIMP\",\"GUI\",\"Hradware\",\"Installers\",\"iOS\",\"Land Surveying\",\"Moodle\",\"Music Poll\",\"NetBeans\",\"Networking\",\"News\",\"ontop\",\"OOP\",\"Operating System\",\"Photography\",\"Projects\",\"Signage Poll\",\"Software\",\"SpectroPhotometer\",\"Tiki Wiki\",\"Trips\",\"Tutorials\",\"Uncategorized\",\"Visual Studio\",\"Xcode\"];<br \/>\n    for ($ibh=1; $ibh&lt;sizeof($catsare); $ibh++) {<br \/>\n      if (explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0] == strtolower($catsare[$ibh])) {<br \/>\n        if (strtolower($catsare[$ibh]) == \"ontop\") {<br \/>\n          header('Location: https:\/\/www.rjmprogramming.com.au\/ITblog\/category\/' . str_replace(\" \",\"-\",explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;<br \/>\n        } else {<br \/>\n          header('Location: https:\/\/www.rjmprogramming.com.au\/ITblog\/category\/' . str_replace(\" \",\"-\",explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;<br \/>\n        }<br \/>\n      } else if (explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0] == ('' . $ibh)) {<br \/>\n        if (strtolower($catsare[$ibh]) == \"ontop\") {<br \/>\n          header('Location: https:\/\/www.rjmprogramming.com.au\/ITblog\/?cat=' . str_replace(\" \",\"-\",explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;<br \/>\n        } else {<br \/>\n          header('Location: https:\/\/www.rjmprogramming.com.au\/ITblog\/?cat=' . str_replace(\" \",\"-\",explode(\"&\",strtolower($uparts[-1 + sizeof($uparts)]))[0])) . '#' . $ibh;<br \/>\n        }<br \/>\n      }<br \/>\n    }<br \/>\n  } else if (<font color=blue>!bot_detected() && <\/font>substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) &gt;= '0' && substr(trim($uparts[$ioff - 1 + sizeof($uparts)]) . ' ',0,1) &lt;= '9') {<br \/>\n    if (substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) &gt;= '0' && substr(trim($uparts[$ioff - 2 + sizeof($uparts)]) . ' ',0,1) &lt;= '9') {<br \/>\n      $uwidth=trim($uparts[$ioff - 2 + sizeof($uparts)]);<br \/>\n      $uheight=trim(explode('#',explode('?',$uparts[$ioff - 1 + sizeof($uparts)])[0])[0]);<br \/>\n      $imfnameafterdomainsep=\"random_background_fadeinout.jpg\";<br \/>\n      $ptitle=\"Random Background Webpage Fade Tutorial\";<br \/>\n      selectNewBlogPostingTutorialPicture();<br \/>\n      $postingiurl=explode('ITblog' . DIRECTORY_SEPARATOR, dirname(__FILE__) . DIRECTORY_SEPARATOR)[0] . $imfnameafterdomainsep;<br \/>\n      list($iwidth, $iheight, $itype, $iattr) = getimagesize($postingiurl);<br \/>\n      $amime = getimagesize($postingiurl);<br \/>\n      if ($ioff == 0) {<br \/>\n      \/\/header('Content-Type: image\/jpeg');<br \/>\n      echo \"&lt;html&gt;\" . $tonl . \"&lt;body\" . $bonl . \" style=\\\"background:linear-gradient(rgba(255,255,255,0.7),rgba(255,255,255,0.7)),Url('data:image\/jpeg;base64,\" . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . \"#\" . str_replace('+','%20',urlencode($ptitle)) . \"') 0px 30px no-repeat;background-size:contain;background-repeat:no-repeat;background-position:0px 30px;\\\"&gt;&lt;pre&gt;data:image\/jpeg;base64,\" . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . \"#\" . str_replace('+','%20',urlencode($ptitle)) . \"&lt;\/pre&gt;&lt;br&gt;&lt;iframe id=preif style='display:none;width:100%;height:1200px;' src=''&gt;&lt;\/iframe&gt;&lt;br&gt;&lt;img onclick=\\\"document.getElementsByTagName('pre')[0].click();\\\" id=moimg style='display:none;border-width: 28px;border-style: solid; border-image: linear-gradient(to right, lightblue, lightgreen) 1;' src='data:image\/jpeg;base64,\" . base64_encode(createScaledImage($uwidth,$uheight,$postingiurl,true)) . \"#\" . str_replace('+','%20',urlencode($ptitle)) . \"'&gt;&lt;\/img&gt;&lt;\/body&gt;&lt;\/html&gt;\";<br \/>\n      } else if (1 == 2) {<br \/>\n      \/\/header('Content-Type: image\/jpeg');<br \/>\n      echo '&lt;img src=\"' . \"data:image\/jpeg;base64,\" . base64_encode(file_get_contents($postingiurl)) . \"#\" . str_replace('+','%20',urlencode($ptitle)) . '\"&gt;&lt;\/img&gt;';<br \/>\n      } else {<br \/>\n      createScaledImage($uwidth,$uheight,$postingiurl,false); \/\/imagecreatefromjpeg($postingiurl);<br \/>\n      }<br \/>\n      exit;<br \/>\n    }<br \/>\n  }<br \/>\n}<br \/>\n<\/code><br \/>\n?&gt; <\/p>\n<p> &#8230; as a means of differentiating users &#8220;surfing the net&#8221; from &#8220;search engine crawling bot&#8221; web traffic to our website&#8217;s WordPress blog (affecting search and tag and category and month list and day list query URLs) you are reading.<\/p>\n<p>If this was interesting you may be interested in <a title='Click here to see topics in which you might be interested' href='#d56097' onclick='var dv=document.getElementById(\"d56097\"); dv.innerHTML = \"&lt;iframe width=670 height=600 src=\" + \"https:\/\/www.rjmprogramming.com.au\/ITblog\/tag\/crawl\" + \"&gt;&lt;\/iframe&gt;\"; dv.style.display = \"block\";'>this<\/a> too.<\/p>\n<div id='d56097' style='display: none; border-left: 2px solid green; border-top: 2px solid green;'><\/div>\n","protected":false},"excerpt":{"rendered":"<p>We came across this good precis of What is the aim of a search engine crawling bot? A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot &hellip; <a href=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/search-engine-crawler-bot-traffic-detection-tutorial\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,33,37],"tags":[3994,3849,151,2121,3993,1980,513,2540,932,997,1121,1319,1324,1345,1349,3992,1411,1456],"class_list":["post-56097","post","type-post","status-publish","format-standard","hentry","category-elearning","category-software","category-tutorials","tag-_serverhttp_user_agent","tag-404-php","tag-blog","tag-bot","tag-burst","tag-crawl","tag-google","tag-load","tag-php","tag-programming","tag-serach-engine","tag-tutorial","tag-twentyten","tag-url","tag-user-agent","tag-web-crawler","tag-web-server","tag-wordpress"],"_links":{"self":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/56097"}],"collection":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/comments?post=56097"}],"version-history":[{"count":7,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/56097\/revisions"}],"predecessor-version":[{"id":56104,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/56097\/revisions\/56104"}],"wp:attachment":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/media?parent=56097"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/categories?post=56097"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/tags?post=56097"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}