Detect User-Agents: Cloak and Dagger for Web Sites - Part 2

by Scott Allen

“I’ve heard of User-Agents…”

In a previous post, I introduced you to User-Agents. Now let’s find out why you need to detect them, and how.

According to Wikipedia:

When Internet users visit a web site, a text string is generally sent to identify the user agent to the server. This forms part of the HTTP request, prefixed with User-agent: or User-Agent: and typically includes information such as the application name, version, host operating system, and language. Bots, such as web crawlers, often also include a URL and/or e-mail address so that the webmaster can contact the operator of the bot.

agents

What Is My User-Agent?
Your User-Agent is:
CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

4 Reasons Why You Need to Detect User-Agents

  1. Browsers Have Quirks - Every web page, no matter how strict of an XHTML document, WILL react differently in certain browsers. Sometimes it is necessary to give a document minor adjustments to look uniform in all browsers. You should try to keep this to a minimum, and code your documents according to best practices and standards, but sometimes browsers still don’t cooperate, so we have to tweak.
  2. Personalize Content - It may be appropriate to provide different versions of the content depending on the type of browser or user-agent. For example, you may have specialized content such as podcasts, wallpaper, and downloads for mobile browsers and portable video game systems. It would be important to serve content appropriately so that each visitor has the most relevant experience at your site. As long as the intent is not to deceive search engines, this is not considered cloaking.
  3. Keep Bad Visitors Out of Your Site - Do you often have bandwidth problems because unscrupulous visitors are downloading your entire site or devious webmasters are sending scraper bots to steal your data and use in their spammy sites? Then you need to use your .htaccess file to block bad visitors. Place the following lines into the beginning of your .htaccess file:

    # Bad User-Agent List :: BEGIN
    SetEnvIfNoCase User-Agent "Bad\ User\ Agent\ Here" bad_user_agent
    SetEnvIfNoCase User-Agent "BadUserAgentHere" bad_user_agent
    # Bad User-Agent List :: END

    Then place this near the end:

    <Files *>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_user_agent
    </Files>

    Replace “Bad\ User\ Agent\ Here” and “BadUserAgentHere” with a key identifying phrase from the User-Agent string of the offending visitor(s). Place a slash (\) before spaces and punctuation. If you have more than one, copy and paste the line to create a longer list. For more info, visit the .Htaccess Reference. This will not block all scrapers, but it will eliminate quite a few.

  4. Guide Search Engine Spiders - Every search engine has a web robot, called a spider, that visits your web site. You need to guide these spiders in how they access your site, using a Robots.txt file. As basic as this may be to some of you, you’d be surprised how many webmasters don’t use Robots.txt correctly. If you need help creating a Robots.txt file, visit the Robots.txt Generator.

How Do I Detect My User-Agent?
That’s a great question. Here’s how to detect your User-Agent, in PHP, ASP, and JavaScript.

PHP:
<?php
$MyUserAgent = $_SERVER['HTTP_USER_AGENT'];
echo "Your User-Agent is: $MyUserAgent";
?>

ASP:
<% @ Language=VBScript %>
<%
MyUserAgent = Request.ServerVariables("HTTP_USER_AGENT")
%>
Your User-Agent is: <%=MyUserAgent%>

JavaScript:
<script language="JavaScript">
MyUserAgent = navigator.userAgent;
document.write('Your User-Agent is: ',MyUserAgent);
</script>

Detailed Browser Detection:

  • Browscap PHP Project
    An excellent standalone class that you can install in minutes and use to detect the latest browsers easily.

A Sample List of User-Agents:

Learn About Identifying Spiders/Bots/Browsers:

Tags:
|

Bookmark, Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • StumbleUpon
  • del.icio.us
  • Sphinn
  • Digg
  • Reddit
  • Netscape
  • Technorati
  • Ma.gnolia
  • YahooMyWeb
  • Slashdot
  • Spurl
  • Fark
  • Furl
  • BlinkList


If you enjoyed this post, make sure you subscribe to the RSS feed!


Email This to a Friend Email This to a Friend

Print This Post Print This Post


Related Posts:

  • User-Agents: Cloak and Dagger for Web Sites - Part 1
  • What Is My User-Agent?
  • Cyber-Surveillance and Internet Data-Mining
  • Search Engine Friendly URLs and .htaccess / mod_rewrite - Part 1
  • Detailed Browser Detection


  • About This Entry