Welcome to Geeklog, Anonymous Thursday, March 28 2024 @ 06:16 pm EDT

Using a robots.txt file

  • Saturday, October 02 2004 @ 10:45 am EDT
  • Contributed by:
  • Views: 26,817
Geeklog.net Looking through the server logs of a low-traffic Geeklog site, I couldn't help noticing that Googlebot and its colleagues had been busy there indexing each and every variation of Geeklog's submission forms, e.g.
/submit.php?type=event&mode=&month=08&day=07&year=2004&hour=2
/submit.php?type=event&mode=&month=06&day=17&year=2004&hour=8
etc.

/comment.php?sid=20020513230754519&pid=0&type=article
/comment.php?sid=20020427185655276&pid=0&type=article
etc.

Obviously, it doesn't make a lot of sense to index these particular pages, or the submission forms in general.

There's an easy way to prevent this: Create a robots.txt file.

Every decent search engine will look for a robots.txt file before it starts indexing your site to see if there are any files or directories it shouldn't include in its index. So here's how to tell the search engine spiders to leave comment.php and submit.php alone:

  User-agent: *
  Disallow: /comment.php
  Disallow: /submit.php
  Disallow: /forum/createtopic.php

Put these lines in a simple text file, name it "robots.txt" and upload it to the root of your site, i.e. usually where Geeklog's index.php file resides.

With the "User-agent" line it's possible to set rules for certain spiders. We use a '*' to allow them all to index our site. The next three lines tell the spiders that they should not index these particular files (the third one is, obviously, for the forum so you don't need it if you don't have the forum plugin installed).

More information about the robots.txt file can be found on robotstxt.org. There's also a robots.txt validator to ensure your robots.txt doesn't have any syntax errors.

It's also worth thinking about adding other files or even directories there as well. For example, does it make sense to index the search form? Probably not. But then again, if you have a lot of links to specific search results, you may want those to be indexed.

Another benefit of using a robots.txt, apart from avoiding unnecessary traffic, is that your site is harder to find for the comment spammers (which are known to search for the key phrases that can be found on the comment submission form).