Topics

User Functions

Events

There are no upcoming events

What's New

Stories

1 new Stories in the last 2 weeks

Comments last 2 weeks


Trackbacks last 2 weeks

No new trackback comments

Links last 2 weeks

No recent new links

NEW FILES last 14 days

No new files

Welcome to Geeklog Sunday, May 26 2013 @ 02:07 AM EDT

> >

Using a robots.txt file

Geeklog.net
  • Saturday, October 02 2004 @ 10:45 AM EDT
  • Contributed by:
  • Views:
    20,495
Looking through the server logs of a low-traffic Geeklog site, I couldn't help noticing that Googlebot and its colleagues had been busy there indexing each and every variation of Geeklog's submission forms, e.g.
/submit.php?type=event&mode=&month=08&day=07&year=2004&hour=2
/submit.php?type=event&mode=&month=06&day=17&year=2004&hour=8
etc.

/comment.php?sid=20020513230754519&pid=0&type=article
/comment.php?sid=20020427185655276&pid=0&type=article
etc.

Obviously, it doesn't make a lot of sense to index these particular pages, or the submission forms in general.

There's an easy way to prevent this: Create a robots.txt file.

Every decent search engine will look for a robots.txt file before it starts indexing your site to see if there are any files or directories it shouldn't include in its index. So here's how to tell the search engine spiders to leave comment.php and submit.php alone:

  User-agent: *
  Disallow: /comment.php
  Disallow: /submit.php
  Disallow: /forum/createtopic.php

Put these lines in a simple text file, name it "robots.txt" and upload it to the root of your site, i.e. usually where Geeklog's index.php file resides.

With the "User-agent" line it's possible to set rules for certain spiders. We use a '*' to allow them all to index our site. The next three lines tell the spiders that they should not index these particular files (the third one is, obviously, for the forum so you don't need it if you don't have the forum plugin installed).

More information about the robots.txt file can be found on robotstxt.org. There's also a robots.txt validator to ensure your robots.txt doesn't have any syntax errors.

It's also worth thinking about adding other files or even directories there as well. For example, does it make sense to index the search form? Probably not. But then again, if you have a lot of links to specific search results, you may want those to be indexed.

Another benefit of using a robots.txt, apart from avoiding unnecessary traffic, is that your site is harder to find for the comment spammers (which are known to search for the key phrases that can be found on the comment submission form).

The following comments are owned by whomever posted them. This site is not responsible for what they say.

  • Using a robots.txt file
  • Authored by:ScurvyDawg on Saturday, October 02 2004 @ 03:10 PM EDT
Great post Dirk.
  • Using a robots.txt file
  • Authored by:Marites on Saturday, October 02 2004 @ 05:59 PM EDT
A lot of interesting material have you run the checker on Geeklog.net or anyother G/L site - my own sites like those of geeklog.net gove an error on every line, surely all our codings can't be that bad.

I have a degree in computer sciences and to be honest cannot see that many errors ib the php > html generated code.

-Marites-
  • Using a robots.txt file
  • Authored by:Marites on Saturday, October 02 2004 @ 06:09 PM EDT
Forgot this particlar validator should have URL/robots.txt not the actual site will look up the URL of the one I normally use that allows the site url and seeks out the robot.txt. Sorry about previous post.

Marites
  • Using a robots.txt file
  • Authored by:JohnVanVliet on Sunday, October 03 2004 @ 04:44 AM EDT
hi last mo. i had to ban 2 ip's because msn bot ignored my robots.txt and eat up 3.5 gig ( yes GIG.) of bandwidth
  • Using a robots.txt file
  • Authored by:NeoNecro on Sunday, October 03 2004 @ 02:48 PM EDT
Wow, what do you have on your site??
In two months time googlebot only used 25MB of my banwidth and my site is quiete active and I have over 200 hundred of articles.
Your site has to gigantic.

---
webside.info
  • Using a robots.txt file
  • Authored by:JohnVanVliet on Monday, October 04 2004 @ 02:32 AM EDT
yes it is BIG aprox 4 gig . it is for maps of the solar sys. for Celestia ( a 3d space sim.)
  • Using a robots.txt file
  • Authored by:Dirk on Monday, October 04 2004 @ 04:23 PM EDT
I had a similar problem with Googlebot last year where it downloaded one particular file over a 1000 times for no apparent reason. Fortunately, it was a small file and it did abide the robots.txt that I set up ...

bye, Dirk

  • Using a robots.txt file
  • Authored by:malamute on Monday, December 05 2005 @ 01:26 AM EST
well as far as Google goes,i have blocked the whole range of googles bot

Post a Comment

Your Name
Create Account
Allowed HTML Tags:
 

Security code
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

What code is in the image?
Enter the bolded text, case sensitive!
Important Stuff
  • Please try to keep posts on topic.
  • Try to reply to other people comments instead of starting new threads.
  • Read other people's messages before posting your own to avoid simply duplicating what has already been said.
  • Use a clear subject that describes what your message is about.
  • Your email address will NOT be made public.