robots.txt Setup for Search Engine Spiders

robots.txt is a simple text file that tells the search engine (designated as SE from hereon) crawlers what not to read. This article will explain how to write a robots.txt file for your website, why its important etc.

Its a common misunderstanding that some webmasters think its a file to tell spiders what to read and what not to read. Worng. Its a file only to tell spiders what not to read. 

Reasons you want to have a robots.txt file on your website:

1. Let search engine crawlers read only the files you like to index

When you don't want SE crawlers index certain resources/files on your website, you want to tell them "Don't read these files". For example if you don't want your images be indexed on SE, you want to say that on your robots.txt.

2. Save server resources

SE bots reads files as like you are viewing a page on a browser. Everytime SE bots requests a file from server, it consumes some server resources. Reading unnecessary files is just a waste of resources, slowing down the performance.  If your site have many visitors using the website simultaneously this becomes a big issue.

3. Save Bandwidth 

Everytime SE bot requests a file server have to send it, same like viewing on browsers. And it consumes bandwidth. If you have limited bandwidth, you can restrict SE crawlers from reading unnecessary files on your robots.txt

4. Restrict robots for copyright reasons

You may want to exclude some SE spiders for copyright or other reasons. For example http://www.picsearch.com will download your images and create a thumbnail version of it for people to search. That thumbnail image will be saved in their web server. If you don't like spiders like this access certain resources of your website you can specify that on your robots.txt.

 

Creating robots.txt File: 

To create this file just create a text file named "robots.txt". This file is to be uploaded on the root of your website, not any subdirectory. For example if your site is http://www.mySite.com you should be able to access the robots.txt file at http://www.mySite.com/robots.txt.

Now that you know what is robots.txt lets see how and what stuffs to put in it. You can think of robots.txt file as defining rules for spiders. You can use same rule for all spiders or specify different rules for certain spiders.

There are two major elements you include in your robots.txt file:

1. User-agent

Used to specify the name of the SE spider. If you like to define rules for all spiders you can use the wildcard character *

2. Disallow:

Tells the spiders what not to read. If you like to define a rule for not to read anything (may be for a specific spider or all), put / for Disallowing Everything. You can have multiple "Disallow:".

Spaces after User-agent: and Disallow: are optional.

 

Examples: 

(1) 

User-agent: *
Disallow: /

* means here all spiders. / means all your directories and pages.

Together the rule specifies "All robots (indicated by "*") are instructed not to index any of your pages (indicated by "/")"

 (2)

User-agent: psbot
Disallow: /

Specifying rule for a spicific spider, picsearch on the example above. The above rule specifies "picsearch robot should not read any content of the website".

(3) 

User-agent: *
Disallow: /scripts/
Disallow: /privatedir/
Disallow: /error/blank.html

The above disallows all search engines spiders from crawling the selected directories and pages.

This rule specifies that directories "scripts","privatedir" and file "error/blank.html" should not be read.

/scripts/ which starts with a / means the directory location relative to the root.

The / at the end means all contents or files inside that directory.

(4) 

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /scripts/
Disallow: /privatedir/ 

This is  an example of targeting multiple spiders.

The above rules specify that "No spider is allowed to visit any part of the website, except Google which is allowed to crawl entire site but

the directories scripts and privatedir". Although you first mentioned that no spider should crawl anything, the later rule applies to Google as specified.

When you specify rule spicifying a spider name that spplies to that spider.

Remember to add the trailing slash ("/") if you are indicating a directory.

User-agent: *
Disallow: /private

If you simply add the rule like above, the robots will be disallowed from accessing any files on the directory tree beginning from /private/. In other words, there is an implied wildcard character

following whatever you list in the Disallow line. 

 
Where Do You Get the Name of the Robots?

If you have a particular spider in mind which you want to block, you have to find out its name. To do this, the best way is to check out the website of the search engine. Search engines will usually have a page that gives you details on how you can prevent their spiders from accessing certain files or directories.

Common Mistakes in Robots.txt:
1. It's Not Guaranteed to Work

Although the robots.txt format is listed in a document called "A Standard for Robots Exclusion", not all spiders and robots actually obeys it. Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file (if you are running your site on an Apache server) or similar for respective operating systems.

2. Don't List Your Secret Directories

Anyone can access your robots file, not just robots. For example, typing http://www.WebCosmo.com/robots.txt will get you WebCosmo's robots.txt file. Some webmasters think they can list their secret directories in their robots.txt file to prevent that directory from being accessed. That is not the case actually. Listing a directory in a robots.txt file rather attracts attention to the directory. In fact, some spiders (like certain spammers' email harvesting robots) make it a point to check the robots.txt for excluded directories to spider. 

3. Only One Directory/File per Disallow line

Do not put multiple directories on a single Disallow line. This will probably not work, since the Robots Exclusion Standard only provides one directory per Disallow statement.

4. Put a robots.txt file on your site anyways 

Even if you want all your directories to be accessed by spiders, a simple robots file with the following may be useful:

User-agent: *
Disallow:
 
The Allow Field: 

some crawlers now support an additional field called "Allow:", most notably, Google. "Allow:" lets you explicitly specify what files/folders can be crawled. However, this field is currently not

part of the "robots.txt" protocol. So I think its better not to use it since it may confuse crawlers from other search engines.

The following is an example of using the Allow field:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow:  

Currently rated 4.7 by 3 people

  • Currently 4.666667/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Posted by: Manik
Posted on: 10/8/2007 at 6:57 AM
Tags:
Categories: Advertising / Marketing | Programming / Coding | SEO
E-mail |  Stumble it! |  Propeller it! |  Digg it! |  del.icio.us |  Technorati
Post Information: Permalink | Comments (2) | Post RSSRSS comment feed

Related posts

Comments

Yiwu China cn

Friday, February 22, 2008 1:53 AM

Yiwu China

Good post!

randall mitchell us

Tuesday, April 08, 2008 5:46 PM

randall mitchell

very good article. it explains very simply and directly what to do using examples.

Add comment


(Will show your Gravatar icon)  

  Country flag

[b][/b] - [i][/i] - [u][/u]- [quote][/quote]



Live preview

Wednesday, December 03, 2008 11:11 AM