robots.txt is a simple text file that tells the search engine
(designated as SE from hereon) crawlers what not to read. This article
will explain how to write a robots.txt file for your website, why its
important etc.
Its a common misunderstanding that some webmasters think its a file to tell spiders what to read and what not to read. Worng. Its a file only to tell spiders what not to read.
Reasons you want to have a robots.txt file on your website:
1. Let search engine crawlers read only the files you like to index
When you don't want SE crawlers index certain resources/files on
your website, you want to tell them "Don't read these files". For
example if you don't want your images be indexed on SE, you want to say
that on your robots.txt.
2. Save server resources
SE bots reads files as like you are viewing a page on a browser.
Everytime SE bots requests a file from server, it consumes some server
resources. Reading unnecessary files is just a waste of resources,
slowing down the performance. If your site have many visitors using
the website simultaneously this becomes a big issue.
3. Save Bandwidth
Everytime SE bot requests a file server have to send it, same like
viewing on browsers. And it consumes bandwidth. If you have limited
bandwidth, you can restrict SE crawlers from reading unnecessary files
on your robots.txt
4. Restrict robots for copyright reasons
You may want to exclude some SE spiders for copyright or other
reasons. For example http://www.picsearch.com will download your images
and create a
thumbnail version of it for people to search. That thumbnail image will
be saved in their web
server. If you don't like spiders like this access certain resources of
your website you can specify that on your robots.txt.
Creating robots.txt File:
To create this file just create a text file named "robots.txt". This
file is to be uploaded on the root of your website, not any
subdirectory. For example if your site is http://www.mySite.com you
should be able to access the robots.txt file at
http://www.mySite.com/robots.txt.
Now that you know what is robots.txt lets see how and what stuffs to put in it. You can think of robots.txt file as defining rules for spiders. You can use same rule for all spiders or specify different rules for certain spiders.
There are two major elements you include in your robots.txt file:
1. User-agent:
Used to specify the name of the SE spider. If you like to define rules for all spiders you can use the wildcard character *
2.
Disallow:
Tells the spiders what not to read. If you like to define a rule for not to read anything (may be for a specific spider or all), put / for Disallowing Everything. You can have multiple "Disallow:".
Spaces after User-agent: and Disallow: are optional.
Examples:
(1)
User-agent: *
Disallow: /
* means here all spiders. / means all your directories and pages.
Together the rule specifies "All robots (indicated by "*") are instructed not to index any of your pages (indicated by "/")"
(2)
User-agent: psbot
Disallow: /
Specifying rule for a spicific spider, picsearch on the example above.
The above rule specifies "picsearch robot should not read any content
of the website".
(3)
User-agent: *
Disallow: /scripts/
Disallow: /privatedir/
Disallow: /error/blank.html
The above disallows all search engines spiders from crawling the selected directories and pages.
This rule specifies that directories "scripts","privatedir" and file "error/blank.html" should not be read.
/scripts/ which starts with a / means the directory location relative to the root.
The / at the end means all contents or files inside that directory.
(4)
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /scripts/
Disallow: /privatedir/
This is an example of targeting multiple spiders.
The above rules specify that "No spider is allowed to visit any part
of the website, except Google which is allowed to crawl entire site but
the directories scripts and privatedir". Although you first mentioned
that no spider should crawl anything, the later rule applies to Google
as specified.
When you specify rule spicifying a spider name that spplies to that spider.
Remember to add the trailing slash ("/") if you are indicating a directory.
User-agent: *
Disallow: /private
If you simply add the rule like above, the robots will be disallowed
from accessing any files on the directory tree beginning from
/private/. In other words, there is an implied wildcard character
following whatever you list in the Disallow line.
Where Do You Get the Name of the Robots?
If you have a particular spider in mind which you want to block, you have to find out its name.
To do this, the best way is to check out the website of the search engine. Search
engines will usually have a page that gives you details on how you can prevent
their spiders from accessing certain files or directories.
Common Mistakes in Robots.txt:
1. It's Not Guaranteed to Work
Although the robots.txt format is listed in a document called
"A Standard for Robots Exclusion", not all spiders and robots actually obeys it.
Listing something in your robots.txt is no guarantee that it will be excluded.
If you really need to protect something, you should use a .htaccess file (if you are running
your site on an Apache server) or similar for respective operating systems.
2. Don't List Your Secret Directories
Anyone can access your robots file, not just robots. For example,
typing
http://www.WebCosmo.com/robots.txt will get you WebCosmo's robots.txt
file. Some webmasters think they can list their secret directories in
their robots.txt
file to prevent that directory from being accessed. That is not the
case actually. Listing
a directory in a robots.txt file rather attracts attention to the
directory. In fact,
some spiders (like certain spammers' email harvesting robots) make it a
point to check the
robots.txt for excluded directories to spider.
3. Only One Directory/File per Disallow line
Do not put multiple directories on a single Disallow line. This will probably not
work, since the Robots Exclusion Standard only provides one directory
per Disallow statement.
4. Put a robots.txt file on your site anyways
Even if you want all your directories to be accessed by spiders, a simple robots file with the
following may be useful:
User-agent: *
Disallow:
The Allow Field:
some crawlers now support an additional field called "Allow:", most
notably, Google. "Allow:" lets you explicitly specify what
files/folders can be crawled. However, this field is currently not
part of the "robots.txt" protocol. So I think its better not to use it since it may confuse crawlers from other search engines.
The following is an example of using the Allow field:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow:
Currently rated 4.7 by 3 people
- Currently 4.666667/5 Stars.
- 1
- 2
- 3
- 4
- 5