Web Host Auditor - Independent Web Hosting Reviews
Web Hosting Reviews | Dedicated Server Hosting | Reseller Hosting | Web Hosts | Suggest Host | Articles
 
 

Constructing a robots.txt file

Introduction

When writing a website, there may be a time when you construct an area that you don’t want to be displayed on a search engine. That is, you don’t want the search engines’ spider to crawl those pages. There are a variety of reasons why people choose to “ban” a search engine from a part of their website. Most of time, that reason involves the strain that search engine spiders put on a web server. If you have a large website that has a lot of dynamically generated content, it might be wise to put restraints on how deep a spider can crawl your website, or how fast it can do so.

 

The File

All the files viewable from your website are kept in your document root folder. Usually public_html/. This folder is where you will upload the robots.txt file once you’ve created it. One very important thing to keep in mind is that you must upload this file in ASCII (usually, this option appears in your FTP program), otherwise some search engines will be unable to read it.

 

Construction

It might be helpful to start with a very small example of a robots.txt file.

 

User-agent: * Disallow: /don’t_go/ Crawl-delay: 5    

Let’s break this down, line by line. As you can see, the basic format of these files is [rule]: options [line break].

 

The first line is used to pick out what search engine you are targeting with the proceeding restrictions. Search engine spiders identify themselves by the User-agent variable. It’s the same string that allows websites to determine what web browser you’re using. If you have access to raw logs on your server, look for the column that contains this record, scroll down until you find an entry Googlebot. This is Googles’ spider crawling your site for information.

 

Note: If you don’t see an entry from Google or any other search engine, I suggest you take a moment to submit your site for crawling.

 

There is a complete list of every known spider and its’ user-agent name on the website http://www.searchenginedictionary.com/spider-names.shtml

 

As you can see from the example above, wildcards are fine for the user-agent entry. This allows you to block out all search engines instead of having to specify. But, if you only wanted to block a search engine spider with the user-agent listed as “Scooter” (which happens to be Altavistas’ spider), you could substitute that first line with

 

User-agent: Scooter

 

Underneath each user-agent specifier, there are all the rules that apply to that spider. The two rules recognized by most spiders are “Disallow” and “Crawl-delay”. Disallow is what allows you to control what parts of your website spiders are blocked from. For my previous example, no robot that follows the robots.txt file could access the directory (and subseuquntly, any file in that directory) /dont_go/. You can have as many of these as you need.

 

The last line of the code is the crawl-delay function. This is a fairly new development, and allows for large websites to be indexed without the fear of slowing down server performance. The number after crawl-delay is the number of seconds that a spider will wait until retrieving another page from your website.

 

One Final Example

For this example, there will be two spiders targeted; Googlebot and Fluffy the spider (from searchhippo), and then every other spider will be addressed with a wild card. Googlebot will be unable to crawl more than one page every five seconds, and fluffy the spider will be unable to go into the /secret/ directory. All other spiders will have a 100 second crawl-delay, and will not be able to visit any directory that begins with ~.

 

User-agent: Googlebot Crawl-delay: 5   User-agent: Fluffy the spider Disallow: /secret/   User-agent: * Crawl-delay: 100 Disallow: /~

 

Final Thoughts

One important thing to consider when writing a robots.txt file is that anyone can access it. So if you don’t want a search engine crawling a directory that has very sensitive material, it might not be a good idea to put it into your robots.txt file, because then everyone will be able to see it by simple visiting yourdomain.com/robots.txt. (Look at www.whitehouse.gov/robots.txt for fun)

 

Also, just because you went to the trouble to write the file, doesn’t necessarily mean that the search engine spider will choose to obey it. But, most, if not all, large search engines obey the rules you write in your file. There are only a few spiders that just don’t care. If you find one, there are proper places to report them to.

 


 

 

Copyright © 2005 WebHostAuditor.com
More Webmaster Articles

 
Web Hosts

 1. Hostmonster Review
 2. Hostgator Review

 3. Bluehost Review

 4. Lunarpages Review

 5. Powweb Review

 6. Midphase Review

 7. micfo Review

 8. ipowerWeb Review

 9. Site5 Review

10. Dreamhost Review

11. Startlogic Review

12. Hostrocket Review

13. Simplehost Review

14. Globat Review

15. 1and1 Review

 

 Submit Your Review
Web Host Auditor is an independent web hosting review web site. We invite you to send us a review of your current or past hosting company to help us in our rankings of hosts on this site.
Submit Review
 
How We Rate?
Web Host Auditor is an independent web hosting review web site. We rate hosting companies based on past user feedbacks, lab benchmark testing, and our usage. We invite you to suggest a host for ranking on this site or submit a review of your host.
 
eXTReMe Tracker