When writing a website, there may be a time when you construct an area that you don’t want to be displayed on a search engine. That is, you don’t want the search engines’ spider to crawl those pages. There are a variety of reasons why people choose to “ban” a search engine from a part of their website. Most of time, that reason involves the strain that search engine spiders put on a web server. If you have a large website that has a lot of dynamically generated content, it might be wise to put restraints on how deep a spider can crawl your website, or how fast it can do so.
The File
All the files viewable from your website are kept in your document root folder. Usually public_html/. This folder is where you will upload the robots.txt file once you’ve created it. One very important thing to keep in mind is that you must upload this file in ASCII (usually, this option appears in your FTP program), otherwise some search engines will be unable to read it.
Construction
It might be helpful to start with a very small example of a robots.txt file.
User-agent: * Disallow: /don’t_go/ Crawl-delay: 5
Let’s break this down, line by line. As you can see, the basic format of these files is [rule]: options [line break].
The first line is used to pick out what search engine you are targeting with the proceeding restrictions. Search engine spiders identify themselves by the User-agent variable. It’s the same string that allows websites to determine what web browser you’re using. If you have access to raw logs on your server, look for the column that contains this record, scroll down until you find an entry Googlebot. This is Googles’ spider crawling your site for information.
Note: If you don’t see an entry from Google or any other search engine, I suggest you take a moment to submit your site for crawling.
As you can see from the example above, wildcards are fine for the user-agent entry. This allows you to block out all search engines instead of having to specify. But, if you only wanted to block a search engine spider with the user-agent listed as “Scooter” (which happens to be Altavistas’ spider), you could substitute that first line with
User-agent: Scooter
Underneath each user-agent specifier, there are all the rules that apply to that spider. The two rules recognized by most spiders are “Disallow” and “Crawl-delay”. Disallow is what allows you to control what parts of your website spiders are blocked from. For my previous example, no robot that follows the robots.txt file could access the directory (and subseuquntly, any file in that directory) /dont_go/. You can have as many of these as you need.
The last line of the code is the crawl-delay function. This is a fairly new development, and allows for large websites to be indexed without the fear of slowing down server performance. The number after crawl-delay is the number of seconds that a spider will wait until retrieving another page from your website.
One Final Example
For this example, there will be two spiders targeted; Googlebot and Fluffy the spider (from searchhippo), and then every other spider will be addressed with a wild card. Googlebot will be unable to crawl more than one page every five seconds, and fluffy the spider will be unable to go into the /secret/ directory. All other spiders will have a 100 second crawl-delay, and will not be able to visit any directory that begins with ~.
One important thing to consider when writing a robots.txt file is that anyone can access it. So if you don’t want a search engine crawling a directory that has very sensitive material, it might not be a good idea to put it into your robots.txt file, because then everyone will be able to see it by simple visiting yourdomain.com/robots.txt. (Look at www.whitehouse.gov/robots.txt for fun)
Also, just because you went to the trouble to write the file, doesn’t necessarily mean that the search engine spider will choose to obey it. But, most, if not all, large search engines obey the rules you write in your file. There are only a few spiders that just don’t care. If you find one, there are proper places to report them to.
Web Host Auditor is an independent web hosting review web site. We invite you to send us a review of your current or past hosting company to help us in our rankings of hosts on this site.
Web Host Auditor is an independent web hosting review web site. We rate hosting companies based on past user feedbacks, lab benchmark testing, and our usage. We invite you to suggest a host for ranking on this site or submit a review of your host.