| |
Greenguy and Jim Webmaster Tip - Weekly Tip #43 |
|
|

Tip #43: Understanding Bots
Have you ever wondered how sites like Google and other search engines find all their sites? Yes, some of them are submitted by the authors, but most are found through automated processes. These processes are called robots or 'bots. The robot will come to your site, parse through all the pages it can find, store the data in a database and move on.
Why Control the Robots?
- Sometimes you don't want the robots just roaming anywhere they like on your site.
- You may have semi-private areas that are not password protected, but you still don't want scanned.
- Some areas of your Web site may contain programs or other non-content (like the cgi-bin directory) don't need to be scanned.
- Or perhaps, you just don't have a lot of bandwidth and don't want what you have wasted on a robot.
It's Easy to Communicate with Web 'Bots
The first thing a robot does when it comes to a new Web site is it looks for a file on the root of the Web server called "robots.txt". If there is no file, it assumes that robots are allowed anywhere they can find on the site.
This file consists of two or more lines:
The name of the robot or user-agent that is not allowed on the site. Usually this is left to "*", meaning all robots:
User-agent: *
The area of the site that that agent is not allowed into. All files and sub-directories under that directory will not be scanned by the robot. If there is more than one directory you want to disallow, then duplicate this line as many times as you need it.
Disallow: /private/
So, if you wanted to prevent all robots from going to any area of your site, your robots.txt file would read:
User-agent: *
Disallow: /
If you want to prevent only a specific Web crawler from crawling your site, you need to list it by name in the User-agent line. For example, to prevent Google from spidering your site, you would write
User-agent: Googlebot
Disallow: /
Some Important Things To Remember
The robots.txt file is case-sensitive. If you create a file called Robots.txt or robots.TXT the spiders will ignore whatever it says.
The robots.txt file has to be in the root of your Web server. This means that if you have a Web page virtual host you will need to ask your administrator to add your disallows to their root level directory.
There is no way to "allow" a spider, you can only disallow. So if you have one page in a group of 160 others that you want spidered, you should move it out of the disallowed directory. Or, you can explicitly name every file you want disallowed.
Thank You For Your Support
Greenguy & Jim
Check out GreenguyandJim.com for even more webmaster tips, tricks, resources & support - and don't forget to come back to Klixxx.com every Tuesday for another exclusive webmaster tip from GreenguyandJim.com!
|
|
|