There are a lot of good reasons to use a robots.txt file but one of the increasingly important ones is to prevent unwanted visits from robots. If you notice in your logs that there are a lot of user-agents that you don’t recognize you may be getting visits from crawlers that add no value to your site and simply digest bandwidth.
A comprehensive list of robots can help educate you about which crawlers are out there and which may bring you the most value.
Also you can fin a list of robots commands and a robots.txt file generator.
As an example the section below allows certain crawlers while shoo-ing away others:
# For domain: http://www.domain.com
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow:
User-agent: MSNBot
Disallow:
User-agent: Slurp
Disallow:
User-agent: Teoma
Disallow:
User-agent: Gigabot
Disallow:
User-agent: Scrubby
Disallow:
User-agent: Robozilla
Disallow:
User-agent: Nutch
Disallow:
User-agent: ia_archiver
Disallow:
User-agent: baiduspider
Disallow:
User-agent: yahoo-mmcrawler
Disallow:
User-agent: psbot
Disallow:
User-agent: asterias
Disallow:
User-agent: yahoo-blogs/v3.9
Disallow:
# Shoo
User-agent: *
Disallow: /
Disallow: /cgi-bin/
# Disallow: /images/ – uncomment line with correct path for images
# File exclusions
Disallow: /dir/Privacy-Policy
Disallow: /dir/Security
# Sitemap declaration
sitemap: http://www.domain.com/sitemap.xml


No Comments on “Preventing unwanted robots from crawling your site.”