There are a lot of good reasons to use a robots.txt file but one of the increasingly important ones is to prevent unwanted visits from robots. If you notice in your logs that there are a lot of user-agents that you don’t recognize you may be getting visits from crawlers that add no value to your site and simply digest bandwidth.

A comprehensive list of robots can help educate you about which crawlers are out there and which may bring you the most value.

Also you can fin a list of robots commands and a robots.txt file generator.

As an example the section below allows certain crawlers while shoo-ing away others:

# For domain: http://www.domain.com

User-agent: Googlebot

Disallow:

User-agent: Googlebot-Image

Disallow:

User-agent: MSNBot

Disallow:

User-agent: Slurp

Disallow:

User-agent: Teoma

Disallow:

User-agent: Gigabot

Disallow:

User-agent: Scrubby

Disallow:

User-agent: Robozilla

Disallow:

User-agent: Nutch

Disallow:

User-agent: ia_archiver

Disallow:

User-agent: baiduspider

Disallow:

User-agent: yahoo-mmcrawler

Disallow:

User-agent: psbot

Disallow:

User-agent: asterias

Disallow:

User-agent: yahoo-blogs/v3.9

Disallow:

# Shoo

User-agent: *

Disallow: /

Disallow: /cgi-bin/

# Disallow: /images/ – uncomment line with correct path for images

# File exclusions

Disallow: /dir/Privacy-Policy

Disallow: /dir/Security

# Sitemap declaration

sitemap: http://www.domain.com/sitemap.xml

  • Share/Bookmark

No Comments on “Preventing unwanted robots from crawling your site.”

Comments on this entry are closed.