Robots.txt
Have you also been wondering, why the error log of your web server
constantly returns entries like
[error] [client 204.62.245.187] File does not exist: /usr/local/etc/httpd/htdocs/mysupersite/robots.txt?
When you submit your website to a spider engine, the spider engine
"visits" your site to register it. Most spider engines
thereby search automatically for the robots.txt file. If this file
is not found, the above error occurs.
However, the robots.txt file is not compulsory. Instead of a file,
you can use the "robots" meta tag. However, if you do
not include a robots.txt file and submit your page to hundreds of
spider engines (e.g. with Hello Engines!), you will receive also
hundreds of error messages. Please note that your website is probably
visited by several search engines every day. Therefore, the error.log
file might soon become very large, as it is filled up with irrelevant
error messages.
In the robots.txt file of your site, you have the option to define
the pages that are to be excluded from the indexing. Please not
that only one robots.txt is taken into account per server and that
it must be located on the top level. For a UNIX system, it can for
example be filed in
/usr/local/etc/httpd/htdocs/robots.txt
The syntax of the robots.txt is extremely simple and generally
looks like this:
In the above case, two directories are excluded from the indexing.
For each directory that is not to be indexed by the spider engine,
you must add a separate "disallow" line.
Example: to block all robots from accessing and indexing your website,
enter the following lines in the robots.txt file:
To allow all robots to access and index all pages of your website,
enter the following lines in robots.txt:
To prevent a specific robot from accessing your directories, enter
the following:
To allow only one specific robot to index your directories (thus
blocking all others) enter the following lines:
Similarly, you can exclude specific pages from indexing:
|