This Search Engine

[History] [Discuss]

Locale: en-US
Page: Robots Behaviors

Page Type:

Alias Page To:

Page Border:

Table of Contents:

Title:

Author:

Meta Robots:

Meta Description:

Header Page Name:

Footer Page Name:

The '''Robots Behaviors''' dropdown controls the degree to which your Yioop crawler respects '''robots.txt''' files. A '''robots.txt''' is a file placed by a site operator in the document root of their web site. I.e., it would typically have a url like:
https://some_host_name/robots.txt<br>
or<br>
http://some_host_name/robots.txt.
It is used to specify the files that a particular kind of crawler is allowed to download from a site and at what rate. So for example it might have instructions for how the GoogleBot is allowed to crawl the site, how the BingBot is allowed to crawl the site, etc. The available options are:
* '''Always Follow''' which always follows to the best of Yioop's abilities the robots.txt instructions.
* '''Allow Landing Page Crawl''' which allows Yioop to download urls of the form
https://some_host_name/<br>
or<br>
http://some_host_name/ but otherwise respects the robots.txt file.
* '''Ignore''' which allows Yioop to completely ignore the robots.txt file. This option should only be used at your own risk. There might be some use cases such as where you want to crawl part of a site that you yourself own, but where you don't have control of the robots.txt. For the most part, you should not use this option.

- Help: Robots Behaviors

Page Resources