[
Skip Navigation]
≡
-
Help
:
Robots Behaviors
≡
Robots Behaviors@Help
View
Source
History
Discussion
Help Group
Create/Find Pages
Group Feed
[
History
] [
Discuss
]
Locale: en-US
Page: Robots Behaviors
Page Type:
Standard
Page and Feedback
Page Alias
Media List
Presentation
Url Shortener
Share Wall
Alias Page To:
Page Border:
Solid
Dashed
None
Table of Contents:
Title:
Author:
Meta Robots:
Meta Description:
Header Page Name:
Footer Page Name:
The '''Robots Behaviors''' dropdown controls the degree to which your Yioop crawler respects '''robots.txt''' files. A '''robots.txt''' is a file placed by a site operator in the document root of their web site. I.e., it would typically have a url like: https://some_host_name/robots.txt<br> or<br> http://some_host_name/robots.txt. It is used to specify the files that a particular kind of crawler is allowed to download from a site and at what rate. So for example it might have instructions for how the GoogleBot is allowed to crawl the site, how the BingBot is allowed to crawl the site, etc. The available options are: * '''Always Follow''' which always follows to the best of Yioop's abilities the robots.txt instructions. * '''Allow Landing Page Crawl''' which allows Yioop to download urls of the form https://some_host_name/<br> or<br> http://some_host_name/ but otherwise respects the robots.txt file. * '''Ignore''' which allows Yioop to completely ignore the robots.txt file. This option should only be used at your own risk. There might be some use cases such as where you want to crawl part of a site that you yourself own, but where you don't have control of the robots.txt. For the most part, you should not use this option.
Page Resources
Resources are images, videos, or files associated with this page.
No resources have been saved to this page yet.
[
X
]
(c) This Site -
This Search Engine
We use cookies to implement this site's user functionality, social media features, and traffic analytics.
Privacy Policy Details
.
Allow Cookies