How does the search robot

On this subject there is a detailed faq on Yandex at http://help.yandex.ru/webmaster/?id=995296
Detailed, but not informative enough. For example, a direct question, and given himself: "What is the search engine robot and what is he doing? "Yandex himself and says:

Robot (English crawler) maintains a list of URL, it can index and regularly pumps out the corresponding documents. If the analysis of the document the robot detects a new link, it adds it to your list. Thus, any document or site that has links that can be found by the robot, and hence the search for Yandex.

As you can see, the answer is only the second part of the question. For what is a robot, we have not learned. Let us turn to independent experts with Wikipedia.

Search robot ("web spider" spider, spider, crawler) - a program which is part of the search engine and is designed to bypass Internet pages, to list information about them (keywords) into a search engine. At its core, is more reminiscent of a spider standard web browser. It scans the page content, throws it on the server search engine, which owns and sent the links on the following page. The owners of search engines typically limit the depth of penetration into the spider a site and the maximum size of scanned text, so too big sites can not be fully indexed by search engines. Besides the usual spiders, there are the so-called "woodpeckers" - robots that are "tapping" indexed the site to determine that it is connected to the Internet.

The order of traversal of pages, frequency of visits, the protection of the loop, as well as criteria for the selection of keywords defined by the search engine algorithms.

In most cases, the transition from one page to another through the links contained on the first and subsequent pages.

Also, many search engines provide the user with an opportunity to add the site to the queue for indexing. Usually, this greatly speeds up the indexing of the site, and in cases where no external links do not lead to the site in general is the only way to declare its existence.

Restrict indexing site, you can use the file robots.txt, but some search engines may ignore the existence of this file. Full protection from being indexed by mechanisms that bypass the long spiders can not do. Usually - setting a password on the page, or the requirement to complete a registration form before access to the page content.

Even more clearly. Robot - is a program. The program, built-in search engine as its integral part and subordinate to the algorithms of the search engine. In addition, the robot is subject to, and the author or the website administrator. To subjugate the search engine robot admin site should competently perform the dance with a tambourine to write the instructions in the file robots.txt, which file is the instructions for the robot, which pages do not appear in its index. We note here that access to these pages, if they have inbound links, the robot still open. He just did not put them in the index, though, because of its subordination to a particular search engine algorithms, often changing to an absolute certainty that your sensitive data will not be in error property of the people, it is better to be safe and set the same password on your page or other obstacles for the robot, for example sms-lock :) Robots, of course, constantly improved intellectually, but something tells me that to pay by card or via SMS is not essentially never learn.

And below we see a link to the script, with which we can check which pages on the server are protected from robots Yandex matching instructions in robots.txt: the script

On the same theme:

Roboblog
Search-Bot Log

Like a record? Be sure to subscribe to updates via RSS or by email!

2leep. Com

Leave a Reply

I'm not a robot.

Liveinternet