After 25 Years, There can Finally be an Official Standard for Using Robots.txt, Thanks to Google

After 25 Years, There can Finally be an Official Standard for Using Robots.txt, Thanks to Google

Around 25 years ago, Google put forth an official internet standard for the robots.txt files’ rules. The rules were defined in the Robots Exclusion Protocol (REP) and are still considered as an unofficial standard.

Although the search engines have endorsed REP during the last 25 years, developers often assign their own meanings as it’s unofficial. Moreover, with time, it has become outdated as well, failing to cater to the use cases of today.

Even Google admitted that the ambiguous nature of the standard creates difficulties for website owners to implement the rules correctly.

The Tech Giant then proceeded with a solution as well by documenting how the REP should be applied on modern web. Google then submitted the draft to the Internet Engineering Task Force (IETF) for evaluation.

In 25 years, robots.txt has been widely adopted– in fact over 500 million websites use it! While user-agent, disallow, and allow are the most popular lines in all robots.txt files, we've also seen rules that allowed Googlebot to "Learn Emotion" or "Assimilate The Pickled Pixie".
47 people are talking about this

According to Google, the draft includes extensive details regarding the real world experience of depending on robots.txt rules, used by Googlebot, various crawlers as well as over half a billion websites dependent on REP. With the help of these rules, the website publishers gain the power to decide what they would like to be crawled on their site and whether it should be shown to the interested consumers.

It should be noted that the draft doesn’t change the already defined rules. It has just updated them to suit the modern web.

The updated rules include (but are not limited to):

  1. Robots.txt are no longer limited to just HTTP and can now be used by any transfer protocol based on URI.
  2. At least first 500 kibibytes of a robot.txt should be parsed by the Developers themselves.
  3. To provide website owners with flexibility in updating their robots.txt, a maximum caching time of 24 hours or cache directive value (depending on availability) should be brought forth.
  4. After server failures render a robots.txt file inaccessible, known disallowed pages would not be crawled for a specific amount of time.
Google plans on trying its best to make this standard official and for this purpose, it is open to suggestions regarding the proposed document.

Source of the post : 


Popular posts from this blog

You know Instagram Is Testing Three New Features, Collaborative Posts, Limited Comments and Sell Button On The Profile

Did you know 77 Percent of Twitter Users are Comfortable Sharing Data with the Microblogging Platform To Improve Ad Experience

Did you know Research shows that Apple’s iPhone 12 has secured its spot for being the best selling model for the 1st quarter of 2021