Before worrying about the most advanced techniques and trends for your site to appear well positioned on search engine results pages, every site administrator needs to know what it is and the importance of an even more basic aspect that should be present in every site – the robots.txt file.
Although it is quite simple and has been used since the advent of the commercial Internet, it still fulfills an important function and can help you in the administration of the site and its proper indexing by search engines.
Therefore, today we are going to deal with everything you need to know about robots.txt.
What is robots.txt?
For those of you who are familiar with website development in deeper terms, such as knowledge of HTML and PHP, it will probably be unnecessary to answer this question.
But due to the massive adoption of CMS and no code and the act of just looking for a plugin that solves a problem or meets a need , made it not uncommon to find site administrators who do not know in depth everything that makes a site work or what each thing in one is for.
The robots.txt is a file of text that must be stored in the root folder of the site and that must be written obeying certain conventions, containing rules and instructions given to internet robots that eventually access the site.
So that does the robots.txt file serve?
Much is said about applying SEO and Content Marketing techniques to websites for search engine robots and especially Googlebot, but long before of it becoming so popular and we pay so much attention to it, the robots.txt file was already used in order to guide any robot on what to do – or not to do – when “visiting” the respective site.
Even so much later and with so many changes on the Internet, robots.txt is the first file that Googlebot – and many other bots – check when accessing a domain.
So let’s suppose that for some reason, you don’t want a certain URL or page to be accessed, or even for the images of a page not to be displayed in the results of a search, it is It is possible to have an instruction so that the robot does not include it in the results.
Some may wonder – especially those who are just starting out – why someone would not want to have all their content appear in search results of search engines, right?
Because it can be a page or a whole directory still under development, or being tested. Images can contain restricted information, as in the case of an infographic. That is, there may be justified and important reasons that only concern the site administrator to prevent its tracking.
What is the importance of robots.txt?
In addition to what we’ve already seen, robots.txt is important for a reasonable list of reasons:
- Performance – by determining only what is important, you prevent bots from consuming bandwidth, making too many requests and in any way affecting the performance and experience of legitimate website users ;
- Sitemap – the sitemap file and which is one of the main ways to ensure complete indexing of your site, is informed in robots.txt;
- Indexing – very large sites, with many pages, such as content sites and portals, have a lot of files, which makes the crawling process time-consuming and slower and which in the case of Google – but not only – has a timeout to occur (C rawl Budget). By informing the bot only what should be “read”, it avoids wasting time scanning what is irrelevant;
- Download – content for download and restricted to, for example, users who have registered to receive the respective content;
- Special pages – thank you pages, duplicate pages (e.g. print version), PDF files and any pages you do not want or cannot be crawled must be reported in robots .txt.
How does robot.txt work?
Like HTML files or even PHP, it is a text file, which needs to be written according to some rules – the syntax – to be “understood” by a robot and for it to act according to what you want.
As this is a convention, by default robots that “respect” it, when accessing a certain site, will look for the file and “read” it, to only then proceeded to scan the site and its contents, observing what is allowed (allow statement) and what is not (disallow statement).
The “allow” rule is a redundancy, which means that in most cases sometimes it is not necessary to define what the bot is allowed to scan. In the case of the “disallow” rule, it is different.
As an example, the instruction to guide any robots to disregard a specific directory is:
User-agent: Disallow: /users/
The term “User-agent”, refers to the name of the robot and that in this case, because it is an asterisk, should be interpreted as being any robot, bot, crawler or spider, and these three terms are other ways of calling an internet robot.
If we wanted the instruction to be exclusively directed to Googlebot, instead of “*”, we should use its name – Googlebot.
In this case, there is a list of the most known bots and their respective names, The Web Robots Pages.
Still understanding our example, the instruction “Disallow” tells bots that they are not allowed to scan the contents of the “users” folder, contained in the root of the site.
)But what if we eventually had a third line like this?
Allow: /users/ public/
In this context, there is no permission to scan the files and other directories inside the “user” folder, except the one named “public” and which has the “allow” statement associated with it, indicating that its files and eventual folders/directories are released and, therefore, in this case allow makes sense to appear as an instruction.
When not to use robots.txt?
The robots.txt file must always be present, however, there are some purposes or situations in which it is not the most suitable method.
If you have discovered now that it is possible to disallow access to certain files and directories, using the disallow instruction, know that this is not the mechanism for “protection” or security of sensitive content for some reason.
Even if you intended to do so, it will have the opposite effect in the face of a possible intruder, who, knowing what you intend to hide, will know what to search.
For this type of content, it is necessary to use other methods, such as password-protected directories, among other actions.
The robots.txt file is based on conventions
As we mentioned earlier, its use and reading by internet robots is based on up at conventions and
The first direct consequence of this is that a bot is not “forced” to follow what is stated as an instruction or does not even need to consult it before scanning its site.
The biggest example of this – sweeping a site and disregarding robots.txt – are web scraping tools that work like a robot and that often don’t even have a known name. Practically all of them will ignore the existence of the file.
Different bots, different behaviors
Here is what we said, that is , because it is just a convention, not every bot will present the same behavior.
In addition, the interpretation of instructions may present particularities according to the robot. Therefore, if you have included any instructions and you need a certain tool to behave properly, you need to make sure that the syntax used is adequate.
Some bots offer information and pages with content about crawling and indexing parameters and which can be useful for writing efficient instructions, such as Bingbot.
Prevent indexing
“Ethical” robots that follow conventions will not “disobey” an instruction, but the indexing or not of content does not depend only on what is in robots.txt.
If, for example, , there is a link on an external site or a post by a user on a social network, stating a link that you would not want to be indexed, it could be.
Resolving issues like this, depends on another method, which consists of using the HTML meta tags specific to the robots, directly on the pages where you want them to have an effect, as in the following example :
The first informs that the content of the page must not be indexed by the search engines. In the second, the present links must not be followed.
Alternatively, when you want both, you can use only one line, as below:
Conclusion
The robots. txt is the first and most basic aid to search engines in properly ranking your content for search results.