The robots.txt file, also known as the robots exclusion protocol, is a particularly useful file. It provides search engine robots with instructions about crawling your web pages.
Contrary to what you might think, it is very easy to create and configure a robots.txt file. You don’t need to have special knowledge of web development, if that’s what you’re asking yourself. The only thing you’ll need is a little free time. And believe us, the result is well worth the effort!
In this article, we explain why this file is important for your SEO and we will show you how to best configure it.
What is the robots.txt file?
In the digital world, robots mainly consist of software that visits websites. The most common examples are undoubtedly search engine robots. These robots used by Google, Bing, Yahoo, Baidu (etc.) are designed to explore all the content on websites and then index it.
This indexation makes it possible to appear, more or less favourably, in search results for specific requests.
Moreover, the robots.txt file allows you to influence this exploration. In other words, it is a very powerful tool!
Crawlers always check the robots.txt file for instructions before exploring a website.
If your site does not have this file in place or it is poorly configured, then robots would be expected to crawl the entirety of your website.
Why is the robots.txt file important for SEO?
There are two main reasons why the robots.txt file is important for your SEO:
First, it allows you to choose which of your site’s resources you want explored. Anything you consider irrelevant can therefore easily be removed from the robots’ exploration process, so that they can focus on the essential. In other words, your most relevant content (service pages, blog articles, etc.).
Second, thanks to this tool, you can control the amount of content explored by search engines. Remember, without this valuable file, they are expected to explore your entire site. If you have a lot of pages, Google may give your site too little crawl time. In this case, the robot could spend its time exploring the least important pages of your site.
The idea is to make it easier to explore your website by eliminating low value-added URLs in order to optimize the crawl budget (limits in terms of the number of pages explored on a website) of robots.
Here are some examples of what Google considers to be low value-added URLs:
- Those generated by faceted navigation (refining a search using a filter)
- Those generated by session IDs. For example, logging into your account on an online store.
Removing these URLs will ensure that your high value pages are explored and indexed. This will greatly increase your chances of being well ranked.
Now that you have a better understanding of the importance of the robots.txt file for your SEO, we can move on to the next step, namely creating it and configuring it.
Create and configure your robots.txt file
Start by creating your robots.txt file. No special program is required and you can use a basic text editor: Notepad if you are on Windows or TextEdit if you are on macOS.
Make sure you name it “robots.txt” and don’t forget the “s” at the end, otherwise it won’t work.
Next, you need to place it at the root of your site. In concrete terms, if your site is accessible via the address https://mysite.com, the robots.txt file will be located at the following address https://mysite.com
There are two ways to do this. You can connect to your website’s host and then access the dedicated file manager. Alternatively, you can use an FTP (“File Transfer Protocol”) client, like FileZilla, to communicate with your site’s server.
With your robots.txt file in place, all you have to do is fill it in. To give you an idea, here is what a configured robots.txt file can look like:
Don’t quite understand it? Don’t worry, we’ll take the time to define everything ;).
First, it should be noted that there are two main rules governing this file:
- The “User-agent” directive. This designates to search engine robots that must follow the instructions in the file.
- The “Disallow” directive. This is used to indicate that a directory or page on the site should not be explored by the User-agent. Without this directive, the robot explores your website normally.
This robots.txt rule is especially useful for SEO, since you can ask robots not to explore your low value-added pages.
To optimize the robots’ crawl budget, you can use this directive to instruct them not to explore those parts of your site that are not displayed to the public.
For example, you can prevent access to your login page:
This will ensure that robots do not waste time exploring this page and can focus on the most important thing.
Let’s continue with the basic rules overview. Note that there are generic characters associated with the rules:
- The asterisk * is what is known as a “wildcard“. In this case, it means that the robots.txt file can be explored by all robots (user-agent).
– On the second line we see that all robots are prohibited (disallowed) from accessing all directories and pages of the website. The / symbol is used to indicate this.
Now that the foundations have been laid, we will now turn to the additional rules.
The “Allow” directive is the opposite of the “Disallow” directive. It is only supported by Google and Bing. Generally, it is used like this:
In this example, all robots should avoid the /media directory, except the formulaire.pdf file
Prevent access to a specific search engine
Suppose you want to block access to the directories and pages of your site from the Bing robot (Bingbot). This is how you should proceed:
Note that robots from other search engines can explore your entire site.
You can, if you wish, create different rules for different robots. To help you, here is a list:
In addition to the wildcards * and /, you can mark the end of a URL with the $ sign.
In effect, this example means that all search engine robots should avoid URLs that end in.php.
Note: URLs with specific parameters such as: https://mysite.com/page.php?lang=en will still be accessible since the URL does not end directly after the.php extension.
This is not required, but if you wish, you can also use your robots.txt file to drive search engines to your XML sitemap. Most search engines support it (Google, Bing, Yahoo). This will help them better understand your site structure.
This rule is not taken into account by robots. However, it helps to clarify your robots.txt file, especially if it contains a large number of instructions.
Any sentence must be preceded by the # symbol, so that the robots understand that it is a comment.
Recall, we mentioned earlier that the “Disallow” instruction was useful for your SEO. Well, that’s not quite the case, since even though it prevents the exploration of your pages, your pages may still be indexed.
The “noindex” directive prevents this. Combined with the “Disallow” rule, it ensures that robots do not visit or index certain pages.
Take the example of thank you pages. If you don’t want them to be indexed, here’s how to do it: the other way around.
You can also ask robots not to explore the links on a page, using the “nofollow” directive. Since its configuration is not part of the robots.txt file, we will not discuss it. For those who are curious, Google has dedicated a page on this subject.
Some special features to remember
The robots.txt file, as simple as it may be, has a number of special features that you should be aware of in order to avoid any errors that could harm your site.
- Don’t put everything on the same line: make sure that each of your instructions is on a new line. Several instructions on the same line will cause problems and misunderstandings for robots.
- Order of precedence: each robot treats the robots.txt file differently. By default, the first directive prevails over everything else. However, it’s a little different at Google and Bing. It is the most specific instructions that matter most.
- Beware of malicious robots: the robots.txt file is only an indication of what search engines should do. While “benevolent” robots play the game, this is not necessarily the case for “malevolent” robots who will simply ignore your file.
- The file cannot exceed 500 KB: keep this in mind when configuring your robots.txt file. If it exceeds the maximum size, it may not be taken into account.
- Case sensitive: the robots.txt file is case sensitive, as are the various rules that make it up. So pay close attention. In other words, avoid any capitalization in the names of your directories, links, etc.
- One robots.txt file per domain or subdomain: the instructions in a file can only be applied to the host where the file is located.
Check if your robots.txt file is working
Now that your file is configured, you need to test it to see if it works.
You can do this by going to Google Search Console. Log in to your account. Once this is done, you will need to access the old version of the Search Console.
Next, go to the “Crawl” tab and click on “robots.txt Tester”.
Fill in the space with the data contained in your robots.txt file, then click on test. You can see if your file is compatible with the different robots.
If the “Test” button changes to ” Allowed”, we are happy to inform you that your file is valid! =)
All you have to do is place it at the root of your website.
Now that you know how to create and configure a robots.txt file, you should quickly see an increase in your visibility in search results.
This file will greatly help search engine robots understand your site. They can then explore your website more intelligently and display your most relevant pages in search results.
However, if you need help setting up your robots.txt file, please do not hesitate to contact our team of experts!