The Google search engine indexes advertisers’ web pages according to a number of criteria and a complex algorithm. While many of these criteria are known to specialists, some of Google’s SEO criteria are not transparent and require a certain degree of precision and tracking from web page owners if they want to be indexed as effectively as possible.
A website’s indexing can be defined as its inclusion in a search engine or directory database. This makes it possible to show up in users’ search results, especially on Google. However, while it is easy to index a page on Google, certain criteria must be taken into account to optimize your ranking and benefit from increased online visibility.
How does Google work?
In order to understand indexing, you really need to understand how Google works. First, you should know that Google is an algorithm that works via robots. These robots will come to your site, explore your page, index it and position it.
Exploration, indexing and positioning are the three major steps in referencing a web page. In effect, a robot will analyze your site, explore its pages and decipher all the links found on each of these pages. The robot will then copy each of the pages to Google’s servers, and Google will invariably refer to these pages when it decides to index your site, i.e. to place it on its search engine.
Therefore, indexing your site means that it can appear on the first, second, tenth, twentieth or hundredth page of Google. Then Google will position your pages. Positioning is the placement of your web page on the 1st page, 2nd page, 3rd page etc. of the search results based on the criteria you respond to and the interest that Google assigns to your page for Internet users based on certain keywords. Positioning is the key to getting traffic. The higher your ranking, the more users are likely to stumble across your page. It is therefore essential that Google analyzes your site on a regular basis, otherwise the algorithm will only remember the page that was first visited and cached on Google’s servers, and will not take into account any updates made since.
How to communicate with Google
The SEO specialist’s job is to analyze how Google interprets your site. Then, you need to provide information to files that enable you to communicate with Google’s robots to optimize your SEO.
There are several types of files that are used to communicate with Google.
The robots.txt file
On any website, a file must be available at the root of the site that can be easily accessed if the robots decide to visit it. This is the robots.txt file. This file is made for robots, and they are supposed to visit it first when they visit the site. This file provides all the useful information for robots. One can notably decide within this file to not index this or that part of the site. This can be done for secure sections, or profile pages, for example.
You do not always need to index all the pages of your site because some pages are not designed to be made public. The robots.txt file can therefore instruct Google’s robot to stop indexing pages that you do not want to make public.
Nevertheless, it should be noted that Google can freely take into account the information transmitted in the robot.txt file. In other words, while in most cases the instructions are followed by the robots, Google may decide not to comply with the instructions and index them according to its own criteria.
Next, there are many tags in the robot.txt file that can be applied to certain web pages, which can also be used to provide information. For example, if you don’t want to index a page on your site, there is a “noindex” tag. If it appears on one of your pages, it means that the robot is not supposed to index it, and it will never appear on Google. Naturally, if a tag like this appears on one of your main pages, it will never be indexed, so it will never be positioned and you will lose visibility. So be sure to carefully select the pages of your site that you don’t want to index.
Example of a noindex tag in a robots.txt file
The sitemap.xml file
In addition, we have a file called the “sitemap.xml”. This file is exactly what is sounds like. On many sites, you will find this file that maps all the URLs of the site, or at least the most important ones. There are two types of sitemap, the user sitemap and the technical sitemap.
There is a sitemap specifically for Internet users where all URLs are listed. This sitemap allows Internet users to understand the organization of your site, its different sections, its different pages, etc.
User sitemap on Apple’s website
The second type of sitemap is specially designed for Google. This is the “sitemap.xml” file. This will allow you to communicate all the pages you want to index in the search engine to Google.
To find out which aspects of your site need to be improved in terms of indexing, it can be useful to look at how many URLs on your site have been indexed. Then, compare this number with the total number of pages on your site, as well as with the results from Google Search Console.
Google Search Console
Google Search Console is a tool provided by Google similar to Google Analytics but which provides other types of data. Google Analytics will will show your website traffic, how long visitors stay, etc. For its part, Google Search Console is a free tool offered by Google that allows you to see the status of indexation, the keywords on which you are positioned, any problems on your site related to your positioning and indexation and gives you a lot of information about the presence of your site on the search engine.
The most common indexing issues.
What are the most common cases of indexing problems for websites?
Three types of fairly common issues with website indexing can be identified.
Sites that aren’t indexed
Some customers visit us and wonder why their site doesn’t appear on the search engine despite their efforts in terms of SEO. In some cases – even quite often – we notice that the site is not indexed at all. Indeed, there is a difference between positioning and referencing: a site can be positioned on Google, and it can take a very long time to browse all Google’s pages before finding your site. As referencers, we can use techniques to determine how many pages are actually indexed on your site, and it can happen that none of them are. Often, this can be due to a noindex tag that has been left in the robots.txt file. Therefore, if this file indicates that the site should not be indexed, none of your pages will appear on the search engine until this tag has been removed.
In many instances, when building a website, the site will be developed on a completely different URL before being indexed, such as “dev.mysite.com” or “programming.mysite.com”. These URLs are therefore not intended to be indexed, and programmers put a noindex tag on them. However, as soon as these pages are put online with the correct URL, the tag must be removed to allow indexing.
Sites with an insufficient number of indexed pages.
Google’s robots allocate a number of resources to your website. Indeed, if your site contains 50,000 pages, it is unlikely that robots will visit each of these pages at once, but only a percentage. Therefore, if robots choose to visit only low quality pages, or secondary pages, only these pages will actually be indexed and positioned on the search engine, to the detriment of other pages on your site.
To solve this problem, it is recommended to rework the site’s internal links to ensure that robots understand which pages to visit. It can also be useful to deindex some less important pages so that Google doesn’t linger on them.
Sites with too many indexed pages.
In the case of e-commerce sites most of the time, but also for some showcase sites, we sometimes find that too many pages can be indexed. For an e-commerce site, these can be pages that offer several different URLs, i.e. the same page is accessible via several URLs. This often happens when your site offers the same product with different colors or features.
In the case of a t-shirt for example, the access URL could be for example: “mysite.com/name-of-t-shirt”, followed by a number of URL parameters that will appear at the end of the t-shirt, sometimes indicating the colors, sometimes numbers, depending on how your site was built, and being located after a question mark.
Example of URL search parameters
So the same page will appear with the red t-shirt, the black t-shirt, the blue t-shirt, and only a simple parameter will change in the URL. The problem in this case is that Google will consider these different pages as duplicated content that should not be indexed in the search engine. It should therefore be specified that these pages are not duplicated content in the robots.txt file, indicating not to take into account any particular parameter, or not to index all the parameters.
In the case of showcase sites, if you have a WordPress site that contains a blog section for example, you can choose categories for your articles. It can often happen that you check more than one category for blog posts. These articles will then be available with several URLs based on the categories you have checked and will appear as duplicated content. On all WordPress sites, you can decide to display the category in your URL.
To avoid this problem, you can change the URLs of your articles, or deindex the category. If you decide to change your URLs, you should be very careful because it can have a negative impact on your site and your SEO if you don’t have enough expertise to perform these actions.
Get help improving your site’s indexing
Proper indexing has an important impact on the SEO of your site and therefore on your positioning and traffic. We often find mistakes that are made when programmers or marketers want to improve performance. If you want to analyze the indexation status of your site’s pages, we recommend that you work with an SEO specialist. Of course, you can contact us and we will be happy to support you in your SEO strategy.