Chapter 2: How does search engine work: Crawling, Indexing and Ranking
Search engines are answer machines as I described in Chapter 1. They work to discover, understand and organize the content of the internet in order to give the most relevant results to the questions asked by searchers.
Your content needs to be visible to search engines first to show up in search results. It is probably the most important piece of the SEO puzzle: if you can’t find your site there’s no way you’ll ever show in the SERPs (Search Engine Results Page).
How does search engines work?
Search engines execute three main functions:
Crawl: Scour the internet for content, look the code / site content for each URL they find.
Index: Store and organize the content found during crawling. Once a page is in the index, it is for relevant queries to be displayed in the running.
Rank: Provide the content pieces that best answer a searcher’s query, meaning results are ordered by the most relevant to the least relevant.
What's the search engine crawling?
Crawling is the process of discovery in which search engines send a team of robots (called crawlers or spiders) to find new and updated content. Content can vary — it might be a webpage an image, a video, a PDF, etc. — but content is discovered through links regardless of format.
Googlebot begins by collecting a few web pages, and then follows the links to find new URLs on those web pages. By hopping along this link path the crawler will find new content and add it to their index named Caffeine — a massive database of discovered URLs — to be retrieved later when a searcher is searching for information that is a good match for the content on that URL.
What's the search engine index?
Search engines compile and store information they find in an index, a vast database of all the content they have discovered and they deem it good enough to serve searchers.
Search engine ranking
Search engines scan their index for highly relevant content when someone conducts a search and then order the content in the hopes of solving the search query. This ordering by relevance of the search results is called ranking. Generally speaking, you can say the higher a website is ranked the more relevant the search engine believes the site is to the query.
Search engine crawlers from part or all of your site may be blocked, or search engines may be instructed to avoid storing those pages in their index. While there may be reasons to do this, if you want your content to be found by searchers, first you need to make sure it is accessible to crawlers and is indexable. That is as good as invisible, otherwise.
By the end of this chapter, you’ll have the information you need to work with, rather than against, the search engine!
Crawling: Can search engines find your pages for you?
As you have just learned, making sure that your site is crawled and indexed is a prerequisite for appearing in the SERPs. If you already have a website, starting by seeing how many of your pages are in the index might be a good idea. This will offer some great insights into whether Google is crawling around and finding all the pages you want it to, and none you don’t.
One way to check your indexed pages is the advanced search operator, “site: yourdomain.com.” Head to Google, and type in the search bar “site: yourdomain.com.” This will return results that Google has specified in its index for the site:
The number of results shown by Google (see “About XX results” above) is not accurate but it gives you a solid idea of which pages are indexed on your site and how they show in the search results at the present.
Track and use the Index Coverage report in the Google Search Console for more accurate results. You can sign up for a free Google Search Console account if you don’t currently have one. With this tool, you can submit sitemaps for your site and monitor among other items, how many pages submitted were actually added to Google’s index.
If you do not show up in the search results anywhere, then there are a few possible reasons why:
Your site is brand new and it hasn’t been crawled yet.
Your site is not linked to any external websites.Navigation on your site makes it difficult for a robot to crawl effectively.
Your site contains some basic code called a crawler directive that blocks search engines.
Your site has been penalized by Google for spamming tactics.
Many people are thinking of making sure that Google can find their important pages but it’s easy to forget that there are still pages you don’t want to search on Googlebot. These could include elements like old thin-content URLs, duplicate URLs (such as e-commerce sort-and-filter parameters), special promo code pages, staging or testing pages, etc.
Using robots.txt to separate Googlebot from certain pages and sections of your site.
How Googlebot handles robots.txt files
If Googlebot can’t find a robots.txt file for a site, the site will be crawled.
If Googlebot finds a robots.txt file for a site, it will usually follow the suggestions and start crawling the site.
If Googlebot encounters an error while trying to access the robots.txt site file and can’t tell if one exists or not, the site won’t crawl.
Not all robots on the Web follow robots.txt. Bad-intentioned people (e.g. email address scrapers) create bots that don’t follow this protocol. Indeed, some bad actors use robots.txt files to find where your private content has been located. While blocking crawlers from private pages such as login and administration pages may seem logical so that they do not appear in the index, putting the location of those URLs in a publicly accessible robots.txt file often means that people with malicious intent will locate them more easily. Rather than placing these pages in your robots.txt file, NoIndex is better at gate them behind a login form.
Defining the parameters for URL in GSC
Through applying certain parameters to URLs, certain sites (most popular for e-commerce) make the same content available on multiple different URLs. If you’ve ever shopped online, your search has likely been narrowed down through filters. On Amazon, for example, you can search for “shoes,” and then refine your search by size, color, and style. The URL changes slightly each time you refine:
How is Google aware of which URL version to help searchers? Google does a pretty good job of finding out the correct URL on its own, but in Google Search Console, you can use the URL Parameters feature to tell Google exactly how you want them to handle your pages. When you use this feature to tell Googlebot “crawl no URLs with parameter ,” then you are basically asking Googlebot to hide this content, which may lead to the removal of those pages from the search results. If those parameters create duplicate pages, that’s what you want, but not ideal if you want to index those pages.
Can crawlers find all your important content?
Now that you know some tactics to ensure search engine crawlers stay away from your insignificant content, let’s learn about the optimizations that can help Googlebot find your pages. Often crawling may allow a search engine to find parts of your site, but other pages or sections may be obscured for one purpose or another. It is important to ensure that search engines are able to discover all the content that you want to index, and not just your homepage. Ask yourself: Is it possible for the bot to crawl through your website, and not just it?
Is your content hidden behind the login forms?
Unless you need users to log in, fill out forms, or answer to surveys before they access that content, search engines will not see those protected pages. Definitely, a crawler won’t log in.
Will you rely on forms of search?
Robots are not allowed to use search forms. Many people believe that if they put a search box on their site, search engines will be able to find whatever their visitors are search for.
Is text hidden within content that is not text?
Forms of non-text media (images, video, GIFs, etc.) should not be used to view text you wish to index. Although search engines are increasing the recognition of images, there is no guarantee that they will still be able to read and understand it. It is always best to include text within your webpage’s < HTML > markup.
Can search engines follow the navigation of your site?
Just as a crawler needs to discover your site through links from other sites to guide it from page to page, he needs a path of links on your own site. If you have a page that you want to find search engines but it is not linked to from any other pages it is as nice as it is invisible. Most sites make the critical mistake of structuring their navigation in ways which are inaccessible to search engines, hindering their ability to be listed in search results.
Common navigation errors which can prevent crawlers from seeing all of your site:
Have a mobile navigation which shows results different from your desktop navigation
Personalization, or displaying specific navigation to a particular type of visitor versus others, could seem to clutter a search engine crawler
Forget to link through your navigation to a primary page on your website — remember links are the paths that crawlers take to new pages!
That’s why it is vital that your website has clear navigation and helpful folder structures for URLs.
Have you got a clean information architecture?
Information architecture is the method of arranging and identifying content on a website to maximize user efficiency and findability. The best architecture for information is intuitive, meaning users shouldn’t have to think very hard about moving through your website or finding something.
Are you using Sitemaps?
A sitemap is just what it sounds like: a list of URLs on your website which can be used by crawlers to discover and index content. One of the best ways to ensure that Google recognizes your highest priority pages is by creating and uploading a file that meets Google’s criteria through the Google Search Console. Although submitting a sitemap does not replace the need for good navigation on the web, it can certainly help crawlers take a path to all of your important pages.
If your site has no other sites that link to it, you might still be able to search it by submitting your XML sitemap to the Google Search Console. No assurance that they will include a submitted URL in their index, but it is worth trying!