Search Knowledge Base by Keyword
200 Indexable URLs
Hreflang Builder has numerous features that work to try to generate an XML output that is as valid as possible. To ensure maximum indexation by Search Engines, the hreflang XML site maps should have as few errors as possible to minimize wasted resources by Search Engines. These are commonly referred to as “200 Indexable URLs.” Most SEO Diagnostic tools will have some report that filters out URLs that do not meet this criterion.
It is essential to understand how your source files are created. Surprisingly many CMS systems and even advanced SEO diagnostic tools do not use all of the following criteria to develop a valid output. Unless you filter the results, most SEO diagnostic tools will put the entire dataset of URLs with various tags to indicate each of the different status elements.
For those tools that we have API connections, DeepCrawl, OnCrawl, and Content King, Hreflang Builder’s API calls specifically request the list of 200 Indexable URLs. This is why the tool may indicate you have 500 total or unique URLs, but Hreflang Builder may only have 300. Those 200 that are missing should not meet one of the following criteria.
200 Indexable URLs have
1. 200 Status Header
The 200 status header refers to a page that, when visited, returns a “200 Success” HTTP header status. This means the page successfully loads with the requested URL in the browser. URLs with any other header status fail these first criteria and would require extra steps from the Search Engine to render a page. Google explains how they handle HTTP Status Codes.
2. No Robots Exclusions
No Robots Exclusions means no elements are used to restrict search engines from accessing or indexing the page. If the website has a directive telling the Search Engine NOT to visit or index the page, we should not include that page in the XML sitemaps from Hreflang Builder. There are currently two specific types of exclusions; both are described below.
Robots.txt Exclusions – The robots.txt is a file that is placed at the root of the website. This file traditionally contained rules for URLs or sections of the website are off-limits to spiders. When a reputable Search Engine or bot visits the site, it will check the robots.txt file for restrictions. Depending on the SEO Diagnostic tool, it will either respect these settings but allow you to override them for testing, or it will ignore them but does indicate using a tag that the page was restricted. Some SEO agencies have been using techniques called “robots.txt sculpting” to try to create specific paths through sites. Unfortunately, when done incorrectly will block key parts of the website from being indexed. How Google handles robots.txt exclusions.
Robots Meta Tags – The second form of exclusions happens at the individual URL level. You can add a string to the <head> of the page to indicate of the page can be indexed as well as to or not to follow links deeper into the website. If you do not want specific pages indexed by Google then we should not include these pages in the output hreflang XML. How Google handles robots meta tags exclusions.
3. Self-Referencing Canonical
This is often one of the big problems we encounter where a URL meets the first two criteria but has a canonical to another page. The canonical tag is used to tell search engines there is a preferred version of a URL. If you are using tracking tags you do not need dozens of the same version of URLs indexed. With a canonical tag, you can indicate the preferred version. Since you are telling the search engines to ignore the variations and use the canonical we should not have those variations in the hreflang XML sitemaps. How Google handles canonical tags.