The ultimate guide to robots.txt (2023)

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some additional rules, which can be helpful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

Table of contents

  • Whatisa robots.txt file?
  • What does the robots.txt filedo?
  • Where should I put my robots.txt file?
  • Pros and cons of using robots.txt
  • Robots.txt syntax
  • Don’t block CSS and JS files in robots.txt
  • Test and fix in Google Search Console
  • Validate your robots.txt
  • See the code

Whatisa robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file read by search engines (and other systems). Also called the Robots Exclusion Protocol, the robots.txt file results from a consensus among early search engine developers. It’s not an official standard set by any standards organization, although all major search engines adhere to it.

A basic robots.txt file might look something like this:

User-Agent: *Disallow:Sitemap: https://www.example.com/sitemap_index.xml

What does the robots.txt filedo?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits anypage on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Read more: Bot traffic: What it is and why you should care about it »

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, the crawler should find it at https://www.example.com/robots.txt.

It’s also essential that your robots.txt file is called robots.txt. The name is case-sensitive, so get that right, or it won’t work.

(Video) How to Create robots.txt File

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy on unimportant parts of your site might mean that they focus instead on the sections that do matter. Use the crawl cleanup settings in Yoast SEO Premium to help Google crawls what matters.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is crucial is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have ten different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to many possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help ensure the search engine only spiders your site’s main URLs and won’t go into the enormous spider trap you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a crawler where it can’t go on your site, youcan’tuse it to say to a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page. So your result will look like this:

The ultimate guide to robots.txt (1)

If you want to reliably block a page from appearing in the search results, you need to use a meta robots noindex tag. That means that to find the noindex tag, the search engine has to be able to access that page, sodon’tblock it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. It’s a dead-end when you’ve blocked a page in robots.txt. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup yourrobots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or particular blocks for particular search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

(Video) What is Robots.txt & What Can You Do With It

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case-sensitive, so it’s up to you to write them in lowercase or capitalize them. The valuesarecase-sensitive so /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field matches with that specific spider’s (usually longer) user-agent, so, for instance, the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you want to tell this crawler what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, ad programs, images, videos, etc.

Search engines always choose the most specific block of directives they can find. Say you have three sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engineFieldUser-agent
BaiduGeneralbaiduspider
BaiduImagesbaiduspider-image
BaiduMobilebaiduspider-mobile
BaiduNewsbaiduspider-news
BaiduVideobaiduspider-video
BingGeneralbingbot
BingGeneralmsnbot
BingImages & Videomsnbot-media
BingAdsadidxbot
GoogleGeneralGooglebot
GoogleImagesGooglebot-Image
GoogleMobileGooglebot-Mobile
GoogleNewsGooglebot-News
GoogleVideoGooglebot-Video
GoogleAdSenseMediapartners-Google
GoogleAdWordsAdsBot-Google
Yahoo!Generalslurp
YandexGeneralyandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything so that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

The example below wouldallowall search engines to crawl your entire site by dropping a single character.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It wouldnotblock Google from crawling the /photo directory, as these lines are case-sensitive.

(Video) Wordpress Robots.txt Best Practices - How to Optimize WordPress Robots.txt [Still true in 2023!]

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially,” the robots.txt standard doesn’t support regular expressions or wildcards; however, all major search engines understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case-sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that other search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example, you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1couldbe. Of course, this is only useful in very specific circumstances and pretty dangerous: it’s easy to unblock things you didn’t want to.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives, there are a couple of other crawl directives you can use. All search engine crawlers do not support these directives, so make sure you know their limitations.

The allow directive

While not in the original “specification,” there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The crawl-delay directive

Crawl-delay is an unofficial addition to the standard, and not many search engines adhere to it. At least Google and Yandex don’t use it, with Bing being unclear. In theory, as crawlers can be pretty crawl-hungry, you could try the crawl-delay direction to slow them down.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Do take care when using the crawl-delay directive. By setting a crawl delay of ten seconds, you only allow these search engines to access 8,640 pages a day. This might seem plenty for a small site, but it isn’t very much for large sites. On the other hand, if you get next to no traffic from these search engines, it might be a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive, you can tell search engines – Bing, Yandex, and Google – where to find your XML sitemap. You can, of course, submit your XML sitemaps to each search engine using their webmaster tools. We strongly recommend you do so because webmaster tools will give you a ton of information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative. Yoast SEO automatically adds a link to your sitemap if you let it generate a robots.txt file. On an existing robots.txt file, you can add the rule by hand via the file editor in the Tools section.

(Video) Secrets in robots.txt (PicoCTF 2022 #36 'roboto-sans')

Sitemap: https://www.example.com/my-sitemap.xml

Don’t block CSS and JS files in robots.txt

Since 2015, Google Search Console warns site owners not to block CSS and JS files. We’ve been telling you the same thing for ages: don’t block CSS and JS files in your robots.txt. Let us explain why youshouldn’t block these specific files from Googlebot.

By blocking CSS and JavaScript files, you’re preventing Google from checking if your website works correctly. If you block CSS and JavaScript files in yourrobots.txtfile, Google can’t render your website as intended. Now, Google can’t understand your website, which might result in lower rankings. What’s more, even tools like Ahrefs render web pages and executeJavaScript. So, don’t blockJavaScript if you want your favorite SEO tools to work.

This aligns perfectly with the general assumption that Google has become more “human”. Google wants to see your website like a human visitor would, so it can distinguishthe main elements from the extras. Google wants to know if JavaScript enhances the user experience or ruins it.

Test and fix in Google Search Console

Google helps you find and fix issues with your robots.txt, for instance, in the Page Indexing section in Google Search Console. Simply select the Blocked by robots.txt option:

The ultimate guide to robots.txt (2)

Unblocking blocked resources comes down to changingyour robots.txtfile. You need to set that file up so thatit doesn’t disallow Googleto access your site’s CSS and JavaScript files anymore. If you’re on WordPress and use Yoast SEO, you can do this directly with our Yoast SEO plugin.

Validate your robots.txt

Various tools can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu), and we’d highly recommend using that:

The ultimate guide to robots.txt (3)

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site and slip into search engine oblivion!

Behind the scenes of a robots.txt parser

In July 2019, Google announced they were making their robots.txt parser open source. If you want to get into the nuts and bolts, you can see how their code works (and even use it yourself or propose modifications to it).

Joost de Valk

Joost de Valk is an internet entrepreneur and the founder of Yoast. After selling Yoast he had stopped being active full time and acting as an advisor to the company, but came back to be its interim CTO. He is also the Head of WordPress Strategy for Yoast's parent company Newfold Digital.Joost, together with his wife Marieke, actively invests in and advises several startups through their company Emilia Capital.

(Video) What is Robots.txt & How to Create Robots.txt File? | SEO Tutorial

The ultimate guide to robots.txt (4)

Coming up next!

  • The ultimate guide to robots.txt (5)
    Event

    Webwinkel Vakdagen 2023

    March 29 - 30, 2023 Team Yoast is Webwinkel Vakdagen 2023, click through to see if we'll be there, who will be there and more! See where you can find us next »
  • The ultimate guide to robots.txt (6)
    SEO webinar

    Yoast SEO news webinar - March 28, 2023

    28 March 2023 Our head of SEO, Jono Alderson, will keep you up-to-date about everything that happens in the world of SEO and WordPress. All Yoast SEO webinars »

Videos

1. 🤖 How To Create a Robots.txt File For SEO Using WordPress? - A Beginners Guide
(Visualmodo)
2. Robots.txt File Kya Hai? - Create Robots.txt File for SEO | SEO Tutorial
(WsCube Tech)
3. SEO Tricks with ChatGpt | Robots.txt
(MM Rahman Bappi)
4. Complete Guide of Robots.txt file in SEO | Robots.txt tutorial
(WsCube Tech)
5. How to Add a Robots.txt file
(Seo Site Checkup)
6. How to Create & Optimize Robots txt for SEO | SEO Tutorial in Tamil | #38
(Simplified E-learning)
Top Articles
Latest Posts
Article information

Author: Trent Wehner

Last Updated: 01/02/2023

Views: 6624

Rating: 4.6 / 5 (76 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Trent Wehner

Birthday: 1993-03-14

Address: 872 Kevin Squares, New Codyville, AK 01785-0416

Phone: +18698800304764

Job: Senior Farming Developer

Hobby: Paintball, Calligraphy, Hunting, Flying disc, Lapidary, Rafting, Inline skating

Introduction: My name is Trent Wehner, I am a talented, brainy, zealous, light, funny, gleaming, attractive person who loves writing and wants to share my knowledge and understanding with you.