The robots.txt
file tells search engines which of your site's pages they can
crawl. An invalid robots.txt
configuration can cause two types of problems:
- It can keep search engines from crawling public pages, causing your content to show up less often in search results.
- It can cause search engines to crawl pages you may not want shown in search results.
How the Lighthouse robots.txt
audit fails
Lighthouse flags invalid
robots.txt
files:
Expand the robots.txt
is not valid audit in your report
to learn what's wrong with your robots.txt
.
Common errors include:
No user-agent specified
Pattern should either be empty, start with "/" or "*"
Unknown directive
Invalid sitemap URL
$ should only be used at the end of the pattern
Lighthouse doesn't check that your robots.txt
file is
in the correct location. To function correctly, the file must be in the root of
your domain or subdomain.
How to fix problems with robots.txt
Make sure robots.txt
doesn't return an HTTP 5XX status code
If your server returns a server error (an HTTP status code
in the 500s) for robots.txt
, search engines won't know which pages should be
crawled. They may stop crawling your entire site, which would prevent new
content from being indexed.
To check the HTTP status code, open robots.txt
in Chrome and
check the request in Chrome DevTools.
Keep robots.txt
smaller than 500 KiB
Search engines may stop processing robots.txt
midway through if the file is
larger than 500 KiB. This can confuse the search engine, leading to incorrect
crawling of your site.
To keep robots.txt
small, focus less on individually excluded pages and more
on broader patterns. For example, if you need to block crawling of PDF files,
don't disallow each individual file. Instead, disallow all URLs containing
.pdf
by using disallow: /*.pdf
.
Fix any format errors
- Only empty lines, comments, and directives matching the "name: value" format are
allowed in
robots.txt
. - Make sure
allow
anddisallow
values are either empty or start with/
or*
. - Don't use
$
in the middle of a value (for example,allow: /file$html
).
Make sure there's a value for user-agent
User-agent names to tell search engine crawlers which directives to follow. You
must provide a value for each instance of user-agent
so search engines know
whether to follow the associated set of directives.
To specify a particular search engine crawler, use a user-agent name from its published list. (For example, here's Google's list of user-agents used for crawling.)
Use *
to match all otherwise unmatched crawlers.
user-agent: disallow: /downloads/
No user agent is defined.
user-agent: * disallow: /downloads/ user-agent: magicsearchbot disallow: /uploads/
A general user agent and a magicsearchbot
user agent are defined.
Make sure there are no allow
or disallow
directives before user-agent
User-agent names define the sections of your robots.txt
file. Search engine
crawlers use those sections to determine which directives to follow. Placing a
directive before the first user-agent name means that no crawlers will follow
it.
# start of file disallow: /downloads/ user-agent: magicsearchbot allow: /
No search engine crawler will read the disallow: /downloads
directive.
# start of file user-agent: * disallow: /downloads/
All search engines are disallowed from crawling the /downloads
folder.
Search engine crawlers only follow directives in the section with the most
specific user-agent name. For example, if you have directives for
user-agent: *
and user-agent: Googlebot-Image
, Googlebot Images will only
follow the directives in the user-agent: Googlebot-Image
section.
Provide an absolute URL for sitemap
Sitemap files are a great way to let search engines know about pages on your website. A sitemap file generally includes a list of the URLs on your website, together with information about when they were last changed.
If you choose to submit a sitemap file in robots.txt
, make sure to
use an absolute URL.
sitemap: /sitemap-file.xml
sitemap: https://example.com/sitemap-file.xml