Quantcast
Channel: The SEO Pub
Viewing all articles
Browse latest Browse all 58

A Simple Guide to robots.txt Files (and How Not to Mess Them Up)

$
0
0

The robots.txt file is a simple but powerful tool in technical SEO. It tells search engines which parts of your site they can and can’t crawl. Handled well, it helps search engines focus on the right content. Handled poorly, it can quietly block important pages from ever being indexed.

Where It Lives

Your robots.txt file should always be placed in the root of your domain:
https://example.com/robots.txt

Basic Syntax

Here are the essentials:

User-agent: *
Disallow: /private/
  • User-agent: specifies which crawler you’re targeting (* means all bots).
  • Disallow: tells bots not to crawl specific paths.
  • Use Allow: to override broader disallow rules for specific files or folders.

For Example:

User-agent: Googlebot
Disallow: /test/ 
Allow: /test/public-page.html

If you want bots to be able to crawl everything, use:

User-agent: * 
Disallow:

What It Doesn’t Do

  • robots.txt does NOT prevent indexing. If a page is linked to elsewhere, Google may still index it—even if it’s disallowed from crawling.
    • This is one of the most common mistakes I see when a site is trying to prevent content from being indexed.
  • To prevent indexing, use a noindex meta tag or remove the page entirely.
    • You need to allow crawling of the page so that search spiders can see the noindex directive.

Common Mistakes to Avoid with robots.txt Files

Despite being a simple file, robots.txt can cause major SEO issues if misused. Here are some of the most common mistakes and how to avoid them:

1. Blocking Important Pages or Directories by Accident
It’s easy to accidentally block content that should be crawled and indexed.

Example:

Disallow: /blog/

This prevents search engines from crawling all blog posts — even though those are likely some of your most valuable content pages.

Tip:
Before blocking any directory or URL path, ask yourself: Do I want this to appear in search results? If yes, don’t disallow it.

2. Trying to Use robots.txt to Prevent Indexing
One of the most misunderstood parts of robots.txt: it only controls crawling, not indexing.

If a page is linked to from somewhere else, Google can still index it even if you’ve blocked it in robots.txt. Worse, if it’s blocked, Google can’t see your noindex tag—so it might index it anyway.

Wrong:

Disallow: /checkout/

Better:
Let Google crawl the page and use a <meta name="robots" content="noindex"> tag in the HTML. Or use a x-robots-tag in the HTTP header if it’s a non-HTML file.

3. Blocking All Crawlers (Intentionally or Not)
It’s shockingly common to find this in production:

User-agent: *
Disallow: /

This tells all search engines not to crawl anything. It’s often used during staging or development and accidentally left live when the site goes public.

Always check robots.txt before launching.

4. Disallowing JavaScript, CSS, or Assets Needed for Rendering
Modern websites rely heavily on CSS and JavaScript. Blocking these assets can prevent Google from rendering the page correctly, which can affect how it understands and ranks your content.

Wrong:

Disallow: /wp-content/

That directory often contains stylesheets and scripts essential to how your site looks and works.

Tip:
Make sure critical assets are crawlable—especially if you’re optimizing for Core Web Vitals or mobile usability.

5. Typos or Misplaced Rules
Simple typos like Disalow: instead of Disallow: or placing the file in the wrong directory (example.com/pages/robots.txt instead of example.com/robots.txt) render your directives useless.

Tip:
Use tools like Google Search Console’s robots.txt Tester to validate the file and catch errors before they go live.

6. Forgetting That robots.txt Is Public
Anyone can visit yourdomain.com/robots.txt and see what you’ve blocked. It’s not a security feature. Don’t try to use it to hide sensitive URLs.

If something truly needs to stay private, use authentication, password protection, or remove it from the web entirely.

Avoid These, and You’re Ahead of the Pack

Most robots.txt mistakes are unintentional—but they can quietly kill your rankings. With just a little care and testing, you can make sure your file helps search engines do their job without getting in the way of your own.

TL;DR

  • Use robots.txt to control crawling, not indexing.
  • Be cautious with Disallow:—especially on folders.
  • Don’t try to hide sensitive info with it. It’s public.
  • Test before you publish.

Set it up right, and it’ll help search engines focus on what matters most.


Viewing all articles
Browse latest Browse all 58

Trending Articles