Technical SEO·Intermediate·9 min read

Robots.txt and meta robots. Two tools, two jobs, regularly confused.

Robots.txt blocks crawl. Meta robots controls indexing. Mixing them up is the single most common technical SEO mistake we see, and the one that does the most damage when it goes wrong.

What each tool actually does

Robots.txt is a plain-text file that lives at the root of your domain. Its job is to tell crawlers (Googlebot, Bingbot, the lot) which URLs they are allowed to fetch. Think of it as a sign on the front door. The crawler reads the sign before walking in. If the sign says "do not enter", the crawler does not enter. The URL is never fetched, never rendered, never analysed.

Meta robots is a tag in the <head> of an individual HTML page. It looks like this: <meta name="robots" content="noindex">. Its job is to tell search engines what to do with the page after they have already crawled it. Index it or not. Follow the links on it or not. Show a cached copy in search results or not. The crawler reads the tag during the fetch, then acts on the directive.

The two tools intersect awkwardly. If you block a URL in robots.txt, Googlebot cannot fetch it, which means Googlebot also cannot see any meta robots tag on the page. So you cannot use both on the same URL. You pick one tool depending on what you actually want.

One more nuance most people miss. A page blocked in robots.txt can still appear in Google's index. If other sites link to that blocked URL, Google knows it exists and may show it in results, just without a meta description and with a generic title. To fully remove a page from the index, you need meta robots noindex, which requires the page to be crawlable in the first place.

Why the distinction matters

Two reasons it matters enough to be its own chapter. First, the failure modes are severe. A misplaced robots.txt line can de-index hundreds of pages. A missed noindex on a thank-you page can result in your conversion funnel showing up in Google. Both are recoverable. Neither is fun.

Second, the right choice depends on intent. Pages you do not want Google to spend crawl time on (filter URLs, search results, large faceted catalogues) belong in robots.txt. Pages you want fully removed from the index (thank-you pages, internal admin URLs, staging copies) belong on a meta robots noindex. Pages you want kept private (members-only content, internal-only documents) belong behind authentication, not behind a robots directive. The three are not interchangeable.

2x/yr
on average across our Australian client base, a developer pushes a staging robots.txt to production. Daily monitoring catches it before traffic drops. Without monitoring, the recovery takes weeks.

Robots.txt syntax in 5 minutes

The file is plain text. UTF-8 encoded. Lives at https://yourdomain.com.au/robots.txt and nowhere else. Each block of rules targets a user-agent. The most common pattern:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /*?utm_

Sitemap: https://yourdomain.com.au/sitemap.xml

What that means: for all crawlers (*), do not fetch URLs under /admin/, do not fetch URLs under /cart/, and do not fetch URLs containing ?utm_. The sitemap URL at the end is a separate directive that tells crawlers where the XML sitemap lives.

Three syntax notes that catch people out:

  1. Pattern matching uses * as a wildcard and $ to anchor to the end. Disallow: /*.pdf$ blocks all PDF files. Disallow: /*? blocks every URL containing a question mark.
  2. Allow rules can override Disallow rules but only if they are more specific. Disallow: /products/ followed by Allow: /products/featured/ opens up the subdirectory.
  3. Comments start with #. Use them. Future-you will thank past-you for explaining why a rule exists.

Test every change in the Robots.txt Tester (still available inside Search Console at the time of writing) or with Screaming Frog's "Custom Robots" feature. Never ship a robots.txt change directly to production without testing.

Meta robots directives in 5 minutes

Meta robots is a single line in the <head>:

<meta name="robots" content="noindex,follow">

The directives you actually use:

  • index / noindex. Index the page or do not. Default is index.
  • follow / nofollow. Pass authority through the links on the page or do not. Default is follow.
  • noarchive. Do not show a cached copy of the page in search results.
  • nosnippet. Do not show a snippet of the page text in search results.
  • max-snippet:120. Limit the snippet to a specific character count.
  • noimageindex. Do not index images on the page.

You can combine directives with commas. noindex,nofollow is the strongest "leave this alone" combo and the right call for admin pages, internal search results and staging copies. The noindex/nofollow glossary entry covers the nuances.

For non-HTML files like PDFs, use the HTTP X-Robots-Tag header instead:

X-Robots-Tag: noindex

Most servers can set this in the response config. Same directives apply.

Common mistakes

Do
  • Read your robots.txt out loud once a quarter.
  • Block faceted-navigation parameters in robots.txt on e-commerce sites.
  • Use noindex on thank-you pages, internal search results and admin URLs.
  • Link to your sitemap from inside robots.txt.
  • Monitor the live robots.txt daily with an uptime tool that alerts on content changes.
Do not
  • Ship Disallow: / to production. Ever. Check twice before deploys.
  • Block a URL in robots.txt and also try to noindex it. Google cannot see the noindex if it cannot crawl the page.
  • Use robots.txt to hide sensitive URLs. Anyone can read robots.txt, so listing private paths there is a directory of what to attack.
  • Forget to allow Googlebot access to CSS and JavaScript files. Blocking these breaks rendering.
  • Leave staging robots.txt rules in production. Audit after every deploy.

Tools and checklists

  1. Search Console Robots.txt Tester. Free. Lets you test specific URLs against your live robots.txt and see whether Googlebot would be allowed to fetch them.
  2. Screaming Frog. Free up to 500 URLs. The "Response Codes" and "Directives" reports show every page's meta robots tag and its respect for robots.txt.
  3. A change-monitoring tool. Set up Uptime Robot, Checkly or a similar service to alert when /robots.txt changes. We have caught two production-robots-txt disasters in the last year just from this alert.
  4. Our free SEO audit tool. Flags robots.txt issues and meta robots conflicts in one pass. Run a free audit.

Perth and WA context

The three robots.txt patterns we see most often on Australian sites:

The Shopify default. Shopify ships with a robots.txt that blocks /collections/ filter parameters by default, which is sensible. It also blocks a couple of paths some merchants actually need indexed (collection sort orders for very large catalogues). Customising Shopify's robots.txt is now possible via the theme editor; most Perth Shopify stores have never done it. Worth a 20-minute audit on any Fremantle, Cottesloe or Cockburn boutique.

The WordPress + plugin combo. WordPress writes a virtual robots.txt by default, then plugins like Yoast, Rank Math and All-in-One SEO override it. The result is usually fine. The exception is when two plugins are active and fighting each other, or when a theme has been customised to ship its own physical robots.txt that conflicts with the plugin. We see this on small business sites where multiple developers have touched the codebase over the years.

The custom-CMS legacy site. Mining-supply businesses in Karratha and Port Hedland running legacy ASP.NET or PHP CMS platforms often have a robots.txt set up a decade ago by a developer long gone. It blocks paths that no longer exist, allows paths that should be blocked, and never gets revisited. A morning's audit usually finds two or three real wins.

The ecommerce SEO industry pattern and mining SEO industry pattern both cover these recurring shapes in more detail.

Frequently asked

What is robots.txt in plain English?
Robots.txt is a text file at the root of your website that tells search engine crawlers which URLs they are allowed to fetch. It is the front-door sign. It blocks crawl, not indexing. A page blocked in robots.txt can still appear in Google if other sites link to it, just without a description.
What is the difference between robots.txt and noindex?
Robots.txt prevents Googlebot from crawling a URL. A noindex meta tag tells Google not to keep the URL in the index. You cannot use both on the same URL because Google has to crawl the page to see the noindex tag. The two tools work at different layers of the funnel.
Where does the robots.txt file live?
At the root of your domain, always. https://example.com.au/robots.txt is the only location Google checks. Subdirectory variants like /folder/robots.txt are ignored.
Can a wrong robots.txt kill my SEO?
Yes, instantly. A single line of Disallow: / will remove your entire site from Google over the next crawl cycle. We see this happen on average twice a year across Australian client sites. The fix is one character. The damage takes a fortnight to reverse.
What is the X-Robots-Tag header?
It is a way to send robots directives via HTTP headers instead of the HTML meta tag. Handy when the file is not HTML at all, such as PDFs, images and videos where you cannot add a meta tag. Same directives apply: noindex, nofollow, noarchive.
See how your site stacks up

Get a free SEO audit of your site.

30 seconds. Real Lighthouse scores, real keyword data, real backlink profile, AI-generated quick wins. Free, no sales pitch.

Get a Free SEO Audit

Or call 0435 462 205