What each tool actually does
Robots.txt is a plain-text file that lives at the root of your domain. Its job is to tell crawlers (Googlebot, Bingbot, the lot) which URLs they are allowed to fetch. Think of it as a sign on the front door. The crawler reads the sign before walking in. If the sign says "do not enter", the crawler does not enter. The URL is never fetched, never rendered, never analysed.
Meta robots is a tag in the <head> of an individual HTML page. It looks like this: <meta name="robots" content="noindex">. Its job is to tell search engines what to do with the page after they have already crawled it. Index it or not. Follow the links on it or not. Show a cached copy in search results or not. The crawler reads the tag during the fetch, then acts on the directive.
The two tools intersect awkwardly. If you block a URL in robots.txt, Googlebot cannot fetch it, which means Googlebot also cannot see any meta robots tag on the page. So you cannot use both on the same URL. You pick one tool depending on what you actually want.
One more nuance most people miss. A page blocked in robots.txt can still appear in Google's index. If other sites link to that blocked URL, Google knows it exists and may show it in results, just without a meta description and with a generic title. To fully remove a page from the index, you need meta robots noindex, which requires the page to be crawlable in the first place.
Why the distinction matters
Two reasons it matters enough to be its own chapter. First, the failure modes are severe. A misplaced robots.txt line can de-index hundreds of pages. A missed noindex on a thank-you page can result in your conversion funnel showing up in Google. Both are recoverable. Neither is fun.
Second, the right choice depends on intent. Pages you do not want Google to spend crawl time on (filter URLs, search results, large faceted catalogues) belong in robots.txt. Pages you want fully removed from the index (thank-you pages, internal admin URLs, staging copies) belong on a meta robots noindex. Pages you want kept private (members-only content, internal-only documents) belong behind authentication, not behind a robots directive. The three are not interchangeable.
Robots.txt syntax in 5 minutes
The file is plain text. UTF-8 encoded. Lives at https://yourdomain.com.au/robots.txt and nowhere else. Each block of rules targets a user-agent. The most common pattern:
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /*?utm_
Sitemap: https://yourdomain.com.au/sitemap.xml
What that means: for all crawlers (*), do not fetch URLs under /admin/, do not fetch URLs under /cart/, and do not fetch URLs containing ?utm_. The sitemap URL at the end is a separate directive that tells crawlers where the XML sitemap lives.
Three syntax notes that catch people out:
- Pattern matching uses
*as a wildcard and$to anchor to the end.Disallow: /*.pdf$blocks all PDF files.Disallow: /*?blocks every URL containing a question mark. - Allow rules can override Disallow rules but only if they are more specific.
Disallow: /products/followed byAllow: /products/featured/opens up the subdirectory. - Comments start with
#. Use them. Future-you will thank past-you for explaining why a rule exists.
Test every change in the Robots.txt Tester (still available inside Search Console at the time of writing) or with Screaming Frog's "Custom Robots" feature. Never ship a robots.txt change directly to production without testing.
Meta robots directives in 5 minutes
Meta robots is a single line in the <head>:
<meta name="robots" content="noindex,follow">
The directives you actually use:
index/noindex. Index the page or do not. Default is index.follow/nofollow. Pass authority through the links on the page or do not. Default is follow.noarchive. Do not show a cached copy of the page in search results.nosnippet. Do not show a snippet of the page text in search results.max-snippet:120. Limit the snippet to a specific character count.noimageindex. Do not index images on the page.
You can combine directives with commas. noindex,nofollow is the strongest "leave this alone" combo and the right call for admin pages, internal search results and staging copies. The noindex/nofollow glossary entry covers the nuances.
For non-HTML files like PDFs, use the HTTP X-Robots-Tag header instead:
X-Robots-Tag: noindex
Most servers can set this in the response config. Same directives apply.
Common mistakes
- Read your robots.txt out loud once a quarter.
- Block faceted-navigation parameters in robots.txt on e-commerce sites.
- Use noindex on thank-you pages, internal search results and admin URLs.
- Link to your sitemap from inside robots.txt.
- Monitor the live robots.txt daily with an uptime tool that alerts on content changes.
- Ship
Disallow: /to production. Ever. Check twice before deploys. - Block a URL in robots.txt and also try to noindex it. Google cannot see the noindex if it cannot crawl the page.
- Use robots.txt to hide sensitive URLs. Anyone can read robots.txt, so listing private paths there is a directory of what to attack.
- Forget to allow Googlebot access to CSS and JavaScript files. Blocking these breaks rendering.
- Leave staging robots.txt rules in production. Audit after every deploy.
Tools and checklists
- Search Console Robots.txt Tester. Free. Lets you test specific URLs against your live robots.txt and see whether Googlebot would be allowed to fetch them.
- Screaming Frog. Free up to 500 URLs. The "Response Codes" and "Directives" reports show every page's meta robots tag and its respect for robots.txt.
- A change-monitoring tool. Set up Uptime Robot, Checkly or a similar service to alert when
/robots.txtchanges. We have caught two production-robots-txt disasters in the last year just from this alert. - Our free SEO audit tool. Flags robots.txt issues and meta robots conflicts in one pass. Run a free audit.
Perth and WA context
The three robots.txt patterns we see most often on Australian sites:
The Shopify default. Shopify ships with a robots.txt that blocks /collections/ filter parameters by default, which is sensible. It also blocks a couple of paths some merchants actually need indexed (collection sort orders for very large catalogues). Customising Shopify's robots.txt is now possible via the theme editor; most Perth Shopify stores have never done it. Worth a 20-minute audit on any Fremantle, Cottesloe or Cockburn boutique.
The WordPress + plugin combo. WordPress writes a virtual robots.txt by default, then plugins like Yoast, Rank Math and All-in-One SEO override it. The result is usually fine. The exception is when two plugins are active and fighting each other, or when a theme has been customised to ship its own physical robots.txt that conflicts with the plugin. We see this on small business sites where multiple developers have touched the codebase over the years.
The custom-CMS legacy site. Mining-supply businesses in Karratha and Port Hedland running legacy ASP.NET or PHP CMS platforms often have a robots.txt set up a decade ago by a developer long gone. It blocks paths that no longer exist, allows paths that should be blocked, and never gets revisited. A morning's audit usually finds two or three real wins.
The ecommerce SEO industry pattern and mining SEO industry pattern both cover these recurring shapes in more detail.
Related guides
- Back to the Technical SEO pillar for the full 12-chapter index.
- The technical SEO audit. Stage two covers robots.txt and meta robots.
- Crawl budget explained. Robots.txt is the primary crawl-budget lever.
- Canonical tags explained. The other tool for managing duplicate content.
- XML sitemaps explained. Linked from robots.txt as the sitemap declaration.
- JavaScript SEO essentials. Why blocking CSS or JS breaks rendering.
- Crawling, indexing and ranking. The conceptual prequel.
- Noindex and nofollow glossary entry. The directives in detail.
