Technical SEO·Beginner·9 min read

XML sitemaps explained. What goes in, what stays out, and why most are wrong.

An XML sitemap is the list of URLs you are explicitly asking Google to crawl. Most CMS-generated sitemaps include URLs you do not want indexed. Cleaning this up is a 30-minute job that moves the needle.

What an XML sitemap actually is

An XML sitemap is a file at the root of your domain (almost always /sitemap.xml or /sitemap_index.xml) that lists every URL you want a search engine to know about. The format is structured XML so machines can parse it, but the content is just a list of URLs with optional metadata for each one.

A minimal sitemap entry looks like this:

<url>
  <loc>https://example.com.au/services/seo/</loc>
  <lastmod>2026-05-20</lastmod>
</url>

The <loc> is the URL. The <lastmod> is the date the content last changed. Older sitemap specs also supported <changefreq> and <priority>, but Google has publicly said it ignores both. We do not include them anymore.

The sitemap is best understood as a request, not a command. You are saying to Google: here is the list of URLs I think are important, please prioritise them when crawling. Google may still choose not to index some of them. The sitemap is a strong hint, not a directive.

Sitemaps come in a few flavours: a plain URL sitemap (most common), image sitemaps (the URL of every image you want indexed), video sitemaps (for video-heavy sites), news sitemaps (for publishers in the Google News programme), and HTML sitemaps (for human navigation, not for search engines). For most Perth businesses, the plain XML URL sitemap is the only one that matters.

Why the sitemap matters

Three reasons.

First, discovery. Without a sitemap, Google has to find every URL on your site by following internal links from the homepage. New pages, deeply nested pages and pages with no inbound links can sit undiscovered for weeks. The sitemap gives Google a direct list, so new content shows up in the index much faster (often within hours rather than days).

Second, canonical signalling. A URL included in your sitemap is implicitly being voted canonical by you. If you list /services/seo/ in the sitemap but not /services/seo?utm=email, you are telling Google which one to treat as the master. This works with the canonical tag to keep duplicate variants out of the index.

Third, diagnostics. Search Console reports sitemap submission, sitemap fetch errors, and the difference between submitted URLs and indexed URLs. The gap is one of the most informative numbers in technical SEO. A site with 800 URLs submitted and 200 indexed has a quality or crawl problem worth investigating.

90%
of WordPress sitemaps we audit on new Australian client sites include at least one category of URL that should not be there. The fix is almost always a 10-minute plugin setting change.

What belongs in the sitemap

One rule covers 95 percent of cases. A URL belongs in your sitemap only if it is all of:

  1. Returning a 200 OK status code. No 3xx, no 4xx, no 5xx.
  2. Indexable. Not marked noindex, not blocked by robots.txt.
  3. Canonical. Either self-referencing or the target of other canonicals, never a duplicate URL.
  4. Worth ranking. A page you would be happy for a stranger to land on from search.

If a URL fails any of those tests, it does not belong in the sitemap. Period.

For most sites, "what belongs in" looks like:

  • The homepage.
  • Service pages, product pages or category pages (whichever your business uses).
  • Location pages where you have them.
  • Blog posts, articles or resources.
  • Long-form content like guides and case studies.
  • Pillar pages and cluster pages in a learn hub like this one.

That is it. No drafts, no archives, no parameter URLs, no thank-you pages, no admin pages.

What stays out

The list of URLs that almost always end up in WordPress sitemaps and should not:

  • Tag archives and category archives that duplicate content already on category pages. Most ecommerce and blog tag pages are noindexed by default in good SEO plugins; the sitemap should not include them.
  • Attachment pages. WordPress generates a URL for every uploaded image. They have no content and should not be indexed. Make sure your SEO plugin disables this.
  • Author archives on sites with a single author, where the author archive duplicates the blog index page.
  • Paginated comment threads (/post-name/comment-page-2/).
  • Date-based archives (/2024/05/, /2024/05/15/) for blog content.
  • Internal search-results pages (/?s=keyword).
  • Parameter URLs (?utm_, ?ref=, ?sort=). These should canonicalise to the clean URL and stay out of the sitemap.
  • Thank-you and confirmation pages. Anything in the conversion funnel. Indexing these is embarrassing.
  • Admin or member-only pages. Hide them behind auth, not just in robots.txt.
  • 301-redirected URLs. The destination belongs in the sitemap; the old URL does not.

Crawl your own sitemap with Screaming Frog (set the crawl source to "sitemap" instead of "site"). Look for any of the above. Strip them.

Submitting and monitoring the sitemap

Two steps.

One, declare the sitemap URL inside your robots.txt:

Sitemap: https://example.com.au/sitemap.xml

This is a passive discovery signal. Crawlers reading your robots.txt will find it.

Two, submit it in Search Console under "Sitemaps". This gives Google an explicit pointer and unlocks the reporting on submitted vs indexed URLs. Resubmit any time the sitemap structure changes significantly (after a migration, a platform change or a major content reorganisation).

Once submitted, monitor it. Check Search Console's Sitemaps report monthly. Watch for fetch errors (which mean Google could not even read the file) and for sudden drops in "discovered URLs" (which suggest the sitemap is missing pages it used to include).

Common mistakes

Do
  • Audit the sitemap content quarterly. CMS upgrades sometimes silently add new URL types.
  • Use the same URL format (HTTPS, trailing slash, lowercase) as the rest of your site.
  • Submit to Search Console and Bing Webmaster Tools.
  • Reference the sitemap URL inside robots.txt.
  • Split sitemaps into themed files for larger sites (one for blog posts, one for products, one for pages).
Do not
  • Trust the default WordPress or plugin sitemap without reading it.
  • Include URLs that return non-200 status codes. Each one is a wasted crawl request.
  • Include noindex pages. The two signals contradict each other.
  • Forget to update the sitemap after a site migration. Old URLs hanging around are a slow leak.
  • Use <priority> or <changefreq>. Google ignores them.

Tools and checklists

  1. Search Console Sitemaps report. Free. Submission, fetch status, errors, discovered-vs-indexed counts.
  2. Screaming Frog in sitemap-crawl mode. Crawls the URLs listed in your sitemap, flags status codes, indexability, canonical mismatches.
  3. Yoast or Rank Math (WordPress). Both ship with sensible sitemap defaults. Check the settings: posts in, pages in, tags and authors out (usually).
  4. XML Sitemap Validator. Free online tools that check your sitemap parses correctly. Useful after manual edits.
  5. Our free SEO audit tool. Flags sitemap structure and discoverability issues alongside the broader audit. Run a free audit.

Perth and WA context

The three sitemap patterns we see most often on Australian sites.

The WordPress small business. A Perth tradie, accountant or healthcare practice running WordPress with Yoast or Rank Math. The plugin generates a sitemap automatically and almost always includes a few categories of URL that should not be there: attachment pages, tag archives if the site uses tags, and sometimes paginated comment URLs. The fix is a settings change inside the plugin. The audit takes 10 minutes. The benefit is consistent: Google starts spending crawl budget on the actual content instead of the noise.

The Shopify ecommerce. A Fremantle or Cockburn boutique on Shopify. Shopify generates the sitemap automatically and the defaults are usually fine. The exception is when the merchant has a very large catalogue with seasonal stock changes; the sitemap includes URLs for unpublished products and Google wastes time crawling 404s. Custom sitemap logic via a Shopify app is the fix.

The legacy custom CMS. A WA mining-services or industrial-supply business in Kalgoorlie or Karratha running a custom or legacy CMS where the sitemap was set up once and never revisited. Stale URLs, missing new content, parameter URLs included. The fix is either rewriting the sitemap generator or replacing it with a script that pulls only canonical URLs from the database. A morning's work for a competent developer.

One pattern across all three. The sitemap was set up automatically and never re-read by a human. The audit is the first time anyone has actually opened the file and looked at what it contains. The findings are not exotic; they are the basics. The fixes earn their cost back inside a month.

Frequently asked

What is an XML sitemap?
An XML sitemap is a file at the root of your website that lists the URLs you want search engines to crawl and index. It is an explicit signal: these are the canonical URLs that matter, please prioritise them. Most CMS platforms generate one automatically.
Do I need an XML sitemap?
Yes, for sites larger than about 20 pages. Smaller sites can usually be crawled fully through internal links alone, but the sitemap costs nothing and helps Google discover new pages faster. For e-commerce, news sites or any catalogue with hundreds of URLs, the sitemap is essential.
How big can an XML sitemap be?
50,000 URLs or 50MB uncompressed, whichever comes first. Above that you need to split into multiple sitemap files and link them from a sitemap index file. Most CMS platforms handle this split automatically.
Should I submit my sitemap to Google?
Yes, through Search Console. Submission tells Google explicitly where the sitemap lives and surfaces any errors fetching it. You should also reference the sitemap URL inside your robots.txt as a backup discovery method.
Why are pages in my sitemap not being indexed?
The sitemap is a request, not a command. Google still decides which URLs to index based on quality, duplication, canonical signals and crawl priority. If a URL is in the sitemap but Search Console says Discovered or Crawled but not indexed, the issue is upstream of the sitemap.
See how your site stacks up

Get a free SEO audit of your site.

30 seconds. Real Lighthouse scores, real keyword data, real backlink profile, AI-generated quick wins. Free, no sales pitch.

Get a Free SEO Audit

Or call 0435 462 205