What Are XML Sitemaps and Why Do They Matter for SEO?

Home / SEO News / What Are XML Sitemaps and Why Do They Matter for SEO?
David Galvin
12 February 2025
Read Time: 12 Minutes
Article Summary

XML sitemaps help search engines discover and understand your site’s structure, especially for large or complex websites. This guide covers creation, submission, specialised sitemaps, and what to include or exclude.

Key Takeaways

An XML sitemap is a file that lists the URLs on your website you want search engines to find. It’s written in XML format, usually lives at `yoursite.com/sitemap.xml`, and acts as a roadmap for crawlers like Googlebot. Rather than relying entirely on links to discover your content, search engines can read the sitemap and go straight to the pages that matter.

That’s the short version. The longer version involves understanding when sitemaps are genuinely useful, what should and shouldn’t be in yours, and how to check whether it’s actually working. That’s what this guide covers. No developer jargon, no code tutorials – just the practical knowledge you need to make sure your sitemap is helping your SEO rather than gathering dust. If you want to understand how this fits into the broader picture of technical SEO, we’ve got a full breakdown of that too.

What Exactly Is an XML Sitemap?

An XML sitemap is a structured file that communicates directly with search engines. It follows the sitemap protocol (defined at sitemaps.org) and uses a standardised XML format to list URLs along with optional metadata about each one.

The most basic version looks something like this:

“`xml

https://example.com/page-one/

2026-02-15

https://example.com/page-two/

2026-01-20

“`

Each `` entry contains a `` tag (the page’s full URL) and optionally a `` tag (when it was last updated). There are two other tags in the protocol – `` and `` – but Google has confirmed it ignores both of them. Don’t waste time setting those values. They’re relics from an earlier era of the protocol and have no bearing on how Google crawls or ranks your pages.

The file itself must use UTF-8 encoding and can contain up to 50,000 URLs, with a maximum file size of 50MB uncompressed. If your site has more than 50,000 URLs, you’ll need a sitemap index file that references multiple individual sitemaps. More on that shortly.

How Is an XML Sitemap Different from an HTML Sitemap?

They serve completely different audiences. An XML sitemap is for search engines – it’s a machine-readable file that crawlers process automatically. An HTML sitemap is a webpage designed for human visitors, usually listing key pages in a navigable format.

XML sitemaps support crawl efficiency and indexing. HTML sitemaps support usability and internal linking. Some sites have both, some have one or the other. For SEO purposes, the XML version is the one that matters.

Does Your Site Actually Need One?

Xml Sitemaps

Honest answer: not always. Google is generally very good at discovering pages through internal links. If your site has a few dozen pages, a clear navigation structure, and solid internal linking, Googlebot will probably find everything without a sitemap.

But there are specific situations where a sitemap becomes genuinely important:

Large sites (hundreds or thousands of pages). The more pages you have, the harder it is for crawlers to find everything through links alone. E-commerce sites with thousands of product pages are a classic example.

New sites with few external backlinks. When a site is new, search engines have fewer paths to discover it. A sitemap gives Googlebot an immediate list of everything worth crawling.

Sites with orphan pages. An orphan page is one that has no internal links pointing to it. Without a sitemap, there’s no way for a crawler to find it. This happens more often than you’d think, especially on large sites where pages get created but never properly linked from the navigation or related content.

Sites with rich media content. If images or videos are central to your content strategy, specialised sitemaps help Google discover and index that media.

Sites that update frequently. News sites, blogs with daily publishing, job boards – the `` tag signals to search engines that content has changed and may be worth re-crawling.

When do you probably not need one? If you’ve got a small brochure site – say, under 50 pages – with clean navigation and good internal linking, a sitemap isn’t going to make or break your SEO. It won’t hurt to have one, but it’s not where you should be spending your time.

Sitemap Index Files: When One Sitemap Isn’t Enough

Once your site goes beyond 50,000 URLs or your sitemap file exceeds 50MB, you need to split it up. That’s where a sitemap index comes in.

A sitemap index is essentially a sitemap of sitemaps. It references multiple individual sitemap files, each containing a subset of your URLs. Here’s what the structure looks like:

“`xml

https://example.com/sitemap-posts.xml

2026-03-10

https://example.com/sitemap-pages.xml

2026-02-28

https://example.com/sitemap-products.xml

2026-03-12

“`

In practice, most CMS-generated sitemaps use an index file by default, splitting URLs into logical groups – posts, pages, products, categories – even on smaller sites. WordPress plugins like Yoast and RankMath both do this. It’s a sensible approach regardless of site size because it keeps individual files manageable and makes it easier to spot issues with specific content types.

Yoast caps individual sitemaps at 1,000 URLs per file rather than the protocol maximum of 50,000. That’s a deliberate choice to keep file sizes small and server response times fast, and it’s worth knowing about if you’re wondering why your site has so many sitemap files.

What Should Be in Your Sitemap (and What Shouldn’t)

This is where a lot of sitemaps go wrong. The goal isn’t to list every URL on your site. It’s to list the URLs you want search engines to crawl and index.

Include

Pages you want ranking in search results (service pages, blog posts, product pages, key landing pages)

Pages that are live, return a 200 HTTP status code, and contain useful content

The canonical version of each URL (not duplicates or parameter variations)

Exclude

Pages blocked by robots.txt or marked with a noindex tag – including these creates a mixed signal that confuses crawlers

Redirect URLs (301s, 302s) – only the final destination should be in the sitemap

URLs returning 404 or 410 errors

Paginated archive pages, tag pages, and other low-value index pages (unless they genuinely serve your SEO strategy)

Internal search results pages

Staging or development URLs that accidentally made it into production

A clean sitemap should be a curated list of your best, most indexable content. You’re telling search engines: “These are the pages worth your time.” Including broken, redirected, or noindexed URLs undermines that signal and wastes crawl budget – the finite amount of time and resources Googlebot allocates to your site.

If you’ve been dealing with crawl issues more broadly, our guide to crawl errors covers the diagnostic side in detail.

How to Create an XML Sitemap

There are three common approaches, and the right one depends on your setup.

CMS Plugins (WordPress, Shopify, and Others)

If your site runs on WordPress, you’ve probably already got a sitemap. WordPress has generated a basic sitemap at `/wp-sitemap.xml` since version 5.5. But the built-in version is fairly limited – it doesn’t give you much control over what’s included or excluded.

That’s why most SEO plugins replace it with their own. Yoast SEO generates sitemaps at `/sitemap_index.xml` and automatically excludes noindexed content, splits by content type, and updates dynamically when you publish or remove pages. RankMath does broadly the same thing. Both give you the ability to include or exclude specific post types and taxonomies from the sitemap through their settings panels.

Shopify generates a sitemap automatically at `/sitemap.xml`. You can’t edit it directly, but it updates as you add products, collections, and pages.

Sitemap Generator Tools

If you’re not on a CMS with built-in generation, standalone tools like Screaming Frog, XML-Sitemaps.com, or Yoast’s free sitemap generator can crawl your site and produce a sitemap file you upload manually. This works well for static sites or custom-built platforms.

The downside is that these don’t update automatically. Every time you add, remove, or change pages, you need to regenerate and re-upload the file. For sites that change infrequently, that’s fine. For anything dynamic, it’s a maintenance headache.

Dynamic Sitemaps

Larger or more complex sites often generate sitemaps dynamically – the sitemap is created on the fly by the server each time it’s requested, pulling from the site’s database. This is common on custom-built platforms and headless CMS setups. It means the sitemap always reflects the current state of the site without manual intervention.

The trade-off is that dynamic generation requires development work to set up, and poorly implemented dynamic sitemaps can be slow to respond or miss pages if the query logic isn’t right.

Specialised Sitemaps: Images, Video, and Hreflang

The standard sitemap covers your HTML pages. But there are extensions for specific content types that help search engines discover and understand media and multilingual content.

Image Sitemaps

If images play a significant role in your site – think e-commerce product photography, portfolio sites, or image-heavy editorial content – image sitemap markup helps Google find and index those images. You add image-specific tags within the standard sitemap, pointing to the image URLs and providing optional metadata like captions and titles.

This is particularly useful when images are loaded dynamically via JavaScript, since crawlers might not discover them through the page HTML alone.

Video Sitemaps

Similar principle. If you host video content, video sitemap markup provides metadata like the video title, description, thumbnail URL, duration, and whether it’s embeddable. This helps Google surface your videos in video search results and rich snippets.

Hreflang Sitemaps

For multilingual or multi-regional sites, you can declare hreflang annotations directly in your sitemap rather than (or in addition to) using HTML link elements or HTTP headers. Each URL entry includes references to its equivalent pages in other languages or regions.

This keeps all your international targeting signals in one place and is often easier to manage at scale than adding hreflang tags to every page’s HTML. For sites with dozens of language variants across hundreds of pages, sitemap-based hreflang is usually the most practical approach.

How to Submit Your Sitemap

Creating a sitemap is only half the job. You also need to tell search engines where to find it.

Google Search Console

The most direct method. Log into Google Search Console, navigate to the Sitemaps section in the left sidebar, enter your sitemap URL (usually `https://yoursite.com/sitemap.xml` or `/sitemap_index.xml`), and hit submit. Google will fetch it, process it, and report back on how many URLs were discovered and how many were successfully indexed.

This is also where you’ll see errors – if URLs in your sitemap return errors, are blocked, or can’t be indexed for other reasons, Search Console flags them. Check back after a few days to review the status.

If you update your sitemap significantly (after a site migration or major content overhaul, for instance), resubmit it through Search Console to prompt Google to re-process it.

Bing Webmaster Tools

Same process, different platform. Bing Webmaster Tools has its own sitemap submission section. If you care about Bing traffic (and for some industries and demographics, it’s more significant than you’d think), submit your sitemap there too.

The Robots.txt Directive

You can also reference your sitemap in your robots.txt file by adding a line like:

“`

Sitemap: https://yoursite.com/sitemap.xml

“`

This helps any crawler that reads your robots.txt discover the sitemap automatically, without needing a manual submission in each search engine’s webmaster tools. It’s not a substitute for submitting through Search Console – you won’t get the reporting and error feedback – but it’s a useful belt-and-braces measure. We’ll cover robots.txt in full in a dedicated article, including how it interacts with sitemaps and crawl directives.

Ping (Mostly Deprecated)

Older guides mention “pinging” Google by requesting a URL like `google.com/ping?sitemap=yoursitemapurl`. Google has since deprecated this method. Search Console submission and robots.txt reference are the two methods worth using now.

How to Check If Your Sitemap Is Working

Having a sitemap and having a sitemap that’s actually doing its job are two different things. Here’s how to verify yours.

Check It Loads Correctly

Start simple. Open your sitemap URL in a browser. Does it load? Does it display valid XML? If you get a 404, a blank page, or an error, your sitemap either doesn’t exist or has a configuration problem.

Validate the XML

A sitemap needs to be valid XML. Malformed tags, encoding issues, or missing namespace declarations will cause parsers – including Googlebot – to reject it. Free validators like XML Sitemaps Validator or the W3C Markup Validation Service can check this for you. Common issues include non-UTF-8 characters, unclosed tags, and URLs with unescaped ampersands.

Review in Google Search Console

The Sitemaps report in Search Console is your most important diagnostic tool. It tells you:

Submitted vs indexed: How many URLs you submitted versus how many Google actually indexed. A large gap here means something is preventing pages from being indexed – it could be noindex tags, canonical issues, thin content, or crawl problems.

Errors and warnings: Specific issues Google found when processing your sitemap, like unreachable URLs or format errors.

Last read date: When Googlebot last fetched your sitemap. If this was months ago and you’ve been publishing new content, something may be preventing regular re-crawling.

Cross-Reference with Your Site

Your sitemap should be a reliable mirror of your live, indexable content. Periodically compare the URLs in your sitemap against what’s actually on your site:

Are there live pages missing from the sitemap?

Are there URLs in the sitemap that no longer exist or now redirect?

Are noindexed pages accidentally included?

A site crawl tool like Screaming Frog makes this comparison straightforward. Run a crawl, export your sitemap URLs, and compare the two lists. Any discrepancies point to maintenance that needs doing.

Common Sitemap Mistakes (and How to Avoid Them)

After auditing hundreds of sites, certain problems come up repeatedly. Most are easy to fix once you know to look for them.

Including Noindexed or Blocked URLs

If a page has a noindex meta tag or is disallowed in robots.txt, it shouldn’t be in your sitemap. Including it creates a contradictory signal – you’re simultaneously telling Google “please index this” and “please don’t index this.” Google will typically follow the noindex directive, but the mixed signal wastes crawl resources and clutters your Search Console reports.

Listing Non-Canonical URLs

Every URL in your sitemap should be the canonical version of that page. If you have duplicate pages with canonical tags pointing elsewhere, the duplicates shouldn’t appear in the sitemap. Only the canonical destination belongs there.

Canonical tag management is its own topic – we’ll cover that properly in a separate piece.

Letting the Sitemap Go Stale

A sitemap that was generated once and never updated is worse than it sounds. As you publish new content, remove old pages, and restructure the site, the sitemap drifts further from reality. CMS-generated sitemaps usually handle this automatically, but if you’re maintaining yours manually or using a static file, schedule regular regeneration.

Forgetting to Submit After Migration

Site migrations are one of the most common times sitemaps get overlooked. URLs change, structures shift, and the old sitemap becomes a list of broken links overnight. After any significant migration, regenerate your sitemap with the new URL structure and resubmit it through Search Console immediately.

Ignoring HTTP Status Codes

Every URL in your sitemap should return a 200 status code. URLs returning 301 redirects, 404 errors, or 5xx server errors don’t belong there. Regularly audit your sitemap URLs against their actual HTTP responses and clean out anything that isn’t returning a clean 200.

XML Sitemaps and Crawl Efficiency

Sitemaps play a specific role in how efficiently search engines crawl your site. Googlebot has a finite crawl budget for each site – the number of pages it will crawl within a given period. On small sites, this rarely matters. On larger sites, it becomes a real consideration.

A well-structured sitemap helps Googlebot spend its crawl budget on the pages that matter. Instead of following every link chain across your entire site, Googlebot can consult the sitemap for a direct list of priority URLs. This is especially valuable for:

Deep pages that are many clicks from the homepage

Orphan pages with no internal links pointing to them

Newly published content that hasn’t been linked from other pages yet

Sites with complex URL structures where parameters and filters create thousands of variations

That said, a sitemap isn’t a fix for poor site architecture. If your important pages are buried six clicks deep with no internal links, the real solution is fixing your information architecture and internal linking, not relying on the sitemap to compensate. The sitemap is a safety net, not a substitute for good structure.

Sitemaps and SEO: Keeping Perspective

It’s easy to overthink sitemaps. They’re a foundational piece of technical SEO – like having a robots.txt file or an SSL certificate – but they’re not a ranking factor in themselves. Having a perfect sitemap won’t push you up the results. Not having one (or having a broken one) can hold you back by preventing pages from being discovered and indexed.

The sites that get the most value from sitemaps are the ones that also invest in the content, authority, and technical foundations that actually drive rankings. A sitemap ensures your content can be found. Everything else determines whether it deserves to rank.

If your sitemap’s in good shape but you’re not seeing the search performance you’d expect, the issue is almost certainly elsewhere – content quality, backlink profile, keyword strategy, or broader technical health. For a proper assessment of where things stand, a technical SEO audit will give you the full picture. At Gorilla Marketing, sitemap analysis is part of every audit we run. Senior strategists review your sitemap structure, cross-reference it against your indexed pages, and flag anything that’s leaking crawl budget or leaving content undiscovered. If something’s off, we’ll tell you what it is and what to do about it.

David Galvin
David has been in search marketing for over 8 years, specialising in technical SEO. He focuses on the technical foundations that impact visibility, including site structure, performance, and tracking. With a solid technical grounding and hands-on experience across Linux, PHP, JavaScript, and CSS, he works to identify and resolve the issues that genuinely hold websites back. If he’s not in front of a laptop, you’ll usually find him hiking up a mountain or visiting his son in Dublin.

Related Articles