XML sitemap
An XML sitemap enumerates the URLs you want crawled. Agents prefer to seed their crawl queue from a sitemap rather than guess at links.
On this page
- What is an XML sitemap and why does every site need one?
- Why do AI agents rely on XML sitemaps for discovery?
- Is an XML sitemap required for AI agent readiness?
- What the Sitemaps 0.9 protocol says
- What good XML sitemap implementation looks like
- How do I add an XML sitemap to my site?
- How can I test my XML sitemap?
- Frequently asked questions
- Does my single-page app need an XML sitemap if all content is client-rendered?
- Is sitemap.xml still relevant when I already have robots.txt and llms.txt?
- Do e-commerce sites need separate XML sitemaps for product pages and category pages?
- Can I use a sitemap index even if my site has fewer than 50,000 URLs?
- Should SaaS documentation sites include logged-in or gated pages in their XML sitemap?
- Does Vercel automatically generate an XML sitemap for Next.js sites?
- What's the difference between an XML sitemap and an HTML sitemap for AI agents?
- Do news publishers need to update their XML sitemap every time they publish an article?
What is an XML sitemap and why does every site need one?
An XML sitemap is a machine-readable file that lists the URLs on your site you want search engines and crawlers to discover. It lives at a conventional path—typically /sitemap.xml—and follows the Sitemaps 0.9 protocol. Instead of forcing a crawler to spider your entire site by following links, you hand it a structured manifest up front.
The format is straightforward XML: a <urlset> root element containing one <url> entry per page, each with a <loc> tag for the URL itself and optional metadata like <lastmod> (when it last changed), <changefreq>, and <priority>. If your site exceeds 50,000 URLs or 50 MB, you split it across multiple sitemaps and reference them from a sitemap index file. The protocol is old—defined in 2005—but it works, and every crawler on the planet supports it.
Why do AI agents rely on XML sitemaps for discovery?
AI agents don't browse your site the way humans do. When ChatGPT's web-browsing feature, Perplexity, or an enterprise RAG pipeline needs to index your content, it looks for a sitemap first. Without one, the agent either guesses at URLs (slow, incomplete) or falls back to expensive recursive crawling (expensive in tokens, time, and your server load). A sitemap seeds the crawl queue with exactly the pages you want indexed—no dead ends, no guessing at pagination patterns.
This directly affects citation rates and agent-assisted conversions. If your pricing page or API documentation isn't in the sitemap, an agent might never find it, and users asking "How much does X cost?" get an "I don't know" instead of a quote. Similarly, agentic commerce flows (ChatGPT deciding which SaaS tool to recommend) rely on structured discovery. A missing sitemap means you're hoping the agent gets lucky. You're also risking false positives from WAFs and rate limiters: an agent that can't find your sitemap will hammer your site with exploratory requests.
Is an XML sitemap required for AI agent readiness?
This check is required for any site that wants to be taken seriously by AI agents. If you publish more than a dozen pages—documentation, a blog, a product catalog—you need a sitemap. The only exceptions are single-page marketing sites or internal tools never meant to be crawled.
Even static sites benefit. Agents don't distinguish between "important" and "unimportant" sites; they just look for the sitemap at the conventional path. If it's missing, you're opting out of structured discovery.
What the Sitemaps 0.9 protocol says
The Sitemaps 0.9 protocol defines a few simple rules:
- Required: Each
<url>must contain a<loc>element with the full URL. - Recommended: Include
<lastmod>(ISO 8601 date) so agents can skip unchanged pages. - Optional:
<changefreq>(always, hourly, daily, weekly, monthly, yearly, never) and<priority>(0.0 to 1.0). - Limits: Up to 50,000 URLs or 50 MB per sitemap file. Beyond that, use a sitemap index.
- Encoding: UTF-8. Entity-escape special characters (
&,<,>,',").
Minimum valid example:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2025-01-15</lastmod>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2025-01-10</lastmod>
</url>
</urlset>
For large sites, you use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2025-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2025-01-14</lastmod>
</sitemap>
</sitemapindex>
What good XML sitemap implementation looks like
Stripe publishes a clean, well-maintained sitemap at stripe.com/sitemap.xml that includes documentation, blog posts, and product pages with accurate <lastmod> dates. Shopify's sitemap setup uses a sitemap index to partition store pages, help docs, and blog content across multiple files, staying under the 50,000-URL limit per file.
Both examples demonstrate the core principle: every URL you want an agent to see is present, and metadata is up to date. Neither wastes time with low-value URLs (login pages, search results).
How do I add an XML sitemap to my site?
-
Generate the sitemap. Most frameworks do this automatically:
- Next.js: Add a
sitemap.tsorsitemap.xml.tsroute in your/appdirectory (App Router) or use a library likenext-sitemap. - Astro: Use the
@astrojs/sitemapintegration. - WordPress: Enabled by default since version 5.5 (
/wp-sitemap.xml). - Static sites: Use your build tool (Hugo has built-in sitemap generation; Eleventy needs a plugin).
- Next.js: Add a
-
Place it at
/sitemap.xml. If you use a sitemap index, point/sitemap.xmlto that index. -
Include
<lastmod>for every URL. Pull from Git commit dates, CMS update timestamps, or filesystemmtime. -
Exclude noise. No search result pages, pagination variants (unless canonical), or gated content that requires login.
-
Validate and test. Ensure the file returns HTTP 200, parses as valid XML, and contains at least one
<url>entry.
Example Next.js App Router sitemap:
// app/sitemap.ts
import { MetadataRoute } from 'next'
export default function sitemap(): MetadataRoute.Sitemap {
return [
{
url: 'https://example.com',
lastModified: new Date(),
},
{
url: 'https://example.com/about',
lastModified: new Date('2025-01-10'),
},
]
}
How can I test my XML sitemap?
curl -I https://yoursite.com/sitemap.xml
curl https://yoursite.com/sitemap.xml | head -n 20
Check that the first command returns HTTP/1.1 200 OK and Content-Type: application/xml (or text/xml), and the second shows valid XML with <urlset> and at least one <url> entry.
Or just run a free scan and we'll check this for you alongside 30+ other agent-readiness signals.
Frequently asked questions
Does my single-page app need an XML sitemap if all content is client-rendered?
Yes, if you want AI agents to discover your content. Even SPAs with client-side routing should generate a sitemap.xml listing all routes. Use pre-rendering or SSG to ensure each route has a crawlable URL. Without a sitemap, agents can't map your route structure and will miss most pages.
Is sitemap.xml still relevant when I already have robots.txt and llms.txt?
robots.txt only grants or denies access; it doesn't list your URLs. llms.txt provides context but isn't a discovery mechanism. sitemap.xml remains the canonical way to tell crawlers and agents which pages exist. All three serve different purposes and should coexist on every site.
Do e-commerce sites need separate XML sitemaps for product pages and category pages?
For large catalogs (>10,000 URLs), yes. Use a sitemap index at /sitemap.xml pointing to /sitemap-products.xml and /sitemap-categories.xml. This keeps each file under the 50,000-URL limit and lets you set different <lastmod> dates for product inventory updates versus category reshuffles, helping agents prioritize fresh content.
Can I use a sitemap index even if my site has fewer than 50,000 URLs?
Yes. Sitemap indexes are valid for any site size and help you organize URLs by type (blog, docs, products). This makes maintenance easier and lets you update one section without regenerating the entire sitemap. Just ensure your index file is at /sitemap.xml.
Should SaaS documentation sites include logged-in or gated pages in their XML sitemap?
No. Only include publicly accessible URLs. Agents can't authenticate, so listing gated docs wastes their crawl budget and yours. If you want agents to know about premium features, create public marketing pages or a feature overview that summarizes what's behind the login.
Does Vercel automatically generate an XML sitemap for Next.js sites?
No. Vercel hosts your site but doesn't auto-generate sitemaps. You must add a sitemap.ts file in your Next.js /app directory (App Router) or use a library like next-sitemap for the Pages Router. Vercel will serve the generated sitemap at /sitemap.xml once you deploy.
What's the difference between an XML sitemap and an HTML sitemap for AI agents?
An XML sitemap (/sitemap.xml) is a machine-readable file that agents parse to discover all URLs. An HTML sitemap is a human-facing page with links. Agents ignore HTML sitemaps; they only look for the Sitemaps 0.9 XML format. Always provide an XML sitemap for agent readiness.
Do news publishers need to update their XML sitemap every time they publish an article?
Ideally, yes. Most CMS platforms (WordPress, Ghost, Contentful) auto-regenerate sitemaps on publish. For high-frequency publishers, use a sitemap index and update only the /sitemap-news.xml section. Accurate <lastmod> dates help agents prioritize breaking news over archive content, improving citation timeliness.