XML Sitemap and robots.txt: Configuring Them Correctly
6 min
The XML sitemap lists your priority pages to facilitate their discovery by Googlebot. The robots.txt file controls which sections the bot can crawl. These two files are complementary and must be kept up to date to avoid indexing errors.
The sitemap and robots.txt are the two most fundamental SEO configuration files. Misconfigured, they can accidentally exclude key pages or waste the crawl budget on useless URLs.
The XML sitemap: structure and best practices
An XML sitemap lists the URLs you want indexed, optionally accompanied by metadata (modification date, update frequency, priority). Google reads these metadata but does not follow them to the letter.
For sites with more than 50,000 URLs or more than 50 MB, create a sitemap index pointing to several thematic sitemap files (articles, products, categories).
- Only include canonical, indexable URLs returning a 200 code.
- Exclude noindex pages, redirects, and pages with parameters.
- Submit your sitemap in Search Console and reference it in robots.txt.
- Update the sitemap automatically with each new publication.
The robots.txt file: directives and limits
Robots.txt is located at the domain root and uses a simple syntax of Allow and Disallow rules by user-agent. It tells Googlebot which parts of the site not to crawl — but does not guarantee exclusion from indexing.
A page blocked by robots.txt can still appear in results if external links point to it. For total exclusion, use the noindex tag, not robots.txt.
- Block administration, staging, and test folders.
- Block internal search URLs that generate thousands of variations.
- Never block CSS and JS files necessary for page rendering.
- Reference the sitemap URL at the bottom of the robots.txt file.
Critical errors and how to avoid them
The most serious error: accidentally blocking the entire site with 'Disallow: /' in robots.txt following a migration or poorly cleaned staging configuration. Check this file as a priority after each deployment.
Including error URLs (404, 301) in the sitemap is a common mistake that signals a lack of rigor to Google and wastes the crawl budget on non-existent resources.
In SEO audits, between 15 and 40% of sites present inconsistencies between their sitemap and actually indexable pages, often due to insufficient maintenance after site updates.
Sector studies 2025-2026 on technical SEO audits
FAQ
Should priority and frequency be indicated in the sitemap?
These tags (priority and changefreq) are largely ignored by Google, which uses its own signals to estimate crawl frequency. Their presence is not harmful but their absence is also not a problem.
How long does Google take to read a submitted sitemap?
After submission in Search Console, Google generally reads the sitemap within 24 to 72 hours. The discovery of new URLs and their effective indexing take longer depending on site authority.
Does robots.txt work for all search engines?
All respectful bots following the standard respect robots.txt. However, malicious bots (scrapers, non-compliant crawlers) ignore it. Robots.txt is therefore not a security tool but a crawl management tool.