SD004 – Canonical URL Validation

Overview

Checks that the page has exactly one valid canonical URL and that it aligns with the page URL (same domain, similar path).

Probability that AI systems use this signal: Crawlers and generative AI pipelines consume structured and visible page signals for training and inference. Assessment-specific signals (e.g. schema, canonicals, trust cues) affect how likely a page is indexed and surfaced. The probability that this assessment's signal influences AI behavior is high when the page is in a product or compliance context.

Impact on geo compliance: Passing this assessment supports geo compliance by ensuring machine-readable and visible content meet standards that reduce the risk of wrong locale, pricing, or trust in AI-generated answers. Failing can lead to non-compliant or misleading surfacing.

What We Check

All `<link rel="canonical">` hrefs are collected.
Zero canonical: fail. Multiple canonicals: fail.
URL format validated via `urlparse`: scheme and netloc required.
**Canonical vs page URL:** same netloc; path similarity: paths normalized (rstrip `/`), then either equal or `_paths_are_similar` (path parts overlap ≥ 50% of min length). Path length difference > 2 parts → not similar.

Pass / Fail and Score

**PASSED** if score >= 60.
**Score:** 100 if canonical matches page pattern; 80 if canonical valid but doesn’t match page URL pattern.

How to Fix When It Fails

Add canonical: `<link rel="canonical" href="...">`.
Use only one canonical per page.
Use valid URL format.
Ensure canonical matches page URL structure (same domain, consistent path).

Common Issues

No canonical link.
Multiple canonical links.
Invalid URL (missing scheme or netloc).
Canonical points to different domain or very different path.

Dependencies

None.

How to Verify

View source for single `<link rel="canonical" href="...">` with valid absolute URL matching the page.

Related Assessments

SD001–SD010 (structured data); SEO best practice.

Additional Resources

Google: Consolidate duplicate URLs (canonical) — Use a single canonical URL per page; same domain and path alignment. SI008 (SPA routing) and SI007 (SSR) can affect canonical handling.