← Back to assessments

SI004 – URL Deduplication

Overview

Validates URL normalization and deduplication: no duplicate URLs in processed list; current URL not a duplicate of a previous one; problematic URL patterns flagged.

Probability that AI systems use this signal: Crawlers and generative AI pipelines consume structured and visible page signals for training and inference. Assessment-specific signals (e.g. schema, canonicals, trust cues) affect how likely a page is indexed and surfaced. The probability that this assessment's signal influences AI behavior is high when the page is in a product or compliance context.

Impact on geo compliance: Passing this assessment supports geo compliance by ensuring machine-readable and visible content meet standards that reduce the risk of wrong locale, pricing, or trust in AI-generated answers. Failing can lead to non-compliant or misleading surfacing.

What We Check

  • **_normalize_url:** urlparse; lowercase; path rstrip /; query sorted; fragment removed (implementation may strip or normalize query). Normalized URL compared to context.url.
  • **context.metadata["processed_urls"]:** if present, all normalized; duplicate list length vs set length → −50; if current normalized URL in processed (excluding last) → −30.
  • **_check_url_patterns:** implementation-specific checks for problematic patterns (e.g. session params, duplicate keys). −10 per issue.
  • If context.url != normalized_url, recommendation to note normalization.

Pass / Fail and Score

  • **PASSED** if score >= 60.

How to Fix When It Fails

  • Implement URL deduplication before processing; skip duplicate URL processing; review URL structure and parameters.

Common Issues

  • Same URL processed twice (e.g. with/without trailing slash or query); problematic query params.

Dependencies

None. Relies on context.url and optional context.metadata.processed_urls.

How to Verify

  • Normalize URLs before adding to processed_urls; compare with current URL.

Additional Resources

  • URL normalization; canonical URLs