URL Standardization and Cleaning

This step provides the cleaned version of the website domain for URLs in source datasets.

This service is included without additional licensing in the following data product templates:

  • B2B Customers
  • B2B Customers with Firmographics
  • Legal Entities

Unlike other data quality services, URL standardization and cleaning is performed through a transformation step.

The Standardize URL transformation step provides the cleaned version of the website domain for URLs in source datasets. After mastering, the cleaned domain is available in the Website Domain field in the mastered entity.

In this case, "domain" refers to the entire domain section of the URL. URLs are constructed from three basic parts, as shown in the diagram below:

  1. Protocol, such as http://, https://, or mailto://. In this example, the protocol is https://
  2. Full domain information identified and cleaned by this step. In this example:
    • .com is the top-level domain, which identifies the type of organization (such as .com for corporate entities or .edu for educational institutions).
    • tamr is the second-level domain, and is the name of the website.
    • cloud and docs are subdomains. Subdomains organize the website content into different categories. In this example, docs is a subdomain of the main website, and cloud is a subdomain of docs.
  3. Subdirectory. In this example, user-roles is a subdirectory of the website.

Parts of a URL

The Standardize URL step extracts the domain information, and standardizes, or cleans, it as follows:

  • Removes the protocol and subdirectories.
  • Removes special characters, other than dash (-) characters within the domain.
  • Converts alphabetic characters to lowercase.
  • Retains subdomains.

The table below provides examples of cleaned domains returned by this step.

Type of CleaningSource URLCleaned Domain
Removes protocol and "www"- https://www.tamr.com
- https://tamr.com
- http://tamr.com
- https://www.tamr.com
- https://www1.tamr.com
- httpssss://tamr.com
Removes subdirectories- https://tamr.com/dataproductstamr.com
Removes special characters- tamr?.com
- t,am%r.com
- [@tamr.com]
- tamr.com//
Converts to lowercase- Tamr.COM
- TAMR.com
Retains - characters in the domain- https://tamr-example.comtamr-example.com
Retains subdomain- https://docs.tamr.com
- https://cloud.docs.tamr.com