URL Standardization and Cleaning
This step provides the cleaned version of the website domain for URLs in source datasets.
This service is included without additional licensing in the following data product templates:
- B2B Customers
- B2B Customers with Firmographics
- Legal Entities
Unlike other data quality services, URL standardization and cleaning is performed through a transformation step.
The Standardize URL transformation step provides the cleaned version of the website domain for URLs in source datasets. After mastering, the cleaned domain is available in the Website Domain
field in the mastered entity.
In this case, "domain" refers to the entire domain section of the URL. URLs are constructed from three basic parts, as shown in the diagram below:
- Protocol, such as http://, https://, or mailto://. In this example, the protocol is
https://
- Full domain information identified and cleaned by this step. In this example:
.com
is the top-level domain, which identifies the type of organization (such as .com for corporate entities or .edu for educational institutions).tamr
is the second-level domain, and is the name of the website.cloud
anddocs
are subdomains. Subdomains organize the website content into different categories. In this example,docs
is a subdomain of the main website, andcloud
is a subdomain ofdocs
.
- Subdirectory. In this example,
user-roles
is a subdirectory of the website.
The Standardize URL step extracts the domain information, and standardizes, or cleans, it as follows:
- Removes the protocol and subdirectories.
- Removes special characters, other than dash (-) characters within the domain.
- Converts alphabetic characters to lowercase.
- Retains subdomains.
The table below provides examples of cleaned domains returned by this step.
Type of Cleaning | Source URL | Cleaned Domain |
---|---|---|
Removes protocol and "www" | - https://www.tamr.com - https://tamr.com - http://tamr.com - https://www.tamr.com - https://www1.tamr.com - httpssss://tamr.com | tamr.com |
Removes subdirectories | - https://tamr.com/dataproducts | tamr.com |
Removes special characters | - tamr?.com - t,am%r.com - [@tamr.com] - tamr.com// | tamr.com |
Converts to lowercase | - Tamr.COM - TAMR.com | tamr.com |
Retains - characters in the domain | - https://tamr-example.com | tamr-example.com |
Retains subdomain | - https://docs.tamr.com - https://cloud.docs.tamr.com | docs.tamr.com cloud.docs.tamr.com |
Updated 11 months ago