URL Standardization and Cleaning
This step provides the cleaned version of the website domain for URLs in source datasets.
Unlike other enrichment services, URL standardization and cleaning is performed through a transformation step.
The Standardize URL transformation step provides the cleaned version of the website domain for URLs in source datasets. After mastering, the cleaned domain is available in the
Website Domain field in the mastered entity.
In this case, "domain" refers to the entire domain section of the URL. URLs are constructed from three basic parts, as shown in the diagram below:
- Protocol, such as http://, https://, or mailto://. In this example, the protocol is
- Full domain information identified and cleaned by this step. In this example:
.comis the top-level domain, which identifies the type of organization (such as .com for corporate entities or .edu for educational institutions).
tamris the second-level domain, and is the name of the website.
docsare subdomains. Subdomains organize the website content into different categories. In this example,
docsis a subdomain of the main website, and
cloudis a subdomain of
- Subdirectory. In this example,
user-rolesis a subdirectory of the website.
The Standardize URL step extracts the domain information, and standardizes, or cleans, it as follows:
- Removes the protocol and subdirectories.
- Removes special characters, other than dash (-) characters within the domain.
- Converts alphabetic characters to lowercase.
- Retains subdomains.
The table below provides examples of cleaned domains returned by this step.
|Type of Cleaning||Source URL||Cleaned Domain|
|Removes protocol and "www"||- |
|Removes subdirectories||- |
|Removes special characters||- |
|Converts to lowercase||- |
|Retains ||- |
|Retains subdomain||- |
Updated 9 days ago