Understanding Clustering

Suggest Edits

For each data product, Tamr Cloud uses both a trained machine-learning model and rules to identify and cluster source records that refer to the same real-world entity. For each cluster, the flow creates a single entity (a company, contact, patient, and so on) with the most appropriate values from the clustered records. These entities are referred to as both mastered entities or golden records.

The model and rules are applied in clustering. Open clustering in your flow to see the list of specific attributes used by the machine-learning model for the data product and the list of deterministic rules that are used to cluster records.

Clustering:

Uses the trained model to cluster records based on similarity of key attributes.
Applies any rules to match or split those clusters, based on matching or non-matching values for specific attributes.
Finally, applies any changes you have made as part of curation. See Adjusting Mastering Results to learn more.

Clustering Model

First, similar records are clustered together by the trained machine learning model.

Models are data product specific, and are trained to accurately identify similar companies, contacts, patients, and so on. Each model considers the similarities of values for relevant attributes for the data product to determine which records should be clustered together.

For example, the B2B Customers data product, which masters and enriches company data, considers the similarities in these attributes to determine if source records represent the same company:

Company name
Alternate company names
Full address and address components
Phone number
Website

To learn more about the clustering model for your data product, refer to the data product documentation.

Clustering Rules

After the records have been clustered by the model, clustering rules are applied to refine the results. Clustering rules deterministically identify records that should or should not be clustered together based on values in specific attributes that reliably indicate unique entities.

Types of Rules

A match rule matches clusters with matching non-null values for specified attributes, such as a trusted_id. Match rules will not match clusters that contain only null or empty values for the specified attribute.

A split rule splits clusters that contain records with different non-null values for specified attributes. The rule splits the cluster so that each new cluster contains records with matching values for the attributes.

Rule Priority

Most data products include multiple rules, which are listed in descending order of priority. Rules with higher priority will take precedence over other rules if there is a conflict in the clustering logic.

Identifying Applied Clustering Rules

On the Configure Data Product page, each rule is numbered. After the data product runs, the Applied Clustering Rules (clustering_metadata.applied_clustering_rules) attribute in the source records dataset provides the number assigned to any rules applied to the record and the rule type. This helps you understand the specific rules that determined the cluster for a given source record.

Note: After running the data product, the Source Records page also includes an Applied Rules attribute, which provides the internal zero-based number of the first applied rule. For example, if Rule 1 is applied, this attribute value is "0". The Applied Clustering Rules attributes provides more information about the clustering rules that were applied to the records.

Clustering Rule Example

A Healthcare Provider data product includes a rule to always cluster together records with the same value for a customtrusted_id field, and to never cluster records with different trusted_id values. It also applies another, lower priority, rule that prevents healthcare providers who have different middle names from being clustered together.

Consider the rules in a Healthcare Provider data product, that are listed in this priority order:

Match and split on trusted_id: Records with matching values are clustered together; records with non-matching non-null values are put into different clusters.
Split on middle_name: Records with non-matching non-null values are put into different clusters.

Here is how these rules are applied to a healthcare provider cluster:

First, Tamr splits any records with different, non-null middle name values into different clusters.
Then, Tamr splits any records in a cluster with non-matching, non-null values into different clusters.
Finally, Tamr matches any clusters where records have the same trusted-id value.

This means that a record with a different middle_name value but the same trusted_id value will end up being clustered together, because the rule to match clusters with the same trusted_id has a higher priority than the rule to split on middle name.

Based on these rules, the records in the table below are grouped into 3 clusters as follows:

Cluster A: Record 1, 2, and 3. These records are clustered together because:
- Records 1 and 2 have the same trusted_id. Although the middle_name values are different between the records, the trusted_id rule takes priority.
- Since Record 3 has a blank trusted_id, it is included with the records with the most common trusted_id within the cluster.
Cluster B: Record 4. This record is put into its own cluster because:
- It has a different trusted_id value than Records 1 and 2 and therefore is not included in Cluster A.
- It has a different middle name than Record 5 and therefore cannot be clustered with Record 5 despite high similarity in other attribute values.
Cluster C: Record 5. This record is put into its own cluster because it has a different middle name value than the other similar records.

Record	Cluster	trusted_id	address_line_1	city	first_name	last_name	middle_name	provider_specialty	region	Applied Clustering Rules
1	A	ab1cd2	123 Main Street	Springfield	Christopher	Rogers	Adam	Internal Medicine	OH	Rule 1 (Match)
2	A	ab1cd2	3 Spruce Ave.		Chris	Rogers	Arthur	Internal Medicine	MA	Rule 1 (Match)
3	A		123 Main Street	Springfield	Christopher	Rogers	Adam	Internal Medicine	OH
4	B	ef3hi4	Main St.	Springfield	Chris	Rodgers	Adam	Internal Medicine	OH	Rule 2 (Split)
5	C		123 Main Street	Springfield	Chris	Rodgers	Brian	Internal Medicine	OH	Rule 2 (Split)

Updated 24 days ago