Understanding Clustering
For each data product, Tamr Cloud uses both a trained machine-learning model and rules to identify and cluster source records that refer to the same real-world entity. For each cluster, the flow creates a single entity (a company, contact, patient, and so on) with the most appropriate values from the clustered records. These entities are referred to as both mastered entities or golden records.
The model and rules are applied in clustering. Open clustering in your flow to see the list of specific attributes used by the machine-learning model for the data product and the list of deterministic rules that are used to cluster records.
Clustering:
- Uses the trained model to cluster records based on similarity of key attributes.
- Applies any rules to match or split those clusters, based on matching or non-matching values for specific attributes.
- Finally, applies any changes you have made as part of curation. See Adjusting Mastering Results to learn more.
Clustering Model
First, similar records are clustered together by the trained machine learning model.
Models are data product specific, and are trained to accurately identify similar companies, contacts, patients, and so on. Each model considers the similarities of values for relevant attributes for the data product to determine which records should be clustered together.
For example, the B2B Customers data product, which masters and enriches company data, considers the similarities in these attributes to determine if source records represent the same company:
- Company name
- Alternate company names
- Full address and address components
- Phone number
- Website
To learn more about the clustering model for your data product, refer to the data product documentation.
Clustering Rules
After the records have been clustered by the model, clustering rules are applied to refine the results. Clustering rules deterministically identify records that should or should not be clustered together based on values in specific attributes that reliably indicate unique entities.
Types of Rules
A match rule matches clusters with matching values for a specified attribute, such as a trusted_id
. Match rules will not match clusters that contain only null or empty values for the specified attribute.
A split rule splits clusters that contain records with different values for a specific attribute. The rule splits the cluster so that each new cluster contains records with matching values for the attribute.
Rule Priority
Most data products include multiple rules, which are listed in descending order of priority. Rules with higher priority will take precedence over other rules if there is a conflict in the clustering logic.
Clustering Rule Example
A Healthcare Provider data product includes a rule to always cluster together records with the same value for a trusted_id
field, and to never cluster records with different trusted_id
values. It also applies another, lower priority, rule that prevents healthcare providers who have different middle names from being clustered together.
Consider the rules in a Healthcare Provider data product, that are listed in this priority order:
trusted_id
: Records with matching values are clustered together; records with non-matching non-null values are put into different clusters.middle_name
: Records with non-matching non-null values are put into different clusters.
Here is how these rules are applied to a healthcare provider cluster:
- First, Tamr splits any records with different, non-null middle name values into different clusters.
- Then, Tamr splits any records in a cluster with non-matching, non-null values into different clusters.
- Finally, Tamr matches any clusters where records have the same trusted-id value.
This means that a record with a different middle_name value
but the same trusted_id
value will end up being clustered together, because the rule to match clusters with the same trusted_id
has a higher priority than the rule to split on middle name.
Based on these rules, the records in the table below are grouped into 3 clusters as follows:
- Cluster A: Record 1, 2, and 3. These records are clustered together because:
- Records 1 and 2 have the same
trusted_id
. Although themiddle_name
values are different between the records, thetrusted_id
rule takes priority. - Since Record 3 has a blank
trusted_id
, it is included with the records with the most commontrusted_id
within the cluster.
- Records 1 and 2 have the same
- Cluster B: Record 4. This record is put into its own cluster because:
- It has a different
trusted_id
value than Records 1 and 2 and therefore is not included in Cluster A. - It has a different middle name than Record 5 and therefore cannot be clustered with Record 5 despite high similarity in other attribute values.
- It has a different
- Cluster C: Record 5. This record is put into its own cluster because it has a different middle name value than the other similar records.
Cluster | A | A | A | B | C |
---|---|---|---|---|---|
Attribute | Record 1 | Record 2 | Record 3 | Record 4 | Record 5 |
trusted_id | ab1cd2 | ab1cd2 | blank | ef3hi4 | blank |
address_line_1 | 123 Main Street | 3 Spruce Ave. | 123 Main Street | Main St. | 123 Main Street |
city | Springfield | blank | Springfield | Springfield | Springfield |
first_name | Christopher | Chris | Christopher | Chris | Chris |
last_name | Rogers | Rogers | Rogers | Rodgers | Rodgers |
middle_name | Adam | Arthur | Adam | Adam | Brian |
provider_specialty | Internal Medicine | Internal Medicine | Internal Medicine | Internal Medicine | Internal Medicine |
region | OH | MA | OH | OH | OH |
Updated 7 months ago