About Persistent Identifier Fields for Legacy Data Products

Tamr Cloud assigns a unique, persistent identifier to each entity. The same identifier is added to each source record in a cluster and to the mastered entity record.

When you add a source dataset to Tamr Cloud, the source dataset must include a field for the unique primary key. During the mastering process, Tamr Cloud also assigns a unique, persistent identifier to each business entity. The same identifier is added to each source record in a source record cluster and to the entity.

Primary Key Fields from Source Datasets

The following field stores the unique primary key from the source datasets. This field is available by viewing step output in the Configure Flow page, source record tables in the Tamr Cloud UI, and in exported datasets, as described in the table below.

FieldStep Output or Exported DatasetNotes
entityIdApply Clustering Model Step

Source Records by Entities Dataset

Source Record tables
This is the unique identifier for the source record, generated by Tamr. This generated identifier is the tamr_record_id, which is a 128-bit hash value of the source dataset name and the source primary key.

Note: This is not the same as the entity_id field in the Consolidate Records step output, which stores the Tamr ID for the clustered source records and their mastered entities.

Persistent Identifiers Assigned by Tamr Cloud

The following fields store the Tamr ID, which is the Tamr Cloud-generated identifier for a mastered entity record and its clustered source records. The Tamr ID a 128-bit universally unique identifier (UUID).

These fields are available in the step output in the Configure Flow page, source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
clusterId1Entities by Similarity DatasetThis is the first entity in the pair of entities being compared for similarity. See Datasets Available for Export for more information.
clusterId2Entities by Similarity DatasetThis is the second entity in the pair of entities being compared for similarity. See Datasets Available for Export for more information.
entity_idConsolidate Records Step
Entity IDEntities DatasetThis is the default field name of the entity_id output field as configured in the Configure Attributes step. Depending on your step configuration, this field might have a different name.
persistentIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
suggestedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
This is the persistent ID assigned by the clustering model.

If a user overrides clusters, the persistent ID for the new source record cluster is available in the verifiedClusterId field.
tamr_idConsolidate Records Step
Tamr IDEntities DatasetThis is the default field name of the tamr_id output field as configured in the Configure Attributes step.

Depending on your step configuration, this field might have a different name.

Example: Persistent Identifiers through the Clustering and Consolidation Process

The following diagram illustrated how source primary key values and Tamr-generated persistent identifiers are created and retained during the clustering and record consolidation process. Note that in the diagram, the ID values are shorted for readability.

  1. Source records: Before being added to Tamr, each source record must have a unique 'primaryKey'.
  2. Create tamr_record_id step: The Create tamr_record_id step in the mastering flow assigns each source record a tamr_record_id to ensure that it has a unique identifier across all source datasets. The tamr_record_id is a 128-bit hash value of the source dataset name and the source primary key.
  3. Apply Clustering and Consolidate Records steps: The Apply Clustering step in the mastering flow groups source records that refer to the same entity into a cluster, and assigns the same persistentId (or Tamr ID) to all source records in the same cluster. In the image above, the three source records are grouped into the same cluster, and are assigned the same persistentId.
  4. Published mastered entity: The mastered entity for each cluster is assigned the same persistentId as the records in the cluster.

Additional Fields for Clustering Rules

The following fields store values only for source records to which a clustering rule has been applied. If cluster rules have not been applied, these fields are empty. These fields are available by viewing step output in the Configure Flow page, and in source record tables and exported datasets, as described in the table below.

FieldStep Output or Exported DatasetNotes
ruleClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
The new clusterId for a record after clustering rules are applied.

Note: The source record cluster assigned by the model is stored in the suggestedClusterId field.
appliedRulesApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
For internal use by Tamr; a value in this field indicates that a clustering rule was applied.

Additional Fields for Cluster Overrides

The following fields store values only for source records to which a user has applied cluster overrides. If cluster overrides have not been applied, these fields are empty. These fields are available by viewing step output in the Configure Flow page, and in source record tables and exported datasets, as described in the table below.

FieldStep Output or Exported DatasetNotes
verifiedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
If a user overrides clusters, this field stores the persistentId of the source record cluster to which the record was moved.

Note: The source record cluster assigned by the model is stored in the suggestedClusterId field.
verificationTypeApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables
The value in this field indicates how Tamr Cloud applied the override. This field is always set to suggest, meaning that Tamr Cloud applied the override after applying the clustering model.

Example: Persistent Identifiers Available after Clustering Overrides

In the following example, a user has moved a source record for the A&H Automotive Industries company from the source record cluster suggested by clustering model into a source record cluster for the A&L Sanchez Painting company. Before the user applied cluster overrides, the A&L Sanchez Painting company source record cluster included two source records.

This example shows the output of the Apply Clustering Model step, filtered to the relevant fields. The same fields also are included in the source record tables, and in the Source Records by Entity exported dataset.

1403

Persistent identifier example

  1. suggestedClusterId: This field provides the persistent identifier for the source record cluster created by the clustering model. The clustering model assigned the A&H Automotive Industries to a different source record cluster than the records for A&L Sanchez Painting and Construction Company:
    • For A&H Automotive Industries, the suggestedClusterId is e6ef4554-21a8-38d6-96b7-423d87640455.
    • For the two A&L Sanchez Painting and Construction Company, the suggestedClusterId is 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  2. verificationType: This field indicates how Tamr Cloud applied the cluster override. This field is always set to suggest, meaning that Tamr Cloud applied the override after applying the clustering model.
  3. verifiedClusterId: This field provides the persistent identifier for the source record cluster after overrides were applied. The records for A&H Automotive Industries and A&L Sanchez Painting and Construction have been assigned to the same source record cluster through cluster overrides: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  4. entityId: This field provides the primary key value for each record from the input dataset.
  5. persistentId: This field provides the final persistent identifier for the all records in the source record cluster. Note that the field value is the same as the value of the verifiedClusterId: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.