Persistent Identifier Fields in Tamr Cloud

When you add a source dataset to Tamr Cloud, the source dataset must include a field for the unique primary key. During the mastering process, Tamr Cloud also assigns a unique, persistent identifier to each business entity. The same identifier is added to each source record in a cluster and to the entity.

Primary Key Fields from Source Datasets

The following field stores the unique primary key from the source datasets. This field is available by viewing step output in Designer, entity source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
entityIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
Unlike the entity_id field in the Consolidate Records step output, which stores the Tamr-generated persistent identifier for entities and source record clusters, this field stores the original unique primary key value from the input dataset.

If the entity type template includes the Create tamr_record_id transformation step, this field value is the concatenation of the source dataset name and the unique key from the source record (datasetname_uniqueKeyValue).

Persistent Identifiers Assigned by Tamr Cloud

The following fields store the same Tamr-generated identifier for a mastered entity and its clustered source records. These fields are available by viewing step output in Designer, entity source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
clusterId1Entities by Similarity DatasetThis is the first entity in the pair of entities being compared for similarity. See Available Published Datasets for more information.
clusterId2Entities by Similarity DatasetThis is the second entity in the pair of entities being compared for similarity. See Available Published Datasets for more information.
entity_idConsolidate Records Step
Entity IDEntities DatasetThis is the default field name of the entity_id output field as configured in the Deliver to Studio step. Depending on your step configuration, this field might have a different name.
persistentIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
suggestedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
This is the persistent ID assigned by the clustering model.

If a user overrides source record clusters in Curator, the persistent ID for the new cluster is available in the verifiedClusterId field.
tamr_idConsolidate Records Step
Tamr_IDEntities DatasetThis is the default field name of the tamr_id output field as configured in the Deliver to Studio step.

Depending on your step configuration, this field might have a different name.

Additional Fields for Source Record Overrides

The following fields store values only for source records to which a curator has applied cluster overrides. If source records overrides have not been applied, these fields are empty. These fields are available by viewing step output in Designer, source record tables in Curator and Studio, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
verifiedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
If a user overrides source record clusters in Curator, this field stores the persistentId of the cluster to which the record was moved.

Note: The cluster assigned by the model is stored in the suggestedClusterId field.
verificationTypeApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
The value in this field indicates how Tamr applied the override. This field is always set to suggest, meaning that Tamr applied the override after applying the clustering model.

Example: Persistent Identifiers Available after Clustering

In the following example, a curator has moved a source record for the A&H Automotive Industries company from the source record cluster suggested by clustering model into a source record cluster for the A&L Sanchez Painting company. Before the curator applied source record overrides, the A&L Sanchez Painting company cluster included two source records.

This example shows the output of the Apply Clustering Model step, filtered to the relevant fields. The same fields also are included in the entity source records tables in Studio and Curator, and in the Source Records by Entity published dataset.

14031403

Persistent identifer example

  1. suggestedClusterId: This field provides the persistent identifier for the source record cluster created by the clustering model. The clustering model assigned the A&H Automotive Industries to a different cluster than the records for A&L Sanchez Painting and Construction Company:
    • For A&H Automotive Industries, the suggestedClusterId is e6ef4554-21a8-38d6-96b7-423d87640455.
    • For the two A&L Sanchez Painting and Construction Company, the suggestedClusterId is 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  2. verificationType: This field indicates how Tamr Cloud applied the source record override. This field is always set to suggest, meaning that Tamr Cloud applied the override after applying the clustering model.
  3. verifiedClusterId: This field provides the persistent identifier for the source record cluster after overrides were applied. The records for A&H Automotive Industries and A&L Sanchez Painting and Construction have been assigned to the same cluster through source record overrides: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  4. entityId: This field provides the primary key value for each record from the input dataset.
  5. persistentId: This field provides the final persistent identifier for the all records in the source record cluster. Note that the field value is the same as the value of the verifiedClusterId: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.