Persistent Identifier Fields in Tamr Cloud

Tamr Cloud assigns a unique, persistent identifier to each entity. The same identifier is added to each source record in a cluster and to the mastered entity record.

When you add a source dataset to Tamr Cloud, the source dataset must include a field for the unique primary key. During the mastering process, Tamr Cloud also assigns a unique, persistent identifier to each business entity. The same identifier is added to each source record in a source record cluster and to the entity.

Primary Key Fields from Source Datasets

The following field stores the unique primary key from the source datasets. This field is available by viewing step output in Designer, source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
entityIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
Unlike the entity_id field in the Consolidate Records step output, which stores the Tamr Cloud-generated persistent identifier for entities and source record clusters, this field stores the original unique primary key value from the input dataset.

If the data product template includes the Create tamr_record_id transformation step, this field value is the concatenation of the source dataset name and the unique key from the source record (datasetname_uniqueKeyValue).

Persistent Identifiers Assigned by Tamr Cloud

The following fields store the same Tamr Cloud-generated identifier for a mastered entity record and its clustered source records. The Tamr Cloud-generated identifier is a 128-bit universally unique identifier (UUID).

These fields are available by viewing step output in Designer, source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
clusterId1Entities by Similarity DatasetThis is the first entity in the pair of entities being compared for similarity. See Available Published Datasets for more information.
clusterId2Entities by Similarity DatasetThis is the second entity in the pair of entities being compared for similarity. See Available Published Datasets for more information.
entity_idConsolidate Records Step
Entity IDEntities DatasetThis is the default field name of the entity_id output field as configured in the Deliver to Studio step. Depending on your step configuration, this field might have a different name.
persistentIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
suggestedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
This is the persistent ID assigned by the clustering model.

If a user overrides clusters in Curator, the persistent ID for the new source record cluster is available in the verifiedClusterId field.
tamr_idConsolidate Records Step
Tamr_IDEntities DatasetThis is the default field name of the tamr_id output field as configured in the Deliver to Studio step.

Depending on your step configuration, this field might have a different name.

Additional Fields for Cluster Overrides

The following fields store values only for source records to which a curator has applied cluster overrides. If cluster overrides have not been applied, these fields are empty. These fields are available by viewing step output in Designer, source record tables in Curator and Studio, and in published datasets, as described in the table below.

FieldStep Output or Published DatasetNotes
verifiedClusterIdApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
If a user overrides clusters in Curator, this field stores the persistentId of the source record cluster to which the record was moved.

Note: The source record cluster assigned by the model is stored in the suggestedClusterId field.
verificationTypeApply Clustering Model Step

Source Records by Entities Dataset

Entity Source Record tables in Studio and Curator
The value in this field indicates how Tamr Cloud applied the override. This field is always set to suggest, meaning that Tamr Cloud applied the override after applying the clustering model.

Example: Persistent Identifiers Available after Clustering

In the following example, a curator has moved a source record for the A&H Automotive Industries company from the source record cluster suggested by clustering model into a source record cluster for the A&L Sanchez Painting company. Before the curator applied cluster overrides, the A&L Sanchez Painting company source record cluster included two source records.

This example shows the output of the Apply Clustering Model step, filtered to the relevant fields. The same fields also are included in the source record tables in Studio and Curator, and in the Source Records by Entity published dataset.

1403

Persistent identifer example

  1. suggestedClusterId: This field provides the persistent identifier for the source record cluster created by the clustering model. The clustering model assigned the A&H Automotive Industries to a different source record cluster than the records for A&L Sanchez Painting and Construction Company:
    • For A&H Automotive Industries, the suggestedClusterId is e6ef4554-21a8-38d6-96b7-423d87640455.
    • For the two A&L Sanchez Painting and Construction Company, the suggestedClusterId is 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  2. verificationType: This field indicates how Tamr Cloud applied the cluster override. This field is always set to suggest, meaning that Tamr Cloud applied the override after applying the clustering model.
  3. verifiedClusterId: This field provides the persistent identifier for the source record cluster after overrides were applied. The records for A&H Automotive Industries and A&L Sanchez Painting and Construction have been assigned to the same source record cluster through cluster overrides: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.
  4. entityId: This field provides the primary key value for each record from the input dataset.
  5. persistentId: This field provides the final persistent identifier for the all records in the source record cluster. Note that the field value is the same as the value of the verifiedClusterId: 1ed1f847-e992-3fa6-a143-70a1a9cbd0d5.