About Persistent Identifier Fields for Legacy Data Products
Tamr Cloud assigns a unique, persistent identifier to each entity. The same identifier is added to each source record in a cluster and to the mastered entity record.
When you add a source dataset to Tamr Cloud, the source dataset must include a field for the unique primary key. During the mastering process, Tamr Cloud also assigns a unique, persistent identifier to each business entity. The same identifier is added to each source record in a source record cluster and to the entity.
Primary Key Fields from Source Datasets
The following field stores the unique primary key from the source datasets. This field is available by viewing step output in the Configure Flow page, source record tables in the Tamr Cloud UI, and in exported datasets, as described in the table below.
Field | Step Output or Exported Dataset | Notes |
---|---|---|
entityId | Apply Clustering Model Step Source Records by Entities Dataset Source Record tables | This is the unique identifier for the source record, generated by Tamr. This generated identifier is the tamr_record_id , which is a 128-bit hash value of the source dataset name and the source primary key.Note: This is not the same as the entity_id field in the Consolidate Records step output, which stores the Tamr ID for the clustered source records and their mastered entities. |
Persistent Identifiers Assigned by Tamr Cloud
The following fields store the Tamr ID, which is the Tamr Cloud-generated identifier for a mastered entity record and its clustered source records. The Tamr ID a 128-bit universally unique identifier (UUID).
These fields are available in the step output in the Configure Flow page, source record tables in the Tamr Cloud UI, and in published datasets, as described in the table below.
Field | Step Output or Published Dataset | Notes |
---|---|---|
clusterId1 | Entities by Similarity Dataset | This is the first entity in the pair of entities being compared for similarity. See Datasets Available for Export for more information. |
clusterId2 | Entities by Similarity Dataset | This is the second entity in the pair of entities being compared for similarity. See Datasets Available for Export for more information. |
entity_id | Consolidate Records Step | |
Entity ID | Entities Dataset | This is the default field name of the entity_id output field as configured in the Configure Attributes step. Depending on your step configuration, this field might have a different name. |
persistentId | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | |
suggestedClusterId | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | This is the persistent ID assigned by the clustering model. If a user overrides clusters, the persistent ID for the new source record cluster is available in the verifiedClusterId field. |
tamr_id | Consolidate Records Step | |
Tamr ID | Entities Dataset | This is the default field name of the tamr_id output field as configured in the Configure Attributes step.Depending on your step configuration, this field might have a different name. |
Example: Persistent Identifiers through the Clustering and Consolidation Process
The following diagram illustrated how source primary key values and Tamr-generated persistent identifiers are created and retained during the clustering and record consolidation process. Note that in the diagram, the ID values are shorted for readability.
- Source records: Before being added to Tamr, each source record must have a unique 'primaryKey'.
- Create tamr_record_id step: The Create tamr_record_id step in the mastering flow assigns each source record a
tamr_record_id
to ensure that it has a unique identifier across all source datasets. Thetamr_record_id
is a 128-bit hash value of the source dataset name and the source primary key. - Apply Clustering and Consolidate Records steps: The Apply Clustering step in the mastering flow groups source records that refer to the same entity into a cluster, and assigns the same
persistentId
(or Tamr ID) to all source records in the same cluster. In the image above, the three source records are grouped into the same cluster, and are assigned the samepersistentId
. - Published mastered entity: The mastered entity for each cluster is assigned the same
persistentId
as the records in the cluster.
Additional Fields for Clustering Rules
The following fields store values only for source records to which a clustering rule has been applied. If cluster rules have not been applied, these fields are empty. These fields are available by viewing step output in the Configure Flow page, and in source record tables and exported datasets, as described in the table below.
Field | Step Output or Exported Dataset | Notes |
---|---|---|
ruleClusterId | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | The new clusterId for a record after clustering rules are applied. Note: The source record cluster assigned by the model is stored in the suggestedClusterId field. |
appliedRules | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | For internal use by Tamr; a value in this field indicates that a clustering rule was applied. |
Additional Fields for Cluster Overrides
The following fields store values only for source records to which a user has applied cluster overrides. If cluster overrides have not been applied, these fields are empty. These fields are available by viewing step output in the Configure Flow page, and in source record tables and exported datasets, as described in the table below.
Field | Step Output or Exported Dataset | Notes |
---|---|---|
verifiedClusterId | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | If a user overrides clusters, this field stores the persistentId of the source record cluster to which the record was moved.Note: The source record cluster assigned by the model is stored in the suggestedClusterId field. |
verificationType | Apply Clustering Model Step Source Records by Entities Dataset Entity Source Record tables | The value in this field indicates how Tamr Cloud applied the override. This field is always set to suggest , meaning that Tamr Cloud applied the override after applying the clustering model. |
Example: Persistent Identifiers Available after Clustering Overrides
In the following example, a user has moved a source record for the A&H Automotive Industries company from the source record cluster suggested by clustering model into a source record cluster for the A&L Sanchez Painting company. Before the user applied cluster overrides, the A&L Sanchez Painting company source record cluster included two source records.
This example shows the output of the Apply Clustering Model step, filtered to the relevant fields. The same fields also are included in the source record tables, and in the Source Records by Entity exported dataset.
- suggestedClusterId: This field provides the persistent identifier for the source record cluster created by the clustering model. The clustering model assigned the A&H Automotive Industries to a different source record cluster than the records for A&L Sanchez Painting and Construction Company:
- For A&H Automotive Industries, the
suggestedClusterId
ise6ef4554-21a8-38d6-96b7-423d87640455
. - For the two A&L Sanchez Painting and Construction Company, the
suggestedClusterId
is1ed1f847-e992-3fa6-a143-70a1a9cbd0d5
.
- For A&H Automotive Industries, the
- verificationType: This field indicates how Tamr Cloud applied the cluster override. This field is always set to
suggest
, meaning that Tamr Cloud applied the override after applying the clustering model. - verifiedClusterId: This field provides the persistent identifier for the source record cluster after overrides were applied. The records for A&H Automotive Industries and A&L Sanchez Painting and Construction have been assigned to the same source record cluster through cluster overrides:
1ed1f847-e992-3fa6-a143-70a1a9cbd0d5
. - entityId: This field provides the primary key value for each record from the input dataset.
- persistentId: This field provides the final persistent identifier for the all records in the source record cluster. Note that the field value is the same as the value of the
verifiedClusterId
:1ed1f847-e992-3fa6-a143-70a1a9cbd0d5
.
Updated 4 months ago