Datasets Available to Publish for Legacy Data Products
You can publish several different datasets for each data product, including mastered entities, source records by cluster, and cluster similarity details.
When you create a publish destination, you select one or more datasets to publish to that destination: Mastered Entities, Source Records by Cluster, or Cluster by Similarity.
Mastered Entities Dataset
Published Dataset Name: tamr
This dataset includes mastered entities.
In this dataset, the Tamr ID for each mastered entity (and its clustered source records) is stored in both the Tamr_ID
and Entity_Id
fields.
If you have applied any attribute value overrides, those changes are included in this published dataset. See Editing Mastered Entity Attribute Values for more information on attribute overrides.
Source Records by Cluster
Published Dataset Name: cluster_details
This dataset includes records with their cluster (entity) IDs, and is the output of the Clustering step in the mastering flow.
In this dataset, the Tamr ID for the record's cluster (and related mastered entity) is stored in both the persistentId
and suggestedClusterId
fields.
If you have made any record overrides and re-run the mastering flow to apply those changes, the changes are included in this published dataset. See Modifying Source Record Clusters for more information on record overrides.
Cluster by Similarity
Published Dataset Name:cluster_similarities
Cluster, or entity, similarity refers to the similarity between the clustered source records for a pair of entities. The similarity can range from 0 to 1; during the mastering process, Tamr Cloud automatically merges any entities whose similarity is greater than 0.5.
The published dataset includes similarity metrics for pairs of entities whose cluster similarity is greater than 0 and less than 0.5.
This dataset can help you identify the most important entities to review. By reviewing entities whose similarity scores are close to 0.5, you may identify entities that should be merged or records that should be moved between the entities.
In this dataset, the Tamr IDs for the compared clusters are stored in the clusterId1
and clusterId2
fields.
Joining Records across Published Datasets
The following fields store the entity/cluster ID, by which you can join records across two or more datasets:
- Mastered Entities dataset:
Tamr_ID
andEntity Id
fields. - Source Records by Cluster:
persistentId
andsuggestedClusterId
fields. - Cluster by Similarity:
clusterId1
andclusterId2
fields.
Updated 4 months ago