Available Published Datasets
You can publish several different datasets for each data product, including mastered entity records, source records by cluster, and cluster similarity details.
When you create a publish destination, you select one more datasets to publish for that destination: Entities, Source Records by Entity, or Entities by Similarity.
In the published dataset name, Tamr Cloud converts spaces and hyphens in the data product name to underscores.
Entities Dataset
Published Dataset Name: tamr_<entity_type_name>
This dataset includes mastered entity records.
If you have applied any field value overrides in Curator, those changes are included in this published dataset. See Editing Field Values for more information on field overrides.
Source Records by Entity
Published Dataset Name: tamr_<entity_type_name_>cluster_details
This dataset includes records with their cluster (entity) IDs, and is the output of the Clustering step in the Designer mastering flow.
If you have made any record overrides in Curator and re-run the mastering flow to apply those changes, the changes are included in this published dataset. See Managing Entity Record Clusters for more information on record overrides.
Entities by Similarity
Published Dataset Name: tamr_<entity_type_name_>cluster_similarities
Entity, or cluster, similarity refers to the similarity between the clustered > for a pair of entities. The similarity can range from 0 to 1; during the mastering process, Tamr Cloud automatically merges any entities whose similarity is greater than 0.5.
The published dataset includes similarity metrics for pairs of entities whose similarity is greater than 0 and less than 0.5.
This dataset can help you identify the most important entities to review in Curator. By reviewing entities whose similarity scores are close to 0.5, you may identify entities that should be merged or records that should be moved between the entities.
Joining Records across Published Datasets
The following fields store the entity/cluster ID, by which you can join records across two or more datasets:
- Entities dataset:
Tamr_ID
andEntity Id
fields - Source Records by Entity:
persistentId
andsuggestedClusterId
fields - Entities by Similarity:
clusterId1
andclusterId2
fields
Updated 23 days ago