Datasets Available to Publish for Legacy Data Products

You can publish several different datasets for each data product, including mastered entities, source records by cluster, and cluster similarity details.

When you create a publish destination, you select one or more datasets to publish to that destination: Mastered Entities, Source Records by Cluster, or Cluster by Similarity.

Mastered Entities Dataset

Published Dataset Name: tamr

This dataset includes mastered entities.

In this dataset, the Tamr ID for each mastered entity (and its clustered source records) is stored in both the Tamr_ID and Entity_Id fields.

If you have applied any attribute value overrides, those changes are included in this published dataset. See Editing Mastered Entity Attribute Values for more information on attribute overrides.

Source Records by Cluster

Published Dataset Name: cluster_details

This dataset includes records with their cluster (entity) IDs, and is the output of the Clustering step in the mastering flow.

In this dataset, the Tamr ID for the record's cluster (and related mastered entity) is stored in both the persistentId and suggestedClusterId fields.

If you have made any record overrides and re-run the mastering flow to apply those changes, the changes are included in this published dataset. See Modifying Source Record Clusters for more information on record overrides.

Cluster by Similarity

Published Dataset Name:cluster_similarities

Cluster, or entity, similarity refers to the similarity between the clustered source records for a pair of entities. The similarity can range from 0 to 1; during the mastering process, Tamr Cloud automatically merges any entities whose similarity is greater than 0.5.

The published dataset includes similarity metrics for pairs of entities whose cluster similarity is greater than 0 and less than 0.5.

This dataset can help you identify the most important entities to review. By reviewing entities whose similarity scores are close to 0.5, you may identify entities that should be merged or records that should be moved between the entities.

In this dataset, the Tamr IDs for the compared clusters are stored in the clusterId1 and clusterId2 fields.

Joining Records across Published Datasets

The following fields store the entity/cluster ID, by which you can join records across two or more datasets:

  • Mastered Entities dataset: Tamr_ID and Entity Idfields.
  • Source Records by Cluster: persistentId and suggestedClusterId fields.
  • Cluster by Similarity: clusterId1 and clusterId2 fields.