Available Published Datasets

You can publish several different datasets for each data product, including mastered entity records, source records by cluster, and cluster similarity details.

When you create a publish destination, you select one more datasets to publish for that destination: Entities, Source Records by Entity, or Entities by Similarity.

In the published dataset name, Tamr Cloud converts spaces and hyphens in the data product name to underscores.

Entities Dataset

Published Dataset Name: tamr_<entity_type_name>

This dataset includes mastered entity records.

If you have applied any field value overrides in Curator, those changes are included in this published dataset. See Editing Field Values for more information on field overrides.

Source Records by Entity

Published Dataset Name: tamr_<entity_type_name_>cluster_details

This dataset includes records with their cluster (entity) IDs, and is the output of the Clustering step in the Designer mastering flow.

If you have made any record overrides in Curator and re-run the mastering flow to apply those changes, the changes are included in this published dataset. See Managing Entity Record Clusters for more information on record overrides.

Entities by Similarity

Published Dataset Name: tamr_<entity_type_name_>cluster_similarities

Entity, or cluster, similarity refers to the similarity between the clustered > for a pair of entities. The similarity can range from 0 to 1; during the mastering process, Tamr Cloud automatically merges any entities whose similarity is greater than 0.5.

The published dataset includes similarity metrics for pairs of entities whose similarity is greater than 0 and less than 0.5.

This dataset can help you identify the most important entities to review in Curator. By reviewing entities whose similarity scores are close to 0.5, you may identify entities that should be merged or records that should be moved between the entities.

Joining Records across Published Datasets

The following fields store the entity/cluster ID, by which you can join records across two or more datasets:

  • Entities dataset: Tamr_ID and Entity Idfields
  • Source Records by Entity: persistentId and suggestedClusterId fields
  • Entities by Similarity: clusterId1 and clusterId2 fields