Curating Data Product Data

You can override Tamr Cloud-computed clusters and field values.

As part of the data product mastering flow, Tamr Cloud groups records that refer to the same entity into a cluster, using a trained model. Cluster size can range from one record to thousands of records.

Tamr Cloud also applies rules that produce a single entity that best represents each cluster. These rules determine the most appropriate value for each mastered entity field.

If you have curator (or higher) permissions for a data product, you can review and, if necessary, override the Tamr Cloud-computed clusters and field values for that data product.

Note: Curator enters a read-only state while a flow is running, meaning you cannot perform actions until the flow finishes running.

Curating Data

Overriding Field Values

When you override a field value, the field value updates in Studio automatically. This value is persistent; it is not overwritten when mastering flows deliver updated data to Studio. If the mastering flow is re-run, this value persists in Studio, Curator, and published datasets.
See Editing Field Values.

Overriding Clusters

When you override clusters, by merging entities, creating new entities, or by moving source records between entities, these changes are applied by the Clustering step in Designer the next time the flow is run. These changes then persist on subsequent mastering flow runs. See Managing Clusters.

Field Override Process

This diagram shows how Tamr applies overrides as part of the mastering process. When you merge entities, move records between entities, or create new entities, your changes are applied by the Clustering step in Designer the next time the flow is run.

For field overrides, Tamr Cloud automatically updates the mastered entity data in Studio with the override value. The override value is also included in published data.

4220

Metrics in Curator

Data Product Metrics

The following metrics are available in Curator for each data product, based on the last mastering flow run:

  • Source Datasets: The number of source (input) datasets.
  • Source Records: The total number of records from all source datasets.
  • Entities: The number of entities resulting from data mastering.
  • % Duplicates: The percentage of records in source datasets that are part of a multi-record cluster.

To view data product metrics:
Select a data product tile in Curator to open the data product. The metrics display at the top of the Entities and Fields tabs.

1434

Metrics in Curator

Entity Metrics

For each entity, you can review, and also sort and filter, by:

  • Source Records: The number of records in the cluster for that entity.
  • Source Datasets: The number of source datasets from which the clustered records originated.
  • Similar Entities: The number of similar entities.
  • Has Value Overrides: Whether the entity has any field value overrides (yes/no).

To view entity metrics:
Select a data product tile in Curator to open the data product. On the Entities tab, the table includes columns with these metrics for each entity. You can sort these columns.

Field Metrics

For each data product field, you can view the percentage of source records that are complete, meaning that they have non-null values.

To view field metrics:
Select a data product tile in Curator to open the data product. Select the Fields tab. The table includes a % Records with non null values column.