Curation Guidebook
Best practices for curation in Tamr Cloud.
Follow these curation best practices to help you get started with reviewing and curating your data. The goal of curation is to continuously improve and validate your results.
Identify Key Entities
When reviewing your data, a good place to start is by looking at your most important entities.
How do you decide which mastered entities to review?
You might want to review your largest clusters, or the clusters you consider the most important to the accuracy of your data. You can quickly:
- Search for key entities for your organization, such as critical accounts.
- Search for common entities that might have errors in how they are clustered.
- Sort by the different curation attributes:
- Sort by Number of Source Records to view your largest clusters. You might want to review mastered entities with a large number of clustered source records to ensure that the most appropriate value was selected for each attribute.
- Sort by Number of Similar Clusters to find clusters with similar clusters. You can further filter by Maximum Similarity to quickly find the clusters that have the most similarity to other clusters. You might want to compare similar source record clusters to ensure that the records were clustered correctly. If a cluster has many similar clusters, Tamr recognizes the similarity, but the cluster in question was also different enough to not be clustered with any of these similar entities. Curate similar clusters by merging ones that represent the same entity, and verifying source records as belonging to the correct cluster.
- Filter to entities that were changed in the last flow run to review these updates.
- Sort or filter by cluster or attribute uniformity. The uniformity score for the cluster indicates how similar clustered source records are to each other. In the Configure Data Product page, you can also select attributes for which to calculate a uniformity score. At the attribute level, the uniformity score provide insight into how similar the values for this attribute are within the source record cluster. Uniformity scores range from 0 to 1. For example, uniformity score of 1 for an attribute means that all records in the cluster have the same value for this attribute, while a uniformity score of 0 indicates that all records in this cluster have different values for this attribute.
Other tips:
- Once you’ve identified a number of key entities, bookmark them so that you can find them easily.
- Save your table configuration settings showing your key entities with table views, as described in the section below.
Create Table Views
As you review and curate, table views allow you to save your customized table configuration settings. By filtering, sorting, or reordering columns, you can create a view that highlights your most relevant data. If you save your view, you can find it in the view dropdown, and you won’t have to reconfigure your table each time.
You can save up to six views at a time, meaning you can switch between different data presentations as needed. For example, you might use one view for a high-level overview, while another might highlight a particular subset of data.
Save a view that targets your key entities. For example, you could save a view with the Source Records column sorted from ascending to descending order to view your largest clusters, or with the Similar Entities column sorted to view clusters with the most similar entities.
Curate Key Entities
Select one of your key entities to review its data. On the Entity Details page, check that the data is correct. If you see an incorrect value, navigate to the Attribute Overrides page to correct it. To override incorrect values, see Editing Mastered Entity Attribute Values.
Next, open a key entity and go to the Manage Cluster Details page. Review the records in this entity. Do all records belong? If you’re sure they do belong, verify them as belonging to the correct cluster. To move records out that don’t belong in the key entity, you can create new entities for them or you can move them to other clusters. See Modifying Source Record Clusters for more information on moving records between clusters.
On the left, open Filter Source Records by Entity panel to see similar clusters.
Select the checkbox next to the clusters with the highest percent similarity to review the records of similar clusters underneath your key entity’s records. If a similar record belongs in the key entity, you can use drag and drop to merge it into your key entity. Select the record from the bottom of your screen and drag the record into your key entity.
Take Action
When you notice any incorrect data, you should take action to correct it. This might mean making value overrides or adjusting clusters as described above. If you are not sure whether something is accurate, you can leave feedback on records or clusters and assign the feedback for review.
Refresh Your Data Product and Iterate
Once you have made changes to refine your results, refresh your data product to save your changes. Then, review your key entities again. Once you are seeing fewer inaccuracies in your key entities data, you can stop rigorous curation. As you bring in more source data, the curation cycle continues. Remember, the goal of curation is to continuously improve and validate your results.
Updated 2 months ago