Curation Guidebook for Legacy Data Products
Best practices for curation in Tamr Cloud.
Follow these curation best practices to help you get started with reviewing and curating your data. The goal of curation is to continuously improve and validate your results.
Identify Key Entities
When reviewing your data, a good place to start is by looking at your most important entities.
How do you decide which mastered entities to review?
You might want to review your largest clusters, or the clusters you consider the most important to the accuracy of your data. You can quickly:
- Search for more important entities.
- Search for common entities that might have errors in how they are clustered.
- Sort the Source Records column to view your largest clusters. You might want to review mastered entities with a large number of clustered source records to ensure that the most appropriate value was selected for each attribute.
- Sort the Similar Entities column to find clusters with similar clusters. You might want to compare similar source record clusters to ensure that the records were clustered correctly. If a cluster has many similar clusters, Tamr recognizes the similarity, but the cluster in question was also different enough to not be clustered with any of these similar entities. Curate similar clusters by merging ones that represent the same entity, and verifying source records as belonging to the correct cluster.
Or you might want to review entities with pending or applied changes, or with source records that have not yet been verified. You can quickly filter to:
- Entities with pending changes, to review the changes that will applied the next time the flow runs.
- Entities that changed in the last flow run to review the changes to record clusters and newly created entities.
- Entities with with attribute overrides, to review the changed attribute values.
- Entities with unverified or partially verified source records, to determine whether the unverified source records belong in the cluster.
Other tips:
- Once you’ve identified a number of key entities, bookmark them so that you can find them easily.
- Save your table configuration settings showing your key entities with table views, as described in the section below.
Create Table Views
As you review and curate, table views allow you to save your customized table configuration settings. By filtering, sorting, or reordering columns, you can create a view that highlights your most relevant data. If you save your view, you can find it in the view dropdown, and you won’t have to reconfigure your table each time.
You can save up to six views at a time, meaning you can switch between different data presentations as needed. For example, you might use one view for a high-level overview, while another might highlight a particular subset of data.
Save a view that targets your key entities. For example, you could save a view with the Source Records column sorted from ascending to descending order to view your largest clusters, or with the Similar Entities column sorted to view clusters with the most similar entities.
Curate Key Entities
Select one of your key entities to review its data. On the Entity Details page, check that the data is correct. If you see an incorrect value, navigate to the Attribute Overrides page to correct it. To override incorrect values, see Editing Mastered Entity Attribute Values.
Next, open a key entity and go to the Manage Cluster Details page. Review the records in this entity. Do all records belong? If you’re sure they do belong, verify them as belonging to the correct cluster. To move records out that don’t belong in the key entity, you can create new entities for them or you can move them to other clusters. See Modifying Source Record Clusters for more information on moving records between clusters.
On the left, open Filter Source Records by Entity panel to see similar clusters.
Select the checkbox next to the clusters with the highest percent similarity to review the records of similar clusters underneath your key entity’s records. If a similar record belongs in the key entity, you can use drag and drop to merge it into your key entity. Select the record from the bottom of your screen and drag the record into your key entity.
Take Action
When you notice any incorrect data, you should take action to correct it. This might mean making value overrides or adjusting clusters as described above. If you are not sure whether something is accurate, you can leave feedback on records or clusters and assign the feedback for review.
Run Your Data Product and Iterate
Once you have made changes to refine your results, run your data product to save your changes. Then, review your key entities again. Once you are seeing fewer inaccuracies in your key entities data, you can stop rigorous curation. As you bring in more source data, the curation cycle continues. Remember, the goal of curation is to continuously improve and validate your results.
Updated 5 months ago