Adding Attributes from Source Datasets to a Data Product
Best practices for adding more attributes (fields) to your data product.
The mastering flow provided by each data product template includes a pre-defined unified schema with industry-standard attributes. When the mastering flow is run, these attributes, as well as any attributes added by the data quality and enrichment services, are included in the data product output.
Your source datasets might contain additional columns (fields) that you would like to include as attributes in your final data product. Follow the steps and best practices in this topic to add these columns and ensure that the attributes appear correctly in your data product.
Select the scenario in which you want to add attributes:
Adding Attributes to New Data Products
The documentation for each data product template explains how to configure the flow for a new data product. Refer to the topics for your data product for template-specific guidance.
In general, follow these steps when adding attributes to a new data product flow:
-
In the Schema Mapping step, add and map new attributes to the unified schema. Following the instructions in Mapping Input Fields to a Unified Schema.
-
In the Consolidate Records transformation step, add rules to select the attribute values for the mastered entities.
For example, you might want the mastered entity attribute value to be the most common value from the clustered source records, or you might prefer to use a value from a specific dataset that you know to be highly reliable. Follow the instructions in Modifying Record Consolidation Transformations.
-
If you are creating a Company Mastering with D&B or Supplier Mastering with D&B data product, add the new attributes to the Consolidate Record Fields transformation step so that they will be passed through to the next step in the flow.
To add the attributes, enter the attribute names in the comma-separated list before this line:
dnb_match_candidates_match_quality_information_confidence_code as confidence_code;
-
In the Configure Attributes step, add and map the new attributes to include them in your final mastered entity output. Follow the instructions in Configuring Data Display.
Note: If you want the new attributes to be included in the entity tables in Tamr Cloud, add and map them to the first, or primary, attribute group in this step.
Adding Attributes to Configured Data Products
After your data product is configured and in use, you might need to add attributes for source dataset columns that:
- Already exist in a source dataset but are not currently mapped to attributes in the data product.
- Are available in a new source dataset.
- A source dataset for the data product has been updated to include new columns, and you want to add attributes for those columns.
The steps to add attributes in each of these cases are largely the same. However, for the third case, you also need to remove and re-add the source dataset from Tamr Cloud, as described in the procedures below.
Note: When working with any new source datasets or datasets in which new columns have been added, confirm that the columns meet the Requirements for Source Datasets.
Step 1: Create a development version of the data product
Important: When updating a configured data product, we recommend that you create a copy of the data product and test all changes in the copy. Once you have verified your changes using a sample of your data, you can then duplicate the changes in the existing data product.
Save a copy of the data product to use for development. Follow the instructions in Adding a Data Product.
The copy includes most of the flow configuration of the original data product, but does not include the source data.
Step 2: Replace the existing source dataset with a new version
This step is only necessary if you are adding attributes because a source dataset for the data product has been updated to include new columns. Tamr Cloud cannot refresh source datasets if the schema has changed.
Skip this step if you are adding columns that already exist in the source dataset, or if you are adding a new source dataset to a data product.
To replace the source dataset with the new version:
-
In Admin > Sources, locate the source dataset. Note its name, and then delete it.
-
Recreate the source with the EXACT same name. This is critical to retain the same Tamr IDs for your mastered entities in the original data product.
Note: Using a different source name or using a source with different primary keys than the source being replaced will result in NEW Tamr IDs in the original data product. The copied (development) version of the data product will have new Tamr IDs regardless of the source name.
Step 3: Update the flow to add the new attributes
-
In the Add Data step, add the same source datasets used in the original data product. Follow the instructions in Adding Data to Your Data Product.
-
In the Schema Mapping step:
- Select the source datasets as the input, following the instructions in Adding Data to Your Data Product. For the sources used in the original data product, the schema maps automatically.
- Add and map new attributes to the unified schema. Following the instructions in Mapping Input Fields to a Unified Schema.
-
Use a small data sample for testing purposes. In the Create tamr_record_id transformation step, set the flow to randomly select 100 records for testing. After the first line,
use input;
, enter:sample 100;
Important: Only use this line in the data product copy and NOT in the original.
-
In the Consolidate Records transformation step, add rules to select the attribute values for the mastered entities. For example, you might want the mastered entity attribute value to be the most common value from the clustered source records, or you might prefer to use a value from a specific dataset that you know to be highly reliable. Follow the instructions in Modifying Record Consolidation Transformations.
-
If you are creating a Company Mastering with D&B or Supplier Mastering with D&B data product, add the new attributes to the Consolidate Record Fields transformation step so that they will be passed through to the next step in the flow.
To add the attributes, enter the attribute names in the comma-separated list before this line:
dnb_match_candidates_match_quality_information_confidence_code as confidence_code;
-
In the Configure Attributes step, add and map the new attributes to include them in your final mastered entity output. Follow the instructions in Configuring Data Display.
Note: If you want the new attributes to be included in the entity tables in Tamr Cloud, add and map them to the first, or primary, attribute group in this step. \
-
Run the flow and review the output in Tamr Cloud. Make any necessary adjustments.
-
Once you have verified your changes using a sample of your data, duplicate the changes you made in all flow steps, except for Create_tamr_id, in the original data product.
Step 4: Update the published mastered entity dataset to include the attributes
Edit the publish destination for the data product to add the new attributes to the Mastered Entities dataset. See Managing Publish Destinations.
Updated about 1 year ago