Publishing Data Products
You publish data product datasets to make your updated mastered data available in cloud storage destinations.
You can publish complete data product datasets to connected storage or database locations, or you can publish a downloadable CSV file with up to 1 million records.
Adding a Publish Configuration
When adding a publish configuration, you select which data product dataset to publish, which attributes to include, and the connected location to which to publish the data.
For ADLS Gen2, Amazon S3, and Google Cloud Storage connections, you can also choose whether to publish the dataset as multiple files or as a single file. Note that publishing as a single file may increase processing time.
Available data product datasets are described in the table below.
Data Product Dataset | Description |
---|---|
Golden Records | Includes mastered entities, also known as golden records. A golden record is the single record that best represents a cluster of records referring to the same real-world entity. In this dataset, the Tamr ID for each golden record (and its clustered source records) is stored in the attribute with the internal name tamr_id . You can use this attribute to join records across datasets.If you have applied any attribute value overrides, those changes are included in this published view dataset. See Editing Mastered Entity Attribute Values for more information on attribute overrides. |
Source Records | Includes records with their cluster (Tamr) IDs. The Tamr ID is stored in the attribute with the internal name tamr_id .If you have made any record overrides and re-run the mastering flow to apply those changes, the changes are included in this published dataset. See Modifying Source Record Clusters for more information on record overrides. |
Enhanced Source Records | Similar to the source records dataset, includes records with their Tamr IDs and includes any applied record overrides. However, values in this dataset have been standardized and enhanced using the data quality services provided in the data product. For example, if the data product includes the Address Standardization, Validation, and Geocoding service, address attributes values may be replaced with the standardized and validated values returned by that service. |
Enrichment Results | Available for B2B Customers data products, this dataset includes the enrichment attributes provided by a selected data provider. For each record, these datasets include the Tamr ID for the corresponding golden record, the Enrich ID, the Tamr Enrich match status, and the attributes provided by the data provider. |
To add a publish configuration:
- Navigate to the Publish page for the data product.
- Select Add Configuration.
- Select which data product view to publish: Golden Records, Source Records, or Enhanced Source Records.
- Enter a name for the configuration and select Create.
- Select which attributes to include in your output. You can search for specific attributes, and also filter to selected or unselected attributes.
- Optionally, edit the publish name values for selected attributes to change the name of these attributes for downstream systems. Select the Save icon next to each publish name to save your change.
- Optionally, reorder columns in your output by dragging and dropping fields.
- From the Destination dropdown, select a configured connection.
To publish a CSV file of up 1 million records for download, select the File Download connection. - If you selected an ADLS Gen2, Amazon S3, or Google Cloud Storage connection:
- Choose whether to publish the dataset to a single file (default is multiple files).
- Enter the Path. This is the path for the published output within the configured bucket and prefix of the connection. After entering the path, the full URI to the published output displays.
Do not include trailing or leading/
unless they are intended to be part of the path. For example:- Specify
/path/to/object
if the object is stored at<bucket>/<prefix>//path/to/object
. - Specify
path/to/object
if the object is stored at<bucket>/<prefix>/path/to/object
.
- Specify
- If you selected a BigQuery or Snowflake connection, enter the Table to which to publish the output.
Publishing Data Product Views
Before You Begin:
When you publish data, any data already published to the destination for the data product is overwritten. Before publishing, back up the target file or table.
Important Notes for Snowflake:
- If you are publishing to Snowflake and have added or removed output fields since the last publish job, you must either update the destination Snowflake table to match the updated schema or delete the destination table; otherwise, publishing will fail. If you delete the destination table before exporting, the table is recreated with the updated schema when published.
- In order to view the published dataset, you must have read access in Snowflake to the table to which it was published. Contact your Snowflake administrator if you are not able to view the published datasets.
Important Note for BigQuery:
- If you are publishing to BigQuery and have added or removed output fields since the last publish job, you must either update the destination BigQuery table to match the updated schema or delete the destination table; otherwise, publishing will fail. If you delete the destination table before exporting, the table is recreated with the updated schema when published. (This is not applicable for publishing legacy data products.)
To publish data product views:
Navigate to the Publish page to configure your publish destination. Then:
- To publish all configured data product views, select Publish All.
- To publish an individual data product view, select the Publish to [Connection Name] option for that publish configuration.
- For publish configurations configured with the File Download connection, select Generate File. When the file is ready for download, the Download option is enabled. Note that downloadable files contain up to 1 million records.
About Published Files
For cloud storage connections (Amazon S3, GCS, and ADLSv2), the dataset is published by default as multiple files. You can choose to publish the dataset as a single file, which may increase the processing time.
Output File Format
The following are applicable for all published CSV files:
- Delimiter: Comma (,)
- Encoding: UTF-8
- Header row: The first row is treated as a header.
- Trailing and leading spaces: Trimmed in both headers and values.
- Quote characters within values: The quote
"
character in values are escaped with a quote"
.
For example,Hello my name is "John Doe"
is changed toHello my name is ""John Doe""
in the published file. - Commas within values": Values that contain commas are enclosed in quotes (
" "
).
For example:Company, Inc
is converted to"Company, Inc"
Hello, my name is "John Doe"
is converted to"Hello, my name is ""John Doe"""
- Multi-line values: Multi-line values are converted to single-line values.
Multipart File Output
If you select to publish to multiple files (default) Tamr Cloud publishes the files to a directory in multiple parts. The name of your files will be output-<number file>.csv
. Once publish is done, there will be an additional file in your directory named _SUCCESS
; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.
Example directory after multipart file publish:
<provided_path>
    ├── output-00000.csv
├── output-00001.csv
├── output-00002.csv
├── output-00003.csv
├── output-00004.csv
    └── _SUCCESS
Single File Output
If you select to publish a single file, the filename is output-00000.csv
. When publishing is complete, there will be an additional file in your directory named _SUCCESS
; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.
Example directory after single file output:
<provided_path>
├── output-00000.csv
└── _SUCCESS
Guidance for Writing Integrations
If you are writing an integration, configure your program to parse all CSV files in the expected output folder, as the behavior changes with increased volume of records.
Example integration in Python:
# Import libraries
import glob
import pandas as pd
# Get CSV files list from a folder
path = '/dataProduct/tamr
csv_files = glob.glob(path + "/*.csv")
# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)
# Concatenate all DataFrames
big_df = pd.concat(df_list, ignore_index=True)
Updated 1 day ago