Publishing Data Products

You publish data product datasets to make your updated mastered data available in cloud storage destinations.

You can publish complete data product view datasets to connected storage or database locations, or you can publish a downloadable CSV file with up to 1 million records.

Adding a Publish Configuration

When adding a publish configuration, you select which data product view to publish, which attributes to include, and the connected location to which to publish the data.

Available data product views are described in the table below.

Data Product ViewDescription
Golden RecordsIncludes mastered entities, also known as golden records. A golden record is the single record that best represents a cluster of records referring to the same real-world entity.

In this view, the Tamr ID for each golden record (and its clustered source records) is stored in the attribute with the internal name tamr_id. You can use this attribute to join records across datasets.

If you have applied any attribute value overrides, those changes are included in this published view dataset. See Editing Mastered Entity Attribute Values for more information on attribute overrides.
Source RecordsIncludes records with their cluster (Tamr) IDs. The Tamr ID is stored in the attribute with the internal name tamr_id.

If you have made any record overrides and re-run the mastering flow to apply those changes, the changes are included in this published view dataset. See Modifying Source Record Clusters for more information on record overrides.
Enhanced Source RecordsSimilar to the source records dataset, includes records with their Tamr IDs and includes any applied record overrides.

However, values in this dataset have been standardized and enhanced using the data quality services provided in the data product. For example, if the data product includes the Address Standardization, Validation, and Geocoding service, address attributes values may be replaced with the standardized and validated values returned by that service.

To add a publish configuration:

  1. Navigate to the Publish page for the data product.
  2. Select Add Configuration and enter a name for the configuration.
  3. Select which data product view to publish: Golden Records, Source Records, or Enhanced Source Records.
  4. Select which attributes to include in your output.
  5. Optionally, edit the publish name values for selected attributes to change the name of these attributes for downstream systems. Select the Save icon Save next to each publish name to save your change.
  6. Optionally, reorder columns in your output by dragging and dropping fields.
  7. From the Destination dropdown, select a configured connection.
    To publish a CSV file of up 1 million records for download, select the File Download connection.
  8. If you selected a connection other than File Download:
    • For ADLS Gen2, Amazon S3, and Google Cloud Storage connections, enter the Path. This is the path for the published output within the configured bucket and prefix of the connection. After entering the path, the full URI to the published output displays.
      Do not include trailing or leading / unless they are intended to be part of the path. For example:
      • Specify /path/to/object if the object is stored at <bucket>/<prefix>//path/to/object.
      • Specify path/to/object if the object is stored at <bucket>/<prefix>/path/to/object.
    • For BigQuery and Snowflake connections, enter the Table to which to publish the output.

Publishing Data Product Views

Before You Begin:
When you publish data, any data already published to the destination for the data product is overwritten. Before publishing, back up the target file or table.

important Important Notes for Snowflake:

  • If you are publishing to Snowflake and have added or removed output fields since the last publish job, you must either update the destination Snowflake table to match the updated schema or delete the destination table; otherwise, publishing will fail. If you delete the destination table before exporting, the table is recreated with the updated schema when published.
  • In order to view the published dataset, you must have read access in Snowflake to the table to which it was published. Contact your Snowflake administrator if you are not able to view the published datasets.

To publish data product views:

Navigate to the Publish page to configure your publish destination. Then:

  • To publish all configured data product views, select Publish All.
  • To publish an individual data product view, select the Publish to [Connection Name] option for that publish configuration.
  • For publish configurations configured with the File Download connection, select Generate File. When the file is ready for download, the Download option is enabled. Note that downloadable files contain up to 1 million records.

About Published Files

For cloud storage connections (Amazon S3, GCS, and ADLSv2), the following applies. If your dataset is under one million records, it is published as a single file. If your dataset is over one million records, it is published as multiple files.

Output File Format

The following are applicable for all published CSV files:

  • Delimiter: Comma (,)
  • Encoding: UTF-8
  • Header row: The first row is treated as a header.
  • Trailing and leading spaces: Trimmed in both headers and values.
  • Quote characters within values: The quote " character in values are escaped with a quote ".
    For example, Hello my name is "John Doe" is changed to Hello my name is ""John Doe"" in the published file.
  • Commas within values": Values that contain commas are enclosed in quotes (" ").
    For example:
    • Company, Inc is converted to "Company, Inc"
    • Hello, my name is "John Doe" is converted to "Hello, my name is ""John Doe"""
  • Multi-line values: Multi-line values are converted to single-line values.

Single File Output

If your dataset is under one million records, Tamr publishes it as a single file. The name of the file is output-00000.csv. When publishing is complete, there will be an additional file in your directory named _SUCCESS; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.

Example directory after single file output:

<provided_path>
    ├── output-00000.csv
    └── _SUCCESS

Multipart File Output

For larger datasets, Tamr Cloud publishes the files to a directory in multiple parts. The name of your files will be output-<number file>.csv. Once publish is done, there will be an additional file in your directory named _SUCCESS; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.

Example directory after multipart file publish:

<provided_path>
    ├── output-00000.csv
    ├── output-00001.csv
    ├── output-00002.csv
    ├── output-00003.csv
    ├── output-00004.csv
    └── _SUCCESS

Guidance for Writing Integrations

If you are writing an integration, configure your program to parse all CSV files in the expected output folder, as the behavior changes with increased volume of records.

Example integration in Python:

# Import libraries
import glob
import pandas as pd

# Get CSV files list from a folder
path = '/dataProduct/tamr
csv_files = glob.glob(path + "/*.csv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)

# Concatenate all DataFrames
big_df   = pd.concat(df_list, ignore_index=True)