Publishing Data Product Data

You make different data product datasets available to downstream destinations by publishing them to a cloud storage destination.

Your ability to publish datasets depends on your user role and permissions.

The datasets that you can publish for a data product include Mastered Entities, Source Records by Cluster, and Cluster by Similarity. When configuring a publish destination, you select which datasets to publish to that location and which columns to include in those datasets. See Datasets Available for Export for information on these datasets.

When you publish data, any data already published to the destination for the data product is overwritten. You can also download published datasets.

If you have publisher (or higher) permissions for your data product, you can select Export Sample to quickly download the table data on the page. Configure the number of rows to download at the bottom of the table.

Before You Begin:
Because publishing overwrites any data already published to the destination, back up the target file or table before publishing.

important Important Notes for Snowflake:

  • If you are publishing to Snowflake and have added or removed output fields since the last publish job, you must either update the destination Snowflake table to match the updated schema or delete the destination table; otherwise, publishing will fail. If you delete the destination table before exporting, the table is recreated with the updated schema when published.
  • In order to view the published dataset, you must have read access in Snowflake to the table to which it was published. Contact your Snowflake administrator if you are not able to view the published datasets.

To publish data product datasets:

  1. Open the data product from the home page.
  2. Select the Publish page.
  3. If the Publish table does not include the destination that you want to use, add a new destination. See Adding a Publish Destination for instructions.
  4. In the table, select Publish Publish for the destination to which to publish the data and confirm.
    The datasets configured for that destination are published to the cloud storage location.

You can monitor the progress of the publish job.

File Output

For cloud storage connections (Amazon S3, GCS, and ADLSv2), the following applies. If your dataset is under one million records, it is published as a single file. If your dataset is over one million records, it is published as multiple files, as described below.

In the published dataset name, spaces and hyphens in the data product name are converted to underscores.

To read more about the published datasets, see Datasets Available to Publish.

Output File Format

The following are applicable for all published CSV files:

  • Delimiter: Comma (,)
  • Encoding: UTF-8
  • Header row: The first row is treated as a header.
  • Trailing and leading spaces: Trimmed in both headers and values.
  • Quote characters within values: The quote " character in values are escaped with a quote ".
    For example, Hello my name is "John Doe" is changed to Hello my name is ""John Doe"" in the published file.
  • Commas within values": Values that contain commas are enclosed in quotes (" ").
    For example:
    • Company, Inc is converted to "Company, Inc"
    • Hello, my name is "John Doe" is converted to "Hello, my name is ""John Doe"""
  • Multi-line values: Multi-line values are converted to single-line values.

Single File Output

If your dataset is under one million records, Tamr publishes it as a single file. The name of your file is <dataproduct name>/<dataset name>/output.csv. When publishing is complete, there will be a file in your directory named _SUCCESS; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.

Example directory after single file output:

dataProduct
└── tamr
    ├── output.csv
    └── _SUCCESS

Multipart File Output

For larger datasets, Tamr Cloud publishes the files to a directory in multiple parts. The name of your files will be output-<number file>.csv. Once publish is done, there will be a file in your directory named _SUCCESS; it is used to indicate that your publish job is complete and that data is ready to be consumed. This file is empty and does not contain any data.

Example directory after multipart file publish:

dataProduct
├── cluster_details
│   ├── output-00001.csv
│   ├── output-00002.csv
│   ├── output-00003.csv
│   ├── output-00004.csv
│   └── _SUCCESS
├── cluster_similarities
│   ├── output-00001.csv
│   ├── output-00002.csv
│   ├── output-00003.csv
│   ├── output-00004.csv
│   └── _SUCCESS
└── tamr
    ├── output.csv
    └── _SUCCESS

Guidance for Writing Integrations

If you are writing an integration, configure your program to parse all CSV files in the expected output folder, as the behavior changes with increased volume of records.

Example integration in Python:

# Import libraries
import glob
import pandas as pd

# Get CSV files list from a folder
path = '/dataProduct/tamr
csv_files = glob.glob(path + "/*.csv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)

# Concatenate all DataFrames
big_df   = pd.concat(df_list, ignore_index=True)

Database Table Output

For database connections, the published output table names are as follows.

Database ConnectionTable Name Format
BigQuery<configured_prefix>_<data_product_name>_<dataset_name>
Snowflake<data_product_name>\_<dataset_name>