Requirements for Source Datasets

Before you add a source dataset to your staging zone, ensure that it meets the requirements for use in your data product.

The source datasets for your data products might be stored in your organization's applications, databases, or file systems.

Before adding this data to Tamr Cloud:

  1. Perform any necessary Extract, Transform, or Load (ETL) steps to denormalize your datasets and ensure that the source datasets meet the general requirements described in this topic. Your datasets must also meet the specific schema requirements for the data product you are creating. See Data Product Guides (or Data Product Templates for legacy templates).
  2. Move or copy these prepared source datasets into a staging zone.
    A staging zone is a cloud storage system and the bucket or database within that system that you have connected to Tamr Cloud. Your organization fully manages the staging zone, and allows Tamr Cloud to import data from that system through a configured, secure connection. See Connecting to External Data Repositories for more information about staging zones.

General Requirements

The following requirements apply to all source datasets, regardless of whether the source is a file or a database table or view:

  • Flat file, table, or view: All datasets must be flat tables, views, or files, with row-level granularity.
  • Unique name: Each source dataset must have a unique name.
  • Primary Key: Each source dataset must have a unique primary key field. See the About Primary Keys section below for more information.
  • Column Header Values:
    • Maximum length: 300 characters
    • Cannot begin with number characters. Columns beginning with number characters will be prepended by an underscore. For example, the column name 1_address will be converted to _1_address.
    • Allowed characters: Letters (a-z, A-Z), numbers (0-9), and underscores.
    • Spaces and non-alphanumeric characters in column names are converted to underscores. Leading and trailing spaces are trimmed.
      Note: If you are using a GCS connection, replace any spaces in the file path with %20. For example, you would replace source /file source.csv with source%20/file%20source.csv.
    • Column names must be unique (case-sensitive).
    • Column names cannot include any of the following prefixes:
      • _TABLE_
      • _FILE_
      • _PARTITION_
      • _ROW_TIMESTAMP_
      • __ROOT__
      • _COLIDENTIFIER_
    • Column names cannot be the following exact name: NO
  • Column Data Values:
    • Data type:
      • BigQuery datasets: See Requirements for BigQuery
      • Snowflake datasets: all types supported by Snowflake
      • All other datasets: string only
        Note: Tamr Cloud converts all data to to string
    • Format: UTF-8, Windows 1252
    • Double-quoted values are allowed.
    • Data in each row must map to the header columns

Requirements for Source Files

  • File size limit:
    • 20GB per file added through AWS S3, Google Cloud Storage, or other connected cloud storage location
    • 500 MB per uploaded file
  • File formats (extensions): CSV (.csv)
  • Delimiters: Any single-character delimiter such as comma, semi-colon, and pipe. Space-based delimiters such as space and tab are not supported.
  • Row separators:
    • Newline
    • Carriage return followed by a newline

Requirements for Source Database Tables and Views

  • Table size limit: 20 GB

Note for Snowflake sources: Snowflake treats all numeric values as 128-bit integers (Int128). Tamr Cloud supports up to 64-bit integers (Int64), and automatically converts numeric values to Int64. If your data includes numeric values over Int64 and you need to preserve precision above Int64, covert these values to strings in your downstream systems before loading the data into Tamr Cloud.

About Primary Keys

A primary key is a single field that uniquely identifies a record in a source dataset.

Primary keys are unique and stable over time:

  • Unique: each primary key appears only once in the dataset.
  • Stable: the key for a given record does not change over time.

Tamr suggests the primary key to be meaningful to the data, as this reduces the likelihood of breaking changes upstream. For example, if there is a designated primary key in the source system, it may be best to use this as the primary key, rather than another unique key.