The source datasets for your data products might be stored in your organization's applications, databases, or file systems.

Before adding this data to Tamr Cloud:

Perform any necessary Extract, Transform, or Load (ETL) steps to denormalize your datasets and ensure that the source datasets meet the general requirements described in this topic. Your datasets must also meet the specific schema requirements for the data product you are creating. See Data Product Guides (or Data Product Templates for legacy templates).
Move or copy these prepared source datasets into a staging zone.
A staging zone is a cloud storage system and the bucket or database within that system that you have connected to Tamr Cloud. Your organization fully manages the staging zone, and allows Tamr Cloud to import data from that system through a configured, secure connection. See Connecting to External Data Repositories for more information about staging zones.

General Requirements

The following requirements apply to all source datasets, regardless of whether the source is a file or a database table or view:

Flat file, table, or view: All datasets must be flat tables, views, or files, with row-level granularity.
Unique name: Each source dataset must have a unique name.
Primary Key: Each source dataset must have a unique primary key field. See the About Primary Keys section below for more information.
Column Header Values:
- Maximum length: 300 characters
- Cannot begin with number characters. Columns beginning with number characters will be prepended by an underscore. For example, the column name 1_address will be converted to _1_address.
- Allowed characters: Letters (a-z, A-Z), numbers (0-9), and underscores.
- Spaces and non-alphanumeric characters in column names are converted to underscores. Leading and trailing spaces are trimmed.
  Note: If you are using a GCS connection, replace any spaces in the file path with %20. For example, you would replace source /file source.csv with source%20/file%20source.csv.
- Column names must be unique (case-insensitive).
- Column names cannot include any of the following prefixes:
  - _TABLE_
  - _FILE_
  - _PARTITION_
  - _ROW_TIMESTAMP_
  - __ROOT__
  - _COLIDENTIFIER_
- Column names cannot be the following exact name: NO
Column Data Values:
- Data type:
  - BigQuery datasets: See Requirements for BigQuery
  - Snowflake datasets: all types supported by Snowflake
  - All other datasets: string only
    Note: Tamr Cloud converts all data to to string
- Format: UTF-8, Windows 1252
- Double-quoted values are allowed.
- Data in each row must map to the header columns
Supported Date and Time Formats:. Tamr supports all combinations of the following:
- Date formats:
  - yyyy-MM-dd
  - dd.MM.yyyy
  - d.MM.yyyy
  - dd/MM/yyyy
  - d/MM/yyyy
- Time formats:
  - HH:mm
  - HH:mm:ss
  - HH:mm:ss.SSS
- Time format postfixes:
  - none
  - Z
  - X
  - XXX
  - 'XXX'
- Date-time delimiters:
  - space (" ")
  - 'T'

Supported date and time examples:

yyyy-MM-dd
dd.MM.yyyy
yyyy-MM-dd HH:mm:ss
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd'T'HH:mm:ss.SSS
yyyy-MM-dd'T'HH:mm:ss.SSSX
yyyy-MM-dd'T'HH:mm:ss.SSSXXX
yyyy-MM-dd'T'HH:mm:ss.SSS'XXX'

Requirements for Source Files

Note: For file upload, only CSV format is supported.

File size limit:
- 20GB per file added through AWS S3, Google Cloud Storage, or other connected cloud storage location
- 500 MB per uploaded file
File formats:
- Avro
- CSV (.csv)
- Delta Lake
- New-line delimited JSON (NDJSON)
- Parquet
Specific to CSV files
- Delimiters: Any single-character delimiter such as comma, semi-colon, and pipe. Space-based delimiters such as space and tab are not supported.
- Row separators:
  - Newline
  - Carriage return followed by a newline
Specific to Delta Lake sources
- Ensure the version of the Delta table protocol you are using is compatible with Tamr Cloud. You can ensure compatibility by using only features supported by Delta table basic functionality, as described in the Table Protocol Versioning Delta Lake Documentation. Tamr is compatible with the following:
  - minReaderVersion: 1
  - minWriterVersion: 2
- Additionally, if using Databricks ensure the enableDeletionVectors setting is disabled at the workspace level.
Specific to GCS sources:
- The filepath for GCS sources cannot include colons.
Specific to Parquet sources:
- For Parquet files, Tamr does not support datetime fields that have been generated using INT96 data type.

Requirements for Source Database Tables and Views

Table size limit: 20 GB

Note for Snowflake sources: Snowflake treats all numeric values as 128-bit integers (Int128). Tamr Cloud supports up to 64-bit integers (Int64), and automatically converts numeric values to Int64. If your data includes numeric values over Int64 and you need to preserve precision above Int64, covert these values to strings in your downstream systems before loading the data into Tamr Cloud.

About Primary Keys

A primary key is a single field that uniquely identifies a record in a source dataset.

Primary keys are unique and stable over time:

Unique: each primary key appears only once in the dataset.
Stable: the key for a given record does not change over time.

Tamr suggests the primary key to be meaningful to the data, as this reduces the likelihood of breaking changes upstream. For example, if there is a designated primary key in the source system, it may be best to use this as the primary key, rather than another unique key.