The source datasets for your data products might be stored in your organization's applications, databases, or file systems.
Before adding this data to Tamr Cloud:
- Perform any necessary Extract, Transform, or Load (ETL) steps to denormalize your datasets and ensure that the source datasets meet the general requirements described in this topic. Your datasets must also meet the specific schema requirements for the data product you are creating. See Data Product Templates.
- Move or copy these prepared source datasets into a staging zone.
A staging zone is a cloud storage system and the bucket or database within that system that you have connected to Tamr Cloud. Your organization fully manages the staging zone, and allows Tamr Cloud to import data from that system through a configured, secure connection. See Connecting to External Data Repositories for more information about staging zones.
The following requirements apply to all source datasets, regardless of whether the source is a file or a database table or view:
- Flat file, table, or view: All datasets must be flat tables, views, or files, with row-level granularity.
- Unique name: Each source dataset must have a unique name.
- Primary Key: Each source dataset must have a unique primary key field. See the About Primary Keys section below for more information.
- Column Header Values:
- Maximum length: 300 characters
- Allowed characters: Letters (a-z, A-Z), numbers (0-9), and underscores.
Spaces and non-alphanumeric characters in column names are converted to underscores. Leading and trailing spaces are trimmed.
- Column names must be unique (case-sensitive).
- Column names cannot include any of the following prefixes:
- Column names cannot be the following exact name:
- Column Data Values:
- Data type:
- BigQuery datasets: See Requirements for BigQuery
- Snowflake datasets: all types supported by Snowflake
- All other datasets:
Note: Tamr Cloud converts all data to to string
- Format: UTF-8, Windows 1252
- Double-quoted values are allowed.
- Data in each row must map to the header columns
- Data type:
- File size limit:
- 20GB per file added through AWS S3, Google Cloud Storage, or other connected cloud storage location
- 500 MB per uploaded file
- File formats (extensions): CSV (.csv)
- Delimiters: Comma
- Row separators:
- Carriage return followed by a newline
- Table size limit: 20 GB
A primary key is a single field that uniquely identifies a record in a source dataset.
Primary keys are unique and stable over time:
- Unique: each primary key appears only once in the dataset.
- Stable: the key for a given record does not change over time.
Tamr suggests the primary key to be meaningful to the data, as this reduces the likelihood of breaking changes upstream. For example, if there is a designated primary key in the source system, it may be best to use this as the primary key, rather than another unique key.
Updated 3 days ago