SnowPro Core Certification: Data Transformations (5/8)
Please subscribe YouTube Channel(请订阅油管频道): Data Driven Wealth 数说财富 DDW - YouTube
✅Transforming Data During a Load
Snowflake supports transforming data while loading it into a table using the COPY INTO <table> command, dramatically simplifying your ETL pipeline for basic transformations. This feature helps you avoid the use of temporary tables to store pre-transformed data when reordering columns during a data load. This feature applies to both bulk loading and Snowpipe.
The COPY command supports:
Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your data files to have the same number and ordering of columns as your target table.
The ENFORCE_LENGTH | TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.
While loading the data, complex transformations such as joins, filters, aggregations, and the use of FLATTEN are not supported as they are not essential data transformations. Therefore, joining, filtering, and aggregating the data are supported ONLY after the data has been loaded into a table.
The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.
👉Supported File Formats
The following file format types are supported for COPY transformations:
CSV
JSON
Avro
ORC
Parquet
XML
To parse a staged data file, it is necessary to describe its file format:
The default format is character-delimited UTF-8 text. The default field delimiter is a comma character (,
). The default record delimiter is the new line character. If the source data is in another format, specify the file format type and options.
When querying staged data files, the ERROR_ON_COLUMN_COUNT_MISMATCH
option is ignored. There is no requirement for your data files to have the same number and ordering of columns as your target table.
To transform JSON data during a load operation, you must structure the data files in NDJSON (“Newline delimited JSON”) standard format; otherwise, you might encounter the following error:
Error parsing JSON: more than one document in the input
Specify the format type and options that match your data files.
To explicitly specify file format options, set them in one of the following ways:
Querying staged data files using a SELECT statement: |
|
Loading columns from staged data files using a COPY INTO <table> statement: |
|
👉Parameters in copy into
If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.
If FALSE, strings are automatically truncated to the target column length.
TRUNCATECOLUMNS:
If TRUE, strings are automatically truncated to the target column length.
If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.
✅Transformation Parameters
If TRUE, strings are automatically truncated to the target column length.
If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.
( SELECT [alias.]$file_col_num[.element] [ , [alias.]$file_col_num[.element] ... ] FROM ... [ alias ] )
Required for transforming data during loading
Specifies an explicit set of fields/columns (separated by commas) to load from the staged data files. The fields/columns are selected from the files using a standard SQL query (i.e. SELECT list), where:
alias
Specifies an optional alias for the FROM
value (e.g. d
in COPY INTO t1 (c1) FROM (SELECT d.$1 FROM @mystage/file1.csv.gz d);
).
file_col_num
Specifies the positional number of the field/column (in the file) that contains the data to be loaded (1
for the first field, 2
for the second field, etc.)
element
Specifies the path and element name of a repeating value in the data file (applies only to semi-structured data files).
The SELECT list defines a numbered set of field/columns in the data files you are loading from. The list must match the sequence of columns in the target table. You can use the optional ( col_name [ , col_name ... ] )
parameter to map the list to specific columns in the target table.
Note that the actual field/column order in the data files can be different from the column order in the target table. It is only important that the SELECT list maps fields/columns in the data files to the corresponding columns in the table.
Note
The SELECT statement used for transformations does not support all functions. For a complete list of the supported functions and more details about data loading transformations, including examples, see the usage notes in Transforming Data During a Load.
Also, data loading transformation only supports selecting data from user stages and named stages (internal or external).
( SELECT [alias.]$file_col_num[.element] [ , [alias.]$file_col_num[.element] ... ] FROM ... [ alias ] )
Required for transforming data during loading
Specifies an explicit set of fields/columns (separated by commas) to load from the staged data files. The fields/columns are selected from the files using a standard SQL query (i.e. SELECT list), where:
alias
Specifies an optional alias for the
FROM
value (e.g.d
inCOPY INTO t1 (c1) FROM (SELECT d.$1 FROM @mystage/file1.csv.gz d);
).
file_col_num
Specifies the positional number of the field/column (in the file) that contains the data to be loaded (
1
for the first field,2
for the second field, etc.)
element
Specifies the path and element name of a repeating value in the data file (applies only to semi-structured data files).
The SELECT list defines a numbered set of field/columns in the data files you are loading from. The list must match the sequence of columns in the target table. You can use the optional ( col_name [ , col_name ... ] )
parameter to map the list to specific columns in the target table.
Note that the actual field/column order in the data files can be different from the column order in the target table. It is only important that the SELECT list maps fields/columns in the data files to the corresponding columns in the table.
Note
The SELECT statement used for transformations does not support all functions. For a complete list of the supported functions and more details about data loading transformations, including examples, see the usage notes in Transforming Data During a Load.
Also, data loading transformation only supports selecting data from user stages and named stages (internal or external).
✅Supported Functions
Snowflake currently supports the following subset of functions for COPY transformations:
Note that when this function is used to explicitly cast a value, neither the DATE_FORMAT file format option nor the DATE_INPUT_FORMAT parameter is applied.
Note that when this function is used to explicitly cast a value, neither the TIME_FORMAT file format option nor the TIME_INPUT_FORMAT parameter is applied.
Note that when this function is used to explicitly cast a value, neither the TIMESTAMP_FORMAT file format option nor the TIMESTAMP_INPUT_FORMAT parameter is applied.
Note that the COPY INTO <table> command does not support the optional
format
argument for this function.Note that the COPY INTO <table> command does not support the optional
format
argument for this function.TRY_TO_TIMESTAMP / TRY_TO_TIMESTAMP_*
Note that the COPY INTO <table> command does not support the optional
format
argument for this function.
Note in particular that the VALIDATE function ignores the SELECT list in a COPY INTO <table> statement. The function parses the files referenced in the statement and returns any parsing errors. This behavior can be surprising if you expect the function to evaluate the files in the context of the COPY INTO <table> expressions.
Note that COPY transformations do not support the FLATTEN function, or JOIN or GROUP BY (aggregate) syntax.
The list of supported functions might expand over time.
The following categories of functions are also supported:
Scalar SQL UDFs.
Note
For Scalar SQL UDFs, Snowflake has limited support for transformation error handling, and you may encounter inconsistent or unexpected ON_ERROR copy option behavior.
Filtering Results
Filtering the results of a FROM clause using a WHERE clause is not supported. The ORDER BY, LIMIT,FETCH,TOP keywords in SELECT statements are also not supported.
The DISTINCT keyword in SELECT statements is not fully supported. Specifying the keyword can lead to inconsistent or unexpected ON_ERROR copy option behavior.
VALIDATION_MODE Parameter
The VALIDATION_MODE parameter does not support COPY statements that transform data during a load.
CURRENT_TIME, CURRENT_TIMESTAMP Default Column Values
Instead of using CURRENT_TIME, CURRENT_TIMESTAMP default column values to capture load time, we recommend that you query METADATA$START_SCAN_TIME to get an accurate time value of record loading. For more information, refer to Querying Metadata for Staged Files.
✅Transforming CSV Data
👉Load a Subset of Table Data
Load a subset of data into a table. For any missing columns, Snowflake inserts the default values. The following example loads data from columns 1, 2, 6, and 7 of a staged CSV file:
Reorder CSV Columns During a Load
The following example reorders the column data from a staged CSV file before loading it into a table. Additionally, the COPY statement uses the SUBSTR , SUBSTRING function to remove the first few characters of a string before inserting it:
Convert Data Types During a Load
Convert staged data into other data types during a data load. All conversion functions are supported.
For example, convert strings as binary values, decimals, or timestamps using the TO_BINARY, TO_DECIMAL , TO_NUMBER , TO_NUMERIC, and TO_TIMESTAMP / TO_TIMESTAMP_* functions, respectively.
Sample CSV file:
SQL statements:
Include Sequence Columns in Loaded Data
Create a sequence object using CREATE SEQUENCE. When loading data into a table using the COPY command, access the object using a NEXTVAL
expression to sequence the data in a target number column. For more information about using sequences in queries, see Using Sequences.
Include AUTOINCREMENT / IDENTITY Columns in Loaded Data
Set the AUTOINCREMENT or IDENTITY default value for a number column. When loading data into a table using the COPY command, omit the column in the SELECT statement. The statement automatically populates the column.
Transforming Semi-structured Data
The examples in this section apply to any semi-structured data type except where noted.
Load semi-structured Data into Separate Columns
The following example loads repeating elements from a staged semi-structured file into separate table columns with different data types.
This example loads the following semi-structured data into separate columns in a relational table, with the location
object values loaded into a VARIANT column and the remaining values loaded into relational columns:
The following SQL statements load the file sales.json
from the internal stage mystage
:
Note
This example loads JSON data, but the SQL statements are similar when loading semi-structured data of other types (e.g. Avro, ORC, etc.).
For an additional example using Parquet data, see Load Parquet Data into Separate Columns (in this topic).
Load Parquet Data into Separate Columns
Similar to the previous example, but loads semi-structured data from a file in the Parquet format. This example is provided for users who are familiar with Apache Parquet:
Flatten Semi-structured Data
FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. Using the sample data from Load semi-structured Data into Separate Columns, create a table with a separate row for each element in the objects.
Split Semi-structured Elements and Load as VARIANT Values into Separate Columns
Following the instructions in Load semi-structured Data into Separate Columns, you can load individual elements from semi-structured data into different columns in your target table. Additionally, using the SPLIT function, you can split element values that contain a separator and load them as an array.
For example, split IP addresses on the dot separator in repeating elements. Load the IP addresses as arrays in separate columns:
No comments:
Post a Comment