SnowPro Core Certification: Data Transformations (5/8)

SnowPro Core Certification: Data Transformations (5/8)



Please subscribe YouTube Channel(请订阅油管频道): Data Driven Wealth 数说财富 DDW - YouTube

✅Transforming Data During a Load

https://docs.snowflake.com/en/user-guide/data-load-transform

Snowflake supports transforming data while loading it into a table using the COPY INTO <table> command, dramatically simplifying your ETL pipeline for basic transformations. This feature helps you avoid the use of temporary tables to store pre-transformed data when reordering columns during a data load. This feature applies to both bulk loading and Snowpipe.

The COPY command supports:

  • Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your data files to have the same number and ordering of columns as your target table.

  • The ENFORCE_LENGTH | TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.

When loading data into a table using the COPY command, Snowflake allows you to do simple transformations on the data as it is being loaded. During the load process, the COPY command allows for modifying the order of columns, omitting one or more columns, casting data into specified data types, and truncating values. 

While loading the data, complex transformations such as joins, filters, aggregations, and the use of FLATTEN are not supported as they are not essential data transformations. Therefore, joining, filtering, and aggregating the data are supported ONLY after the data has been loaded into a table.

The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.

👉Supported File Formats

The following file format types are supported for COPY transformations:

  • CSV

  • JSON

  • Avro

  • ORC

  • Parquet

  • XML

To parse a staged data file, it is necessary to describe its file format:

CSV

The default format is character-delimited UTF-8 text. The default field delimiter is a comma character (,). The default record delimiter is the new line character. If the source data is in another format, specify the file format type and options.

When querying staged data files, the ERROR_ON_COLUMN_COUNT_MISMATCH option is ignored. There is no requirement for your data files to have the same number and ordering of columns as your target table.

JSON

To transform JSON data during a load operation, you must structure the data files in NDJSON (“Newline delimited JSON”) standard format; otherwise, you might encounter the following error:

Error parsing JSON: more than one document in the input

All other file format types

Specify the format type and options that match your data files.

To explicitly specify file format options, set them in one of the following ways:

Querying staged data files using a SELECT statement:

  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the SELECT statement.

Loading columns from staged data files using a COPY INTO <table> statement:

  • As file format options specified directly in the COPY INTO <table> statement.

  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

If no File Format object or options are not provided to either the stage or copy statement, the default behaviour will be to try and interpret the contents of a stage as a CSV with UTF-8 encoding.

👉Parameters in copy into

ENFORCE_LENGTH:   
  • If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.

  • If FALSE, strings are automatically truncated to the target column length. 

TRUNCATECOLUMNS:

  • If TRUE, strings are automatically truncated to the target column length.

  • If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.


Transformation Parameters

( SELECT [alias.]$file_col_num[.element] [ , [alias.]$file_col_num[.element] ... ] FROM ... [ alias ] )

Required for transforming data during loading

Specifies an explicit set of fields/columns (separated by commas) to load from the staged data files. The fields/columns are selected from the files using a standard SQL query (i.e. SELECT list), where:

alias

Specifies an optional alias for the FROM value (e.g. d in COPY INTO t1 (c1) FROM (SELECT d.$1 FROM @mystage/file1.csv.gz d);).

file_col_num

Specifies the positional number of the field/column (in the file) that contains the data to be loaded (1 for the first field, 2 for the second field, etc.)

element

Specifies the path and element name of a repeating value in the data file (applies only to semi-structured data files).

The SELECT list defines a numbered set of field/columns in the data files you are loading from. The list must match the sequence of columns in the target table. You can use the optional ( col_name [ , col_name ... ] ) parameter to map the list to specific columns in the target table.

Note that the actual field/column order in the data files can be different from the column order in the target table. It is only important that the SELECT list maps fields/columns in the data files to the corresponding columns in the table.

Note

The SELECT statement used for transformations does not support all functions. For a complete list of the supported functions and more details about data loading transformations, including examples, see the usage notes in Transforming Data During a Load.

Also, data loading transformation only supports selecting data from user stages and named stages (internal or external).

Supported Functions

Snowflake currently supports the following subset of functions for COPY transformations:

Note in particular that the VALIDATE function ignores the SELECT list in a COPY INTO <table> statement. The function parses the files referenced in the statement and returns any parsing errors. This behavior can be surprising if you expect the function to evaluate the files in the context of the COPY INTO <table> expressions.

Note that COPY transformations do not support the FLATTEN function, or JOIN or GROUP BY (aggregate) syntax.

The list of supported functions might expand over time.

The following categories of functions are also supported:

Note

For Scalar SQL UDFs, Snowflake has limited support for transformation error handling, and you may encounter inconsistent or unexpected ON_ERROR copy option behavior.



Filtering Results

Filtering the results of a FROM clause using a WHERE clause is not supported. The ORDER BY, LIMIT,FETCH,TOP keywords in SELECT statements are also not supported.

The DISTINCT keyword in SELECT statements is not fully supported. Specifying the keyword can lead to inconsistent or unexpected ON_ERROR copy option behavior.

VALIDATION_MODE Parameter

The VALIDATION_MODE parameter does not support COPY statements that transform data during a load.

CURRENT_TIME, CURRENT_TIMESTAMP Default Column Values

Instead of using CURRENT_TIME, CURRENT_TIMESTAMP default column values to capture load time, we recommend that you query METADATA$START_SCAN_TIME to get an accurate time value of record loading. For more information, refer to Querying Metadata for Staged Files.

Transforming CSV Data

👉Load a Subset of Table Data

Load a subset of data into a table. For any missing columns, Snowflake inserts the default values. The following example loads data from columns 1, 2, 6, and 7 of a staged CSV file:

copy into home_sales(city, zip, sale_date, price)
   from (select t.$1, t.$2, t.$6, t.$7 from @mystage/sales.csv.gz t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat);

Reorder CSV Columns During a Load

The following example reorders the column data from a staged CSV file before loading it into a table. Additionally, the COPY statement uses the SUBSTR , SUBSTRING function to remove the first few characters of a string before inserting it:

copy into home_sales(city, zip, sale_date, price)
   from (select SUBSTR(t.$2,4), t.$1, t.$5, t.$4 from @mystage t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat);

Convert Data Types During a Load

Convert staged data into other data types during a data load. All conversion functions are supported.

For example, convert strings as binary values, decimals, or timestamps using the TO_BINARYTO_DECIMAL , TO_NUMBER , TO_NUMERIC, and TO_TIMESTAMP / TO_TIMESTAMP_* functions, respectively.

Sample CSV file:

snowflake,2.8,2016-10-5
warehouse,-12.3,2017-01-23

SQL statements:

-- Stage a data file in the internal user stage
PUT file:///tmp/datafile.csv @~;

-- Query the staged data file
select t.$1,t.$2,t.$3 from @~/datafile.csv.gz t;

-- Create the target table
create or replace table casttb (
  col1 binary,
  col2 decimal,
  col3 timestamp_ntz
  );

-- Convert the staged CSV column data to the specified data types before loading it into the destination table
copy into casttb(col1, col2, col3)
from (
  select to_binary(t.$1, 'utf-8'),to_decimal(t.$2, '99.9', 9, 5),to_timestamp_ntz(t.$3)
  from @~/datafile.csv.gz t
)
file_format = (type = csv);

-- Query the target table
select * from casttb;

+--------------------+------+-------------------------+
| COL1               | COL2 | COL3                    |
|--------------------+------+-------------------------|
| 736E6F77666C616B65 |    3 | 2016-10-05 00:00:00.000 |
| 77617265686F757365 |  -12 | 2017-01-23 00:00:00.000 |
+--------------------+------+-------------------------+

Include Sequence Columns in Loaded Data

Create a sequence object using CREATE SEQUENCE. When loading data into a table using the COPY command, access the object using a NEXTVAL expression to sequence the data in a target number column. For more information about using sequences in queries, see Using Sequences.

-- Create a sequence
create sequence seq1;

-- Create the target table
create or replace table mytable (
  col1 number default seq1.nextval,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Include the sequence nextval expression in the COPY statement
copy into mytable (col1, col2, col3)
from (
  select seq1.nextval, $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Include AUTOINCREMENT / IDENTITY Columns in Loaded Data

Set the AUTOINCREMENT or IDENTITY default value for a number column. When loading data into a table using the COPY command, omit the column in the SELECT statement. The statement automatically populates the column.

-- Create the target table
create or replace table mytable (
  col1 number autoincrement start 1 increment 1,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Omit the sequence column in the COPY statement
copy into mytable (col2, col3)
from (
  select $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Transforming Semi-structured Data

The examples in this section apply to any semi-structured data type except where noted.

Load semi-structured Data into Separate Columns

The following example loads repeating elements from a staged semi-structured file into separate table columns with different data types.

This example loads the following semi-structured data into separate columns in a relational table, with the location object values loaded into a VARIANT column and the remaining values loaded into relational columns:

-- Sample data:
{"location": {"city": "Lexington","zip": "40503"},"dimensions": {"sq_ft": "1000"},"type": "Residential","sale_date": "4-25-16","price": "75836"},
{"location": {"city": "Belmont","zip": "02478"},"dimensions": {"sq_ft": "1103"},"type": "Residential","sale_date": "6-18-16","price": "92567"},
{"location": {"city": "Winchester","zip": "01890"},"dimensions": {"sq_ft": "1122"},"type": "Condo","sale_date": "1-31-16","price": "89921"}

The following SQL statements load the file sales.json from the internal stage mystage:

Note

This example loads JSON data, but the SQL statements are similar when loading semi-structured data of other types (e.g. Avro, ORC, etc.).

For an additional example using Parquet data, see Load Parquet Data into Separate Columns (in this topic).

 -- Create an internal stage with the file type set as JSON.
 CREATE OR REPLACE STAGE mystage
   FILE_FORMAT = (TYPE = 'json');

 -- Stage a JSON data file in the internal stage.
 PUT file:///tmp/sales.json @mystage;

 -- Query the staged data. The data file comprises three objects in NDJSON format.
 SELECT t.$1 FROM @mystage/sales.json.gz t;

 +------------------------------+
 | $1                           |
 |------------------------------|
 | {                            |
 |   "dimensions": {            |
 |     "sq_ft": "1000"          |
 |   },                         |
 |   "location": {              |
 |     "city": "Lexington",     |
 |     "zip": "40503"           |
 |   },                         |
 |   "price": "75836",          |
 |   "sale_date": "2022-08-25", |
 |   "type": "Residential"      |
 | }                            |
 | {                            |
 |   "dimensions": {            |
 |     "sq_ft": "1103"          |
 |   },                         |
 |   "location": {              |
 |     "city": "Belmont",       |
 |     "zip": "02478"           |
 |   },                         |
 |   "price": "92567",          |
 |   "sale_date": "2022-09-18", |
 |   "type": "Residential"      |
 | }                            |
 | {                            |
 |   "dimensions": {            |
 |     "sq_ft": "1122"          |
 |   },                         |
 |   "location": {              |
 |     "city": "Winchester",    |
 |     "zip": "01890"           |
 |   },                         |
 |   "price": "89921",          |
 |   "sale_date": "2022-09-23", |
 |   "type": "Condo"            |
 | }                            |
 +------------------------------+

 -- Create a target table for the data.
 CREATE OR REPLACE TABLE home_sales (
   CITY VARCHAR,
   POSTAL_CODE VARCHAR,
   SQ_FT NUMBER,
   SALE_DATE DATE,
   PRICE NUMBER
 );

 -- Copy elements from the staged file into the target table.
 COPY INTO home_sales(city, postal_code, sq_ft, sale_date, price)
 FROM (select
 $1:location.city::varchar,
 $1:location.zip::varchar,
 $1:dimensions.sq_ft::number,
 $1:sale_date::date,
 $1:price::number
 FROM @mystage/sales.json.gz t);

 -- Query the target table.
 SELECT * from home_sales;

+------------+-------------+-------+------------+-------+
| CITY       | POSTAL_CODE | SQ_FT | SALE_DATE  | PRICE |
|------------+-------------+-------+------------+-------|
| Lexington  | 40503       |  1000 | 2022-08-25 | 75836 |
| Belmont    | 02478       |  1103 | 2022-09-18 | 92567 |
| Winchester | 01890       |  1122 | 2022-09-23 | 89921 |
+------------+-------------+-------+------------+-------+

Load Parquet Data into Separate Columns

Similar to the previous example, but loads semi-structured data from a file in the Parquet format. This example is provided for users who are familiar with Apache Parquet:

-- Create a file format object that sets the file format type. Accept the default options.
create or replace file format my_parquet_format
  type = 'parquet';

-- Create an internal stage and specify the new file format
create or replace temporary stage mystage
  file_format = my_parquet_format;

-- Create a target table for the data.
create or replace table parquet_col (
  custKey number default NULL,
  orderDate date default NULL,
  orderStatus varchar(100) default NULL,
  price varchar(255)
);

-- Stage a data file in the internal stage
put file:///tmp/mydata.parquet @mystage;

-- Copy data from elements in the staged Parquet file into separate columns
-- in the target table.
-- Note that all Parquet data is stored in a single column ($1)
-- SELECT list items correspond to element names in the Parquet file
-- Cast element values to the target column data type
copy into parquet_col
  from (select
  $1:o_custkey::number,
  $1:o_orderdate::date,
  $1:o_orderstatus::varchar,
  $1:o_totalprice::varchar
  from @mystage/mydata.parquet);

-- Query the target table
SELECT * from parquet_col;

+---------+------------+-------------+-----------+
| CUSTKEY | ORDERDATE  | ORDERSTATUS | PRICE     |
|---------+------------+-------------+-----------|
|   27676 | 1996-09-04 | O           | 83243.94  |
|  140252 | 1994-01-09 | F           | 198402.97 |
...
+---------+------------+-------------+-----------+

Flatten Semi-structured Data

FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. Using the sample data from Load semi-structured Data into Separate Columns, create a table with a separate row for each element in the objects.

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'json');

-- Stage a JSON data file in the internal stage with the default values
put file:///tmp/sales.json @mystage;

-- Create a table composed of the output from the FLATTEN function
create or replace table flattened_source
(seq string, key string, path string, index string, value variant, element variant)
as
  select
    seq::string
  , key::string
  , path::string
  , index::string
  , value::variant
  , this::variant
  from @mystage/sales.json.gz
    , table(flatten(input => parse_json($1)));

  select * from flattened_source;

+-----+-----------+-----------+-------+-------------------------+-----------------------------+
| SEQ | KEY       | PATH      | INDEX | VALUE                   | ELEMENT                     |
|-----+-----------+-----------+-------+-------------------------+-----------------------------|
| 1   | location  | location  | NULL  | {                       | {                           |
|     |           |           |       |   "city": "Lexington",  |   "location": {             |
|     |           |           |       |   "zip": "40503"        |     "city": "Lexington",    |
|     |           |           |       | }                       |     "zip": "40503"          |
|     |           |           |       |                         |   },                        |
|     |           |           |       |                         |   "price": "75836",         |
|     |           |           |       |                         |   "sale_date": "2017-3-5",  |
|     |           |           |       |                         |   "sq__ft": "1000",         |
|     |           |           |       |                         |   "type": "Residential"     |
|     |           |           |       |                         | }                           |
...
| 3   | type      | type      | NULL  | "Condo"                 | {                           |
|     |           |           |       |                         |   "location": {             |
|     |           |           |       |                         |     "city": "Winchester",   |
|     |           |           |       |                         |     "zip": "01890"          |
|     |           |           |       |                         |   },                        |
|     |           |           |       |                         |   "price": "89921",         |
|     |           |           |       |                         |   "sale_date": "2017-3-21", |
|     |           |           |       |                         |   "sq__ft": "1122",         |
|     |           |           |       |                         |   "type": "Condo"           |
|     |           |           |       |                         | }                           |
+-----+-----------+-----------+-------+-------------------------+-----------------------------+

Split Semi-structured Elements and Load as VARIANT Values into Separate Columns

Following the instructions in Load semi-structured Data into Separate Columns, you can load individual elements from semi-structured data into different columns in your target table. Additionally, using the SPLIT function, you can split element values that contain a separator and load them as an array.

For example, split IP addresses on the dot separator in repeating elements. Load the IP addresses as arrays in separate columns:

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'json');

-- Stage a semi-structured data file in the internal stage
put file:///tmp/ipaddress.json @mystage auto_compress=true;

-- Query the staged data
select t.$1 from @mystage/ipaddress.json.gz t;

+----------------------------------------------------------------------+
| $1                                                                   |
|----------------------------------------------------------------------|
| {"ip_address": {"router1": "192.168.1.1","router2": "192.168.0.1"}}, |
| {"ip_address": {"router1": "192.168.2.1","router2": "192.168.3.1"}}  |
+----------------------------------------------------------------------+

-- Create a target table for the semi-structured data
create or replace table splitjson (
  col1 array,
  col2 array
  );

-- Split the elements into individual arrays using the SPLIT function and load them into separate columns
-- Note that all JSON data is stored in a single column ($1)
copy into splitjson(col1, col2)
from (
  select split($1:ip_address.router1, '.'),split($1:ip_address.router2, '.')
  from @mystage/ipaddress.json.gz t
);

-- Query the target table
select * from splitjson;

+----------+----------+
| COL1     | COL2     |
|----------+----------|
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "1",   |   "0",   |
|   "1"    |   "1"    |
| ]        | ]        |
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "2",   |   "3",   |
|   "1"    |   "1"    |
| ]        | ]        |
+----------+----------+


No comments:

Post a Comment

Featured Posts

SnowPro Badges and Certificates

SnowPro Badges and Certificates Online Verification https://achieve.snowflake.com/profile/richardhou888/wallet

Popular Posts Recommended